# Parallel Computing

##### Experimental html version of downloadable textbook, see https://theartofhpc.com
$% mathjax inclusion. \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{{\equiv\atop\mathrm{\scriptstyle D}}}}} \newcommand\macro[1]{\langle#1\rangle} \newcommand\dtdxx{\frac{\alpha\Delta t}{\Delta x^2}} \newcommand\toprule{\hline} \newcommand\midrule{\hline} \newcommand\bottomrule{\hline} \newcommand\nobreak{} \newcommand\multicolumn[3]{#3}$ 2.1 : Introduction
2.1.1 : Functional parallelism versus data parallelism
2.1.2 : Parallelism in the algorithm versus in the code
2.2 : Theoretical concepts
2.2.1 : Definitions
2.2.1.1 : Speedup and efficiency
2.2.1.2 : Cost-optimality
2.2.2 : Asymptotics
2.2.3 : Amdahl's law
2.2.3.1 : Amdahl's law with communication overhead
2.2.3.2 : Gustafson's law
2.2.3.3 : Amdahl's law and hybrid programming
2.2.4 : Critical path and Brent's theorem
2.2.5 : Scalability
2.2.5.1 : Iso-efficiency
2.2.5.2 : Precisely what do you mean by scalable?
2.2.6 : Simulation scaling
2.2.7 : Other scaling measures
2.2.8 : Concurrency; asynchronous and distributed computing
2.3 : Parallel Computers Architectures
2.3.1 : SIMD
2.3.1.1 : Pipelining and pipeline processors
2.3.1.2 : True SIMD in CPUs and GPUs
2.3.2 : MIMD / SPMD computers
2.3.3 : The commoditization of supercomputers
2.4 : Different types of memory access
2.4.1 : Symmetric Multi-Processors: Uniform Memory Access
2.4.2 : Non-Uniform Memory Access
2.4.2.1 : Affinity
2.4.2.2 : Coherence
2.4.3 : Logically and physically distributed memory
2.5 : Granularity of parallelism
2.5.1 : Data parallelism
2.5.2 : Instruction-level parallelism
2.5.3 : Task-level parallelism
2.5.4 : Conveniently parallel computing
2.5.5 : Medium-grain data parallelism
2.5.6 : Task granularity
2.6 : Parallel programming
2.6.1 : Thread parallelism
2.6.1.1 : The fork-join mechanism
2.6.1.2 : Hardware support for threads
2.6.1.3 : Threads example
2.6.1.4 : Contexts
2.6.1.5 : Race conditions, thread safety, and atomic operations
2.6.1.6 : Memory models and sequential consistency
2.6.1.7 : Affinity
2.6.1.8 : Cilk Plus
2.6.2 : OpenMP
2.6.2.1 : OpenMP examples
2.6.3 : Distributed memory programming through message passing
2.6.3.1 : The global versus the local view in distributed programming
2.6.3.2 : Blocking and non-blocking communication
2.6.3.3 : The MPI library
2.6.3.4 : Blocking
2.6.3.5 : Collective operations
2.6.3.6 : Non-blocking communication
2.6.3.7 : MPI version 1 and 2 and 3
2.6.3.8 : One-sided communication
2.6.4 : Hybrid shared/distributed memory computing
2.6.5 : Parallel languages
2.6.5.1 : Discussion
2.6.5.2 : Unified Parallel C
2.6.5.3 : High Performance Fortran
2.6.5.4 : Co-array Fortran
2.6.5.5 : Chapel
2.6.5.6 : Fortress
2.6.5.7 : X10
2.6.5.8 : Linda
2.6.5.9 : The Global Arrays library
2.6.6 : OS-based approaches
2.6.7 : Active messages
2.6.8 : Bulk synchronous parallelism
2.6.9 : Data dependencies
2.6.9.1 : Types of data dependencies
2.6.9.2 : Parallelizing nested loops
2.6.10 : Program design for parallelism
2.6.10.1 : Parallel data structures
2.6.10.2 : Latency hiding
2.7 : Topologies
2.7.1 : Some graph theory
2.7.2 : Busses
2.7.3 : Linear arrays and rings
2.7.4 : 2D and 3D arrays
2.7.5 : Hypercubes
2.7.5.1 : Embedding grids in a hypercube
2.7.6 : Switched networks
2.7.6.1 : Cross bar
2.7.6.2 : Butterfly exchange
2.7.6.3 : Fat-trees
2.7.6.4 : Over-subscription and contention
2.7.7 : Cluster networks
2.7.7.1 : Case study: Stampede
2.7.7.2 : Case study: Cray Dragonfly networks
2.7.8 : Bandwidth and latency
2.7.9 : Locality in parallel computing
2.8 : Multi-threaded architectures
2.9 : Co-processors, including GPUs
2.9.1 : A little history
2.9.2 : Bottlenecks
2.9.3 : GPU computing
2.9.3.1 : SIMD-type programming with kernels
2.9.3.2 : GPUs versus CPUs
2.9.3.3 : Expected benefit from GPUs
2.9.4 : Intel Xeon Phi
2.10 : Load balancing
2.10.1 : Load balancing versus data distribution
2.10.2 : Load scheduling
2.10.3 : Load balancing of independent tasks
2.10.4 : Load balancing as graph problem
2.10.5 : Load redistributing
2.10.5.1 : Diffusion load balancing
2.10.5.2 : Load balancing with space-filling curves
2.11 : Remaining topics
2.11.1 : Distributed computing, grid computing, cloud computing
2.11.2 : Usage scenarios
2.11.3 : Characterization
2.11.4 : Capability versus capacity computing
2.11.5 : MapReduce
2.11.5.1 : Expressive power of the MapReduce model
2.11.5.2 : Mapreduce software
2.11.5.3 : Implementation issues
2.11.5.4 : Functional programming
2.11.6 : The top500 list
2.11.6.1 : The top500 list as a recent history of supercomputing
2.11.7 : Heterogeneous computing

# 2 Parallel Computing

The largest and most powerful computers are sometimes called supercomputers'. For the last two decades, this has, without exception, referred to parallel computers: machines with more than one CPU that can be set to work on the same problem.

Parallelism is hard to define precisely, since it can appear on several levels. In the previous chapter you already saw how inside a CPU several instructions can be in flight' simultaneously. This is called instruction-level parallelism  , and it is outside explicit user control: it derives from the compiler and the CPU deciding which instructions, out of a single instruction stream, can be processed simultaneously. At the other extreme is the sort of parallelism where more than one instruction stream is handled by multiple processors, often each on their own circuit board. This type of parallelism is typically explicitly scheduled by the user.

In this chapter, we will analyze this more explicit type of parallelism, the hardware that supports it, the programming that enables it, and the concepts that analyze it.

## 2.1 Introduction

crumb trail: > parallel > Introduction

In scientific codes, there is often a large amount of work to be done, and it is often regular to some extent, with the same operation being performed on many data. The question is then whether this work can be sped up by use of a parallel computer. If there are $n$ operations to be done, and they would take time $t$ on a single processor, can they be done in time $t/p$ on $p$ processors?

Let us start with a very simple example. Adding two vectors of length $n$


for (i=0; i


can be done with up to $n$ processors. In the idealized case with $n$ processors, each processor has local scalars a,b,c and executes

FIGURE 2.1: Parallelization of a vector addition.

the single instruction a=b+c  . This is depicted in figure  2.1  .

In the general case, where each processor executes something like


for (i=my_low; i


execution time is linearly reduced with the number of processors. If each operation takes a unit time, the original algorithm takes time $n$, and the parallel execution on $p$ processors $n/p$. The parallel algorithm is faster by a factor of $p$%

(Note: {Here we ignore lower order errors in this result when $p$ does not divide perfectly in $n$. We will also, in general, ignore matters of loop overhead.}  )

.

Next, let us consider summing the elements of a vector. (An operation that has a vector as input but only a scalar as output is often called a reduction  .) We again assume that each processor contains just a single array element. The sequential code:


s = 0;
for (i=0; i


is no longer obviously parallel, but if we recode the loop as


for (s=2; s<2*n; s*=2)
for (i=0; i


there is a way to parallelize it: every iteration of the outer loop is now a loop that can be done by $n/s$ processors in parallel. Since the

FIGURE 2.2: Parallelization of a vector reduction.

outer loop will go through $\log_2n$ iterations, we see that the new algorithm has a reduced runtime of $n/p\cdot\log_2 n$. The parallel algorithm is now faster by a factor of $p/\log_2n$. This is depicted in figure  2.2  .

Even from these two simple examples we can see some of the characteristics of parallel computing:

• Sometimes algorithms need to be rewritten slightly to make them parallel.

• A parallel algorithm may not show perfect speedup.

There are other things to remark on. In the first case, if each processors has its $x_i,y_i$ in a local store the algorithm can be executed without further complications. In the second case, processors need to communicate data among each other and we haven't assigned a cost to that yet.

First let us look systematically at communication. We can take the parallel algorithm in the right half of figure  2.2 and turn it into a tree graph (see Appendix  app:graph  ) by defining the inputs as leave nodes, all partial sums as interior nodes, and the root as the total sum. There is an edge from one node to another if the first is input to the (partial) sum in the other. This is illustrated in figure  2.3  . In this figure nodes are horizontally aligned with other computations that can be performed simultaneously; each level is sometimes called a superstep in the computation. Nodes are vertically aligned if they are computed on the same processors, and an arrow corresponds to a communication if it goes from one processor to another.

FIGURE 2.3: Communication structure of a parallel vector reduction.

The vertical alignment in figure  2.3 is not the only one possible. If nodes are shuffled within a superstep or horizontal level, a different communication pattern arises.

Exercise Consider placing the nodes within a superstep on random processors. Show that, if no two nodes wind up on the same processor, at most twice the number of communications is performed from the case in figure  2.3  .
End of exercise

Exercise Can you draw the graph of a computation that leaves the sum result on each processor? There is a solution that takes twice the number of supersteps, and there is one that takes the same number. In both cases the graph is no longer a tree, but a more general DAG  .
End of exercise

Processors are often connected through a network, and moving data through this network takes time. This introduces a concept of distance between the processors. In figure  2.3  , where the processors are linearly ordered, this is related to their rank in the ordering. If the network only connects a processor with its immediate neighbors, each iteration of the outer loop increases the distance over which communication takes place.

Exercise

Assume that an addition takes a certain unit time, and that moving a number from one processor to another takes that same unit time. Show that the communication time equals the computation time.

Now assume that sending a number from processor $p$ to $p\pm k$ takes time $k$. Show that the execution time of the parallel algorithm now is of the same order as the sequential time.
End of exercise

The summing example made the unrealistic assumption that every processor initially stored just one vector element: in practice we will have $p Exercise Consider the case of summing 8 elements with 4 processors. Show that some of the edges in the graph of figure 2.3 no longer correspond to actual communications. Now consider summing 16 elements with, again, 4 processors. What is the number of communication edges this time? End of exercise These matters of algorithm adaptation, efficiency, and communication, are crucial to all of parallel computing. We will return to these issues in various guises throughout this chapter. ### 2.1.1 Functional parallelism versus data parallelism crumb trail: > parallel > Introduction > Functional parallelism versus data parallelism From the above introduction we can describe parallelism as finding independent operations in the execution of a program. In all of the examples these independent operations were in fact identical operations, but applied to different data items. We could call this data parallelism : the same operation is applied in parallel to many data elements. This is in fact a common scenario in scientific computing: parallelism often stems from the fact that a data set (vector, matrix, graph,\ldots) is spread over many processors, each working on its part of the data. The term data parallelism is traditionally mostly applied if the operation is a single instruction; in the case of a subprogram it is often called task parallelism . It is also possible to find independence, not based on data elements, but based on the instructions themselves. Traditionally, compilers analyze code in terms of ILP : independent instructions can be given to separate floating point units, or reordered, for instance to optimize register usage (see also section 2.5.2 ). ILP is one case of functional parallelism ; on a higher level, functional parallelism can be obtained by considering independent subprograms, often called task parallelism ; see section 2.5.3 . Some examples of functional parallelism are Monte Carlo simulations, and other algorithms that traverse a parametrized search space, such as boolean satisfyability problems. ### 2.1.2 Parallelism in the algorithm versus in the code crumb trail: > parallel > Introduction > Parallelism in the algorithm versus in the code Often we are in the situation that we want to parallelize an algorithm that has a common expression in sequential form. In some cases, this sequential form is straightforward to parallelize, such as in the vector addition discussed above. In other cases there is no simple way to parallelize the algorithm; we will discuss linear recurrences in section 6.9.2 . And in yet another case the sequential code may look not parallel, but the algorithm actually has parallelism. Exercise  for i in [1:N]: x[0,i] = some_function_of(i) x[i,0] = some_function_of(i) for i in [1:N]: for j in [1:N]: x[i,j] = x[i-1,j]+x[i,j-1]  Answer the following questions about the double i,j loop: 1. Are the iterations of the inner loop independent, that is, could they be executed simultaneously? 2. Are the iterations of the outer loop independent? 3. If x[1,1] is known, show that x[2,1] and x[1,2] can be computed independently. 4. Does this give you an idea for a parallelization strategy? End of exercise We will discuss the solution to this conundrum in section 6.9.1 . In general, the whole of chapter High performance linear algebra will be about the amount of parallelism intrinsic in scientific computing algorithms. ## 2.2 Theoretical concepts crumb trail: > parallel > Theoretical concepts There are two important reasons for using a parallel computer: to have access to more memory or to obtain higher performance. It is easy to characterize the gain in memory, as the total memory is the sum of the individual memories. The speed of a parallel computer is harder to characterize. This section will have an extended discussion on theoretical measures for expressing and judging the gain in execution speed from going to a parallel architecture. ### 2.2.1 Definitions crumb trail: > parallel > Theoretical concepts > Definitions #### 2.2.1.1 Speedup and efficiency crumb trail: > parallel > Theoretical concepts > Definitions > Speedup and efficiency A simple approach to defining speedup is to let the same program run on a single processor, and on a parallel machine with$p$processors, and to compare runtimes. With$T_1$the execution time on a single processor and$T_p$the time on$p$processors, we define the speedup as$S_p=T_1/T_p$. (Sometimes$T_1$is defined as the best time to solve the problem on a single processor', which allows for using a different algorithm on a single processor than in parallel.) In the ideal case,$T_p=T_1/p$, but in practice we don't expect to attain that, so$S_p\leq p$. To measure how far we are from the ideal speedup, we introduce the efficiency$E_p=S_p/p$. Clearly,$0< E_p\leq 1$. There is a practical problem with the above definitions: a problem that can be solved on a parallel machine may be too large to fit on any single processor. Conversely, distributing a single processor problem over many processors may give a distorted picture since very little data will wind up on each processor. Below we will discuss more realistic measures of speed-up. There are various reasons why the actual speed is less than$p$. For one, using more than one processor necessitates communication and synchronization, which is overhead that was not part of the original computation. Secondly, if the processors do not have exactly the same amount of work to do, they may be idle part of the time (this is known as load unbalance ), again lowering the actually attained speedup. Finally, code may have sections that are inherently sequential. Communication between processors is an important source of a loss of efficiency. Clearly, a problem that can be solved without communication will be very efficient. Such problems, in effect consisting of a number of completely independent calculations, is called embarrassingly parallel (or conveniently parallel ; see section 2.5.4 ); it will have close to a perfect speedup and efficiency. Exercise The case of speedup larger than the number of processors is called superlinear speedup . Give a theoretical argument why this can never happen. End of exercise In practice, superlinear speedup can happen. For instance, suppose a problem is too large to fit in memory, and a single processor can only solve it by swapping data to disc. If the same problem fits in the memory of two processors, the speedup may well be larger than$2$since disc swapping no longer occurs. Having less, or more localized, data may also improve the cache behavior of a code. A form of superlinear speedup can also happen in search algorithms. Imagine that each processor starts in a different location of the search space; now it can happen that, say, processor 3 immediately finds the solution. Sequentially, all the possibilities for processors 1 and 2 would have had to be traversed. The speedup is much greater than 3 in this case. #### 2.2.1.2 Cost-optimality crumb trail: > parallel > Theoretical concepts > Definitions > Cost-optimality In cases where the speedup is not perfect we can define overhead as the difference $$T_o = pT_p-T1.$$ We can also interpret this as the difference between simulating the parallel algorithm on a single processor, and the actual best sequential algorithm. We will later see two different types of overhead: 1. The parallel algorithm can be essentially different from the sequential one. For instance, sorting algorithms have a complexity$O(n\log n)$, but the parallel bitonic sort (section 8.6 ) has complexity$O(n\log^2n)$. 2. The parallel algorithm can have overhead derived from the process or parallelizing, such as the cost of sending messages. As an example, section 6.2.2 analyzes the communication overhead in the matrix-vector product. A parallel algorithm is called cost-optimal if the overhead is at most of the order of the running time of the sequential algorithm. Exercise The definition of overhead above implicitly assumes that overhead is not parallelizable. Discuss this assumption in the context of the two examples above. End of exercise ### 2.2.2 Asymptotics crumb trail: > parallel > Theoretical concepts > Asymptotics If we ignore limitations such as that the number of processors has to be finite, or the physicalities of the interconnect between them, we can derive theoretical results on the limits of parallel computing. This section will give a brief introduction to such results, and discuss their connection to real life high performance computing. Consider for instance the matrix-matrix multiplication$C=AB$, which takes$2N^3$operations where$N$is the matrix size. Since there are no dependencies between the operations for the elements of~$C$, we can perform them all in parallel. If we had$N^2$processors, we could assign each to an$(i,j)$coordinate in~$C$, and have it compute$c_{ij}$in$2N$time. Thus, this parallel operation has efficiency~$1$, which is optimal. Exercise Show that this algorithm ignores some serious issues about memory usage: • If the matrix is kept in shared memory, how many simultaneous reads from each memory locations are performed? • If the processors keep the input and output to the local computations in local storage, how much duplication is there of the matrix elements? End of exercise Adding$N$numbers$\{x_i\}_{i=1\ldots N}$can be performed in$\log_2 N$time with$N/2$processors. If we have$n/2$processors we could compute: 1. Define$s^{(0)}_i = x_i$. 2. Iterate with$j=1,\ldots,\log_2 n$: 3. Compute$n/2^j$partial sums$s^{(j)}_i=s^{(j-1)}_{2i}+s^{(j-1)}_{2i+1}$We see that the$n/2$processors perform a total of$n$operations (as they should) in$\log_2n$time. The efficiency of this parallel scheme is~$O(1/\log_2n)$, a slowly decreasing function of~$n$. Exercise Show that, with the scheme for parallel addition just outlined, you can multiply two matrices in$\log_2 N$time with$N^3/2$processors. What is the resulting efficiency? End of exercise It is now a legitimate theoretical question to ask • If we had infinitely many processors, what is the lowest possible time complexity for matrix-matrix multiplication, or • Are there faster algorithms that still have$O(1)$efficiency? Such questions have been researched (see for instance~ [He:surveyparallel] ), but they have little bearing on high performance computing. A first objection to these kinds of theoretical bounds is that they implicitly assume some form of shared memory. In fact, the formal model for these algorithms is called a PRAM , where the assumption is that every memory location is accessible to any processor. Often an additional assumption is made that multiple simultaneous accesses to the same location are in fact possible. Since write and write accesses have a different behavior in practice, there is the concept of CREW-PRAM, for Concurrent Read, Exclusive Write PRAM. The basic assumptions of the PRAM model are unrealistic in practice, especially in the context of scaling up the problem size and the number of processors. A~further objection to the PRAM model is that even on a single processor it ignores the memory hierarchy; section~ 1.3 . But even if we take distributed memory into account, theoretical results can still be unrealistic. The above summation algorithm can indeed work unchanged in distributed memory, except that we have to worry about the distance between active processors increasing as we iterate further. If the processors are connected by a linear array, the number of hops' between active processors doubles, and with that, asymptotically, the computation time of the iteration. The total execution time then becomes~$n/2$, a disappointing result given that we throw so many processors at the problem. What if the processors are connected with a hypercube topology (section~ 2.7.5 )? It is not hard to see that the summation algorithm can then indeed work in$\log_2n$time. However, as$n\rightarrow\infty$, can we physically construct a sequence of hypercubes of$n$nodes and keep the communication time between two connected constant? Since communication time depends on latency, which partly depends on the length of the wires, we have to worry about the physical distance between nearest neighbors. The crucial question here is whether the hypercube (an$n$-dimensional object) can be embedded in 3-dimensional space, while keeping the distance (measured in meters) constant between connected neighbors. It is easy to see that a 3-dimensional grid can be scaled up arbitrarily while maintaining a unit wire length, but the question is not clear for a hypercube. There, the length of the wires may have to increase as$n$grows, which runs afoul of the finite speed of electrons. We sketch a proof (see~ [Fisher:fastparallel] for more details) that, in our three dimensional world and with a finite speed of light, speedup is limited to$\sqrt[4]{n}$for a problem on$n$processors, no matter the interconnect. The argument goes as follows. Consider an operation that involves collecting a final result on one processor. Assume that each processor takes a unit volume of space, produces one result per unit time, and can send one data item per unit time. Then, in an amount of time~$t$, at most the processors in a ball with radius~$t$, that is,$O(t^3)$processors total, can contribute to the final result; all others are too far away. In time~$T$, then, the number of operations that can contribute to the final result is$\int_0^T t^3dt=O(T^4)$. This means that the maximum achievable speedup is the fourth root of the sequential time. Finally, the question what if we had infinitely many processors' is not realistic as such, but we will allow it in the sense that we will ask the weak scaling question (section~ 2.2.5 ) what if we let the problem size and the number of processors grow proportional to each other'. This question is legitimate, since it corresponds to the very practical deliberation whether buying more processors will allow one to run larger problems, and if so, with what bang for the buck'. ### 2.2.3 Amdahl's law crumb trail: > parallel > Theoretical concepts > Amdahl's law One reason for less than perfect speedup is that parts of a code can be inherently sequential. This limits the parallel efficiency as follows. Suppose that$5\%$of a code is sequential, then the time for that part can not be reduced, no matter how many processors are available. Thus, the speedup on that code is limited to a factor Law} [amd:law] , which we will now formulate. FIGURE 2.4: Sequential and parallel time in Amdahl's analysis. Let$F_s$be the sequential fraction and$F_p$be the parallel fraction (or more strictly: the parallelizable' fraction) of a code, respectively. Then$F_p+F_s=1$. The parallel execution time$T_p$on$p$processors is the sum of the part that is sequential$T_1F_s$and the part that can be parallelized$T_1F_p/P$: $$T_P=T_1(F_s+F_p/P). \label{eq:amdahl}$$ (see figure 2.4 ) As the number of processors grows$P\rightarrow\infty$, the parallel execution time now approaches that of the sequential fraction of the code:$T_P\downarrow T_1F_s$. We conclude that speedup is limited by$S_P\leq 1/F_s$and efficiency is a decreasing function$E\sim 1/P$. The sequential fraction of a code can consist of things such as I/O operations. However, there are also parts of a code that in effect act as sequential. Consider a program that executes a single loop, where all iterations can be computed independently. Clearly, this code offers no obstacles to parallelization. However by splitting the loop in a number of parts, one per processor, each processor now has to deal with loop overhead: calculation of bounds, and the test for completion. This overhead is replicated as many times as there are processors. In effect, loop overhead acts as a sequential part of the code. Exercise Let's do a specific example. Assume that a code has a setup that takes 1 second and a parallelizable section that takes 1000 seconds on one processor. What are the speedup and efficiency if the code is executed with 100 processors? What are they for 500 processors? Express your answer to at most two significant digits. End of exercise Exercise Investigate the implications of Amdahl's law: if the number of processors$P$increases, how does the parallel fraction of a code have to increase to maintain a fixed efficiency? End of exercise #### 2.2.3.1 Amdahl's law with communication overhead In a way, Amdahl's law, sobering as it is, is even optimistic. Parallelizing a code will give a certain speedup, but it also introduces communication overhead that will lower the speedup attained. Let us refine our model of 2.4 \cite[p. 367]{Landau:comp-phys}): $$T_p= T_1(F_s+F_p/P) +T_c,$$ where$T_c$is a fixed communication time. To assess the influence of this communication overhead, we assume that the code is fully parallelizable, that is,$F_p=1$. We then find that $$S_p=\frac{T_1}{T_1/p+T_c}. \label{eq:amdahl-comm}$$ For this to be close to$p$, we need$T_c\ll T_1/p$or$p\ll T_1/T_c$. In other words, the number of processors should not grow beyond the ratio of scalar execution time and communication overhead. #### 2.2.3.2 Gustafson's law crumb trail: > parallel > Theoretical concepts > Amdahl's law > Gustafson's law Amdahl's law was thought to show that large numbers of processors would never pay off. However, the implicit assumption in Amdahl's law is that there is a fixed computation which gets executed on more and more processors. In practice this is not the case: typically there is a way of scaling up a problem (in chapter Numerical treatment of differential equations you will learn the concept of discretization'), and one tailors the size of the problem to the number of available processors. A more realistic assumption would be to say that there is a sequential fraction independent of the problem size, and a parallel fraction that can be arbitrarily replicated. To formalize this, instead of starting with the execution time of the sequential program, let us start with the execution time of the parallel program, and say that $$T_p=T(F_s+F_p) \qquad\hbox{with F_s+F_p=1}.$$ Now we have two possible definitions of$T_1$. First of all, there is the$T_1$you get from setting$p=1$in$T_p$. (Convince yourself that that is actually the same as$T_p$.) However, what we need is$T_1$describing the time to do all the operations of the parallel program. FIGURE 2.5: Sequential and parallel time in Gustafson's analysis. (See figure 2.5 .) This is: $$T_1=F_sT+p\cdot F_pT.$$ This gives us a speedup of $$S_p=\frac{T_1}{T_p}=\frac{F_s+p\cdot F_p}{F_s+F_p} = F_s+p\cdot F_p = p-(p-1)\cdot F_s. \label{eq:gustaf-s}$$ From this formula we see that: • Speedup is still bounded by$p$; • \ldots but it's a positive number; • for a given$p$it's again a decreasing function of the sequential fraction. Exercise 2.5 and$F_p$. What is the asymptotic behavior of the efficiency$E_p$? End of exercise As with Amdahl's law, we can investigate the behavior of Gustafson's law if we include communication overhead. Let's go back to 2.2.3.1 problem, and approximate it as $$S_p = p(1-\frac{T_c}{T_1}p).$$ Now, under the assumption of a problem that is gradually being scaled up,$T_c$and$T_1$become functions of$p$. We see that, if$T_1(p)\sim pT_c(p)$, we get linear speedup that is a constant fraction away from$1$. As a general discussion we can not take this analysis further; in section 6.2.2 you'll see a detailed analysis of an example. #### 2.2.3.3 Amdahl's law and hybrid programming Above, you learned about hybrid programming, a mix between distributed and shared memory programming. This leads to a new form of Amdahl's law. Suppose we have$p$nodes with$c$cores each, and$F_p$describes the fraction of the code that uses$c$-way thread parallelism. We assume that the whole code is fully parallel over the$p$nodes. The ideal speed up would be$p c$, and the ideal parallel running time$T_1/(pc)$, but the actual running time is $$T_{p,c} = T_1 \left(\frac {F_s}{p} + \frac{F_p}{p c}\right) = \frac{T_1}{pc}\left( F_sc+F_p\right) = \frac{T_1}{pc}\left( 1+ F_s(c-1)\right).$$ Exercise Show that the speedup$T_1/T_{p,c}$can be approximated by$p/F_s$. End of exercise In the original Amdahl's law, speedup was limited by the sequential portion to a fixed number$1/F_s$, in hybrid programming it is limited by the task parallel portion to$p/F_s$. ### 2.2.4 Critical path and Brent's theorem crumb trail: > parallel > Theoretical concepts > Critical path and Brent's theorem The above definitions of speedup and efficiency, and the discussion of Amdahl's law and Gustafson's law, made an implicit assumption that parallel work can be arbitrarily subdivided. As you saw in the summing example in section 2.1 , this may not always be the case: there can be dependencies between operations, meaning that one operation depends on an earlier in the sense of needing its result as input. Dependent operations can not be executed simultaneously, so they limit the amount of parallelism that can be employed. We define the critical path as a (possibly non-unique) chain of dependencies of maximum length. (This length is sometimes known as span .) Since the tasks on a critical path need to be executed one after another, the length of the critical path is a lower bound on parallel execution time. To make these notions precise, we define the following concepts: Definition $$\begin{array}{l@{\colon}l} T_1&\hbox{the time the computation takes on a single processor}\\ T_p&\hbox{the time the computation takes with p processors}\\ T_\infty&\hbox{the time the computation takes if unlimited processors are available}\\ P_\infty&\hbox{the value of p for which T_p=T_\infty} \end{array}$$ End of definition With these concepts, we can define the average parallelism of an algorithm as$T_1/T_\infty$, and the length of the critical path is$T_\infty$. We will now give a few illustrations by showing a graph of tasks and their dependencies. We assume for simplicity that each node is a unit time task. {\textwidth} {.25\textwidth} {.75\textwidth} The maximum number of processors that can be used is 2 and the average parallelism is$4/3$: $$\begin{array}{l} T_1=4,\quad T_\infty=3 \quad\Rightarrow T_1/T_\infty=4/3\\ T_2=3,\quad S_2=4/3,\quad E_2=2/3\\ P_\infty=2 \end{array}$$ {\textwidth} {.25\textwidth} {.75\textwidth} The maximum number of processors that can be used is 3 and the average parallelism is$9/5$; efficiency is maximal for$p=2$: $$\begin{array}{l} T_1=9,\quad T_\infty=5 \quad\Rightarrow T_1/T_\infty=9/5\\ T_2=6,\quad S_2=3/2,\quad E_2=3/4\\ T_3=5,\quad S_3=9/5,\quad E_3=3/5\\ P_\infty=3 \end{array}$$ {\textwidth} {.4\textwidth} {.6\textwidth} The maximum number of processors that can be used is 4 and that is also the average parallelism; the figure illustrates a parallelization with$P=3$that has efficiency$\equiv1$: $$\begin{array}{l} T_1=12,\quad T_\infty=4 \quad\Rightarrow T_1/T_\infty=3\\ T_2=6,\quad S_2=2,\quad E_2=1\\ T_3=4,\quad S_3=3,\quad E_3=1\\ T_4=3,\quad S_4=4,\quad E_4=1\\ P_\infty=4 \end{array}$$ Based on these examples, you probably see that there are two extreme cases: • If every task depends on precisely on other, you get a chain of dependencies, and$T_p=T_1$for any$p$. • On the other hand, if all tasks are independent (and$p$divides their number) you get$T_p=T_1/p$for any$p$. • In a slightly less trivial scenario than the previous, consider the case where the critical path is of length$m$, and in each of these$m$steps there are$p-1$independent tasks, or at least: dependent only on tasks in the previous step. There will then be perfect parallelism in each of the$m$steps, and we can express$T_p = T_1/p$or$T_p= m+ (T_1-m)/p$. That last statement actually holds in general. This is known as Brent's theorem : Theorem Let$m$be the total number of tasks,$p$the number of processors, and$t$the length of a critical path . Then the computation can be done in $$T_p \leq t +\frac{m-t}{p}.$$ End of theorem Proof Divide the computation in steps, such that tasks in step$i+1$are independent of each other, and only dependent on step$i$. Let$s_i$be the number of tasks in step$i$, then the time for that step is$\lceil \frac{s_i}{p} \rceil$. Summing over$i$gives $$T_p = \sum_i^t \lceil \frac{s_i}{p} \rceil \leq \sum_i^t \frac{s_i+p-1}{p} = t + \sum_i^t \frac{s_i-1}{p} = t+\frac{m-t}{p}.$$ End of proof Exercise Consider a tree of depth$d$, that is, with$2^d-1$nodes, and a search $$\max_{n\in\mathrm{nodes}} f(n).$$ Assume that all nodes need to be visited: we have no knowledge or any ordering on their values. Analyze the parallel running time on$p$processors, where you may assume that$p=2^q$, with$q
End of exercise

Exercise Apply Brent's theorem to Gaussian elimination, assuming that add/multiply/division all take one unit time.

Describe the critical path and give the length. What is the resulting upper bound on the parallel runtime?

How many processors could you theoretically use? What speedup and efficiency does that give?
End of exercise

### 2.2.5 Scalability

crumb trail: > parallel > Theoretical concepts > Scalability

Above, we remarked that splitting a given problem over more and more processors does not make sense: at a certain point there is just not enough work for each processor to operate efficiently. Instead, in practice users of a parallel code will either choose the number of processors to match the problem size, or they will solve a series of increasingly larger problems on correspondingly growing numbers of processors. In both cases it is hard to talk about speedup. Instead, the concept of scalability is used.

We distinguish two types of scalability. So-called strong scalability is in effect the same as speedup as discussed above. We say that a problem shows strong scalability if, partitioned over more and more processors, it shows perfect or near perfect speedup, that is, the execution time goes down linearly with the number of processors. In terms of efficiency we can describe this as:

$$\left. \begin{array}{l} N\equiv\mathrm{constant}\\ P\rightarrow\infty \end{array} \right\} \Rightarrow E_P\approx\mathrm{constant}$$

Typically, one encounters statements like this problem scales up to 500 processors', meaning that up to 500 processors the speedup will not noticeably decrease from optimal. It is not necessary for this problem to fit on a single processor: often a smaller number such as 64 processors is used as the baseline from which scalability is judged.

Exercise We can formulate strong scaling as a runtime that is inversely proportional to the number of processors:

$$t=c/p.$$

Show that on a log-log plot, that is, you plot the logarithm of the runtime against the logarithm of the number of processors, you will get a straight line with slope $-1$.

Can you suggest a way of dealing with a non-parallelizable section, that is, with a runtime $t=c_1+c_2/p$?
End of exercise

More interestingly, weak scalability describes the behavior of execution as problem size and number of processors both grow, but in such a way that the amount of work per processor stays constant. The term work' here is ambiguous: sometimes weak scaling is interpreted as keeping the amount of data constant, in other cases it's the number of operations that stays constant.

Measures such as speedup are somewhat hard to report, since the relation between the number of operations and the amount of data can be complicated. If this relation is linear, one could state that the amount of data per processor is kept constant, and report that parallel execution time is constant as the number of processors grows. (Can you think of applications where the relation between work and data is linear? Where it is not?)

In terms of efficiency:

$$\left. \begin{array}{l} N\rightarrow\infty\\ P\rightarrow\infty\\ M=N/P\equiv\mathrm{constant} \end{array} \right\} \Rightarrow E_P\approx\mathrm{constant}$$

Exercise Suppose you are investigating the weak scalability of a code. After running it for a couple of sizes and corresponding numbers of processes, you find that in each case the flop rate is roughly the same. Argue that the code is indeed weakly scalable.
End of exercise

Exercise In the above discussion we always implicitly compared a sequential algorithm and the parallel form of that same algorithm. However, in section  2.2.1 we noted that sometimes speedup is defined as a comparison of a parallel algorithm with the \textbf{best} sequential algorithm for the same problem. With that in mind, compare a parallel sorting algorithm with runtime $(\log n)^2$ (for instance, bitonic sort ; section  8.6  ) to the best serial algorithm, which has a running time of $n\log n$.

Show that in the weak scaling case of $n=p$ speedup is $p/\log p$. Show that in the strong scaling case speedup is a descending function of $n$.
End of exercise

Remark A historical anecdote.


Message:  1023110, 88 lines Posted:  5:34pm EST, Mon Nov
25/85, imported:  ....  Subject:  Challenge from Alan Karp
To:  Numerical-Analysis, ...  From GOLUB@SU-SCORE.ARPA

I have just returned from the Second SIAM Conference on
Parallel Processing for Scientific Computing in Norfolk,
Virginia.  There I heard about 1,000 processor systems,
4,000 processor systems, and even a proposed 1,000,000
processor system.  Since I wonder if such systems are the
best way to do general purpose, scientific computing, I am
making the following offer.

I will pay $100 to the first person to demonstrate a speedup of at least 200 on a general purpose, MIMD computer used for scientific computing. This offer will be withdrawn at 11:59 PM on 31 December 1995.  This was satisfied by scaling up the problem. End of remark #### 2.2.5.1 Iso-efficiency crumb trail: > parallel > Theoretical concepts > Scalability > Iso-efficiency In the definition of weak scalability above, we stated that, under some relation between problem size$N$and number of processors$P$, efficiency will stay constant. We can make this precise and define the iso-efficiency curve as the relation between$N,P$that gives constant efficiency [Grama:1993:isoefficiency] . #### 2.2.5.2 Precisely what do you mean by scalable? In scientific computing scalability is a property of an algorithm and the way it is parallelized on an architecture, in particular noting the way data is distributed. In computer industry parlance the term scalability' is sometimes applied to architectures or whole computer systems: A scalable computer is a computer designed from a small number of basic components, without a single bottleneck component, so that the computer can be incrementally expanded over its designed scaling range, delivering linear incremental performance for a well-defined set of scalable applications. General-purpose scalable computers provide a wide range of processing, memory size, and I/O resources. Scalability is the degree to which performance increments of a scalable computer are linear'' [Bell:outlook] . In particular, • horizontal scaling is adding more hardware components, such as adding nodes to a cluster; • vertical scaling corresponds to using more powerful hardware, for instance by doing a hardware upgrade. ### 2.2.6 Simulation scaling crumb trail: > parallel > Theoretical concepts > Simulation scaling In most discussions of weak scaling we assume that the amount of work and the amount of storage are linearly related. This is not always the case; for instance the operation complexity of a matrix-matrix product is$N^3$for$N^2$data. If you linearly increase the number of processors, and keep the data per process constant, the work may go up with a higher power. A similar effect comes into play if you simulate time-dependent PDEs . (This uses concepts from chapter Numerical treatment of differential equations .) Here, the total work is a product of the work per time step and the number of time steps. These two numbers are related; in section 4.1.2 you will see that the time step has a certain minimum size as a function of the space discretization. Thus, the number of time steps will go up as the work per time step goes up. Consider now applications such as weather prediction , and say that currently you need 4 hours of compute time to predict the next 24 hours of weather. The question is now: if you are happy with these numbers, and you buy a bigger computer to get a more accurate prediction, what can you say about the new computer? In other words, rather than investigating scalability from the point of the running of an algorithm, in this section we will look at the case where the simulated time$S$and the running time$T$are constant. Since the new computer is presumably faster, we can do more operations in the same amount of running time. However, it is not clear what will happen to the amount of data involved, and hence the memory required. To analyze this we have to use some knowledge of the math of the applications. Let$m$be the memory per processor, and$P$the number of processors, giving: $$M=Pm\qquad\hbox{total memory.}$$ If$d$is the number of space dimensions of the problem, typically 2 or 3, we get $$\Delta x = 1/M^{1/d}\qquad\hbox{grid spacing.}$$ For stability this limits the time step$\Delta t$to $$\Delta t= \begin{cases} \Delta x=1\bigm/M^{1/d}&\hbox{hyperbolic case}\\ \Delta x^2=1\bigm/M^{2/d}&\hbox{parabolic case} \end{cases}$$ (noting that the hyperbolic case was not discussed in chapter Numerical treatment of differential equations .) With a simulated time$S$, we find $$k=S/\Delta t\qquad \hbox{time steps.}$$ If we assume that the individual time steps are perfectly parallelizable, that is, we use explicit methods, or implicit methods with optimal solvers, we find a running time $$T=kM/P=\frac{S}{\Delta t}m.$$ Setting$T/S=C$, we find $$m=C\Delta t,$$ that is, the amount of memory per processor goes down as we increase the processor count. (What is the missing step in that last sentence?) Further analyzing this result, we find $$m=C\Delta t = c \begin{cases} 1\bigm/M^{1/d}&\hbox{hyperbolic case}\\ 1\bigm/M^{2/d}&\hbox{parabolic case} \end{cases}$$ Substituting$M=Pm$, we find ultimately $$m = C \begin{cases} 1\bigm/P^{1/(d+1)}&\hbox{hyperbolic}\\ 1\bigm/P^{2/(d+2)}&\hbox{parabolic} \end{cases}$$ that is, the memory per processor that we can use goes down as a higher power of the number of processors. Exercise Explore simulation scaling in the context of the Linpack benchmark , that is, Gaussian elimination. Ignore the system solving part and only consider the factorization part; assume that it can be perfectly parallelized. 1. Suppose you have a single core machine, and your benchmark run takes time$T$with$M$words of memory. Now you buy a processor twice as fast, and you want to do a benchmark run that again takes time$T$. How much memory do you need? 2. Now suppose you have a machine with$P$processors, each with$M$memory, and your benchmark run takes time$T$. You buy a machine with$2P$processors, of the same clock speed and core count, and you want to do a benchmark run, again taking time$T$. How much memory does each node take? End of exercise ### 2.2.7 Other scaling measures crumb trail: > parallel > Theoretical concepts > Other scaling measures Amdahl's law above was formulated in terms of the execution time on one processor. In many practical situations this is unrealistic, since the problems executed in parallel would be too large to fit on any single processor. Some formula manipulation gives us quantities that are to an extent equivalent, but that do not rely on this single-processor number [Moreland:formalmetrics2015] . For starters, applying the definition$S_p(n) = \frac{ T_1(n) }{ T_p(n) }$to strong scaling, we observe that$T_1(n)/n$is the sequential time per operation, and its inverse$n/T_1(n)$can be called the sequential computational rate , denoted$R_1(n)$. Similarly defining a parallel computational rate' $$R_p(n) = n/T_p(n)$$ we find that $$S_p(n) = R_p(n)/R_1(n)$$ In strong scaling$R_1(n)$will be a constant, so we make a logarithmic plot of speedup, purely based on measuring$T_p(n)$. ### 2.2.8 Concurrency; asynchronous and distributed computing Even on computers that are not parallel there is a question of the execution of multiple simultaneous processes. Operating systems typically have a concept of time slicing , where all active process are given command of the CPU for a small slice of time in rotation. In this way, a sequential can emulate a parallel machine; of course, without the efficiency. However, time slicing is useful even when not running a parallel application: OSs will have independent processes (your editor, something monitoring your incoming mail, et cetera) that all need to stay active and run more or less often. The difficulty with such independent processes arises from the fact that they sometimes need access to the same resources. The situation where two processes both need the same two resources, each getting hold of one, is called deadlock . A famous formalization of resource contention is known as the dining philosophers problem. The field that studies such as independent processes is variously known as concurrency , asynchronous computing , or distributed computing . The term concurrency describes that we are dealing with tasks that are simultaneously active, with no temporal ordering between their actions. The term distributed computing derives from such applications as database systems, where multiple independent clients need to access a shared database. We will not discuss this topic much in this book. Section 2.6.1 discusses the thread mechanism that supports time slicing; on modern multicore processors threads can be used to implement shared memory parallel computing. The book Communicating Sequential Processes' offers an analysis of the interaction between concurrent processes [Hoare:CSP] . Other authors use topology to analyze asynchronous computing [Herlihy:1999:topological] . ## 2.3 Parallel Computers Architectures crumb trail: > parallel > Parallel Computers Architectures For quite a while now, the top computers have been some sort of parallel computer, that is, an architecture that allows the simultaneous execution of multiple instructions or instruction sequences. One way of characterizing the various forms this can take is due to Flynn [flynn:taxonomy] . Flynn's taxonomy characterizes architectures by whether the data flow and control flow are shared or independent. The following four types result (see also figure 2.6 ): FIGURE 2.6: The four classes of the Flynn's taxonomy. • [SISD] Single Instruction Single Data: this is the traditional CPU architecture: at any one time only a single instruction is executed, operating on a single data item. • [SIMD] Single Instruction Multiple Data: in this computer type there can be multiple processors, each operating on its own data item, but they are all executing the same instruction on that data item. Vector computers (section 2.3.1.1 ) are typically also characterized as SIMD. • [MISD] Multiple Instruction Single Data. No architectures answering to this description exist; one could argue that redundant computations for safety-critical applications are an example of MISD. • [MIMD] Multiple Instruction Multiple Data: here multiple CPUs operate on multiple data items, each executing independent instructions. Most current parallel computers are of this type. We will now discuss SIMD and MIMD architectures in more detail. \newpage ### 2.3.1 SIMD crumb trail: > parallel > Parallel Computers Architectures > SIMD {r}{2in} WRAPFIGURE 2.7: Architecture of the MasPar 2 array processors. Parallel computers of the SIMD type apply the same operation simultaneously to a number of data items. The design of the CPUs of such a computer can be quite simple, since the arithmetic unit does not need separate logic and instruction decoding units: all CPUs execute the same operation in lock step. This makes SIMD computers excel at operations on arrays, such as  for (i=0; i  and, for this reason, they are also often called \indexterm{array processors}. Scientific codes can often be written so that a large fraction of the time is spent in array operations. On the other hand, there are operations that can not can be executed efficiently on an array processor. For instance, evaluating a number of terms of a recurrence$x_{i+1}=ax_i+b_iinvolves that many additions and multiplications, but they alternate, so only one operation of each type can be processed at any one time. There are no arrays of numbers here that are simultaneously the input of an addition or multiplication. In order to allow for different instruction streams on different parts of the data, the processor would have a mask bit' that could be set to prevent execution of instructions. In code, this typically looks like  where (x>0) { x[i] = sqrt(x[i])  The programming model where identical operations are applied to a number of data items simultaneously, is known as data parallelism . Such array operations can occur in the context of physics simulations, but another important source is graphics applications. For this application, the processors in an array processor can be much weaker than the processor in a PC: often they are in fact bit processors, capable of operating on only a single bit at a time. Along these lines, ICL had the 4096 processor DAP [DAP:79a] in the 1980s, and Goodyear MPP [Batcher:85a] in the 1970s. Later, the Connection Machine (CM-1, CM-2, CM-5) were quite popular. While the first Connection Machine had bit processors (16 to a chip), the later models had traditional processors capable of floating point arithmetic, and were not true SIMD architectures. All were based on a hyper-cube interconnection network; see section 2.7.5 . Another manufacturer that had a commercially successful array processor was MasPar ; figure 2.7 illustrates the architecture. You clearly see the single control unit for a square array of processors, plus a network for doing global operations. Supercomputers based on array processing do not exist anymore, but the notion of SIMD lives on in various guises. For instance, GPUs are SIMD-based, enforced through their CUDA programming language. Also, the Intel Xeon Phi has a strong SIMD component. While early SIMD architectures were motivated by minimizing the number of transistors necessary, these modern co-processors are motivated by power efficiency considerations. Processing instructions (known as instruction issue ) is actually expensive compared to a floating point operation, in time, energy, and chip real estate needed. Using SIMD is then a way to economize on the last two measures. #### 2.3.1.1 Pipelining and pipeline processors A number of computers have been based on a \indexterm{vector processor} or pipeline processor design. The first commercially successful supercomputers, the Cray-1 and the Cyber-205 were of this type. In recent times, the Cray-X1 and the NEC SX series have featured vector pipes. The Earth Simulator' computer [Sato2004] , which led the TOP500 (section 2.11.6 ) for 3 years, was based on NEC SX processors. The general idea behind pipelining was described in section 1.2.1.3 . While supercomputers based on pipeline processors are in a distinct minority, pipelining is now mainstream in the superscalar CPUs that are the basis for clusters . A typical CPU has pipelined floating point units, often with separate units for addition and multiplication; see section 1.2.1.3 . However, there are some important differences between pipelining in a modern superscalar CPU and in, more old-fashioned, vector units. The pipeline units in these vector computers are not integrated floating point units in the CPU, but can better be considered as attached vector units to a CPU that itself has a floating point unit. The vector unit has vector registers (Note: {The Cyber205 was an exception, with direct-to-memory architecture.} ) with a typical length of 64 floating point numbers; there is typically no vector cache'. The logic in vector units is also simpler, often addressable by explicit vector instructions. Superscalar CPUs, on the other hand, are fully integrated in the CPU and geared towards exploiting data streams in unstructured code. #### 2.3.1.2 True SIMD in CPUs and GPUs crumb trail: > parallel > Parallel Computers Architectures > SIMD > True SIMD in CPUs and GPUs True SIMD array processing can be found in modern CPUs and GPUs, in both cases inspired by the parallelism that is needed in graphics applications. Modern CPUs from Intel and AMD, as well as PowerPC chips, have \indextermbusdef{vector}{instructions} that can perform multiple instances of an operation simultaneously. On Intel processors this is known as SSE or AVX . These extensions were originally intended for graphics processing, where often the same operation needs to be performed on a large number of pixels. Often, the data has to be a total of, say, 128 bits, and this can be divided into two 64-bit reals, four 32-bit reals, or a larger number of even smaller chunks such as 4 bits. The AVX instructions are based on up to 512-bit wide SIMD, that is, eight double precision floating point numbers can be processed simultaneously. Just as single floating point operations operate on data in registers (section 1.3.3 ), vector operations use vector registers . The locations in a vector register are sometimes referred to as \indextermbusdef{SIMD}{lanes}. The use of SIMD is mostly motivated by power considerations. Decoding instructions can be more power consuming than executing them, so SIMD parallelism is a way to save power. Current compilers can generate SSE or AVX instructions automatically; sometimes it is also possible for the user to insert pragmas, for instance with the Intel compiler:  void func(float *restrict c, float *restrict a, float *restrict b, int n) { #pragma vector always for (int i=0; i  Use of these extensions often requires data to be aligned with cache line boundaries (section 1.3.4.7 ), so there are special allocate and free calls that return aligned memory. Version 4 of OpenMP also has directives for indicating SIMD parallelism. Array processing on a larger scale can be found in GPU s. A GPU contains a large number of simple processors, ordered in groups of 32, typically. Each processor group is limited to executing the same instruction. Thus, this is true example of SIMD processing. For further discussion, see section 2.9.3 . ### 2.3.2 MIMD / SPMD computers crumb trail: > parallel > Parallel Computers Architectures > MIMD / SPMD computers By far the most common parallel computer architecture these days is called MIMD : the processors execute multiple, possibly differing instructions, each on their own data. Saying that the instructions differ does not mean that the processors actually run different programs: most of these machines operate in SPMD mode, where the programmer starts up the same executable on the parallel processors. Since the different instances of the executable can take differing paths through conditional statements, or execute differing numbers of iterations of loops, they will in general not be completely in sync as they were on SIMD machines. If this lack of synchronization is due to processors working on different amounts of data, it is called load unbalance , and it is a major source of less than perfect speedup ; see section 2.10 . There is a great variety in MIMD computers. Some of the aspects concern the way memory is organized, and the network that connects the processors. Apart from these hardware aspects, there are also differing ways of programming these machines. We will see all these aspects below. Many machines these days are called clusters . They can be built out of custom or commodity processors (if they consist of PCs, running Linux, and connected with Ethernet , they are referred to as \indextermsub{Beowulf} {clusters} [Gropp:BeowulfBook] ); since the processors are independent they are examples of the MIMD or SPMD model. ### 2.3.3 The commoditization of supercomputers In the 1980s and 1990s supercomputers were radically different from personal computer and mini or super-mini computers such as the DEC PDP and VAX series. The SIMD vector computers had one ( CDC Cyber205 or Cray-1 ), or at most a few ( ETA-10 , Cray-2 , Cray X/MP , Cray Y/MP ), extremely powerful processors, often a vector processor. Around the mid-1990s clusters with thousands of simpler (micro) processors started taking over from the machines with relative small numbers of vector pipes (see http://www.top500.org/lists/1994/11 ). At first these microprocessors ( IBM Power series , Intel i860 , MIPS , DEC Alpha ) were still much more powerful than home computer' processors, but later this distinction also faded to an extent. Currently, many of the most powerful clusters are powered by essentially the same Intel Xeon and AMD Opteron chips that are available on the consumer market. Others use IBM Power Series or other server' chips. See section 2.11.6 for illustrations of this history since 1993. ## 2.4 Different types of memory access crumb trail: > parallel > Different types of memory access In the introduction we defined a parallel computer as a setup where multiple processors work together on the same problem. In all but the simplest cases this means that these processors need access to a joint pool of data. In the previous chapter you saw how, even on a single processor, memory can have a hard time keeping up with processor demands. For parallel machines, where potentially several processors want to access the same memory location, this problem becomes even worse. We can characterize parallel machines by the approach they take to the problem of reconciling multiple accesses, by multiple processes, to a joint pool of data. The main distinction here is between distributed memory and shared memory . With distributed memory, each processor has its own physical memory, and more importantly its own address space . \caption{References to identically named variables in the distributed and shared memory case.} Thus, if two processors refer to a variable x , they access a variable in their own local memory. This is an instance of the SPMD model. On the other hand, with shared memory, all processors access the same memory; we also say that they have a shared address space . See figure 2.4 . ### 2.4.1 Symmetric Multi-Processors: Uniform Memory Access Parallel programming is fairly simple if any processor can access any memory location. For this reason, there is a strong incentive for manufacturers to make architectures where processors see no difference between one memory location and another: any memory location is accessible to every processor, and the access times do not differ. This is called UMA , and the programming model for architectures on this principle is often called SMP . There are a few ways to realize an SMP architecture. Current desktop computers can have a few processors accessing a shared memory through a single memory bus; for instance Apple markets a model with 2 six-core processors. Having a memory bus that is shared between processors works only for small numbers of processors; for larger numbers one can use a crossbar that connects multiple processors to multiple memory banks; see section 2.7.6 . 2.21 shows 2.7.6.2 show butterfly exchange , which is built up out of simple On multicore processors there is uniform memory access of a different type: the cores typically have a shared cache , typically the L3 or L2 cache. ### 2.4.2 Non-Uniform Memory Access crumb trail: > parallel > Different types of memory access > Non-Uniform Memory Access The UMA approach based on shared memory is obviously limited to a small number of processors. The crossbar networks are expandable, so they would seem the best choice. However, in practice one puts processors with a local memory in a configuration with an exchange network. This leads to a situation where a processor can access its own memory fast, and other processors' memory slower. This is one case of so-called NUMA : a strategy that uses physically distributed memory, abandoning the uniform access time, but maintaining the logically shared address space: each processor can still access any memory location. #### 2.4.2.1 Affinity When we have NUMA , the question of where to place data, in relation to the process or thread that will access it, becomes important. This is known as affinity ; if we look at it from the point of view of placing the processes or threads, it is called process affinity . QUOTE 2.8: Non-uniform memory access in a four-socket motherboard. Figure 2.8 illustrates NUMA in the case of the four-socket motherboard of the TACC Ranger cluster . Each chip has its own memory (8Gb) but the motherboard acts as if the processors have access to a shared pool of 32Gb. Obviously, accessing the memory of another processor is slower than accessing local memory. In addition, note that each processor has three connections that could be used to access other memory, but the rightmost two chips use one connection to connect to the network. This means that accessing each other's memory can only happen through an intermediate processor, slowing down the transfer, and tying up that processor's connections. #### 2.4.2.2 Coherence While the NUMA approach is convenient for the programmer, it offers some challenges for the system. Imagine that two different processors each have a copy of a memory location in their local (cache) memory. If one processor alters the content of this location, this change has to be propagated to the other processors. If both processors try to alter the content of the one memory location, the behavior of the program can become undetermined. Keeping copies of a memory location synchronized is known as cache coherence (see section 1.4.1 for further details); a multi-processor system using it is sometimes called a cache-coherent NUMA' or ccNUMA architecture. Taking NUMA to its extreme, it is possible to have a software layer that makes network-connected processors appear to operate on shared memory. This is known as distributed shared memory or virtual shared memory . In this approach a hypervisor offers a shared memory API, by translating system calls to distributed memory management. This shared memory API can be utilized by the Linux kernel , which can support 4096 threads. Among current vendors only SGI (the UV line) and Cray (the XE6 ) market products with large scale NUMA. Both offer strong support for PGAS languages; see section 2.6.5 . There are vendors, such as ScaleMP , that offer a software solution to distributed shared memory on regular clusters. ### 2.4.3 Logically and physically distributed memory The most extreme solution to the memory access problem is to offer memory that is not just physically, but that is also logically distributed: the processors have their own address space, and can not directly see another processor's memory. This approach is often called distributed memory', but this term is unfortunate, since we really have to consider the questions separately whether memory is distributed and whether is appears distributed. Note that NUMA also has physically distributed memory; the distributed nature of it is just not apparent to the programmer. With logically and physically distributed memory, the only way one processor can exchange information with another is through passing information explicitly through the network. You will see more about this in section 2.6.3.3 . This type of architecture has the significant advantage that it can scale up to large numbers of processors: the IBM BlueGene has been built with over 200,000 processors. On the other hand, this is also the hardest kind of parallel system to program. Various kinds of hybrids between the above types exist. In fact, most modern clusters will have NUMA nodes, but a distributed memory network between nodes. ## 2.5 Granularity of parallelism crumb trail: > parallel > Granularity of parallelism In this section we look at parallelism from a point of how far to subdivide the work over processing elements. The concept we explore here is that of granularity : the balance between the amount of independent work per processing element, and how often processing elements need to synchronize. We talk of large grain parallelism' if there is a lot of work in between synchronization points, and small grain parallelism' if that amount of work is small. Obviously, in order for small grain parallelism to be profitable, the synchronization needs to be fast; with large grain parallelism we can tolerate most costly synchronization. The discussion in this section will be mostly on a conceptual level; in section~ 2.6 we will go into some detail on actual parallel programming. ### 2.5.1 Data parallelism crumb trail: > parallel > Granularity of parallelism > Data parallelism It is fairly common for a program to have loops with a simple body that gets executed for all elements in a large data set:  for (i=0; i<1000000; i++) a[i] = 2*b[i];  Such code is considered an instance of data parallelism or fine-grained parallelism . If you had as many processors as array elements, this code would look very simple: each processor would execute the statement  a = 2*b  on its local data. If your code consists predominantly of such loops over arrays, it can be executed efficiently with all processors in lockstep. Architectures based on this idea, where the processors can in fact only work in lockstep, have existed, see section~ 2.3.1 . Such fully parallel operations on arrays appear in computer graphics, where every pixel of an image is processed independently. For this reason, GPUs (section~ 2.9.3 ) are strongly based on data parallelism. The CUDA language, invented by NVidia , allows for elegant expression of data parallelism. Later developed languages, such as Sycl , or libraries such as Kokkos , aim at similar expression, but are more geared towards heterogeneous parallelism. Continuing the above example for a little bit, consider the operation \For{0\leq i<\mathrm{max}$}{$i_{\mathrm{left}}=\mod(i-1,\mathrm{max})$\\$i_{\mathrm{right}}=\mod(i+1,\mathrm{max})$\\$a_i = (b_{i_{\mathrm{left}}}+b_{i_{\mathrm{right}}})/2$} On a data parallel machine, that could be implemented as \SetKw{shiftleft}{shiftleft} \SetKw{shiftright}{shiftright}$ bleft \leftarrow \shiftright( b  )$\\$ bright \leftarrow \shiftleft( b  )$\\$ a \leftarrow ( bleft + bright  )/2$where the shiftleft/right instructions cause a data item to be sent to the processor with a number lower or higher by 1. For this second example to be efficient, it is necessary that each processor can communicate quickly with its immediate neighbors, and the first and last processor with each other. In various contexts such a blur' operations in graphics, it makes sense to have operations on 2D data: \For{$0

and consequently processors have be able to move data to neighbors in a 2D grid.

### 2.5.2 Instruction-level parallelism

crumb trail: > parallel > Granularity of parallelism > Instruction-level parallelism

In ILP  , the parallelism is still on the level of individual instructions, but these need not be similar. For instance, in

$a\leftarrow b+c$\\ $d\leftarrow e*f$

the two assignments are independent, and can therefore be executed simultaneously. This kind of parallelism is too cumbersome for humans to identify, but compilers are very good at this. In fact, identifying ILP is crucial for getting good performance out of modern superscalar CPUs.

### 2.5.3 Task-level parallelism

crumb trail: > parallel > Granularity of parallelism > Task-level parallelism

At the other extreme from data and instruction-level parallelism, task parallelism that can be executed in parallel. As an example, a search in a tree data structure could be implemented as follows:

{SearchInTree}{root} \SetKw{optimal}{optimal}\SetKw{exit}{exit}\SetKw{search}{SearchInTree}\SetKw{parl}{parallel} \eIf{\optimal(root)}{\exit} {\parl: \search(leftchild),\search(rightchild)}

The search tasks in this example are not synchronized, and the number of tasks is not fixed: it can grow arbitrarily. In practice, having too many tasks is not a good idea, since processors are most efficient if they work on just a single task. Tasks can then be scheduled as follows:

\While{there are tasks left}{ wait until a processor becomes inactive;\\ spawn a new task on it}

(There is a subtle distinction between the two previous pseudo-codes. In the first, tasks were self-scheduling: each task spawned off two new ones. The second code is an example of the manager-worker paradigm : there is one central task which lives for the duration of the code, and which spawns and assigns the worker tasks.)

Unlike in the data parallel example above, the assignment of data to processor is not determined in advance in such a scheme. Therefore, this mode of parallelism is most suited for thread-programming, for instance through the OpenMP library; section  2.6.2  .

Let us consider a more serious example of task-level parallelism.

A finite element mesh is, in the simplest case, a collection of triangles that covers a 2D object. Since angles that are too acute should be avoided, the Delauney mesh refinement process can take certain triangles, and replace them by better shaped ones. This is illustrated in figure  2.9 : the black triangles violate some angle condition, so either they themselves get subdivided, or they are joined with some neighboring ones (rendered in grey) and then jointly redivided.

FIGURE 2.9: A mesh before and after refinement.

In pseudo-code, this can be implemented as in figure  2.10  .

Mesh m = /* read in initial mesh */ \\ WorkList wl; \\ wl.add(mesh.badTriangles()); \\ \While { (wl.size() != 0) } { Element e = wl.get(); //get bad triangle \\ if (e no longer in mesh) continue; \\ Cavity c = new Cavity(e); \\ c.expand(); \\ c.retriangulate(); \\ mesh.update(c); \\ wl.add(c.badTriangles()); \\ }

DISPLAYALGORITHM 2.10: Task queue implementation of Delauney refinement.

(This figure and code are to be found in  [Kulkami:howmuch]  , which also contains a more detailed discussion.)

It is clear that this algorithm is driven by a worklist (or task queue  ) data structure that has to be shared between all processes. Together with the dynamic assignment of data to processes, this implies that this type of irregular parallelism is suited to shared memory programming, and is much harder to do with distributed memory.

### 2.5.4 Conveniently parallel computing

crumb trail: > parallel > Granularity of parallelism > Conveniently parallel computing

In certain contexts, a simple, often single processor, calculation needs to be performed on many different inputs. Since the computations have no data dependencies and need not be done in any particular sequence, this is often called embarrassingly parallel or \indexterm{conveniently parallel} computing. This sort of parallelism can happen at several levels. In examples such as calculation of the Mandelbrot set or evaluating moves in a chess game, a subroutine-level computation is invoked for many parameter values. On a coarser level it can be the case that a simple program needs to be run for many inputs. In this case, the overall calculation is referred to as a parameter sweep  .

### 2.5.5 Medium-grain data parallelism

crumb trail: > parallel > Granularity of parallelism > Medium-grain data parallelism

The above strict realization of data parallelism assumes that there are as many processors as data elements. In practice, processors will have much more memory than that, and the number of data elements is likely to be far larger than the processor count of even the largest computers. Therefore, arrays are grouped onto processors in subarrays. The code then looks like this:


my_lower_bound = // some processor-dependent number
my_upper_bound = // some processor-dependent number
for (i=my_lower_bound; i


This model has some characteristics of data parallelism, since the operation performed is identical on a large number of data items. It can also be viewed as task parallelism, since each processor executes a larger section of code, and does not necessarily operate on equal sized chunks of data.

### 2.5.6 Task granularity

crumb trail: > parallel > Granularity of parallelism > Task granularity

In the previous subsections we considered different level of finding parallel work, or different ways of dividing up work so as to find parallelism. There is another way of looking at this: we define the granularity of a parallel scheme as the amount of work (or the task size) that a processing element can perform before having to communicate or synchronize with other processing elements.

In ILP we are dealing with very fine-grained parallelism, on the order of a single instruction or just a few instructions. In true task parallelism the granularity is much coarser.

The interesting case here is data parallelism, where we have the freedom to choose the task sizes. On SIMD machines we can choose a granularity of a single instruction, but, as you saw in section  2.5.5  , operations can be grouped into medium-sized tasks. Thus, operations that are data parallel can be executed on distributed memory clusters, given the right balance between the number of processors and total problem size.

Exercise Discuss choosing the right granularity for a data parallel operation such as averaging on a two-dimensional grid. Show that there is a

surface-to-volume effect: the amount of communication is of a lower order than the computation. This means that, even if communication is much slower than computation, increasing the task size will still give a balanced execution.
End of exercise

Unfortunately, choosing a large task size to overcome slow communication may aggravate another problem: aggregating these operations may give tasks with varying running time, causing load imbalance  . One solution here is to use an overdecomposition of the problem: create more tasks then there are processing elements, and assign multiple tasks to a processor (or assign tasks dynamically) to even out irregular running times. This is known as dynamic scheduling  , and the examples in section  2.5.3 illustrate this; see also section  2.6.2.1  . An example of overdecomposition in linear algebra is discussed in section  6.3.2  .

## 2.6 Parallel programming

crumb trail: > parallel > Parallel programming

Parallel programming is more complicated than sequential programming. While for sequential programming most programming languages operate on similar principles (some exceptions such as functional or logic languages aside), there is a variety of ways of tackling parallelism. Let's explore some of the concepts and practical aspects. There are various approaches to parallel programming. First of all, there does not seem to be any hope of a parallelizing compiler that can automagically transform a sequential program into a parallel one. Apart from the problem of figuring out which operations are independent, the main problem is that the problem of locating data in a parallel context is very hard. A~compiler would need to consider the whole code, rather than a subroutine at a time. Even then, results have been disappointing. More productive is the approach where the user writes mostly a sequential program, but gives some indications about what computations can be parallelized, and how data should be distributed. Indicating parallelism of operations explicitly is done in OpenMP (section~ 2.6.2  ); indicating the data distribution and leaving parallelism to the compiler and runtime is the basis for PGAS languages (section~ 2.6.5  ). Such approaches work best with shared memory. By far the hardest way to program in parallel, but with the best results in practice, is to expose the parallelism to the programmer and let the programmer manage everything explicitly. This approach is necessary in the case of distributed memory programming. We will have a general discussion of distributed programming in section~ 2.6.3.1 ; section~ 2.6.3.3 will discuss the MPI library.

### 2.6.1 Thread parallelism

crumb trail: > parallel > Parallel programming > Thread parallelism

As a preliminary to OpenMP (section~ 2.6.2  ), we will briefly go into threads'. To explain what a thread is, we first need to get technical about what a process is. A~unix process corresponds to the execution of a single program. Thus, it has in memory:
• The program code, in the form of machine language instructions;
• A heap  , containing for instance arrays that were created with malloc ;
• A stack with quick-changing information, such as the program counter that indicates what instruction is currently being executed, and data items with local scope, as well as intermediate results from computations.
This process can have multiple threads; these are similar in that they see the same program code and heap, but they have their own stack. Thus, a thread is an independent strand' of execution through a process. Processes can belong to different users, or be different programs that a single user is running concurrently, so they have their own data space. On the other hand, threads are part of one process and therefore share the process heap. Threads can have some private data, for instance by have their own data stack, but their main characteristic is that they can collaborate on the same data.

#### 2.6.1.1 The fork-join mechanism

crumb trail: > parallel > Parallel programming > Thread parallelism > The fork-join mechanism

Threads are dynamic, in the sense that they can be created during program execution. (This is different from the MPI model, where every processor run one process, and they are all created and destroyed at the same time.) When a program starts, there is one thread active: the main thread  . Other threads are created by \indextermbusdef{thread}{spawning}, and the main thread can wait for their completion.

FIGURE 2.11: Thread creation and deletion during parallel execution.

This is known as the fork-join model; it is illustrated in figure  2.11  . A group of threads that is forked from the same thread and active simultaneously is known as a thread team  .

#### 2.6.1.2 Hardware support for threads

Threads as they were described above are a software construct. Threading was possible before parallel computers existed; they were for instance used to handle independent activities in an OS  . In the absence of parallel hardware, the OS would handle the threads through multitasking or \indextermdef{time slicing}: each thread would regularly get to use the CPU for a fraction of a second. (Technically, the Linux kernel treads processes and threads though the task concept; tasks are kept in a list, and are regularly activated or de-activated.)

This can lead to higher processor utilization, since the instructions of one thread can be processed while another thread is waiting for data. (On traditional CPUs, switching between threads is somewhat expensive (an exception is the hyperthreading mechanism) but on GPUs it is not, and in fact they need many threads to attain high performance.)

On modern multicore processors there is an obvious way of supporting threads: having one thread per core gives a parallel execution that uses your hardware efficiently. The shared memory allows the threads to all see the same data. This can also lead to problems; see section  2.6.1.5  .

#### 2.6.1.3 Threads example

crumb trail: > parallel > Parallel programming > Thread parallelism > Threads example

The following example, which is strictly Unix-centric and will not work on Windows, is a clear illustration of the fork-join model. It uses the pthreads library to spawn a number of tasks that all update a global counter. Since threads share the same memory space, they indeed see and update the same memory location.


#include
#include

int sum=0;

sum = sum+1;
return;
}

int main() {
int i;
printf("forking\n");
for (i=0; i

return 0;
}



The fact that this code gives the right result is a coincidence: it only happens because updating the variable is so much quicker than creating the thread. (On a multicore processor the chance of errors will greatly increase.) If we artificially increase the time for the update, we will no longer get the right result:


int t = sum; sleep(1); sum = t+1;
return;
}



Now all threads read out the value of sum  , wait a while (presumably calculating something) and then update.

This can be fixed by having a lock on the code region that should be mutually exclusive':



int t;
t = sum; sleep(1); sum = t+1;
return;
}

int main() {
....



The lock and unlock commands guarantee that no two threads can interfere with each other's update.

#### 2.6.1.4 Contexts

crumb trail: > parallel > Parallel programming > Thread parallelism > Contexts

In the above example and its version with the sleep command we glanced over the fact that there were two types of data involved. First of all, the variable  s was created outside the thread spawning part. Thus, this variable was shared  .

On the other hand, the variable  t was created once in each spawned thread. We call this private data.

The totality of all data that a thread can access is called its context  . It contains private and shared data, as well as temporary results of computations that the thread is working on. (It also contains the program counter and stack pointer. If you don't know what those are, don't worry.)

It is quite possible to create more threads than a processor has cores, so a processor may need to switch between the execution of different threads. This is known as a \indextermbusdef{context}{switch}.

Context switches are not for free on regular CPUs, so they only pay off if the granularity of the threaded work is high enough. The exceptions to this story are:

• CPUs that have hardware support for multiple threads, for instance through hyperthreading (section  2.6.1.9  ), or as in the

Intel Xeon Phi (section  2.9  );

• GPUs  , which in fact rely on fast context switching (section  2.9.3  );

• certain other exotic' architectures such as the Cray XMT (section  2.8  ).

#### 2.6.1.5 Race conditions, thread safety, and atomic operations

Shared memory makes life easy for the programmer, since every processor has access to all of the data: no explicit data traffic between the processor is needed. On the other hand, multiple processes/processors can also write to the same variable, which is a source of potential problems.

Suppose that two processes both try to increment an integer variable  I :

Init:
I=0

process 1:
I=I+2

process 2:
I=I+3

This is a legitimate activity if the variable is an accumulator for values computed by independent processes. The result of these two updates depends on the sequence in which the processors read and write the variable.

 \toprule scenario 1. scenario 1. scenario 3.\ \midrule \multicolumn{6}{c}{$I =0$}\ \midrule read $I =0$ read $I =0$ read $I =0$ read $I =0$ read $I =0$ compute $I =2$ compute $I =3$ compute $I =2$ compute $I =3$ compute $I =2$ write $I =2$ write $I =3$ write $I =2$ write $I =3$ write $I =2$ read $I =2$ compute $I =5$ write $I =5$ \midrule \multicolumn{2}{c}{$I =3$} \multicolumn{2}{c}{$I =2$} \multicolumn{2}{c}{$I =5$} \ \bottomrule

TABULAR 2.12: Three executions of a data race scenario.

Figure  2.12 illustrates three scenarios. Such a scenario, where the final result depends on which thread executes first, is known as a race condition or data race A formal definition would be:

We talk of a a data race if there are two statements $S_1,S_2$,

• that are not causally related;

• that both access a location $L$; and

• at least one access is a write.

A very practical example of such conflicting updates is the inner product calculation:


for (i=0; i<1000; i++)
sum = sum+a[i]*b[i];



Here the products are truly independent, so we could choose to have the loop iterations do them in parallel, for instance by their own threads. However, all threads need to update the same variable  sum  . This particular case of a data conflict is called reduction  , and it is common enough that many threading systems have a dedicated mechanism for it.

Code that behaves the same whether it's executed sequentially or threaded is called \indextermbusdef{thread}{safe}. As you can see from the above examples, a lack of thread safety is typically due to the treatment of shared data. This implies that the more your program uses local data, the higher the chance that it is thread safe. Unfortunately, sometimes the threads need to write to shared/global data.

There are essentially two ways of solving this problem. One is that we declare such updates of a shared variable a critical section of code. This means that the instructions in the critical section (in the inner product example read sum

from memory, update it, write back to memory') can be executed by only one thread at a time. In particular, they need to be executed entirely by one thread before any other thread can start them so the ambiguity problem above will not arise. Of course, the above code fragment is so common that systems like OpenMP (section  2.6.2  ) have a dedicated mechanism for it, by declaring it a reduction

operation.

Critical sections can for instance be implemented through the semaphore mechanism  [Dijkstra:semaphores]  . Surrounding each critical section there will be two atomic operations controlling a semaphore, a sign post. The first process to encounter the semaphore will lower it, and start executing the critical section. Other processes see the lowered semaphore, and wait. When the first process finishes the critical section, it executes the second instruction which raises the semaphore, allowing one of the waiting processes to enter the critical section.

The other way to resolve common access to shared data is to set a temporary lock on certain memory areas. This solution may be preferable, if common execution of the critical section is likely, for instance if it implements writing to a database or hash table. In this case, one process entering a critical section would prevent any other process from writing to the data, even if they might be writing to different locations; locking the specific data item being accessed is then a better solution.

The problem with locks is that they typically exist on the operating system level. This means that they are relatively slow. Since we hope that iterations of the inner product loop above would be executed at the speed of the floating point unit, or at least that of the memory bus, this is unacceptable.

One implementation of this is transactional memory  , where the hardware itself supports atomic operations; the term derives from database transactions, which have a similar integrity problem. In transactional memory, a process will perform a normal memory update, unless the processor detects a conflict with an update from another process. In that case, the updates (transactions') are canceled and retried with one processor locking the memory and the other waiting for the lock. This is an elegant solution; however, canceling the transaction may carry a certain cost of pipeline flushing (section  1.2.5  ) and cache line invalidation (section  1.4.1  ).

#### 2.6.1.6 Memory models and sequential consistency

The above signaled phenomenon of a race condition means that the result of some programs can be non-deterministic, depending on the sequence in which instructions are executed. There is a further factor that comes into play, and which is called the memory model that a processor and/or a language uses  [AdveBoehm:memorymodels]  . The memory model controls how the activity of one thread or core is seen by other threads or cores.

As an example, consider

initially:
A=B=0;
, then
process 1:
A=1; x = B;

process 2:
B=1; y = A;

As above, we have three scenarios, which we describe by giving a global sequence of statements:

 \toprule scenario 1. scenario 2. scenario 3. \midrule $A \leftarrow 1$ $A \leftarrow 1$ $B \leftarrow 1$ $x \leftarrow B$ $B \leftarrow 1$ $y \leftarrow A$ $B \leftarrow 1$ $x \leftarrow B$ $A \leftarrow 1$ $y \leftarrow A$ $y \leftarrow A$ $x \leftarrow B$ \midrule $x=0, y=1$ $x=1,y=1$ $x=1,y=0$ \bottomrule

(In the second scenario, statements 1,2 can be reversed, as can 3,4, without change in outcome.)

The three different outcomes can be characterized as being computed by a global ordering on the statements that respects the local orderings. This is known as \indexterm{sequential consistency}: the parallel outcome is consistent with a sequential execution that interleaves the parallel computations, respecting their local statement orderings.

Maintaining sequential consistency is expensive: it means that any change to a variable immediately needs to be visible on all other threads, or that any access to a variable on a thread needs to consult all other threads. We discussed this in section  1.4.1  .

In a relaxed memory model it is possible to get a result that is not sequentially consistent. Suppose, in the above example, that the compiler decides to reorder the statements for the two processes, since the read and write are independent. In effect we get a fourth scenario:

 \toprule scenario 4. \midrule $x \leftarrow B$ $y \leftarrow A$ $A \leftarrow 1$ $B \leftarrow 1$ \midrule $x=0, y=0$ \bottomrule

leading to the result $x=0,y=0$, which was not possible under the sequentially consistent model above. (There are algorithms for finding such dependencies  [KrishnaYelick:cycledetect]  .)

Sequential consistency implies that


integer n
n = 0
!$omp parallel shared(n) n = n + 1 !$omp end parallel



should have the same effect as


n = 0
n = n+1 ! for processor 0
n = n+1 ! for processor 1
! et cetera



With sequential consistency it is no longer necessary to declare atomic operations or critical sections; however, this puts strong demands on the implementation of the model, so it may lead to inefficient code.

#### 2.6.1.7 Affinity

crumb trail: > parallel > Parallel programming > Thread parallelism > Affinity

Thread programming is very flexible, effectively creating parallelism as needed. However, a large part of this book is about the importance of data movement in scientific computations, and that aspect can not be ignored in thread programming.

In the context of a multicore processor, any thread can be scheduled to any core, and there is no immediate problem with this. However, if you care about high performance, this flexibility can have unexpected costs. There are various reasons why you want to certain threads to run only on certain cores. Since the OS is allowed to migrate threads threads to stay in place.

• If a thread migrates to a different core, and that core has its own cache, you lose the contents of the original cache, and unnecessary memory transfers will occur.

• If a thread migrates, there is nothing to prevent the OS from putting two threads on one core, and leaving another core completely unused. This obviously leads to less than perfect speedup, even if the number of threads equals the number of cores.

We call affinity the mapping between threads ( thread affinity  ) affinity}) and cores. Affinity is usually expressed as a mask : a description of the locations where a thread is allowed to run.

As an example, consider a two-socket node, where each socket has four cores.

With two threads and socket affinity we have the following affinity mask:

 \toprule thread socket 0 socket 1 \midrule 0 0-1-2-3 1 4-5-6-7 \bottomrule

With core affinity the mask depends on the affinity type. The typical strategies are close' and spread'. With close affinity  , the mask could be:

 \toprule thread socket 0 socket 1 \midrule 0 0 1 \hphantom{0-}1 \bottomrule

Having two threads on the same socket means that they probably share an L2 cache, so this strategy is appropriate if they share data.

On the other hand, with spread affinity the threads are placed further apart:

 \toprule thread socket 0 socket 1 \midrule 0 0 1 4 \bottomrule

This strategy is better for bandwidth-bound applications, since now each thread has the bandwidth of a socket, rather than having to share it in the close' case.

If you assign all cores, the close and spread strategies lead to different arrangements:

Close

 \toprule socket 0 socket 1 \midrule 0-1-2-3 4-5-6-7 \bottomrule

 \toprule socket 0 socket 1 \midrule 0-2-4-6 1-3-5-7 \bottomrule

Affinity and data access patterns

Affinity can also be considered as a strategy of binding execution to data.

Consider this code:


for (i=0; i


The first loop, by accessing elements of $x$, bring memory into cache or page table. The second loop accesses elements in the same order, so having a fixed affinity is the right decision for performance.

In other cases a fixed mapping is not the right solution:


for (i=0; i


In this second example, either the program has to be transformed, or the programmer has to maintain in effect a task queue  .

First touch

It is natural to think of affinity in terms of put the execution where the data is'. However, in practice the opposite view sometimes makes sense. For instance, figure  2.8 showed how the shared memory of a cluster node can actually be distributed. Thus, a thread can be attached to a socket, but data can be allocated by the OS on any of the sockets. The mechanism that is often used by the OS is called the first-touch policy:

• When the program allocates data, the OS does not actually create it;

• instead, the memory area for the data is created the first time a thread accesses it;

• thus, the first thread to touch the area in effect causes the data to be allocated on the memory of its socket.

Exercise Explain the problem with the following code:


// serial initialization
for (i=0; i


End of exercise

For an in-depth discussion of memory policies, see  [Lameter:NUMAq]  .

#### 2.6.1.8 Cilk Plus

crumb trail: > parallel > Parallel programming > Thread parallelism > Cilk Plus

Other programming models based on threads exist. For instance, Intel Cilk Plus ( http://www.cilkplus.org/  ) is a set of extensions of C/C++ with which a programmer can create threads.

\tt

Sequential code:

int fib(int n)\{
if (n<2) return 1;
else \{
int rst=0;
rst += fib(n-1);
rst += fib(n-2);
return rst;
\}
\}

Cilk code:

cilk int fib(int n)\{
if (n<2) return 1;
else \{
int rst=0;
rst += cilk\_spawn fib(n-1);
rst += cilk\_spawn fib(n-2);
cilk\_sync;
return rst;
\}
\}

MULTICOLS 2.13: Sequential and Cilk code for Fibonacci numbers.

Figure  2.13 shows the Fibonacci numbers calculation, sequentially and threaded with Cilk. In this example, the variable rst is updated by two, potentially independent threads. The semantics of this update, that is, the precise definition of how conflicts such as simultaneous writes are resolved, is defined by sequential consistency ; see section  2.6.1.6  .

In the above examples you saw that the threads that are spawned during one program run essentially execute the same code, and have access to the same data. Thus, at a hardware level, a thread is uniquely determined by a small number of local variables, such as its location in the code (the program counter  ) and intermediate results of the current computation it is engaged in.

Hyperthreading is an Intel technology to let multiple threads use the processor truly simultaneously, so that part of the processor would be optimally used. 2.6.1.9  .

If a processor switches between executing one thread and another, it saves this local information of the one thread, and loads the information of the other. The cost of doing this is modest compared to running a whole program, but can be expensive compared to the cost of a single instruction. Thus, hyperthreading may not always give a performance improvement.

Certain architectures have support for multi-threading  . This means that the hardware actually has explicit storage for the local information of multiple threads, and switching between the threads can be very fast. This is the case on GPUs (section  2.9.3  ), and on the Intel Xeon Phi architecture, where each core can support up to four threads.

However, the hyperthreads share the functional units of the core, that is, the arithmetic processing. Therefore, multiple active threads will not give a proportional increase in performance for computation-dominated codes. Most gain is expected in the case where the threads are heterogeneous in character.

### 2.6.2 OpenMP

crumb trail: > parallel > Parallel programming > OpenMP

OpenMP is an extension to the programming languages C and Fortran. Its main approach to parallelism is the parallel execution of loops: based on compiler directives  , a preprocessor can schedule the parallel execution of the loop iterations.

Since OpenMP is based on threads  , it features dynamic parallelism : the number of execution streams operating in parallel can vary from one part of the code to another. Parallelism is declared by creating parallel regions, for instance indicating that all iterations of a loop nest are independent, and the runtime system will then use whatever resources are available.

OpenMP is not a language, but an extension to the existing C and Fortran languages. It mostly operates by inserting directives into source code, which are interpreted by the compiler. It also has a modest number of library calls, but these are not the main point, unlike in MPI (section  2.6.3.3  ). Finally, there is a runtime system that manages the parallel execution.

OpenMP has an important advantage over MPI in its programmability: it is possible to start with a sequential code and transform it by incremental parallelization  . By contrast, turning a sequential code into a distributed memory MPI program is an all-or-nothing affair.

Many compilers, such as gcc or the Intel compiler, support the OpenMP extensions. In Fortran, OpenMP directives are placed in comment statements; in C, they are placed in #pragma CPP directives, which indicate compiler specific extensions. As a result, OpenMP code still looks like legal C or Fortran to a compiler that does not support OpenMP. Programs need to be linked to an OpenMP runtime library, and their behavior can be controlled through environment variables.

For more information about OpenMP, see  [Chapman2008:OpenMPbook] and http://openmp.org/wp/  .

#### 2.6.2.1 OpenMP examples

crumb trail: > parallel > Parallel programming > OpenMP > OpenMP examples

The simplest example of OpenMP use is the parallel loop. \lstset{language=C}


#pragma omp parallel for
for (i=0; i


Clearly, all iterations can be executed independently and in any order. The pragma CPP directive then conveys this fact to the compiler.

Some loops are fully parallel conceptually, but not in implementation: \lstset{language=C}


for (i=0; i


Here it looks as if each iteration writes to, and reads from, a shared variable  t  . However, t  is really a temporary variable, local to each iteration. Code that should be parallelizable, but is not due to such constructs, is called not thread safe  .

OpenMP indicates that the temporary is private to each iteration as follows: \lstset{language=C}


#pragma omp parallel for shared(a,b), private(t)
for (i=0; i


If a scalar is indeed shared, OpenMP has various mechanisms for dealing with that. For instance, shared variables commonly occur in reduction operations : \lstset{language=C}


sum = 0;
#pragma omp parallel for reduction(+:sum)
for (i=0; i


As you see, a sequential code can be parallelized with minimal effort.

The assignment of iterations to threads is done by the runtime system, but the user can guide this assignment. We are mostly concerned with the case where there are more iterations than threads: if there are $P$ threads and $N$ iterations and $N>P$, how is iteration $i$ going to be assigned to a thread?

The simplest assignment uses round-robin task scheduling  , a static scheduling strategy where thread $p$ gets iterations $p\times(N/P),\ldots,(p+1)\times (N/P)-1$. This has the advantage that if some data is reused between iterations, it will stay in the data cache of the processor executing that thread. On the other hand, if the iterations differ in the amount of work involved, the process may suffer from load unbalance with static scheduling. In that case, a dynamic scheduling strategy would work better, where each thread starts work on the next unprocessed iteration as soon as it finishes its current iteration. See the example in section  2.10.2  .

You can control OpenMP scheduling of loop iterations with the schedule

keyword; its values include static and dynamic  . It is also possible to indicate a chunksize  , which controls the size of the block of iterations that gets assigned together to a thread. If you omit the chunksize, OpenMP will divide the iterations into as many blocks as there are threads.

Exercise Let's say there are $t$ threads, and your code looks like \lstset{language=C}


for (i=0; i


If you specify a chunksize of 1, iterations $0,t,2t,\ldots$ go to the first thread, $1,1+t,1+2t,\ldots$ to the second, et cetera. Discuss why this is a bad strategy from a performance point of view. Hint: look up the definition of false sharing  . What would be a good chunksize?
End of exercise

### 2.6.3 Distributed memory programming through message passing

While OpenMP programs, and programs written using other shared memory paradigms, still look very much like sequential programs, this does not hold true for message passing code. Before we discuss the MPI library in some detail, we will take a look at this shift the way parallel code is written.

#### 2.6.3.1 The global versus the local view in distributed programming

There can be a marked difference between how a parallel algorithm looks to an observer, and how it is actually programmed. Consider the case where we have an array of processors~$\{P_i\}_{i=0..p-1}$, each containing one element of the arrays $x$~and~$y$, and $P_i$~computes $$\begin{cases} y_i\leftarrow y_i+x_{i-1}&i>0\\ \mbox{y_i unchanged}&i=0 \end{cases} \label{eq:mpi-send-left}$$

The global description of this could be

• Every processor $P_i$ except the last sends its $x$ element to $P_{i+1}$;

• every processor $P_i$ except the first receive an $x$ element from their neighbor $P_{i-1}$, and

• they add it to their $y$ element.

However, in general we can not code in these global terms. In the SPMD model (section  2.3.2  ) each processor executes the same code, and the overall algorithm is the result of these individual behaviors. The local program has access only to local data -- everything else needs to be communicated with send and receive operations -- and the processor knows its own number.

One possible way of writing this would be

• If I am processor 0 do nothing. Otherwise receive a $y$ element from the left, add it to my $x$ element.

• If I am the last processor do nothing. Otherwise send my $y$ element to the right.

At first we look at the case where sends and receives are so-called blocking communication instructions: a send instruction does not finish until the sent item is actually received, and a receive instruction waits for the corresponding send. This means that sends and receives between processors have to be carefully paired. We will now see that this can lead to various problems on the way to an efficient code.

The above solution is illustrated in figure  2.6.3.1  , where we show

\caption{Local and resulting global view of an algorithm for sending data to the right.}

the local timelines depicting the local processor code, and the resulting global behavior. You see that the processors are not working at the same time: we get serialized execution  .

What if we reverse the send and receive operations?

• If I am not the last processor, send my $x$ element to the right;

• If I am not the first processor, receive an $x$ element from the left and add it to my $y$ element.

This is illustrated in figure  2.6.3.1 and you see that

\caption{Local and resulting global view of an algorithm for sending data to the right.}

again we get a serialized execution, except that now the processors are activated right to left.

If the algorithm in equation  2.6.3.1 had been cyclic:

$$\begin{cases} y_i\leftarrow y_i+x_{i-1}&i=1… n-1\\ y_0\leftarrow y_0+x_{n-1}&i=0 \end{cases} \label{eq:cyclic-add}$$

the problem would be even worse. Now the last processor can not start its receive since it is blocked sending $x_{n-1}$ to processor 0. This situation, where the program can not progress because every processor is waiting for another, is called deadlock  .

The solution to getting an efficient code is to make as much of the communication happen simultaneously as possible. After all, there are no serial dependencies in the algorithm. Thus we program the algorithm as follows:

• If I am an odd numbered processor, I send first, then receive;

• If I am an even numbered processor, I receive first, then send.

This is illustrated in figure  2.6.3.1  , and we see that

\caption{Local and resulting global view of an algorithm for sending data to the right.}

the execution is now parallel.

Exercise Take another look at figure  2.3 of a parallel reduction. The basic actions are:

• receive data from a neighbor

• add it to your own data

• send the result on.

As you see in the diagram, there is at least one processor who does not send data on, and others may do a variable number of receives before they send their result on.

Write node code so that an SPMD program realizes the distributed reduction. Hint: write each processor number in binary. The algorithm uses a number of steps that is equal to the length of this bitstring.

• Assuming that a processor receives a message, express the distance to the origin of that message in the step number.

• Every processor sends at most one message. Express the step where this happens in terms of the binary processor number.

End of exercise

#### 2.6.3.2 Blocking and non-blocking communication

The reason for blocking instructions is to prevent accumulation of data in the network. If a send instruction were to complete before the corresponding receive started, the network would have to store the data somewhere in the mean time. Consider a simple example:


buffer = ... ;  // generate some data
send(buffer,0); // send to processor 0
buffer = ... ;  // generate more data
send(buffer,1); // send to processor 1



After the first send, we start overwriting the buffer. If the data in it hasn't been received, the first set of values would have to be buffered somewhere in the network, which is not realistic. By having the send operation block, the data stays in the sender's buffer until it is guaranteed to have been copied to the recipient's buffer.

One way out of the problem of sequentialization or deadlock that arises from blocking instruction is the use of non-blocking communication instructions, which include explicit buffers for the data. With non-blocking send instruction, the user needs to allocate a buffer for each send, and check when it is safe to overwrite the buffer.


buffer0 = ... ;   // data for processor 0
send(buffer0,0);  // send to processor 0
buffer1 = ... ;   // data for processor 1
send(buffer1,1);  // send to processor 1
...
// wait for completion of all send operations.



#### 2.6.3.3 The MPI library

If OpenMP is the way to program shared memory, MPI ~ [mpi-reference] is the standard solution for programming distributed memory. MPI (Message Passing Interface') is a specification for a library interface for moving data between processes that do not otherwise share data. The MPI routines can be divided roughly in the following categories:
• Process management. This includes querying the parallel environment and constructing subsets of processors.
• Point-to-point communication \index{point-to-point communication}. This is a set of calls where two processes interact. These are mostly variants of the send and receive calls.
• Collective calls. In these routines, all processors (or the whole of a specified subset) are involved. Examples are the broadcast call, where one processor shares its data with every other processor, or the gather call, where one processor collects data from all participating processors.
Let us consider how the OpenMP examples can be coded in MPI (Note: {This is not a course in MPI programming, and consequently the examples will leave out many details of the MPI calls. If you want to learn MPI programming, consult for instance~ [Gropp:UsingMPI1,Gropp:UsingMPI2,Gropp:UsingAdvancedMPI]  .}  ) . First of all, we no longer allocate

double a[ProblemSize];


but

double a[LocalProblemSize];


where the local size is roughly a $1/P$ fraction of the global size. (Practical considerations dictate whether you want this distribution to be as evenly as possible, or rather biased in some way.) The parallel loop is trivially parallel, with the only difference that it now operates on a fraction of the arrays:

for (i=0; i

However, if the loop involves a calculation based on the iteration number, we need to map that to the global value:

for (i=0; i

(We will assume that each process has somehow calculated the values of LocalProblemSize and MyFirstVariable  .) Local variables are now automatically local, because each process has its own instance:

for (i=0; i

However, shared variables are harder to implement. Since each process has its own data, the local accumulation has to be explicitly assembled:

for (i=0; i

The reduce' operation sums together all local values~ s into a variable globals that receives an identical value on each processor. This is known as a collective operation  . Let us make the example slightly more complicated:

for (i=0; i

If we had shared memory, we could write the following parallel code:

for (i=0; i

To turn this into valid distributed memory code, first we account for the fact that bleft and bright need to be obtained from a different processor for i==0 ( bleft  ), and for i==LocalProblemSize-1 ( bright  ). We do this with a exchange operation with our left and right neighbor processor:

// get bfromleft and bfromright from neighbor processors, then
for (i=0; i

Obtaining the neighbor values is done as follows. First we need to ask our processor number, so that we can start a communication with the processor with a number one higher and lower.

MPI_Sendrecv
(/* to be sent:  */ &b[LocalProblemSize-1],
/* destination  */ myTaskID+1,
/* to be recvd: */ &bfromleft,
/* source:      */ myTaskID-1,
/* some parameters omitted */
);
&bfromright, /* ... */ );


There are still two problems with this code. First, the sendrecv operations need exceptions for the first and last processors. This can be done elegantly as follows:

if (myTaskID==0) leftproc = MPI_PROC_NULL;
else leftproc = myTaskID-1;
else rightproc = myTaskID+1;
MPI_Sendrecv( &b[LocalProblemSize-1], &bfromleft,  rightproc );
MPI_Sendrecv( &b[0],                  &bfromright, leftproc);


Exercise There is still a problem left with this code: the boundary conditions from the original, global, version have not been taken into account. Give code that solves that problem.
End of exercise MPI gets complicated if different processes need to take different actions, for example, if one needs to send data to another. The problem here is that each process executes the same executable, so it needs to contain both the send and the receive instruction, to be executed depending on what the rank of the process is.

MPI_Send(myInfo,1,MPI_INT,/* to: */ 1,/* labeled: */,0,
MPI_COMM_WORLD);
} else {
MPI_Recv(myInfo,1,MPI_INT,/* from: */ 0,/* labeled: */,0,
/* not explained here: */&status,MPI_COMM_WORLD);
}



#### 2.6.3.4 Blocking

Although MPI is sometimes called the assembly language of parallel programming', for its perceived difficulty and level of explicitness, it is not all that hard to learn, as evinced by the large number of scientific codes that use it. The main issues that make MPI somewhat intricate to use are buffer management and blocking semantics. These issues are related, and stem from the fact that, ideally, data should not be in two places at the same time. Let us briefly consider what happens if processor~1 sends data to processor~2. The safest strategy is for processor~1 to execute the send instruction, and then wait until processor~2 acknowledges that the data was successfully received. This means that processor~1 is temporarily blocked until processor~2 actually executes its receive instruction, and the data has made its way through the network. This is the standard behavior of the MPI_Send and MPI_Recv calls, which are said to use blocking communication  . Alternatively, processor~1 could put its data in a buffer, tell the system to make sure that it gets sent at some point, and later checks to see that the buffer is safe to reuse. This second strategy is called non-blocking communication  , and it requires the use of a temporary buffer.

#### 2.6.3.5 Collective operations

In the above examples, you saw the MPI_Allreduce call, which computed a global sum and left the result on each processor. There is also a local version MPI_Reduce which computes the result only on operations} or collectives. The collectives are:
• [reduction]: each processor has a data item, and these items need to be combined arithmetically with an addition, multiplication, max, or min operation. The result can be left on one processor, or on all, in which case we call this an {\bf allreduce} operation.
• [broadcast]: one processor has a data item that all processors need to receive.
• [gather]: each processor has a data item, and these items need to be collected in an array, without combining them in an operations such as an addition. The result can be left on one processor, or on all, in which case we call this an {\bf allgather}.
• [scatter]: one processor has an array of data items, and each processor receives one element of that array.
• [all-to-all]: each processor has an array of items, to be scattered to all other processors.
Collective operations are blocking (see section~ 2.6.3.4  ), although MPI~3.0 (which is currently only a draft) will have non-blocking collectives. We will analyze the cost of collective operations in detail in section~ 6.1  .

#### 2.6.3.6 Non-blocking communication

In a simple computer program, each instruction takes some time to execute, in a way that depends on what goes on in the processor. In parallel programs the situation is more complicated. A~send operation, in its simplest form, declares that a certain buffer of data needs to be sent, and program execution will then stop until that buffer has been safely sent and received by another processor. This sort of operation is called a non-local operation since it depends on the actions of other processes, and a \indexterm{blocking communication} operation since execution will halt until a certain event takes place. Blocking operations have the disadvantage that they can lead to deadlock  . In the context of message passing this describes the situation that a process is waiting for an event that never happens; for instance, it can be waiting to receive a message and the sender of that message is waiting for something else. Deadlock occurs if two processes are waiting for each other, or more generally, if you have a cycle of processes where each is waiting for the next process in the cycle. Example:

if ( /* this is process 0 */ )
// wait for message from 1
else if ( /* this is process 1 */ )
// wait for message from 0


A block receive here leads to deadlock. Even without deadlock, they can lead to considerable \indexterm{idle time} in the processors, as they wait without performing any useful work. On the other hand, they have the advantage that it is clear when the buffer can be reused: after the operation completes, there is a guarantee that the data has been safely received at the other end. The blocking behavior can be avoided, at the cost of complicating the buffer semantics, by using non-blocking communication operations. A non-blocking send ( MPI_Isend  ) declares that a data buffer needs to be sent, but then does not wait for the completion of the corresponding receive. There is a second operation MPI_Wait that will actually block until the receive has been completed. The advantage of this decoupling of sending and blocking is that it now becomes possible to write:

MPI_ISend(somebuffer,&handle); // start sending, and
// get a handle to this particular communication
{ ... }  // do useful work on local data
MPI_Wait(handle); // block until the communication is completed;
{ ... }  // do useful work on incoming data


With a little luck, the local operations take more time than the communication, and you have completely eliminated the communication time. In addition to non-blocking sends, there are non-blocking receives. A typical piece of code then looks like

MPI_ISend(sendbuffer,&sendhandle);
{ ... }  // do useful work on local data
MPI_Wait(sendhandle); Wait(recvhandle);
{ ... }  // do useful work on incoming data


Exercise 2.6.3.1 solves the problem using non-blocking sends and receives. What is the disadvantage of this code over a blocking solution?
End of exercise

#### 2.6.3.7 MPI version 1 and 2 and 3

The first MPI standard~ [mpi-ref] had a number of notable omissions, which are included in the MPI~2 standard~ [mpi-2-reference]  . One of these concerned parallel input/output: there was no facility for multiple processes to access the same file, even if the underlying hardware would allow that. A~separate project MPI-I/O has now been rolled into the MPI-2 standard. We will discuss parallel I/O in this book. A second facility missing in MPI, though it was present in PVM ~ [pvm-1,pvm-2] which predates MPI, is process management: there is no way to create new processes and have them be part of the parallel run. Finally, MPI-2 has support for one-sided communication: one process puts data into the memory of another, without the receiving process doing an actual receive instruction. We will have a short discussion in section~ 2.6.3.8 below. With MPI-3 the standard has gained a number of new features, such as non-blocking collectives, neighborhood collectives, and a profiling interface. The one-sided mechanisms have also been updated.

#### 2.6.3.8 One-sided communication

The MPI way of writing matching send and receive instructions is not ideal for a number of reasons. First of all, it requires the programmer to give the same data description twice, once in the send and once in the receive call. Secondly, it requires a rather precise orchestration of communication if deadlock is to be avoided; the alternative of using asynchronous calls is tedious to program, requiring the program to manage a lot of buffers. Lastly, it requires a receiving processor to know how many incoming messages to expect, which can be tricky in irregular applications. Life would be so much easier if it was possible to pull data from another processor, or conversely to put it on another processor, without that other processor being explicitly involved. This style of programming is further encouraged by the existence of RDMA support on some hardware. An early example was the Cray T3E  . These days, one-sided communication is widely available through its incorporation in the MPI-2 library; section~ 2.6.3.7  . Let us take a brief look at one-sided communication in MPI-2, using averaging of array values as an example: $$\forall_i\colon a_i\leftarrow (a_i+a_{i-1}+a_{i+1})/3.$$

The MPI parallel code will look like


// do data transfer
a_local = (a_local+left+right)/3



It is clear what the transfer has to accomplish: the a_local

variable needs to become the left variable on the processor with the next higher rank, and the right variable on the one with the next lower rank.

First of all, processors need to declare explicitly what memory area is available for one-sided transfer, the so-called window'. In this example, that consists of the a_local  , left  , and right

variables on the processors:


MPI_Win_create(&a_local,...,&data_window);
MPI_Win_create(&left,....,&left_window);
MPI_Win_create(&right,....,&right_window);



The code now has two options: it is possible to push data out


target = my_tid-1;
MPI_Put(&a_local,...,target,right_window);
target = my_tid+1;
MPI_Put(&a_local,...,target,left_window);



or to pull it in


data_window = a_local;
source = my_tid-1;
MPI_Get(&right,...,data_window);
source = my_tid+1;
MPI_Get(&left,...,data_window);



The above code will have the right semantics if the Put and Get calls are blocking; see section  2.6.3.4  . However, part of the attraction of one-sided communication is that it makes it easier to express communication, and for this, a non-blocking semantics is assumed.

The problem with non-blocking one-sided calls is that it becomes necessary to ensure explicitly that communication is successfully completed. For instance, if one processor does a one-sided put

operation on another, the other processor has no way of checking that the data has arrived, or indeed that transfer has begun at all. Therefore it is necessary to insert a global barrier in the program, for which every package has its own implementation. In MPI-2 the relevant call is the

MPI_Win_fence routine. These barriers in effect divide the program execution in supersteps ; see section  2.6.8  .

Another form of one-sided communication is used in the Charm++ package; see section  2.6.7  .

### 2.6.4 Hybrid shared/distributed memory computing

Modern architectures are often a mix of shared and distributed memory. For instance, a cluster will be distributed on the level of the nodes, but sockets and cores on a node will have shared memory. One level up, each socket can have a shared L3 cache but separate L2 and L1 caches. Intuitively it seems clear that a mix of shared and distributed programming techniques would give code that is optimally matched to the architecture. In this section we will discuss such hybrid programming models, and discuss their efficacy. A common setup of clusters uses distributed memory nodes sockets using MPI to communicate between the nodes ( inter-node communication  ) and OpenMP for parallelism on the node ( intra-node communication  ). In practice this is realized as follows:

• On each node a single MPI process is started (rather than one per core);
• This one MPI process then uses OpenMP (or another threading protocol) to spawn as many threads are there are independent sockets or cores on the node.
• The OpenMP threads can then access the shared memory of the node.
The alternative would be to have an MPI process on each core or socket and do all communication through message passing, even between processes that can see the same shared memory. Remark For reasons of affinity it may be desirable to start one MPI process per socket, rather than per node. This does not materially alter the above argument.
End of remark This hybrid strategy may sound like a good idea but the truth is complicated.
• Message passing between MPI processes sounds like it's more expensive than communicating through shared memory. However, optimized versions of MPI can typically detect when processes are on the same node, and they will replace the message passing by a simple data copy. The only argument against using MPI is then that each process has its own data space, so there is memory overhead because each process has to allocate space for buffers and duplicates of the data that is copied.
• Threading is more flexible: if a certain part of the code needs more memory per process, an OpenMP approach could limit the number of threads on that part. On the other hand, flexible handling of threads incurs a certain amount of OS overhead that MPI does not have with its fixed processes.
• Shared memory programming is conceptually simple, but there can be unexpected performance pitfalls. For instance, the performance of two processes can now be impeded by the need for maintaining cache coherence and by false sharing  .
On the other hand, the hybrid approach offers some advantage since it bundles messages. For instance, if two MPI processes on one node send messages to each of two processes on another node there would be four messages; in the hybrid model these would be bundled into one message. Exercise Analyze the discussion in the last item above. Assume that the bandwidth between the two nodes is only enough to sustain one message at a time. What is the cost savings of the hybrid model over the purely distributed model? Hint: consider bandwidth and latency separately.
End of exercise This bundling of MPI processes may have an advantage for a deeper technical reason. In order to support a handshake protocol  , each MPI process needs a small amount of buffer space for each other process. With a larger number of processes this can be a limitation, so bundling is attractive on high core count processors such as the Intel Xeon Phi  . The MPI library is explicit about what sort of threading it supports: you can query whether multi-threading is supported at all, whether all MPI calls have to originate from one thread or one thread at-a-time, or whether there is complete freedom in making MPI calls from threads.

### 2.6.5 Parallel languages

crumb trail: > parallel > Parallel programming > Parallel languages

One approach to mitigating the difficulty of parallel programming is the design of languages that offer explicit support for parallelism. There are several approaches, and we will see some examples.
• Some languages reflect the fact that many operations in scientific computing are data parallel (section~ 2.5.1  ). Languages such as HPF (section~ 2.6.5.3  ) have an array syntax  , where operations such as addition of arrays can be expressed as A = B+C  . This syntax simplifies programming, but more importantly, it specifies operations at an abstract level, so that a lower level can make specific decision about how to handle parallelism. However, the data parallelism expressed in HPF is only of the simplest sort, where the data are contained in regular arrays. Irregular data parallelism is harder; the Chapel language (section~ 2.6.5.5  ) makes an attempt at addressing this.
• Another concept in parallel languages, not necessarily orthogonal to the previous, is that of PGAS model: there is only one address space (unlike in the MPI model), but this address space is partitioned, and each partition has affinity with a thread or process. Thus, this model encompasses both SMP and distributed shared memory. A~typical PGAS language, UPC  , allows you to write programs that for the most part looks like regular C code. However, by indicating how the major arrays are distributed over processors, the program can be executed in parallel.

#### 2.6.5.1 Discussion

crumb trail: > parallel > Parallel programming > Parallel languages > Discussion

Parallel languages hold the promise of making parallel programming easier, since they make communication operations appear as simple copies or arithmetic operations. However, by doing so they invite the user to write code that may not be efficient, for instance by inducing many small messages.

FIGURE 2.14: Data shift that requires communication.

As an example, consider arrays a,b that have been horizontally partitioned over the processors, and that are shifted (see figure  2.14  ):


for (i=0; i


If this code is executed on a shared memory machine, it will be efficient, but a naive translation in the distributed case will have a single number being communicated in each iteration of the

i  loop. Clearly, these can be combined in a single buffer send/receive operation, but compilers are usually unable to make this transformation. As a result, the user is forced to, in effect, re-implement the blocking that needs to be done in an MPI implementation:


for (i=0; i


On the other hand, certain machines support direct memory copies through global memory hardware. In that case, PGAS languages can be more efficient than explicit message passing, even with physically distributed memory.

#### 2.6.5.2 Unified Parallel C

crumb trail: > parallel > Parallel programming > Parallel languages > Unified Parallel C

UPC   [UPC:homepage] is an extension to the C language. Its main source of parallelism is data parallelism  , where the compiler discovers independence of operations on arrays, and assigns them to separate processors. The language has an extended array declaration, which allows the user to specify whether the array is partitioned by blocks, or in a round-robin fashion.

The following program in UPC performs a vector-vector addition.


#include
shared int v1[N], v2[N], v1plusv2[N];
void main() {
int i;


The same program with an explicitly parallel loop construct:


#include
shared int v1[N], v2[N], v1plusv2[N];
void main()
{
int i;
upc_forall(i=0; i


is comparable to UPC in spirit, but based on Java rather than on C.

#### 2.6.5.3 High Performance Fortran

crumb trail: > parallel > Parallel programming > Parallel languages > High Performance Fortran

High Performance Fortran

(Note: {This section quoted from Wikipedia}  )

(HPF) is an extension of Fortran 90 with constructs that support parallel computing, published by the High Performance Fortran Forum (HPFF). The HPFF was convened and chaired by Ken Kennedy of Rice University. The first version of the HPF Report was published in 1993.

Building on the array syntax introduced in Fortran 90, HPF uses a data parallel model of computation to support spreading the work of a single array computation over multiple processors. This allows efficient implementation on both SIMD and MIMD style architectures. HPF features included:

• New Fortran statements, such as FORALL, and the ability to create PURE (side effect free) procedures;

• The use of compiler directives for recommended distributions of array data;

• Extrinsic procedure interface for interfacing to non-HPF parallel procedures such as those using message passing;

• Additional library routines, including environmental inquiry, parallel prefix/suffix (e.g., 'scan'), data scattering, and sorting operations.

Fortran 95 incorporated several HPF capabilities. While some vendors did incorporate HPF into their compilers in the 1990s, some aspects proved difficult to implement and of questionable use. Since then, most vendors and users have moved to OpenMP-based parallel processing. However, HPF continues to have influence. For example the proposed BIT data type for the upcoming Fortran-2008 standard contains a number of new intrinsic functions taken directly from HPF.

#### 2.6.5.4 Co-array Fortran

crumb trail: > parallel > Parallel programming > Parallel languages > Co-array Fortran

CAF is an extension to the Fortran 95/2003 language. The main mechanism to support parallelism is an extension to the array declaration syntax, where an extra dimension indicates the parallel distribution. For instance, in


Real,dimension(100),codimension[*] :: X
Real :: Y(100)[*]
Real :: Z(100,200)[10,0:9,*]



arrays X,Y have 100 elements on each processor. Array Z behaves as if the available processors are on a three-dimensional grid, with two sides specified and the third adjustable to accommodate the available processors.

Communication between processors is now done through copies along the (co-)dimensions that describe the processor grid.





The Fortran 2008 standard includes co-arrays.

#### 2.6.5.5 Chapel

crumb trail: > parallel > Parallel programming > Parallel languages > Chapel

Chapel  [Chapel:homepage] is a new parallel programming language

(Note: {This section quoted from the Chapel homepage.}  )

being developed by Cray Inc. as part of the DARPA-led High Productivity Computing Systems program (HPCS). Chapel is designed to improve the productivity of high-end computer users while also serving as a portable parallel programming model that can be used on commodity clusters or desktop multicore systems. Chapel strives to vastly improve the programmability of large-scale parallel computers while matching or beating the performance and portability of current programming models like MPI.

Chapel supports a multithreaded execution model via high-level abstractions for data parallelism, task parallelism, concurrency, and nested parallelism. Chapel's locale type enables users to specify and reason about the placement of data and tasks on a target architecture in order to tune for locality. Chapel supports global-view data aggregates with user-defined implementations, permitting operations on distributed data structures to be expressed in a natural manner. In contrast to many previous higher-level parallel languages, Chapel is designed around a multiresolution philosophy, permitting users to initially write very abstract code and then incrementally add more detail until they are as close to the machine as their needs require. Chapel supports code reuse and rapid prototyping via object-oriented design, type inference, and features for generic programming.

Chapel was designed from first principles rather than by extending an existing language. It is an imperative block-structured language, designed to be easy to learn for users of C, C++, Fortran, Java, Perl, Matlab, and other popular languages. While Chapel builds on concepts and syntax from many previous languages, its parallel features are most directly influenced by ZPL, High-Performance Fortran (HPF), and the Cray MTA's extensions to C and Fortran.

Here is vector-vector addition in Chapel:


const BlockDist= newBlock1D(bbox=[1..m], tasksPerLocale=...);
const ProblemSpace: domain(1, 64)) distributed BlockDist = [1..m];
var A, B, C: [ProblemSpace] real;
forall(a, b, c) in(A, B, C) do
a = b + alpha * c;



#### 2.6.5.6 Fortress

crumb trail: > parallel > Parallel programming > Parallel languages > Fortress

Fortress  [Fortress:homepage] is a programming language developed by Sun Microsystems. Fortress

(Note: {This section quoted from the Fortress homepage.}  )

aims to make parallelism more tractable in several ways. First, parallelism is the default. This is intended to push tool design, library design, and programmer skills in the direction of parallelism. Second, the language is designed to be more friendly to parallelism. Side-effects are discouraged because side-effects require synchronization to avoid bugs. Fortress provides transactions, so that programmers are not faced with the task of determining lock orders, or tuning their locking code so that there is enough for correctness, but not so much that performance is impeded. The Fortress looping constructions, together with the library, turns "iteration" inside out; instead of the loop specifying how the data is accessed, the data structures specify how the loop is run, and aggregate data structures are designed to break into large parts that can be effectively scheduled for parallel execution. Fortress also includes features from other languages intended to generally help productivity -- test code and methods, tied to the code under test; contracts that can optionally be checked when the code is run; and properties, that might be too expensive to run, but can be fed to a theorem prover or model checker. In addition, Fortress includes safe-language features like checked array bounds, type checking, and garbage collection that have been proven-useful in Java. Fortress syntax is designed to resemble mathematical syntax as much as possible, so that anyone solving a problem with math in its specification can write a program that is visibly related to its original specification.

#### 2.6.5.7 X10

crumb trail: > parallel > Parallel programming > Parallel languages > X10

X10 is an experimental new language currently under development at IBM in collaboration with academic partners. The X10 effort is part of the IBM PERCS project (Productive Easy-to-use Reliable Computer Systems) in the DARPA program on High Productivity Computer Systems. The PERCS project is focused on a hardware-software co-design methodology to integrate advances in chip technology, architecture, operating systems, compilers, programming language and programming tools to deliver new adaptable, scalable systems that will provide an order-of-magnitude improvement in development productivity for parallel applications by 2010.

X10 aims to contribute to this productivity improvement by developing a new programming model, combined with a new set of tools integrated into Eclipse and new implementation techniques for delivering optimized scalable parallelism in a managed runtime environment. X10 is a type-safe, modern, parallel, distributed object-oriented language intended to be accessible to Java(TM) programmers. It is targeted to future low-end and high-end systems with nodes that are built out of multi-core SMP chips with non-uniform memory hierarchies, and interconnected in scalable cluster configurations. A member of the Partitioned Global Address Space (PGAS) family of languages, X10 highlights the explicit reification of locality in the form of places; lightweight activities embodied in async, future, foreach, and ateach constructs; constructs for termination detection (finish) and phased computation (clocks); the use of lock-free synchronization (atomic blocks); and the manipulation of global arrays and data structures.

#### 2.6.5.8 Linda

crumb trail: > parallel > Parallel programming > Parallel languages > Linda

As should be clear by now, the treatment of data is by far the most important aspect of parallel programming, far more important than algorithmic considerations. The programming system Linda   [Gelernter85generativecommunication,Linda-CACM]  , also called a coordination language  , is designed to address the data handling explicitly. Linda is not a language as such, but can, and has been, incorporated into other languages.

The basic concept of Linda is the tuple space : data is added to a pool of globally accessible information by adding a label to it. Processes then retrieve data by the label value, and without needing to know which processes added the data to the tuple space.

Linda is aimed primarily at a different computation model than is relevant for HPC : it addresses the needs of asynchronous communicating processes. However, is has been used for scientific computation  [Deshpande92efficientparallel]  . For instance, in parallel simulations of the heat equation (section  4.3  ), processors can write their data into tuple space, and neighboring processes can retrieve their \indexterm{ghost region} without having to know its provenance. Thus, Linda becomes one way of implementing one-sided communication  .

#### 2.6.5.9 The Global Arrays library

crumb trail: > parallel > Parallel programming > Parallel languages > The Global Arrays library

The Global Arrays library ( http://www.emsl.pnl.gov/docs/global/  ) is another example of one-sided communication  , and in fact it predates MPI. This library has as its prime data structure Cartesian product arrays

(Note: {This means that if the array is three-dimensional, it can be described by three integers $n_1,n_2,n_3$, and each point has a coordinate $(i_1,i_2,i_3)$ with $1\leq i_1\leq n_1$ et cetera.}  )

, distributed over a processor grid of the same or lower dimension. Through library calls, any processor can access any sub-brick out of the array in either a put or

get operation. These operations are non-collective. As with any one-sided protocol, a barrier sync is necessary to ensure completion of the sends/receives.

### 2.6.6 OS-based approaches

crumb trail: > parallel > Parallel programming > OS-based approaches

It is possible to design an architecture with a shared address space, and let the data movement be handled by the operating system. The Kendall Square computer  [KSRallcache] had an architecture name all-cache', where no data was directly associated with any processor. Instead, all data was considered to be cached on a processor, and moved through the network on demand, much like data is moved from main memory to cache in a regular CPU. This idea is analogous to the NUMA support in current SGI architectures.

### 2.6.7 Active messages

crumb trail: > parallel > Parallel programming > Active messages

The MPI paradigm (section  2.6.3.3  ) is traditionally based on two-sided operations: each data transfer requires an explicit send and receive operation. This approach works well with relatively simple codes, but for complicated problems it becomes hard to orchestrate all the data movement. One of the ways to simplify consists of using active messages  . This is used in the package Charm++ [charmpp]  .

With active messages, one processor can send data to another, without that second processor doing an explicit receive operation. Instead, the recipient declares code that handles the incoming data, a method' in objective orientation parlance, and the sending processor calls this method with the data that it wants to send. Since the sending processor in effect activates code on the other processor, this is also known as remote method invocation  . A big advantage of this method is that overlap of communication and computation becomes easier to realize.

As an example, consider the matrix-vector multiplication with a tridiagonal matrix

$$\forall_i\colon y_i\leftarrow 2x_i-x_{i+1}-x_{i-1}.$$

See section  4.2.2 for an explanation of the origin of this problem in PDEs  . Assuming that each processor has exactly one index $i$, the MPI code could look like: \lstset{language=C}


if ( /* I am the first or last processor */ )
n_neighbors = 1;
else
n_neighbors = 2;

/* do the MPI_Isend operations on my local data */

sum = 2*local_x_data;
for (neighbor=0; neighbor


With active messages this looks like \lstset{language=C}


void incorporate_neighbor_data(x) {
sum = sum-x;
local_y_data = sum
}
sum = 2*local_xdata;
all_processors[myid+1].incorporate_neighbor_data(local_x_data);
all_processors[myid-1].incorporate_neighbor_data(local_x_data);



### 2.6.8 Bulk synchronous parallelism

crumb trail: > parallel > Parallel programming > Bulk synchronous parallelism

The MPI library (section  2.6.3.3  ) can lead to very efficient code. The price for this is that the programmer needs to spell out the communication in great detail. On the other end of the spectrum, PGAS languages (section  2.6.5  ) ask very little of the programmer, but give not much performance in return. One attempt to find a middle ground is the BSP model  [Valiant:1990:BSP,Skillicorn96questionsand]  . Here the programmer needs to spell out the communications, but not their ordering.

The BSP model orders the program into a sequence of supersteps  , each of which ends with a barrier synchronization. The communications that are started in one superstep are all asynchronous and rely on the barrier for their completion. This makes programming easier and removes the possibility of deadlock. Moreover, all communication are of the one-sided communication type.

Exercise Consider the parallel summing example in section  2.1  . Argue that a BSP implementation needs $\log_2n$ supersteps.
End of exercise

Because of its synchronization of the processors through the barriers concluding the supersteps the BSP model can do a simple cost analysis of parallel algorithms.

Another aspect of the BSP model is its use of overdecomposition of the problem, where multiple processes are assigned to each processor, as well as random placement of data and tasks. This is motivated with a statistical argument that shows it can remedy load imbalance  . If there are $p$ processors and if in a superstep $p$ remote accesses are made, with high likelihood some processor receives $\log p/\log \log p$ accesses, while others receive none. Thus, we have a load imbalance that worsens with increasing processor count. On the other hand, if $p\log p$ accesses are made, for instance because there are $\log p$ processes on each processor, the maximum number of accesses is $3\log p$ with high probability. This means the load balance is within a constant factor of perfect.

The BSP model is implemented in BSPlib  [BSPlib]  . Other system can be said to be BSP-like in that they use the concept of supersteps; for instance Google's Pregel   [Pregel:podc2009]  .

### 2.6.9 Data dependencies

crumb trail: > parallel > Parallel programming > Data dependencies

If two statements refer to the same data item, we say that there is a data dependency between the statements. Such dependencies limit the extent to which the execution of the statements can be rearranged. The study of this topic probably started in the 1960s, when processors could execute statements out of order

to increase throughput. The re-ordering of statements was limited by the fact that the execution had to obey the program order semantics: the result had to be as if the statements were executed strictly in the order in which they appear in the program.

These issues of statement ordering, and therefore of data dependencies, arise in several ways:

• A parallelizing compiler has to analyze the source to determine what transformations are allowed;

• if you parallelize a sequential code with OpenMP directives, you have to perform such an analysis yourself.

Here are two types of activity that require such an analysis:

• When a loop is parallelized, the iterations are no longer executed in their program order, so we have to check for dependencies.

• The introduction of tasks also means that parts of a program can be executed in a different order from in which they appear in a sequential execution.

The easiest case of dependency analysis is that of detecting that loop iterations can be executed independently. Iterations are of course independent if a data item is read in two different iterations, but if the same item is read in one iteration and written in another, or written in two different iterations, we need to do further analysis.

Analysis of data dependencies can be performed by a compiler, but compilers take, of necessity, a conservative approach. This means that iterations may be independent, but can not be recognized as such by a compiler. Therefore, OpenMP shifts this responsibility to the programmer.

We will now discuss data depencies in some detail.

#### 2.6.9.1 Types of data dependencies

crumb trail: > parallel > Parallel programming > Data dependencies > Types of data dependencies

The three types of dependencies are:

• flow dependencies, or read-after-write';

• anti dependencies, or write-after-read'; and

• output dependencies, or write-after-write'.

These dependencies can be studied in scalar code, and in fact compilers do this to determine whether statements can be rearranged, but we will mostly be concerned with their appearance in loops, since in scientific computation much of the work appears there.

Flow dependencies

Flow dependencies are not a problem if the read and write occur in the same loop iteration: \lstset{language=C}


for (i=0; i


On the other hand, if the read happens in a later iteration, there is no simple way to parallelize or vectorize the loop: \lstset{language=C}


for (i=0; i


This usually requires rewriting the code.

Exercise Consider the code \lstset{language=C}


for (i=0; i


where f() and g() denote arithmetical expressions with out further dependencies on x or  i  . Show that this loop can be parallelized/vectorized if you are allowed to use a temporary array.
End of exercise

The simplest case of an anti dependency or write-after-read is a reduction: \lstset{language=C}


for (i=0; i


This can be dealt with by explicit declaring the loop to be a reduction, or to use any of the other strategies in section  6.1.2  .

If the read and write are on an array the situation is more complicated. The iterations in this fragment \lstset{language=C}


for (i=0; i


can not be executed in arbitrary order as such. However, conceptually there is no dependency. We can solve this by introducing a temporary array: \lstset{language=C}


for (i=0; i


This is an example of a transformation that a compiler is unlikely to perform, since it can greatly affect the memory demands of the program. Thus, this is left to the programmer.

The case of an output dependency or write-after-write does not occur by itself: if a variable is written twice in sequence without an intervening read, the first write can be removed without changing the meaning of the program. Thus, this case reduces to a flow dependency.

Other output dependencies can also be removed. In the following code, t  can be declared private, thereby removing the dependency. \lstset{language=C}


for (i=0; i


If the final value of t is wanted, the lastprivate can be used in OpenMP.

#### 2.6.9.2 Parallelizing nested loops

crumb trail: > parallel > Parallel programming > Data dependencies > Parallelizing nested loops

In the above examples, data dependencies were non-trivial if in iteration $i$ of a loop different indices appeared, such as $i$ and $i+1$. Conversely, loops such as \lstset{language=C}


for (int i=0; i


are simple to parallelize. Nested loops, however, take more thought. OpenMP has a collapse' directive for loops such as \lstset{language=C}


for (int i=0; i


Here, the whole $i,j$ iteration space is parallel.

How is that with: \lstset{language=C}


for (n = 0; n < NN; n++)
for (i = 0; i < N; i++)
for (j = 0; j < N; j++)
a[i] += B[i][j]*c[j] + d[n];



Exercise

Do a reuse analysis on this loop. Assume that a,b,c do not all fit in cache together.

Now assume that c and one row of  b fit in cache, with a little room to spare. Can you find a loop interchange that will greatly benefit performance? Write a test to confirm this.
End of exercise

Analyzing this loop nest for parallelism, you see that the j -loop is a reduction, and the n -loop has flow dependencies: every

a[i] is updated in every n -iteration. The conclusion is that you can only reasonable parallelize the i -loop.

Exercise How does this parallelism analysis relate to the loop exchange from exercise  2.6.9.2 ? Is the loop after exchange still parallelizable?

If you speak OpenMP, confirm your answer by writing code that adds up the elements of  a  . You should get the same answer no matter the exchanges and the introduction of OpenMP parallelism.
End of exercise

### 2.6.10 Program design for parallelism

crumb trail: > parallel > Parallel programming > Program design for parallelism

A long time ago it was thought that some magic combination of compiler and runtime system could transform an existing sequential program into a parallel one. That hope has long evaporated, so these days a parallel program will have been written from the ground up as parallel. Of course there are different types of parallelism, and they each have their own implications for precisely how you design your parallel program. In this section we will briefly look into some of the issues.

#### 2.6.10.1 Parallel data structures

One of the issues in parallel program design is the use of AOS vs SOA  . In normal program design you often define a structure \lstset{language=C}


struct { int number; double xcoord,ycoord; } _Node;
struct { double xtrans,ytrans} _Vector;
typedef struct _Node* Node;
typedef struct _Vector* Vector;



and if you need a number of them you create an array of such structures. \lstset{language=C}


Node *nodes = (Node*) malloc( n_nodes*sizeof(struct _Node) );



This is the AOS design.

Now suppose that you want to parallelize an operation \lstset{language=C}


void shift(Node the_point,Vector by) {
the_point->xcoord += by->xtrans;
the_point->ycoord += by->ytrans;
}



which is done in a loop \lstset{language=C}


for (i=0; i


This code has the right structure for MPI programming (section  2.6.3.3  ), where every processor has its own local array of nodes. This loop is also readily parallelizable with OpenMP (section  2.6.2  ).

However, in the 1980s codes had to be substantially rewritten as it was realized that the AOS design was not good for vector computers. In that case you operands need to be contiguous, and so codes had to go to a SOA design: \lstset{language=C}


node_numbers = (int*) malloc( n_nodes*sizeof(int) );
node_xcoords = // et cetera
node_ycoords = // et cetera



and you would iterate \lstset{language=C}


for (i=0; ixtrans;
node_yoords[i] += shift_vector->ytrans;
}



Oh, did I just say that the original SOA design was best for distributed memory programming? That meant that 10 years after the vector computer era everyone had to rewrite their codes again for clusters. And of course nowadays, with increasing SIMD width  , we need to go part way back to the AOS design. (There is some experimental software support for this transformation in the Intel ispc project, http://ispc.github.io/  , which translates SPMD code to SIMD  .)

#### 2.6.10.2 Latency hiding

crumb trail: > parallel > Parallel programming > Program design for parallelism > Latency hiding

Communication between processors is typically slow, slower than data transfer from memory on a single processor, and much slower than operating on data. For this reason, it is good to think about the relative volumes of network traffic versus useful' operations when designing a parallel program. There has to be enough work per processor to offset the communication.

Another way of coping with the relative slowness of communication is to arrange the program so that the communication actually happens while some computation is going on. This is referred to as overlapping computation with communication or latency hiding  .

For example, consider the parallel execution of a matrix-vector product $y=Ax$ (there will be further discussion of this operation in section  6.2.1  ). Assume that the vectors are distributed, so each processor $p$ executes

$$\forall_{i\in I_p}\colon y_i=\sum_j a_{ij}x_j.$$

Since $x$ is also distributed, we can write this as

$$\forall_{i\in I_p}\colon y_i= \left(\sum_{\mbox{\small j local}} +\sum_{\mbox{\small j not local}} \right) a_{ij}x_j.$$

This scheme is illustrated in figure  2.6.10.2  .

\caption{The parallel matrix-vector product with a blockrow distribution.}

We can now proceed as follows:

• Start the transfer of non-local elements of $x$;

• Operate on the local elements of $x$ while data transfer is going on;

• Make sure that the transfers are finished;

• Operate on the non-local elements of $x$.

Exercise How much can you gain from overlapping computation and communication? Hint: consider the border cases where computation takes zero time and and there is only communication, and the reverse. Now consider the general case.
End of exercise

Of course, this scenario presupposes that there is software and hardware support for this overlap. MPI allows for this (see section  2.6.3.6  ), through so-called asynchronous communication or \indexterm{non-blocking communication} routines. This does not immediately imply that overlap will actually happen, since hardware support is an entirely separate question.

## 2.7 Topologies

crumb trail: > parallel > Topologies

If a number of processors are working together on a single task, most likely they need to communicate data. For this reason there needs to be a way for data to make it from any processor to any other. In this section we will discuss some of the possible schemes to connect the processors in a parallel machine. Such a scheme is called a (processor) topology  .

In order to get an appreciation for the fact that there is a genuine problem here, consider two simple schemes that do not scale up':

• Ethernet is a connection scheme where all machines on a network are on a single cable (see remark below). If one machine puts a signal on the wire to send a message, and another also wants to send a message, the latter will detect that the sole available communication channel is occupied, and it will wait some time before retrying its send operation. Receiving data on ethernet is simple: messages contain the address of the intended recipient, so a processor only has to check whether the signal on the wire is intended for it.

The problems with this scheme should be clear. The capacity of the communication channel is finite, so as more processors are connected to it, the capacity available to each will go down. Because of the scheme for resolving conflicts, the average delay before a message can be started will also increase.

• In a fully connected configuration, each processor has one wire for the communications with each other processor. This scheme is perfect in the sense that messages can be sent in the minimum amount of time, and two messages will never interfere with each other. The amount of data that can be sent from one processor is no longer a decreasing function of the number of processors; it is in fact an increasing function, and if the network controller can handle it, a processor can even engage in multiple simultaneous communications.

The problem with this scheme is of course that the design of the network interface of a processor is no longer fixed: as more processors are added to the parallel machine, the network interface gets more connecting wires. The network controller similarly becomes more complicated, and the cost of the machine increases faster than linearly in the number of processors.

Remark The above description of Ethernet is of the original design. With the use of switches, especially in an HPC context, this description does not really apply anymore.

It was initially thought that message collisions implied that ethernet would be inferior to other solutions such as IBM's token ring network, which explicitly prevents collisions. It takes fairly sophisticated statistical analysis to prove that Ethernet works a lot better than was naively expected.
End of remark

In this section we will see a number of schemes that can be increased to large numbers of processors.

### 2.7.1 Some graph theory

crumb trail: > parallel > Topologies > Some graph theory

The network that connects the processors in a parallel computer can theory} concepts. We describe the parallel machine with a graph where each processor is a node, and two nodes are connected if there is a direct connection between them. (We assume that connections are symmetric, so that the network is an

undirected graph  .)

We can then analyze two important concepts of this graph.

First of all, the degree of a node in a graph is the number of other nodes it is connected to. With the nodes representing processors, and the edges the wires, it is clear that a high degree is not just desirable for efficiency of computing, but also costly from an engineering point of view. We assume that all processors have the same degree.

Secondly, a message traveling from one processor to another, through one or more intermediate nodes, will most likely incur some delay at each stage of the path between the nodes. For this reason, the diameter of the graph is important. The diameter is defined as the maximum shortest distance, counting numbers of links, between any two nodes:

$$d(G) = \max_{i,j}|\hbox{shortest path between i and j}|.$$

If $d$ is the diameter, and if sending a message over one wire takes unit time, this means a message will always arrive in at most time $d$.

Exercise

Find a relation between the number of processors, their degree, and the diameter of the connectivity graph.
End of exercise

#### 2.11.6.1 The top500 list as a recent history of supercomputing

The top500 list offers a history of almost 20 years of supercomputing. In this section we will take a brief look at historical developments (Note: {The graphs contain John McCalpin's analysis of the top500 data.}  )  . First of all, figure~ 2.35 shows the evolution of architecture types

FIGURE 2.35: Evolution of the architecture types on the top500 list.

by charting what portion of the aggregate peak performance of the whole list is due to each type.

• Vector machines feature a relatively small number of very powerful vector pipeline processors (section  2.3.1.1  ). This type of architecture has largely disappeared; the last major machine of this type was the Japanese Earth Simulator which is seen as the spike in the graph around 2002, and which was at the top of the list for two years.

• Micro-processor based architectures get their power from the large number of processors in one machine. The graph distinguishes between x86 ( Intel and AMD processors with the exception of the Intel Itanium  ) processors and others; see also the next graph.

• A number of systems were designed as highly scalable architectures: these are denoted MPP for massively parallel processor'. In the early part of the timeline this includes architectures such as the Connection Machine  , later it is almost exclusively the IBM BlueGene  .

• In recent years accelerated systems' are the upcoming trend. Here, a processing unit such as a GPU is attached to the networked main processor.

Next, figure  2.36 shows the dominance of the x86 processor type relative to other micro-processors.

FIGURE 2.36: Evolution of the architecture types on the top500 list.

(Since we classified the IBM BlueGene as an MPP, its processors are not in the Power' category here.)

Finally, figure  2.37 shows the gradual increase in core count. Here we can make the following observations:

FIGURE 2.37: Evolution of the architecture types on the top500 list.

• In the 1990s many processors consisted of more than one chip. In the rest of the graph, we count the number of cores per package', that is, per socket  . In some cases a socket can actually contain two separate dies.

• With the advent of multi-core processors, it is remarkable how close to vertical the section in the graph are. This means that new processor types are very quickly adopted, and the lower core counts equally quickly completely disappear.

• For accelerated systems (mostly systems with GPUs  ) the concept of core count' is harder to define; the graph merely shows the increasing importance of this type of architecture.

### 2.11.7 Heterogeneous computing

crumb trail: > parallel > Remaining topics > Heterogeneous computing

You have now seen several computing models: single core, shared memory multicore, distributed memory clusters, GPUs. These models all have in common that, if there is more than one instruction stream active, all streams are interchangeable. With regard to GPUs we need to refine this statement: all instruction stream on the GPU are interchangeable. However, a GPU is not a standalone device, but can be considered a co-processor to a \indexterm{host processor}. If we want to let the host perform useful work while the co-processor is active, we now have two different instruction streams or types of streams. This situation is known as heterogeneous computing  . In the GPU case, these instruction streams are even programmed by a slightly different mechanisms --~using CUDA for the GPU~-- but this need not be the case: the Intel MIC architecture is programmed in ordinary~C.