# High performance linear algebra

$% mathjax inclusion. \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\becomes{\mathop{:=}} \newcommand\dtdxx{\frac{\alpha\Delta t}{\Delta x^2}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{{\equiv\atop\mathrm{\scriptstyle D}}}}} \newcommand\fp[2]{#1\cdot10^{#2}} \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\macro[1]{\langle#1\rangle} \newcommand\nobreak{} \newcommand\Rn{{\cal R}^n} \newcommand\Rnxn{{\cal R}^{n\times x}} \newcommand\sublocal{_{\mathrm\scriptstyle local}} \newcommand\toprule{\hline} \newcommand\midrule{\hline} \newcommand\bottomrule{\hline} \newcommand\multicolumn[3]{#3} \newcommand\defcolvector[2]{\begin{pmatrix} #1_0 #1_1 vdots #1_{#2-1} \end{pmatrix}} % { left( begin{array}{c} #1_0 #1_1 vdots #1_{#2-1} end{array} right) }$ 6.1 : Collective operations
6.1.2 : Reduction
6.1.3 : Allreduce
6.1.4 : Allgather
6.1.5 : Reduce-scatter
6.2 : Parallel dense matrix-vector product
6.2.1 : Implementing the block-row case
6.2.2 : The block-column case
6.2.3 : Scalability of the dense matrix-vector product
6.2.3.1 : Matrix-vector product, partitioning by rows
6.2.3.1.1 : An optimist's view
6.2.3.1.2 : A pessimist's view
6.2.3.1.3 : A realist's view
6.2.3.2 : Matrix-vector product, partitioning by columns
6.2.3.2.1 : Cost analysis
6.2.3.3 : Two-dimensional partitioning
6.2.3.3.1 : Algorithm
6.2.3.3.2 : Cost analysis
6.3 : LU factorization in parallel
6.3.1 : Solving a triangular system
6.3.2 : Factorization, dense case
6.3.3 : Factorization, sparse case
6.4 : Matrix-matrix product
6.4.1 : Goto matrix-matrix product
6.4.2 : Cannon's algorithm for the distributed memory matrix-matrix product
6.5 : Sparse matrix-vector product
6.5.1 : The single-processor sparse matrix-vector product
6.5.2 : The parallel sparse matrix-vector product
6.5.3 : Parallel efficiency of the sparse matrix-vector product
6.5.4 : Memory behavior of the sparse matrix-vector product
6.5.5 : The transpose product
6.5.6 : Setup of the sparse matrix-vector product
6.6 : Computational aspects of iterative methods
6.6.1 : Vector operations
6.6.1.2 : Inner products
6.6.2 : Finite element matrix construction
6.6.3 : A simple model for iterative method performance
6.7 : Parallel preconditioners
6.7.1 : Jacobi preconditioning
6.7.2 : Gauss-Seidel and SOR
6.7.3 : The trouble with ILU in parallel
6.7.4 : Block Jacobi methods
6.7.5 : Parallel ILU
6.8 : Ordering strategies and parallelism
6.8.1 : Nested dissection
6.8.1.1 : Domain decomposition
6.8.1.2 : Complexity
6.8.1.3 : Irregular domains
6.8.1.4 : Parallelism
6.8.1.5 : Preconditioning
6.8.2 : Variable reordering and coloring: independent sets
6.8.2.1 : Red-black coloring
6.8.2.2 : General coloring
6.8.2.3 : Multi-color parallel ILU
6.8.3 : Irregular iteration spaces
6.8.4 : Ordering for cache efficiency
6.8.5 : Operator splitting
6.9 : Parallelism and implicit operations
6.9.1 : Wavefronts
6.9.2 : Recursive doubling
6.9.3 : Approximating implicit by explicit operations, series expansion
6.10.1 : Analysis
6.10.2 : Communication and work minimizing strategy
6.11 : Block algorithms on multicore architectures

# 6 High performance linear algebra

In this section we will discuss a number of issues pertaining to linear algebra on parallel computers. We will take a realistic view of this topic, assuming that the number of processors is finite, and that the problem data is always large, relative to the number of processors. We will also pay attention to the physical aspects of the communication network between the processors.

We will analyze various linear algebra operations, including iterative methods for the solution of linear systems of equation (if we sometimes just refer to `iterative methods', the qualification to systems of linear equations is implicit), and their behavior in the presence of a network with finite bandwidth and finite connectivity.

## 6.1 Collective operations

crumb trail: > parallellinear > Collective operations

Collective operations are an artifact of distributed memory parallelism; see section  2.4 for a discussion. Mathematically they describe extremely simple operations such as summing an array of numbers. However, since these numbers can be on different processors, each with their own memory space, this operation becomes computationally non-trivial and worthy of study.

Collective operations play an important part in linear algebra operations. In fact, the scalability of even simple operations such as a matrix-vector product can depend on the cost of these collectives as you will see below. We start with a short discussion of the essential ideas behind collectives; see  [Chan2007Collective] for details. We will then see an important application in section  6.2  , and various other places.

In computing the cost of a collective operation, three architectural constants are enough to give lower bounds: $\alpha$, the latency of sending a single message, $\beta$, the inverse of the bandwidth for sending data (see section  1.3.2  ), and $\gamma$, the inverse of the computation rate  , the time for performing an arithmetic operation. Sending $n$ data items then takes time $\alpha +\beta n$. We further assume that a processor can only send one message at a time. We make no assumptions about the connectivity of the processors (see section  2.7 for this); thus, the lower bounds derived here will hold for a wide range of architectures.

The main implication of the architectural model above is that the number of active processors can only double in each step of an algorithm. For instance, to do a broadcast, first processor 0 sends to 1, then 0 and 1 can send to 2 and 3, then 0--3 send to 4--7, et cetera. This cascade of messages is called a minimum spanning tree of the processor network, and it follows that any collective algorithm has at least $\alpha\log_2p$ cost associated with the accumulated latencies.

crumb trail: > parallellinear > Collective operations > Broadcast

In a broadcast operation, a single processor has $n$ data elements that it needs to send to all others: the other processors need a full copy of all $n$ elements. By the above doubling argument, we conclude that a broadcast to $p$ processors takes time at least $\lceil\log_2 p\rceil$ steps with a total latency of $\lceil\log_2 p\rceil \alpha$. Since $n$ elements are sent, this adds at least a time $n\beta$ for all elements to leave the sending processor, giving a total cost lower bound of $$\lceil\log_2 p\rceil \alpha+n\beta.$$

We can illustrate the spanning tree method as follows: $$\begin{array}{|c|ccc|} \hline &t=1&t=2&t=3\\ \hline p_0&x_0\downarrow,x_1\downarrow,x_2\downarrow,x_3\downarrow &x_0\downarrow,x_1\downarrow,x_2\downarrow,x_3\downarrow &x_0,x_1,x_2,x_3\\ p_1&&x_0\downarrow,x_1\downarrow,x_2\downarrow,x_3\downarrow &x_0,x_1,x_2,x_3\\ p_2&&&x_0,x_1,x_2,x_3\\ p_3&&&x_0,x_1,x_2,x_3\\ \hline \end{array}$$ (On $t=1$, $p_0$ sends to $p_1$; on $t=2$ $p_0,p_1$ send to $p_2,p_3$.) This algorithm has the correct $\log_2p\cdot\alpha$ term, but processor 0 repeatedly sends the whole vector, so the bandwidth cost is $\log_2p\cdot n\beta$. If $n$ is small, the latency cost dominates, so we may characterize this as a short vector collective operation

The following algorithm implements the broadcast as a combination of a scatter and a bucket brigade algorithm  . First the scatter: $$\begin{array}{|c|cccc|} \hline &t=0&t=1&t=2&t=3\\ \hline p_0&x_0\downarrow,x_1,x_2,x_3 % t=0 &x_0,x_1\downarrow,x_2,x_3 % t=1 &x_0,x_1,x_2\downarrow,x_3 % t=2 &x_0,x_1,x_2,x_3\downarrow % t=3 \\ p_1& % t=0 &x_1&&\\ p_2&& % t=0,1 &x_2&\\ p_3&&& % t=0,1,2 &x_3\\ \hline \end{array}$$ This involves $p-1$ messages of size $N/p$, for a total time of $$T_{\scriptstyle\mathrm{scatter}}(N,P) = (p-1)\alpha + (p-1)\cdot\frac{N}{p}\cdot \beta.$$

Then the bucket brigade has each processor active in every step, accepting a partial message (except in the first step), and passing it on to the next processor. $$\begin{array}{|c|lll|} \hline &t=0&t=1&et cetera\\ \hline p_0&x_0\downarrow\hphantom{,x_1\downarrow,x_2\downarrow,x_3\downarrow} &x_0\hphantom{\downarrow,x_1\downarrow,x_2\downarrow,}x_3\downarrow &x_0,\hphantom{x_1,}x_2,x_3\\ p_1&\hphantom{x_0\downarrow,}x_1\downarrow\hphantom{,x_2\downarrow,x_3\downarrow} &x_0\downarrow,x_1\hphantom{\downarrow,x_2\downarrow,x_3\downarrow} &x_0,x_1,\hphantom{x_2,}x_3\\ p_2&\hphantom{x_0\downarrow,x_1\downarrow,}x_2\downarrow\hphantom{,x_3\downarrow} &\hphantom{x_0\downarrow,}x_1\downarrow,x_2\hphantom{\downarrow,x_3\downarrow} &x_0,x_1,x_2\hphantom{,x_3}\\ p_3&\hphantom{x_0\downarrow,x_1\downarrow,x_2\downarrow,}x_3\downarrow &\hphantom{x_0\downarrow,x_1\downarrow,}x_2\downarrow,x_3\hphantom{\downarrow} &\hphantom{x_0,}x_1,x_2,x_3\\ \hline \end{array}$$ Each partial message gets sent $p-1$ times, so this stage also has a complexity of $$T_{\scriptsize\mathrm{bucket}}(N,P) = (p-1)\alpha + (p-1)\cdot\frac{N}{p}\cdot beta.$$

The complexity now becomes $$2(p-1)\alpha+2\beta n(p-1)/p$$ which is not optimal in latency, but is a better algorithm if $n$ is large, making this a long vector collective operation  .

### 6.1.2 Reduction

crumb trail: > parallellinear > Collective operations > Reduction

In a reduction operation, each processor has $n$ data elements, and one processor needs to combine them elementwise, for instance computing $n$ sums or products.

By running the broadcast backwards in time, we see that a reduction operation has the same lower bound on the communication of $\lceil\log_2 p\rceil \alpha+n\beta$. A reduction operation also involves computation, which would take a total time of $(p-1)\gamma n$ sequentially: each of $n$ items gets reduced over $p$ processors. Since these operations can potentially be parallelized, the lower bound on the computation is $\frac{p-1}p \gamma n$, giving a total of $$\lceil\log_2 p\rceil \alpha+n\beta +\frac{p-1}p \gamma n.$$

We illustrate the spanning tree algorithm, using the notation $x_i^{(j)}$ for the data item $i$ that was originally on processor $j$, and $x_i^{(j:k)}$ for the sum of the items $i$ of processors $j\ldots k$.