\[ % mathjax inclusion.
\newcommand\bbP{\mathbb{P}}
\newcommand\bbR{\mathbb{R}}
\newcommand\becomes{\mathop{:=}}
\newcommand\dtdxx{\frac{\alpha\Delta t}{\Delta x^2}}
\newcommand\defined{
\mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}}
\newcommand\fp[2]{#1\cdot10^{#2}}
\newcommand\inv{^{-1}}\newcommand\invt{^{-t}}
\newcommand\macro[1]{$\langle$#1$\rangle$}
\newcommand\nobreak{}
\newcommand\Rn{{\cal R}^n}
\newcommand\Rnxn{{\cal R}^{n\times x}}
\newcommand\sublocal{_{\mathrm\scriptstyle local}}
\newcommand\toprule{\hline}
\newcommand\midrule{\hline}
\newcommand\bottomrule{\hline}
\newcommand\multicolumn[3]{#3}
\newcommand\defcolvector[2]{\begin{pmatrix}
#1_0

#1_1

vdots

#1_{#2-1} \end{pmatrix}} % {

left(

begin{array}{c} #1_0

#1_1

vdots

#1_{#2-1}

end{array}

right) } \] 6.1 : Collective operations

6.1.1 : Broadcast

6.1.2 : Reduction

6.1.3 : Allreduce

6.1.4 : Allgather

6.1.5 : Reduce-scatter

6.2 : Parallel dense matrix-vector product

6.2.1 : Implementing the block-row case

6.2.2 : The block-column case

6.2.3 : Scalability of the dense matrix-vector product

6.2.3.1 : Matrix-vector product, partitioning by rows

6.2.3.1.1 : An optimist's view

6.2.3.1.2 : A pessimist's view

6.2.3.1.3 : A realist's view

6.2.3.2 : Matrix-vector product, partitioning by columns

6.2.3.2.1 : Cost analysis

6.2.3.3 : Two-dimensional partitioning

6.2.3.3.1 : Algorithm

6.2.3.3.2 : Cost analysis

6.3 : LU factorization in parallel

6.3.1 : Solving a triangular system

6.3.2 : Factorization, dense case

6.3.3 : Factorization, sparse case

6.4 : Matrix-matrix product

6.4.1 : Goto matrix-matrix product

6.4.2 : Cannon's algorithm for the distributed memory matrix-matrix product

6.5 : Sparse matrix-vector product

6.5.1 : The single-processor sparse matrix-vector product

6.5.2 : The parallel sparse matrix-vector product

6.5.3 : Parallel efficiency of the sparse matrix-vector product

6.5.4 : Memory behavior of the sparse matrix-vector product

6.5.5 : The transpose product

6.5.6 : Setup of the sparse matrix-vector product

6.6 : Computational aspects of iterative methods

6.6.1 : Vector operations

6.6.1.1 : Vector additions

6.6.1.2 : Inner products

6.6.2 : Finite element matrix construction

6.6.3 : A simple model for iterative method performance

6.7 : Parallel preconditioners

6.7.1 : Jacobi preconditioning

6.7.2 : Gauss-Seidel and SOR

6.7.3 : The trouble with ILU in parallel

6.7.4 : Block Jacobi methods

6.7.5 : Parallel ILU

6.8 : Ordering strategies and parallelism

6.8.1 : Nested dissection

6.8.1.1 : Domain decomposition

6.8.1.2 : Complexity

6.8.1.3 : Irregular domains

6.8.1.4 : Parallelism

6.8.1.5 : Preconditioning

6.8.2 : Variable reordering and coloring: independent sets

6.8.2.1 : Red-black coloring

6.8.2.2 : General coloring

6.8.2.3 : Multi-color parallel ILU

6.8.3 : Irregular iteration spaces

6.8.4 : Ordering for cache efficiency

6.8.5 : Operator splitting

6.9 : Parallelism and implicit operations

6.9.1 : Wavefronts

6.9.2 : Recursive doubling

6.9.3 : Approximating implicit by explicit operations, series expansion

6.10 : Grid updates

6.10.1 : Analysis

6.10.2 : Communication and work minimizing strategy

6.11 : Block algorithms on multicore architectures

Back to Table of Contents# 6 High performance linear algebra

## 6.1 Collective operations

### 6.1.1 Broadcast

### 6.1.2 Reduction

#1_1

vdots

#1_{#2-1} \end{pmatrix}} % {

left(

begin{array}{c} #1_0

#1_1

vdots

#1_{#2-1}

end{array}

right) } \] 6.1 : Collective operations

6.1.1 : Broadcast

6.1.2 : Reduction

6.1.3 : Allreduce

6.1.4 : Allgather

6.1.5 : Reduce-scatter

6.2 : Parallel dense matrix-vector product

6.2.1 : Implementing the block-row case

6.2.2 : The block-column case

6.2.3 : Scalability of the dense matrix-vector product

6.2.3.1 : Matrix-vector product, partitioning by rows

6.2.3.1.1 : An optimist's view

6.2.3.1.2 : A pessimist's view

6.2.3.1.3 : A realist's view

6.2.3.2 : Matrix-vector product, partitioning by columns

6.2.3.2.1 : Cost analysis

6.2.3.3 : Two-dimensional partitioning

6.2.3.3.1 : Algorithm

6.2.3.3.2 : Cost analysis

6.3 : LU factorization in parallel

6.3.1 : Solving a triangular system

6.3.2 : Factorization, dense case

6.3.3 : Factorization, sparse case

6.4 : Matrix-matrix product

6.4.1 : Goto matrix-matrix product

6.4.2 : Cannon's algorithm for the distributed memory matrix-matrix product

6.5 : Sparse matrix-vector product

6.5.1 : The single-processor sparse matrix-vector product

6.5.2 : The parallel sparse matrix-vector product

6.5.3 : Parallel efficiency of the sparse matrix-vector product

6.5.4 : Memory behavior of the sparse matrix-vector product

6.5.5 : The transpose product

6.5.6 : Setup of the sparse matrix-vector product

6.6 : Computational aspects of iterative methods

6.6.1 : Vector operations

6.6.1.1 : Vector additions

6.6.1.2 : Inner products

6.6.2 : Finite element matrix construction

6.6.3 : A simple model for iterative method performance

6.7 : Parallel preconditioners

6.7.1 : Jacobi preconditioning

6.7.2 : Gauss-Seidel and SOR

6.7.3 : The trouble with ILU in parallel

6.7.4 : Block Jacobi methods

6.7.5 : Parallel ILU

6.8 : Ordering strategies and parallelism

6.8.1 : Nested dissection

6.8.1.1 : Domain decomposition

6.8.1.2 : Complexity

6.8.1.3 : Irregular domains

6.8.1.4 : Parallelism

6.8.1.5 : Preconditioning

6.8.2 : Variable reordering and coloring: independent sets

6.8.2.1 : Red-black coloring

6.8.2.2 : General coloring

6.8.2.3 : Multi-color parallel ILU

6.8.3 : Irregular iteration spaces

6.8.4 : Ordering for cache efficiency

6.8.5 : Operator splitting

6.9 : Parallelism and implicit operations

6.9.1 : Wavefronts

6.9.2 : Recursive doubling

6.9.3 : Approximating implicit by explicit operations, series expansion

6.10 : Grid updates

6.10.1 : Analysis

6.10.2 : Communication and work minimizing strategy

6.11 : Block algorithms on multicore architectures

Back to Table of Contents

In this section we will discuss a number of issues pertaining to linear algebra on parallel computers. We will take a realistic view of this topic, assuming that the number of processors is finite, and that the problem data is always large, relative to the number of processors. We will also pay attention to the physical aspects of the communication network between the processors.

We will analyze various linear algebra operations, including iterative methods for the solution of linear systems of equation (if we sometimes just refer to `iterative methods', the qualification to systems of linear equations is implicit), and their behavior in the presence of a network with finite bandwidth and finite connectivity.

crumb trail: > parallellinear > Collective operations

Collective operations are an artifact of distributed memory parallelism; see section 2.4 for a discussion. Mathematically they describe extremely simple operations such as summing an array of numbers. However, since these numbers can be on different processors, each with their own memory space, this operation becomes computationally non-trivial and worthy of study.

Collective operations play an important part in linear algebra operations. In fact, the scalability of even simple operations such as a matrix-vector product can depend on the cost of these collectives as you will see below. We start with a short discussion of the essential ideas behind collectives; see [Chan2007Collective] for details. We will then see an important application in section 6.2 , and various other places.

In computing the cost of a collective operation, three architectural
constants are enough to give lower bounds: $\alpha$, the
*latency*
of
sending a single message, $\beta$, the inverse of the
*bandwidth*
for
sending data (see section
1.3.2
), and $\gamma$,
the inverse of the
*computation rate*
, the
time for performing an arithmetic operation. Sending $n$ data
items then takes time $\alpha +\beta n$. We further assume that a
processor can only send one message at a time.
We make no assumptions about the connectivity of the processors
(see section
2.7
for this);
thus, the lower bounds
derived here will hold for a wide range of architectures.

The main implication of the architectural model above is that the
number of active processors can only double in each step of an
algorithm. For instance, to do a broadcast, first processor 0 sends
to 1, then 0 and 1 can send to 2 and 3, then 0--3 send to 4--7, et
cetera. This cascade of messages is called
a
*minimum spanning tree*
of the processor network, and
it follows that any collective algorithm has at least $\alpha\log_2p$
cost associated with the accumulated latencies.

crumb trail: > parallellinear > Collective operations > Broadcast

In a
*broadcast*
operation, a single processor has $n$ data elements
that it needs to send to all others: the other processors need a full
copy of all $n$ elements.
By the above doubling argument, we conclude that a broadcast to $p$
processors takes time at least $\lceil\log_2 p\rceil$ steps with a
total latency of $\lceil\log_2 p\rceil \alpha$. Since $n$ elements are
sent, this adds at least a time $n\beta$ for all elements to leave the
sending processor, giving a total cost lower bound of
\begin{equation}
\lceil\log_2 p\rceil \alpha+n\beta.
\end{equation}

We can illustrate the spanning tree method as follows:
\begin{equation}
\begin{array}{|c|ccc|}
\hline
&t=1&t=2&t=3\\ \hline
p_0&x_0\downarrow,x_1\downarrow,x_2\downarrow,x_3\downarrow
&x_0\downarrow,x_1\downarrow,x_2\downarrow,x_3\downarrow
&x_0,x_1,x_2,x_3\\
p_1&&x_0\downarrow,x_1\downarrow,x_2\downarrow,x_3\downarrow
&x_0,x_1,x_2,x_3\\
p_2&&&x_0,x_1,x_2,x_3\\
p_3&&&x_0,x_1,x_2,x_3\\
\hline
\end{array}
\end{equation}
(On $t=1$, $p_0$ sends to $p_1$; on $t=2$ $p_0,p_1$ send to $p_2,p_3$.)
This algorithm has the correct $\log_2p\cdot\alpha$ term, but processor 0 repeatedly
sends the whole vector, so the bandwidth cost is $\log_2p\cdot n\beta$.
If $n$ is small, the latency cost dominates, so we may characterize this
as a
*short vector collective operation*

The following algorithm implements the broadcast as a combination of a scatter
and a
*bucket brigade algorithm*
. First the scatter:
\begin{equation}
\begin{array}{|c|cccc|}
\hline
&t=0&t=1&t=2&t=3\\ \hline
p_0&x_0\downarrow,x_1,x_2,x_3 % t=0
&x_0,x_1\downarrow,x_2,x_3 % t=1
&x_0,x_1,x_2\downarrow,x_3 % t=2
&x_0,x_1,x_2,x_3\downarrow % t=3
\\
p_1& % t=0
&x_1&&\\
p_2&& % t=0,1
&x_2&\\
p_3&&& % t=0,1,2
&x_3\\
\hline
\end{array}
\end{equation}
This involves $p-1$ messages of size $N/p$, for a total time of
\begin{equation}
T_{\scriptstyle\mathrm{scatter}}(N,P) = (p-1)\alpha +
(p-1)\cdot\frac{N}{p}\cdot \beta.
\end{equation}

Then the bucket brigade has each processor active in every step, accepting a partial message (except in the first step), and passing it on to the next processor. \begin{equation} \begin{array}{|c|lll|} \hline &t=0&t=1&et cetera\\ \hline p_0&x_0\downarrow\hphantom{,x_1\downarrow,x_2\downarrow,x_3\downarrow} &x_0\hphantom{\downarrow,x_1\downarrow,x_2\downarrow,}x_3\downarrow &x_0,\hphantom{x_1,}x_2,x_3\\ p_1&\hphantom{x_0\downarrow,}x_1\downarrow\hphantom{,x_2\downarrow,x_3\downarrow} &x_0\downarrow,x_1\hphantom{\downarrow,x_2\downarrow,x_3\downarrow} &x_0,x_1,\hphantom{x_2,}x_3\\ p_2&\hphantom{x_0\downarrow,x_1\downarrow,}x_2\downarrow\hphantom{,x_3\downarrow} &\hphantom{x_0\downarrow,}x_1\downarrow,x_2\hphantom{\downarrow,x_3\downarrow} &x_0,x_1,x_2\hphantom{,x_3}\\ p_3&\hphantom{x_0\downarrow,x_1\downarrow,x_2\downarrow,}x_3\downarrow &\hphantom{x_0\downarrow,x_1\downarrow,}x_2\downarrow,x_3\hphantom{\downarrow} &\hphantom{x_0,}x_1,x_2,x_3\\ \hline \end{array} \end{equation} Each partial message gets sent $p-1$ times, so this stage also has a complexity of \begin{equation} T_{\scriptsize\mathrm{bucket}}(N,P) = (p-1)\alpha + (p-1)\cdot\frac{N}{p}\cdot beta. \end{equation}

The complexity now becomes
\begin{equation}
2(p-1)\alpha+2\beta n(p-1)/p
\end{equation}
which is not optimal in latency, but is a better algorithm if $n$ is large,
making this a
*long vector collective operation*
.

crumb trail: > parallellinear > Collective operations > Reduction

In a
*reduction*
operation, each processor has $n$ data elements, and
one processor needs to combine them elementwise, for instance
computing $n$ sums or products.

By running the broadcast backwards in time, we see that a reduction operation has the same lower bound on the communication of $\lceil\log_2 p\rceil \alpha+n\beta$. A reduction operation also involves computation, which would take a total time of $(p-1)\gamma n$ sequentially: each of $n$ items gets reduced over $p$ processors. Since these operations can potentially be parallelized, the lower bound on the computation is $\frac{p-1}p \gamma n$, giving a total of \begin{equation} \lceil\log_2 p\rceil \alpha+n\beta +\frac{p-1}p \gamma n. \end{equation}

We illustrate the spanning tree algorithm, using the notation $x_i^{(j)}$ for the data item $i$ that was originally on processor $j$, and $x_i^{(j:k)}$ for the sum of the items $i$ of processors $j\ldots k$.