In this section we will discuss a number of issues pertaining to linear algebra on parallel computers. We will take a realistic view of this topic, assuming that the number of processors is finite, and that the problem data is always large, relative to the number of processors. We will also pay attention to the physical aspects of the communication network between the processors.
We will analyze various linear algebra operations, including iterative methods for the solution of linear systems of equation (if we sometimes just refer to `iterative methods', the qualification to systems of linear equations is implicit), and their behavior in the presence of a network with finite bandwidth and finite connectivity.
crumb trail: > parallellinear > Collective operations
Collective operations are an artifact of distributed memory parallelism; see section 2.4 for a discussion. Mathematically they describe extremely simple operations such as summing an array of numbers. However, since these numbers can be on different processors, each with their own memory space, this operation becomes computationally non-trivial and worthy of study.
Collective operations play an important part in linear algebra operations. In fact, the scalability of even simple operations such as a matrix-vector product can depend on the cost of these collectives as you will see below. We start with a short discussion of the essential ideas behind collectives; see [Chan2007Collective] for details. We will then see an important application in section 6.2 , and various other places.
In computing the cost of a collective operation, three architectural constants are enough to give lower bounds: $\alpha$, the latency of sending a single message, $\beta$, the inverse of the bandwidth for sending data (see section 1.3.2 ), and $\gamma$, the inverse of the computation rate , the time for performing an arithmetic operation. Sending $n$ data items then takes time $\alpha +\beta n$. We further assume that a processor can only send one message at a time. We make no assumptions about the connectivity of the processors (see section 2.7 for this); thus, the lower bounds derived here will hold for a wide range of architectures.
The main implication of the architectural model above is that the number of active processors can only double in each step of an algorithm. For instance, to do a broadcast, first processor 0 sends to 1, then 0 and 1 can send to 2 and 3, then 0--3 send to 4--7, et cetera. This cascade of messages is called a minimum spanning tree of the processor network, and it follows that any collective algorithm has at least $\alpha\log_2p$ cost associated with the accumulated latencies.
crumb trail: > parallellinear > Collective operations > Broadcast
In a broadcast operation, a single processor has $n$ data elements that it needs to send to all others: the other processors need a full copy of all $n$ elements. By the above doubling argument, we conclude that a broadcast to $p$ processors takes time at least $\lceil\log_2 p\rceil$ steps with a total latency of $\lceil\log_2 p\rceil \alpha$. Since $n$ elements are sent, this adds at least a time $n\beta$ for all elements to leave the sending processor, giving a total cost lower bound of
\begin{equation} \lceil\log_2 p\rceil \alpha+n\beta. \end{equation}
We can illustrate the spanning tree method as follows:
\begin{equation} \begin{array}{|c|ccc|} \hline &t=1&t=2&t=3\\ \hline p_0&x_0\downarrow,x_1\downarrow,x_2\downarrow,x_3\downarrow &x_0\downarrow,x_1\downarrow,x_2\downarrow,x_3\downarrow &x_0,x_1,x_2,x_3\\ p_1&&x_0\downarrow,x_1\downarrow,x_2\downarrow,x_3\downarrow &x_0,x_1,x_2,x_3\\ p_2&&&x_0,x_1,x_2,x_3\\ p_3&&&x_0,x_1,x_2,x_3\\ \hline \end{array} \end{equation}
(On $t=1$, $p_0$ sends to $p_1$; on $t=2$ $p_0,p_1$ send to $p_2,p_3$.) This algorithm has the correct $\log_2p\cdot\alpha$ term, but processor 0 repeatedly sends the whole vector, so the bandwidth cost is $\log_2p\cdot n\beta$. If $n$ is small, the latency cost dominates, so we may characterize this as a short vector collective operationThe following algorithm implements the broadcast as a combination of a scatter and a bucket brigade algorithm . First the scatter:
\begin{equation} \begin{array}{|c|cccc|} \hline &t=0&t=1&t=2&t=3\\ \hline p_0&x_0\downarrow,x_1,x_2,x_3 % t=0 &x_0,x_1\downarrow,x_2,x_3 % t=1 &x_0,x_1,x_2\downarrow,x_3 % t=2 &x_0,x_1,x_2,x_3\downarrow % t=3 \\ p_1& % t=0 &x_1&&\\ p_2&& % t=0,1 &x_2&\\ p_3&&& % t=0,1,2 &x_3\\ \hline \end{array} \end{equation}
This involves $p-1$ messages of size $N/p$, for a total time of\begin{equation} T_{\scriptstyle\mathrm{scatter}}(N,P) = (p-1)\alpha + (p-1)\cdot\frac{N}{p}\cdot \beta. \end{equation}
Then the bucket brigade has each processor active in every step, accepting a partial message (except in the first step), and passing it on to the next processor.
\begin{equation} \begin{array}{|c|lll|} \hline &t=0&t=1&et cetera\\ \hline p_0&x_0\downarrow\hphantom{,x_1\downarrow,x_2\downarrow,x_3\downarrow} &x_0\hphantom{\downarrow,x_1\downarrow,x_2\downarrow,}x_3\downarrow &x_0,\hphantom{x_1,}x_2,x_3\\ p_1&\hphantom{x_0\downarrow,}x_1\downarrow\hphantom{,x_2\downarrow,x_3\downarrow} &x_0\downarrow,x_1\hphantom{\downarrow,x_2\downarrow,x_3\downarrow} &x_0,x_1,\hphantom{x_2,}x_3\\ p_2&\hphantom{x_0\downarrow,x_1\downarrow,}x_2\downarrow\hphantom{,x_3\downarrow} &\hphantom{x_0\downarrow,}x_1\downarrow,x_2\hphantom{\downarrow,x_3\downarrow} &x_0,x_1,x_2\hphantom{,x_3}\\ p_3&\hphantom{x_0\downarrow,x_1\downarrow,x_2\downarrow,}x_3\downarrow &\hphantom{x_0\downarrow,x_1\downarrow,}x_2\downarrow,x_3\hphantom{\downarrow} &\hphantom{x_0,}x_1,x_2,x_3\\ \hline \end{array} \end{equation}
Each partial message gets sent $p-1$ times, so this stage also has a complexity of\begin{equation} T_{\scriptsize\mathrm{bucket}}(N,P) = (p-1)\alpha + (p-1)\cdot\frac{N}{p}\cdot beta. \end{equation}
The complexity now becomes
\begin{equation} 2(p-1)\alpha+2\beta n(p-1)/p \end{equation}
which is not optimal in latency, but is a better algorithm if $n$ is large, making this a long vector collective operation .
crumb trail: > parallellinear > Collective operations > Reduction
In a reduction operation, each processor has $n$ data elements, and one processor needs to combine them elementwise, for instance computing $n$ sums or products.
By running the broadcast backwards in time, we see that a reduction operation has the same lower bound on the communication of $\lceil\log_2 p\rceil \alpha+n\beta$. A reduction operation also involves computation, which would take a total time of $(p-1)\gamma n$ sequentially: each of $n$ items gets reduced over $p$ processors. Since these operations can potentially be parallelized, the lower bound on the computation is $\frac{p-1}p \gamma n$, giving a total of
\begin{equation} \lceil\log_2 p\rceil \alpha+n\beta +\frac{p-1}p \gamma n. \end{equation}
We illustrate the spanning tree algorithm, using the notation $x_i^{(j)}$ for the data item $i$ that was originally on processor $j$, and $x_i^{(j:k)}$ for the sum of the items $i$ of processors $j\ldots k$.
\begin{equation} \begin{array}{|c|ccc|} \hline &t=1&t=2&t=3\\ \hline p_0&x_0^{(0)},x_1^{(0)},x_2^{(0)},x_3^{(0)} &x_0^{(0:1)},x_1^{(0:1)},x_2^{(0:1)},x_3^{(0:1)} &x_0^{(0:3)},x_1^{(0:3)},x_2^{(0:3)},x_3^{(0:3)}\\ p_1&x_0^{(1)}\uparrow,x_1^{(1)}\uparrow,x_2^{(1)}\uparrow,x_3^{(1)}\uparrow &&\\ p_2&x_0^{(2)},x_1^{(2)},x_2^{(2)},x_3^{(2)} &x_0^{(2:3)}\uparrow,x_1^{(2:3)}\uparrow,x_2^{(2:3)}\uparrow,x_3^{(2:3)}\uparrow &\\ p_3&x_0^{(3)}\uparrow,x_1^{(3)}\uparrow,x_2^{(3)}\uparrow,x_3^{(3)}\uparrow &&\\ \hline \end{array} \end{equation}
On time $t=1$ processors $p_0,p_2$ receive from $p_1,p_3$, and on $t=2$ $p_0$ receives from $p_2$.As with the broadcast above, this algorithm does not achieve the lower bound; instead it has a complexity
\begin{equation} \lceil\log_2 p\rceil (\alpha+n\beta +\frac{p-1}p \gamma n). \end{equation}
For short vectors the $\alpha$ term dominates, so this algorithm is sufficient. For long vectors one can, as above, use other algorithms [Chan2007Collective] .A long vector reduction can be done using a bucket brigade followed by a gather. The complexity is as above, except that the bucket brigade performs partial reductions, for a time of $\gamma(p-1)N/p$. The gather performs no further operations.
crumb trail: > parallellinear > Collective operations > Allreduce
An allreduce operation computes the same elementwise reduction of $n$ elements on each processor, but leaves the result on each processor, rather than just on the root of the spanning tree. This could be implemented as a reduction followed by a broadcast, but more clever algorithms exist.
The lower bound on the cost of an allreduce is, somewhat remarkably, almost the same as of a simple reduction: since in a reduction not all processors are active at the same time, we assume that the extra work can be spread out perfectly. This means that the lower bound on the latency and computation stays the same. For the bandwidth we reason as follows: in order for the communication to be perfectly parallelized, $\frac{p-1}p n$ items have to arrive at, and leave each processor. Thus we have a total time of
\begin{equation} \lceil \log_2 p\rceil\alpha +2\frac{p-1}pn\beta +\frac{p-1}pn\gamma. \end{equation}
The allreduce is an important operation in iterative methods for linear systems; see for instance section 5.5.9 . There, the reduction is typically applied to a single scalar, meaning that latency is relatively important.
crumb trail: > parallellinear > Collective operations > Allgather
In a gather operation on $n$ elements, each processor has $n/p$ elements, and one processor collects them all, without combining them as in a reduction. The allgather computes the same gather, but leaves the result on all processors. You will see two important applications of this in the dense matrix-vector product (section 6.2.2.3 ), and the setup of the sparse matrix-vector product (section 6.5.6 ).
Again we assume that gathers with multiple targets are active simultaneously. Since every processor originates a minimum spanning tree, we have $\log_2p\alpha$ latency; since each processor receives $n/p$ elements from $p-1$ processors, there is $(p-1)\times(n/p)\beta$ bandwidth cost. The total cost for constructing a length $n$ vector by allgather is then
\begin{equation} \lceil \log_2 p\rceil\alpha +\frac{p-1}pn\beta. \end{equation}
We illustrate this:\begin{equation} \begin{array}{|c|ccc|} \hline &t=1&t=2&t=3\\ \hline p_0&x_0\downarrow&x_0x_1\downarrow&x_0x_1x_2x_3\\ p_1&x_1\uparrow&x_0x_1\downarrow&x_0x_1x_2x_3\\ p_2&x_2\downarrow&x_2x_3\uparrow&x_0x_1x_2x_3\\ p_3&x_3\uparrow&x_2x_3\uparrow&x_0x_1x_2x_3\\ \hline \end{array} \end{equation}
At time $t=1$, there is an exchange between neighbors $p_0,p_1$ and likewise $p_2,p_3$; at $t=2$ there is an exchange over distance two between $p_0,p_2$ and likewise $p_1,p_3$.
crumb trail: > parallellinear > Collective operations > Reduce-scatter
In a reduce-scatter operation, each processor has $n$ elements, and an $n$-way reduction is done on them. Unlike in the reduce or allreduce, the result is then broken up, and distributed as in a scatter operation.
Formally, processor $i$ has an item $x_i^{(i)}$, and it needs $\sum_j x_i^{(j)}$. We could implement this by doing a size $p$ reduction, collecting the vector $(\sum_i x_0^{(i)},\sum_i x_1^{(i)},\ldots)$ on one processor, and scattering the results. However it is possible to combine these operations in a so-called bidirectional exchange algorithm:
\begin{equation} \begin{array}{|c|ccc|} \hline &t=1&t=2&t=3\\ \hline p_0&x_0^{(0)},x_1^{(0)},x_2^{(0)}\downarrow,x_3^{(0)}\downarrow &x_0^{(0:2:2)},x_1^{(0:2:2)}\downarrow \hphantom{x_0^{(0:2:2)},x_1^{(0:2:2)}\downarrow} &x_0^{(0:3)} \hphantom{x_3^{(0:3)}x_3^{(0:3)}x_3^{(0:3)}}\\ p_1&x_0^{(1)},x_1^{(1)},x_2^{(1)}\downarrow,x_3^{(1)}\downarrow &x_0^{(1:3:2)}\uparrow,x_1^{(1:3:2)} \hphantom{x_0^{(0:2:2)},x_1^{(0:2:2)}\downarrow} &\hphantom{x_3^{(0:3)}} x_1^{(0:3)} \hphantom{x_3^{(0:3)}x_3^{(0:3)}} \\ p_2&x_0^{(2)}\uparrow,x_1^{(2)}\uparrow,x_2^{(2)},x_3^{(2)} &\hphantom{x_0^{(0:2:2)},x_1^{(0:2:2)}\downarrow} x_2^{(0:2:2)},x_3^{(0:2:2)}\downarrow &\hphantom{x_3^{(0:3)}x_3^{(0:3)}} x_2^{(0:3)} \hphantom{x_3^{(0:3)}}\\ p_3&x_0^{(3)}\uparrow,x_1^{(3)}\uparrow,x_2^{(3)},x_3^{(3)} &\hphantom{x_0^{(0:2:2)},x_1^{(0:2:2)}\downarrow} x_0^{(1:3:2)}\uparrow,x_1^{(1:3:2)} &\hphantom{x_3^{(0:3)}x_3^{(0:3)}x_3^{(0:3)}} x_3^{(0:3)}\\ \hline \end{array} \end{equation}
The reduce-scatter can be considered as a allgather run in reverse, with arithmetic added, so the cost is
\begin{equation} \lceil \log_2 p\rceil\alpha +\frac{p-1}pn(\beta+\gamma). \end{equation}
crumb trail: > parallellinear > Parallel dense matrix-vector product
In this section we will go into great detail into the performance, and in particular the scalability, of the parallel dense matrix-vector product. First we will consider a simple case, and discuss the parallelism aspects in some amount of detail.
crumb trail: > parallellinear > Parallel dense matrix-vector product > Implementing the block-row case
In designing a parallel version of an algorithm, one often proceeds by making a \indexterm {data decomposition} of the objects involved. In the case of a matrix-vector operations such as the product $y=Ax$, we have the choice of starting with a vector decomposition, and exploring its ramifications on how the matrix can be decomposed, or rather to start with the matrix, and deriving the vector decomposition from it. In this case, it seems natural to start with decomposing the matrix rather than the vector, since it will be most likely of larger computational significance. We now have two choices:
We make a one-dimensional decomposition of the matrix, splitting it in block rows or block columns, and assigning each of these -- or groups of them -- to a processor.
Alternatively, we can make a two-dimensional decomposition, assigning to each processor one or more general submatrices.
We start by considering the decomposition in block rows. Consider a processor $p$ and the set $I_p$ of indices of rows that it owns
(Note: {For ease of exposition we will let $I_p$ be a contiguous range of indices, but any general subset is allowed.} )
, and let $i\in I_p$ be a row that is assigned to this processor. The elements in row $i$ are used in the operation\begin{equation} y_i=\sum_ja_{ij}x_j \end{equation}
We now reason:
If processor $p$ has all $x_j$ values, the matrix-vector product can trivially be executed, and upon completion, the processor has the correct values $y_j$ for $j\in I_p$.
This means that every processor needs to have a copy of $x$, which is wasteful. Also it raises the question of data integrity: you need to make sure that each processor has the correct value of $x$.
In certain practical applications (for instance iterative methods, as you have seen before), the output of the matrix-vector product is, directly or indirectly, the input for a next matrix-vector operation. This is certainly the case for the power method which computes $x, Ax, A^2x,\ldots$. Since our operation started with each processor having the whole of $x$, but ended with it owning only the local part of $Ax$, we have a mismatch.
Maybe it is better to assume that each processor, at the start of the operation, has only the local part of $x$, that is, those $x_i$ where $i\in I_p$, so that the start state and end state of the algorithm are the same. This means we have to change the algorithm to include some communication that allows each processor to obtain those values $x_i$ where $i\not\in\nobreak I_p$.
Exercise
Go through a similar reasoning for the case where the matrix is
decomposed in block columns. Describe the parallel algorithm in
detail, like above, but without giving pseudo code.
End of exercise
Let us now look at the communication in detail: we will consider a fixed processor $p$ and consider the operations it performs and the communication that necessitates. According to the above analysis, in executing the statement $y_i=\sum_ja_{ij}x_j$ we have to be aware what processor the $j$ values `belong to'. To acknowledge this, we write
\begin{equation} y_i=\sum_{j\in I_p}a_{ij}x_j+\sum_{j\not\in I_p}a_{ij}x_j \label{eq:yi=sum-in-and-not} \end{equation}
If $j\in I_p$, the instruction $y_i \leftarrow y_i + a_{aij} x_j$ involves only quantities that are already local to the processor. Let us therefore concentrate on the case $j\not\in I_p$. It would be nice if we could just write the statement
y(i) = y(i) + a(i,j)*x(j)and some lower layer would automatically transfer x(j) from whatever processor it is stored on to a local register. (The PGAS languages of section 2.6.5 aim to do this, but their efficiency is far from guaranteed.) An implementation, based on this optimistic view of parallelism, is given in figure 6.1 .
{Naive Parallel MVP}{$A,x\sublocal,y\sublocal,p$} \KwIn{Processor number~$p$; the elements $x_i$ with $i\in I_p$; matrix elements $A_{ij}$ with $i\in I_p$.} \KwOut{The elements $y_i$ with $i\in I_p$} \For{$i\in I_p$}{$s\leftarrow0$\; \For{$j\in I_p$}{$s\leftarrow s+a_{ij}x_{j}$} \For{$j\not\in I_p$}{send $x_j$ from the processor that owns it to the current one, then\; $s\leftarrow s+a_{ij}x_{j}$} $y_i\leftarrow s$ }
DISPLAYPROCEDURE 6.1: A na\"\i vely coded parallel matrix-vector product.
The immediate problem with such a `local' approach is that too much communication will take place.
If the matrix $A$ is dense, the element $x_j$ is necessary once for each row $i\in I_p$, and it will thus be fetched once for every row $i\in I_p$.
For each processor $q\not=p$, there will be (large) number of elements $x_j$ with $j\in I_q$ that need to be transferred from processor $q$ to $p$. Doing this in separate messages, rather than one bulk transfer, is very wasteful.
Instead of communicating individual elements of $x$, we use a local buffer $B_{pq}$ for each processor $q\not=p$ where we collect the elements from $q$ that are needed to perform the product on $p$. (See
\caption{The parallel matrix-vector product with a blockrow distribution.}
figure 6.1 for an illustration.) The parallel algorithm is given in figure 6.1 .
{Parallel MVP}{$A,x\sublocal,y\sublocal,p$} \KwIn{Processor number~$p$; the elements $x_i$ with $i\in I_p$; matrix elements $A_{ij}$ with $i\in I_p$.} \KwOut{The elements $y_i$ with $i\in I_p$} \For{$q\not=p$}{Send elements of~$x$ from processor $q$ to~$p$, receive in buffer~$B_{pq}$.} $y\sublocal\leftarrow A x\sublocal$\\ \For{$q\not=p$}{$y\sublocal\leftarrow y\sublocal+A_{pq}B_q$}
\caption{A buffered implementation of the parallel matrix-vector product.}
In addition to preventing an element from being fetched more than once, this also combines many small messages into one large message, which is usually more efficient; recall our discussion of bandwidth and latency in section 2.7.8 .
Exercise
Give pseudocode for the matrix-vector product using
nonblocking operations (section
2.6.3.6
)
End of exercise
Above we said that having a copy of the whole of $x$ on each processor was wasteful in space. The implicit argument here is that, in general, we do not want local storage to be function of the number of processors: ideally it should be only a function of the local data. (This is related to weak scaling; section 2.2.5 .)
You see that, because of communication considerations, we have actually decided that it is unavoidable, or at least preferable, for each processor to store the whole input vector. Such trade-offs between space and time efficiency are fairly common in parallel programming. For the dense matrix-vector product we can actually defend this overhead, since the vector storage is of lower order than the matrix storage, so our over-allocation is small by ratio. Below (section 6.5 ), we will see that for the sparse matrix-vector product the overhead can be much less.
It is easy to see that the parallel dense matrix-vector product, as the time for communication}. In the next couple of sections you will see that the block row implementation above is not optimal if we take communication into account. For scalability we need a two-dimensional decomposition. We start with a discussion of collectives.
crumb trail: > parallellinear > Parallel dense matrix-vector product > Scalability of the dense matrix-vector product
\newenvironment{boxedexample}{
\noindent
\newline
\begin{boxedminipage}
[t]{\textwidth}
\addtocounter{ourexample}{1}%
{\bf Example \theourexample.}
\em
}
{
\end{boxedminipage}
}
\newenvironment{boxedexercise}{
\noindent
\newline
\begin{boxedminipage}
[t]{\textwidth}
\addtocounter{exercise}{1}%
{\bf Exercise \theexercise.}
\em
}
{
\end{boxedminipage}
}
\newenvironment{boxit}{
\noindent
\newline
\begin{boxedminipage}
[t]{\textwidth}
}
{
\end{boxedminipage}
}
\begin{boxedminipage}
[t]{\textwidth}
Remark
End of remark
\end{boxedminipage}
\newenvironment{boxedremark}{%
\noindent
\newline
\begin{boxedminipage}
[t]{\textwidth}
\addtocounter{remark}{1}%
{\bf Remark \theremark.}
\em
}
{
\end{boxedminipage}
}
\newenvironment{notetoexpert}{%
\noindent
\newline
\begin{boxedminipage}
[t]{\textwidth}
\addtocounter{remark}{1}%
{\bf Note to the skeptical expert.}
\em
}
{
\end{boxedminipage}
}
\newenvironment{boxeddefinition}{%
\noindent
\newline
\begin{boxedminipage}
[t]{\textwidth}
\addtocounter{mydefinition}{1}%
{\bf Definition \themydefinition.}
\em
}
{
\end{boxedminipage}
}
\newcommand{\eop}{
{\bf endofproof}
}
\renewcommand{\S}{\mathbb{S}}
\newcommand{\N}{\mathbb{N}}
\newcommand{\Rm}{\bbR^{m}}
\newcommand{\Rn}{\bbR^{n}}
\newcommand{\Rk}{\bbR^{k}}
\newcommand{\Rmxm}{\bbR^{m \times m}}
\newcommand{\Rnxn}{\bbR^{n \times n}}
\newcommand{\Rmxn}{\bbR^{m \times n}}
\newcommand{\Rnxm}{\bbR^{n \times m}}
\newcommand{\Rmxk}{\bbR^{m \times k}}
\newcommand{\Rkxn}{\bbR^{k \times n}}
\newcommand{\Rnxk}{\bbR^{n \times k}}
\newcommand{\Rkxk}{\bbR^{k \times k}}
\newcommand{\Sm}{\S^{m}}
\newcommand{\Sn}{\S^{n}}
\newcommand{\Sk}{\S^{k}}
\newcommand{\Sonexm}{\S^{1 \times m}}
\newcommand{\Smxm}{\S^{m \times m}}
\newcommand{\Snxn}{\S^{n \times n}}
\newcommand{\Smxn}{\S^{m \times n}}
\newcommand{\Snxm}{\S^{n \times m}}
\newcommand{\Smxk}{\S^{m \times k}}
\newcommand{\Skxn}{\S^{k \times n}}
\newcommand{\Snxk}{\S^{n \times k}}
\newcommand{\alert}{!!!}
\newcommand{\row}[1]{\check{#1}}
\newcommand{\tr}[1]{{#1}^{\mathrm{T}}}
\newcommand{\tc}[1]{{#1}^{\mathrm{H}}}
\newcommand{\TrLw}[1]{\mathrm{TrLw}(#1)}
\newcommand{\TrUp}[1]{\mathrm{TrUp}(#1)}
\newcommand{\SyLw}[1]{\mathrm{SyLw}(#1)}
\newcommand{\SyUp}[1]{\mathrm{SyUp}(#1)}
\newcommand{\HeLw}[1]{\mathrm{HeLw}(#1)}
\newcommand{\HeUp}[1]{\mathrm{HeUp}(#1)}
\newcommand{\defcolvector}[2]{
\left(
\begin{array}{c}
#1_0 \\
#1_1 \\
\vdots \\
#1_{#2-1}
\end{array}
\right)
}
\newcommand{\defrowvector}[2]{
\left(#1_0, #1_1, \ldots, #1_{#2-1}\right)
}
\begin{array}{c c c c}
\end{array}
\newcommand{\defmatrix}[3]{
\left(
\begin{array}{c c c c}
#1_{00} & #1_{01} & \ldots & #1_{0,#3-1} \\
#1_{10} & #1_{11} & \ldots & #1_{1,#3-1} \\
\vdots & \vdots & \ddots & \vdots \\
#1_{#2-1,0} & #1_{#2-1,0} & \ldots & #1_{#2-1,#3-1} \\
\end{array}
\right)
}
\newcommand{\deflowtrmatrix}[3]{
\left(
\begin{array}{c c c c}
#1_{00} & 0 & \ldots & 0 \\
#1_{10} & #1_{11} & \ldots & 0 \\
\vdots & \vdots & \ddots & \vdots \\
#1_{#2-1,0} & #1_{#2-1,0} & \ldots & #1_{#2-1,#3-1} \\
\end{array}
\right)
}
\newcommand{\Chol}[1]{\mbox{\sc Chol}(#1)}
\newcommand{\tril}[1]{\mbox{\sc tril}(#1)}
\newcommand{\trilu}[1]{\mbox{\sc trilu}(#1)}
\newcommand{\triu}[1]{\mbox{\sc triu}(#1)}
\let\becomes\leftarrow
\let\whline\hline
In this section, we will give a full analysis of the parallel computation of $ y \becomes A x $, where $ x, y \in \Rn $ and $ A \in \Rnxn $. We will assume that $ p $ processes will be used, but we make no assumptions on their connectivity. We will see that the way the matrix is distributed makes a big difference for the scaling of the algorithm; for the original research see [HeWo:94,Schreiber:scalability92,Stewart90] , and see section 2.2.5 for the definitions of the various forms of scaling.
crumb trail: > parallellinear > Parallel dense matrix-vector product > Scalability of the dense matrix-vector product > Matrix-vector product, partitioning by rows
Partition
\begin{equation} A \rightarrow \defcolvector{A}{p} \quad x \rightarrow \defcolvector{x}{p} , \quad \mbox{and} \quad y \rightarrow \defcolvector{y}{p} , \end{equation}
where $ A_i \in \bbR^{m_i \times n} $ and $ x_i, y_i \in \bbR^{m_i} $ with $ \sum_{i=0}^{p-1} m_i = n $ and $ m_i \approx n / p $. We will start by assuming that $ A_i $, $ x_i $, and $ y_i $ are originally assigned to $ {\cal P}_i $.The computation is characterized by the fact that each processor needs the whole vector $x$, but owns only an $n/p$ fraction of it. Thus, we execute an allgather of $x$. After this, the processor can execute the local product $y_i\becomes A_ix$; no further communication is needed after that.
An algorithm with cost computation for $ y = A x $ in parallel is then given by
\begin{equation} \vcenter{\hskip\unitindent \begin{tabular}{ p{2in} l }\toprule Step & Cost (lower bound) \\ \midrule Allgather $ x_i $ so that $ x $ is available on all nodes & $ \lceil \log_2(p)\rceil \alpha + \frac{p-1}{p} n \beta \approx \log_2(p) \alpha + n \beta $ \\ Locally compute $ y_i = A_i x $ & $ \approx 2 \frac{n^2}{p} \gamma $ \\ \bottomrule \end{tabular} } \end{equation}
\paragraph*{Cost analysis}
The total cost of the algorithm is given by, approximately,
\begin{equation} T_p(n) = T_p^{\mbox{1D-row}}(n) = 2 \frac{n^2}{p} \gamma + \begin{array}[t]{c} \underbrace{\log_2(p) \alpha + n \beta.} \\ \mbox{Overhead} \end{array} \end{equation}
Since the sequential cost is $ T_1(n) = 2 n^2 \gamma $, the speedup is given by\begin{equation} S_p^{\mbox{1D-row}}(n) = \frac{T_1(n)} {T_p^{\mbox{1D-row}}(n)} = \frac{2 n^2 \gamma} { 2 \frac{n^2}{p} \gamma + \log_2(p) \alpha + n \beta} = \frac{p} { 1 + \frac{p \log_2(p)}{2 n^2} \frac{\alpha}{\gamma} + \frac{p}{2 n} \frac{\beta}{\gamma} } \end{equation}
and the parallel efficiency by\begin{equation} E_p^{\mbox{1D-row}}(n) = \frac{S_p^{\mbox{1D-row}}(n)}{p} = \frac{1} { 1 + \frac{p \log_2(p)}{2 n^2} \frac{\alpha}{\gamma} + \frac{p}{2 n} \frac{\beta}{\gamma} }. \end{equation}
crumb trail: > parallellinear > Parallel dense matrix-vector product > Scalability of the dense matrix-vector product > Matrix-vector product, partitioning by rows > An optimist's view
Now, if one fixes $ p $ and lets $ n $ get large,
\begin{equation} \lim_{n \rightarrow \infty} E_p( n ) = \lim_{n \rightarrow \infty} \left[ \frac{1} { 1 + \frac{p \log_2(p)}{2 n^2} \frac{\alpha}{\gamma} + \frac{p}{2 n} \frac{\beta}{\gamma} } \right] = 1. \end{equation}
Thus, if one can make the problem large enough, eventually the parallel efficiency is nearly perfect. However, this assumes unlimited memory, so this analysis is not practical.
crumb trail: > parallellinear > Parallel dense matrix-vector product > Scalability of the dense matrix-vector product > Matrix-vector product, partitioning by rows > A pessimist's view
In a strong scalability analysis, one fixes $ n $ and lets $ p $ get large, to get
\begin{equation} \lim_{p \rightarrow \infty} E_p( n ) = \lim_{p \rightarrow \infty} \left[ \frac{1} {1 + \frac{p \log_2(p)}{2 n^2} \frac{\alpha}{\gamma} + \frac{p}{2 n} \frac{\beta}{\gamma} } \right] \sim \frac{1}{p}\downarrow 0. \end{equation}
Thus, eventually the parallel efficiency becomes nearly nonexistent.
crumb trail: > parallellinear > Parallel dense matrix-vector product > Scalability of the dense matrix-vector product > Matrix-vector product, partitioning by rows > A realist's view
In a more realistic view we increase the number of processors with the amount of data. This is called weak scalability , and it makes the amount of memory that is available to store the problem scale linearly with $ p $.
Let $ M $ equal the number of floating point numbers that can be stored in a single node's memory. Then the aggregate memory is given by $ M p $. Let $ n_{\max}(p) $ equal the largest problem size that can be stored in the aggregate memory of $ p $ nodes. Then, if {\em all} memory can be used for the matrix,
\begin{equation} (n_{\max}(p))^2 = M p \quad \mbox{or} \quad n_{\max}(p) = \sqrt{Mp}. \end{equation}
The question now becomes what the parallel efficiency for the largest problem that can be stored on $ p $ nodes:\begin{equation} \begin{array}{r@{{}={}}l} E_p^{\mbox{1D-row}}(n_{\max}(p)) & \frac{1} {1 + \frac{p \log_2(p)}{2 (n_{\max}(p))^2} \frac{\alpha}{\gamma} + \frac{p}{2 n_{\max}(p)} \frac{\beta}{\gamma} } \\ & \frac{1} { 1 + \frac{\log_2(p)}{2 M} \frac{\alpha}{\gamma} + \frac{\sqrt{p}}{2 \sqrt{M}} \frac{\beta}{\gamma} }. \end{array} \end{equation}
Now, if one analyzes what happens when the number of nodes becomes large, one finds that\begin{equation} \lim_{p \rightarrow \infty} E_p( n_{\max}(p) ) = \lim_{p \rightarrow \infty} \left[ \frac{1} {1 + \frac{\log_2(p)}{2 M} \frac{\alpha}{\gamma} + \frac{\sqrt{p}}{2 \sqrt{M}} \frac{\beta}{\gamma} } \right] \sim \frac{1}{\sqrt p} \downarrow 0. \end{equation}
Thus, this parallel algorithm for matrix-vector multiplication does not scale either.If you take a close look at this expression for efficiency, you'll see that the main problem is the $1/\sqrt p$ part of the expression. This terms involves a factor $\beta$, and if you follow the derivation backward you see that it comes from the time to send data between the processors. Informally this can be described as saying that the message size is too large to make the problem scalable. In fact, the message size is constant $n$, regardless the number of processors, while for scalability it probably needs to go down.
Alternatively, a realist realizes that there is a limited amount of time, $ T_{\max} $, to get a computation done. Under the best of circumstances, that is, with zero communication overhead, the largest problem that we can solve in time $ T_{\max} $ is given by
\begin{equation} T_p(n_{\max}(p)) = 2 \frac{(n_{\max}(p))^2}{p} \gamma = T_{\max} . \end{equation}
Thus\begin{equation} (n_{\max}(p))^2 = \frac{T_{\max} p}{2 \gamma} \quad \mbox{or} \quad n_{\max}(p) = \frac{\sqrt{T_{\max}} \sqrt{p}}{\sqrt{2 \gamma}}. \end{equation}
Then the parallel efficiency that is attained by the algorithm for the largest problem that can be solved in time $ T_{\max} $ is given by\begin{equation} E_{p,n_{\max}}=\frac1 {1+\frac{\log_2p}T\alpha+\sqrt{\frac pT\frac \beta\gamma}} \end{equation}
and the parallel efficiency as the number of nodes becomes large approaches\begin{equation} \lim_{p\rightarrow\infty}E_p= \sqrt{\frac{T\gamma}{p\beta}}. \end{equation}
Again, efficiency cannot be maintained as the number of processors increases and the execution time is capped.We can also compute the iso-efficiency curve for this operation, that is, the relationship between $n,p$ for which the efficiency stays constant (see section 2.2.5.1 ). If we simplify the efficiency above as $E(n,p)=\frac{2\gamma}{\beta}\frac{n}{p}$, then $E\equiv c$ is equivalent to $n=O(p)$ and therefore
\begin{equation} M=O\left(\frac{n^2}{p}\right)=O(p). \end{equation}
Thus, in order to maintain efficiency we need to increase the memory per processor pretty quickly. This makes sense, since that downplays the importance of the communication.
crumb trail: > parallellinear > Parallel dense matrix-vector product > Scalability of the dense matrix-vector product > Matrix-vector product, partitioning by columns
Partition
\begin{equation} A \rightarrow \defrowvector{A}{p} \quad x \rightarrow \defcolvector{x}{p} , \quad \mbox{and} \quad y \rightarrow \defcolvector{y}{p} , \end{equation}
where $ A_j \in \bbR^{n \times n_j} $ and $ x_j, y_j \in \bbR^{n_j} $ with $ \sum_{j=0}^{p-1} n_j = n $ and $ n_j \approx n / p $.We will start by assuming that $ A_j $, $ x_j $, and $ y_j $ are originally assigned to $ {\cal P}_j $ (but now $ A_i $ is a block of columns). In this algorithm by columns, processor $i$ can compute the length $n$ vector $A_ix_i$ without prior communication. These partial results then have to be added together
\begin{equation} y\leftarrow \sum_i A_ix_i \end{equation}
in a reduce-scatter operation: each processor $i$ scatters a part $(A_ix_i)_j$ of its result to processor $j$. The receiving processors then perform a reduction, adding all these fragments:\begin{equation} y_j = \sum_i (A_ix_i)_j. \end{equation}
The algorithm with costs is then given by:
\begin{equation} \vcenter{\hskip\unitindent \setbox0=\hbox{Reduce-scatter the $ y^{(j)} $s so that $ y_i = \sum_{j=0}^{p-1} y_i^{(j)} $ is on $ {\cal P}_i $ } \dimen0=\wd0 \setbox1=\hbox{$ \lceil \log_2(p)\rceil \alpha + \frac{p-1}{p} n \beta + \frac{p-1}{p} n \gamma $ } \dimen1=\wd1 \begin{tabular}{ p{\dimen0} p{\dimen1} }\toprule Step & Cost (lower bound) \\ \midrule Locally compute $ y^{(j)} = A_j x_j $ & $ \approx 2 \frac{n^2}{p} \gamma $ \\ Reduce-scatter the $ y^{(j)} $s so that $ y_i = \sum_{j=0}^{p-1} y_i^{(j)} $ is on $ {\cal P}_i $ & $ \lceil \log_2(p)\rceil \alpha + \frac{p-1}{p} n \beta + \frac{p-1}{p} n \gamma $\\ & $ \approx \log_2(p) \alpha + n ( \beta + \gamma ) $ \\ \bottomrule \end{tabular} } \end{equation}
crumb trail: > parallellinear > Parallel dense matrix-vector product > Scalability of the dense matrix-vector product > Matrix-vector product, partitioning by columns > Cost analysis
The total cost of the algorithm is given by, approximately,
\begin{equation} T_p^{\mbox{1D-col}}(n) = 2 \frac{n^2}{p} \gamma + \begin{array}[t]{c} \underbrace{\log_2(p) \alpha + n ( \beta + \gamma ).} \\ \mbox{Overhead} \end{array} \end{equation}
Notice that this is identical to the cost $ T_p^{\mbox{1D-row}}(n) $, except with $ \beta $ replaced by $ (\beta + \gamma )$. It is not hard to see that the conclusions about scalability are the same.
crumb trail: > parallellinear > Parallel dense matrix-vector product > Scalability of the dense matrix-vector product > Two-dimensional partitioning
Clearly, a one-dimensional partitioning, whether by rows or columns, is a badly scaling solution. The choice left to us is to use a two-dimensional partitioning. Here we order the processes in a two-dimensional grid, whether that makes sense or not in a network sense, and give each process a submatrix that spans less than a full row or column. We will only consider square matrices, but we allow for the process grid not to be perfectly square.
\begin{equation} A \rightarrow \defmatrix{A}{r}{c} \quad x \rightarrow \defcolvector{x}{c} , \quad \mbox{and} \quad y \rightarrow \defcolvector{y}{r} , \end{equation}
where $ A_{ij} \in \bbR^{n_i \times n_j} $ and $ x_i, y_i \in \bbR^{n_i} $ with $ \sum_{i=0}^{p-1} n_i = N $ and $ n_i \approx N / \sqrt P $.We will view the processes as an $ r \times c $ mesh, with $ P = r c $, and index them as $p_{ij}$, with $ i=0, \ldots, r-1 $ and $ j = 0,\ldots, c-1 $. Figure 6.2 , for a $12\times12$ matrix on a $3\times4$ processor grid, illustrates the assignment of data to nodes, where the $ i,j$ `cell' shows the matrix and vector elements owned by $ p_{ij} $.
{\footnotesize \begin{equation} \begin{array}{| c | c | c | c |} \hline % 0,0 \begin{array}{c c c c} x_0\\ a_{00} & a_{01} &a_{02} & y_0\\ a_{10} & a_{11} &a_{12} & \\ a_{20} & a_{21} &a_{22} & \\ a_{30} & a_{31} &a_{32} & \\ \end{array} & % 0,1 \begin{array}{c c c c} x_3\\ a_{03} & a_{04} &a_{05} & \\ a_{13} & a_{14} &a_{15} & y_1\\ a_{23} & a_{24} &a_{25} & \\ a_{33} & a_{34} &a_{35} & \\ \end{array} & % 0,2 \begin{array}{c c c c} x_6\\ a_{06} & a_{07} &a_{08} & \\ a_{16} & a_{17} &a_{18} & \\ a_{26} & a_{27} &a_{28} & y_2 \\ a_{37} & a_{37} &a_{38} & \\ \end{array} & % 0,3 \begin{array}{c c c c} x_9\\ a_{09} & a_{0,10} & a_{0,11} & \\ a_{19} & a_{1,10} & a_{1,11} & \\ a_{29} & a_{2,10} & a_{2,11} & \\ a_{39} & a_{3,10} & a_{3,11} & y_3 \\ \end{array} \\ \hline % 1,0 \begin{array}{c c c c} & x_1\\ a_{40} & a_{41} &a_{42} & y_4\\ a_{50} & a_{51} &a_{52} & \\ a_{60} & a_{61} &a_{62} & \\ a_{70} & a_{71} &a_{72} & \\ \end{array} & % 1,1 \begin{array}{c c c c} & x_4\\ a_{43} & a_{44} &a_{45} & \\ a_{53} & a_{54} &a_{55} & y_5\\ a_{63} & a_{64} &a_{65} & \\ a_{73} & a_{74} &a_{75} & \\ \end{array} & % 1,2 \begin{array}{c c c c} & x_7\\ a_{46} & a_{47} &a_{48} & \\ a_{56} & a_{57} &a_{58} & \\ a_{66} & a_{67} &a_{68} & y_6 \\ a_{77} & a_{77} &a_{78} & \\ \end{array} & % 1,3 \begin{array}{c c c c} & x_{10}\\ a_{49} & a_{4,10} & a_{4,11} & \\ a_{59} & a_{5,10} & a_{5,11} & \\ a_{69} & a_{6,10} & a_{6,11} & \\ a_{79} & a_{7,10} & a_{7,11} & y_7 \\ \end{array} \\ \hline % 0,0 \begin{array}{c c c c} &&x_2\\ a_{80} & a_{81} & a_{82} & y_8\\ a_{90} & a_{91} &a_{92} & \\ a_{10,0} &a_{10,1} &a_{10,2} & \\ a_{11,0} &a_{11,1} &a_{11,2} & \\ \end{array} & % 0,1 \begin{array}{c c c c} &&x_5\\ a_{83} & a_{84} & a_{85} & \\ a_{93} & a_{94} & a_{95} & y_9\\ a_{10,3} & a_{10,4} &a_{10,5} & \\ a_{11,3} & a_{11,4} &a_{11,5} & \\ \end{array} & % 0,2 \begin{array}{c c c c} &&x_8\\ a_{86} & a_{87} & a_{88} & \\ a_{96} & a_{97} & a_{98} & \\ a_{10,6} & a_{10,7} &a_{10,8} & y_{10} \\ a_{11,7} & a_{11,7} &a_{11,8} & \\ \end{array} & % 0,3 \begin{array}{c c c c} &&x_{11}\\ a_{89} & a_{8,10} & a_{8,11} & \\ a_{99} & a_{9,10} & a_{9,11} & \\ a_{10,9} & a_{10,10} & a_{10,11} & \\ a_{11,9} & a_{11,10} & a_{11,11} & y_{11} \\ \end{array} \\ \hline \end{array} \end{equation}
}FIGURE 6.2: Distribution of matrix and vector elements for a problem of size 12 on a $4\times 3$ processor grid.
In other words, $ p_{ij} $ owns the matrix block $A_{ij}$ and parts of $x$ and $y$. This makes possible the following algorithm
(Note: {This figure shows a partitioning of the matrix into contiguous blocks, and the vector distribution seem to be what is necessary to work with this matrix distribution. You could also look at this story the other way: start with a distribution of input and output vector, and then decide what that implies for the matrix distribution. For instance, if you distributed $x$ and $y$ the same way, you would arrive at a different matrix distribution, but otherwise the product algorithm would be much the same; see [Flame:PBMD-report] .} )
:
Since $x_j$ is distributed over the $j$th column, the algorithm starts by collecting $x_j$ on each processor $p_{ij}$ by an
allgather inside the processor columns.
Each processor $p_{ij}$ then computes $y_{ij} = A_{ij}x_j$. This involves no further communication.
The result $y_i$ is then collected by gathering together the pieces $y_{ij}$ in each processor row to form $y_i$, and this is then distributed over the processor row. These two operations are in fact combined to form a reduce-scatter .
If $r=c$, we can transpose the $y$ data over the processors, so that it can function as the input for a subsequent matrix-vector product. If, on the other hand, we are computing $A^tAx$, then $y$ is now correctly distributed for the $A^t$ product.
crumb trail: > parallellinear > Parallel dense matrix-vector product > Scalability of the dense matrix-vector product > Two-dimensional partitioning > Algorithm
The algorithm with cost analysis is
\begin{equation} \vcenter{\hskip\unitindent \setbox0=\hbox{Perform local matrix-vector multiply } \dimen0=\wd0 \setbox1=\hbox{$ \lceil \log_2(c)\rceil \alpha + \frac{c-1}{p} n \beta + \frac{c-1}{p} n \gamma $ } \dimen1=\wd1 \begin{tabular}{ p{\dimen0} p{\dimen1} }\toprule Step & Cost (lower bound) \\ \midrule Allgather $ x_i $'s within columns & $ \lceil \log_2(r)\rceil \alpha + \frac{r-1}{p} n \beta$\\ & $ \approx \log_2(r) \alpha + \frac{n}{c} \beta $ \\ Perform local matrix-vector multiply & $ \approx 2 \frac{n^2}{p} \gamma $ \\ Reduce-scatter $ y_i $'s within rows & $ \lceil \log_2(c)\rceil \alpha + \frac{c-1}{p} n \beta + \frac{c-1}{p} n \gamma $\\ & $ \approx \log_2(c) \alpha + \frac{n}{r} \beta + \frac{n}{r} \gamma $ \\ \bottomrule \end{tabular} } \end{equation}
crumb trail: > parallellinear > Parallel dense matrix-vector product > Scalability of the dense matrix-vector product > Two-dimensional partitioning > Cost analysis
The total cost of the algorithm is given by, approximately,
\begin{equation} T_p^{r \times c}(n) = T_p^{r \times c}( n) = 2 \frac{n^2}{p} \gamma + \begin{array}[t]{c} \underbrace{\log_2(p) \alpha + \left( \frac{n}{c} + \frac{n}{r} \right) \beta + \frac{n}{r} \gamma.} \\ \mbox{Overhead} \end{array} \end{equation}
We will now make the simplification that $ r = c = \sqrt{p} $ so that\begin{equation} T_p^{\sqrt{p} \times \sqrt{p}}(n) = T_p^{\sqrt{p} \times \sqrt{p}}( n) = 2 \frac{n^2}{p} \gamma + \begin{array}[t]{c} \underbrace{\log_2(p) \alpha + \frac{n}{\sqrt{p}}\left( 2 \beta+ \gamma \right) } \\ \mbox{Overhead} \end{array} \end{equation}
Since the sequential cost is $ T_1(n) = 2 n^2 \gamma $, the speedup is given by
\begin{equation} S_p^{\sqrt{p} \times \sqrt{p}}(n) = \frac{T_1(n)} {T_p^{\sqrt{p} \times \sqrt{p}}(n)} = \frac{2 n^2 \gamma} { 2 \frac{n^2}{p} \gamma + \frac{n}{\sqrt{p}} \left( 2 \beta + \gamma \right)} = \frac{p} { 1 + \frac{p \log_2(p)}{2 n^2} \frac{\alpha}{\gamma} + \frac{\sqrt{p}}{2n}\frac{ \left( 2 \beta + \gamma \right)}{\gamma}} \end{equation}
and the parallel efficiency by\begin{equation} E_p^{\sqrt{p} \times \sqrt{p}}(n) = \frac{1} { 1 + \frac{p \log_2(p)}{2 n^2} \frac{\alpha}{\gamma} + \frac{\sqrt{p}}{2n}\frac{ \left( 2 \beta + \gamma \right)}{\gamma}} \end{equation}
We again ask the question what the parallel
efficiency for the largest problem that can be stored on $ p $ nodes is.
\begin{eqnarray*}
E_p^{\sqrt{p} \times \sqrt{p}}(n_{\max}(p)) &=&
\frac{1}
{ 1 + \frac{p \log_2(p)}{2 n^2} \frac{\alpha}{\gamma}
+ \frac{\sqrt{p}}{2n}\frac{
\left( 2 \beta + \gamma \right)}{\gamma}}
&=&
\frac{1}
{ 1 + \frac{\log_2(p)}{2 M} \frac{\alpha}{\gamma}
+ \frac{1}{2\sqrt{M}}\frac{
\left( 2 \beta + \gamma \right)}{\gamma}}
\end{eqnarray*}
so that still
\begin{eqnarray*}
\lim_{p \rightarrow \infty}
E_p^{\sqrt{p} \times \sqrt{p}}(n_{\max}(p)) &=&
\lim_{p \rightarrow \infty}
\frac{1}
{ 1 + \frac{\log_2(p)}{2 M} \frac{\alpha}{\gamma}
+ \frac{1}{2\sqrt{M}}\frac{
\left( 2 \beta + \gamma \right)}{\gamma}}
=
0.
\end{eqnarray*}
However, $ \log_2{p} $ grows very slowly with $ p $ and is therefore
considered to act much like a constant. In this case
$ E_p^{\sqrt{p}\times \sqrt{p}}( n_{\rm max}(p) ) $ decreases very slowly and the algorithm is considered to be scalable for practical purposes.
Note that when $ r = p $ the 2D algorithm becomes the "partitioned by rows" algorithm and when $ c = p $ it becomes the "partitioned by columns" algorithm. It is not hard to show that the 2D algorithm is scalable in the sense of the above analysis as long as $r/c$ is kept approximately constant.
Exercise
Compute the iso-efficiency curve for this operation.
End of exercise
crumb trail: > parallellinear > LU factorization in parallel
The matrix-vector and matrix-product are easy to parallelize in one sense. The elements of the output can all be computed independently and in any order, so we have many degrees of freedom in parallelizing the algorithm. This does not hold for computing an LU factorization, or solving a linear system with the factored matrix.
crumb trail: > parallellinear > LU factorization in parallel > Solving a triangular system
The solution of a triangular system $y=L\inv x$ (with $L$ is lower triangular) is a matrix-vector operation, so it has its $O(N^2)$ complexity in common with the matrix-vector product. However, unlike the product operation, this solution process contains a recurrence relation between the output elements:
\begin{equation} y_{i} = \ell_{ii}\inv ( x_i-\sum_{j<i} \ell_{ij}x_j ). \end{equation}
This means that parallelization is not trivial. In the case of a sparse matrix special strategies may be possible; see section 6.9 . Here we will make a few remarks about general, dense case.Let us assume for simplicity that communication takes no time, and that all arithmetic operations take the same unit time. First we consider the matrix distribution by rows, meaning that processor $p$ stores the elements $\ell_{p*}$. With this we can implement the triangular solution as:
Processor 1 solves $y_1=\ell_{11}\inv x_1$ and sends its value to the next processor.
In general, processor $p$ gets the values $y_1,\ldots,y_{p-1}$ from processor $p-\nobreak1$, and computes $y_p$;
Each processor $p$ then sends $y_1,\ldots,y_p$ to $p+1$.
Exercise
Show that this algorithm takes time $2N^2$, just like the sequential algorithm.
End of exercise
This algorithm has each processor passing all computed $y_i$ values to its successor, in a pipeline fashion. However, this means that processor $p$ receives $y_1$ only at the last moment, whereas that value was computed already in the first step. We can formulate the solution algorithm in such a way that computed elements are made available as soon as possible:
Processor 1 solve $y_1$, and sends it to all later processors.
In general, processor $p$ waits for individual messages with values $y_q$ for $q
Processor $p$ then computes $y_p$ and sends it to processors $q$ with $q>p$.
Under the assumption that communication time is negligible, this algorithm can be much faster. For instance, all processors $p>1$ receive $y_1$ simultaneously, and can compute $\ell_{p1}y_1$ simultaneously.
Exercise
Show that this algorithm variant takes a time $O(N)$,
if we ignore communication.
What is the cost if we incorporate communication in the cost?
End of exercise
Exercise
Now consider the matrix distribution by columns: processor $p$ stores $\ell_{*p}$.
Outline the triangular solution algorithm with this distribution, and show that the
parallel solve time is $O(N)$.
End of exercise
crumb trail: > parallellinear > LU factorization in parallel > Factorization, dense case
A full analysis of the scalability of parallel dense LU factorization is quite involved, so we will state without further proof that as in the matrix-vector case a two-dimensional distribution is needed. However, we can identify a further complication. Since factorizations of any type
(Note: {Gaussian elimination can be performed in right-looking, left-looking and Crout variants; see [TSoPMC] .} )
progress through a matrix, processors will be inactive for part of the time.
Exercise Consider the regular right-looking Gaussian elimination
for k=1..n p = 1/a(k,k) for i=k+1,n for j=k+1,n a(i,j) = a(i,j)-a(i,k)*p*a(k,j)Analyze the running time, speedup, and efficiency as a function of $N$, if we assume a one-dimensional distribution, and enough processors to store one column per processor. Show that speedup is limited.
Also perform this analysis for a two-dimensional decomposition where
each processor stores one element.
End of exercise
For this reason, an overdecomposition is used, where the matrix is divided in more blocks than there are processors, and each processor stores several, non-contiguous, sub-matrices.
\caption{One-dimensional cyclic distribution: assignment of four matrix columns to two processors, and the resulting mapping of storage to matrix columns.}
We illustrate this in figure 6.3.2 where we divide four block columns of a matrix to two processors: each processor stores in a contiguous block of memory two non-contiguous matrix columns.
Next, we illustrate in figure 6.3
FIGURE 6.3: Matrix-vector multiplication with a cyclicly distributed matrix.
that a matrix-vector product with such a matrix can be performed without knowing that the processors store non-contiguous parts of the matrix. All that is needed is that the input vector is also cyclicly distributed.
Exercise
Now consider a $4\times4$ matrix and a $2\times2$ processor
grid. Distribute the matrix cyclicly both by rows and columns. Show
how the matrix-vector product can again be performed using the
contiguous matrix storage, as long as the input is distributed
correctly. How is the output distributed? Show that more
communication is needed than the reduction of the one-dimensional
example.
End of exercise
Specifically,
with $P
Exercise Consider a square $n\times n$ matrix, and a square $p\times p$ processor grid, where $p$ divides $n$ without remainder. Consider the overdecomposition outlined above, and make a sketch of matrix element assignment for the specific case $n=6,p=2$. That is, draw an $n\times n$ table where location $(i,j)$ contains the processor number that stores the corresponding matrix element. Also make a table for each of the processors describing the local to global mapping, that is, giving the global $(i,j)$ coordinates of the elements in the local matrix. (You will find this task facilitated by using zero-based numbering.)
Now write functions $P,Q,I,J$ of $i,j$ that describe the global to
local mapping, that is, matrix element $a_{ij}$ is stored in
location $(I(i,j),J(i,j))$ on processor $(P(i,j),Q(i,j))$.
End of exercise
crumb trail: > parallellinear > LU factorization in parallel > Factorization, sparse case
Sparse matrix LU factorization in parallel difficult topic. Any type of factorization involves a sequential component, making it non-trivial to begin with. To see the problem with the sparse case in particular, suppose you process the matrix rows sequentially. The dense case has enough elements per row to derive parallelism there, but in the sparse case that number may be very low.
The way out of this problem is to realize that we are not interested in the factorization as such, but rather that we can use it to solve a linear system. Since permuting the matrix gives the same solution, maybe itself permuted, we can explore permutations that have a higher degree of parallelism.
The topic of matrix orderings already came up in section 5.4.3.5 , motivated by fill-in reduction. We will consider orderings with favorable parallelism properties below: nested dissection in section 6.8.1 , and multi-color orderings in section 6.8.2 .
crumb trail: > parallellinear > Matrix-matrix product
In this section we consider both the sequential and parallel matrix-matrix product, both of which offer some computational challenges.
The matrix-matrix product $C\leftarrow A\cdot B$ (or $C\leftarrow A\cdot B+\gamma C$, as it is used in the BLAS ) has a simple parallel structure. Assuming all square matrices of size $N\times N$
the $N^3$ products $a_{ik}b_{kj}$ can be computed independently, after which
the $N^2$ elements $c_{ij}$ are formed by independent sum reductions, each of which take a time $\log_2 N$.
With $N^3$ operations on $O(N^2)$ elements, there is considerable opportunity for
data reuse %
(section 1.6.1 ). We will explore the `Goto algorithm' in section 6.4.1 .
Depending on the way the matrices are traversed, TLB reuse also needs to be considered (section 1.3.8.2 ).
The distributed memory version of the matrix-matrix product is especially tricky. Assuming $A,B,C$ are all distributed over a processor grid, elements of $A$ have to travel through a processor row, and elements of $B$ through a processor columns. We discuss `Cannon's algorithm' and the outer-product method below.
crumb trail: > parallellinear > Matrix-matrix product > Goto matrix-matrix product
In section~ 1.6.1 we argued that the matrix matrix product (or dgemm in BLAS terms) has a large amount of possible data reuse: there are $O(n^3)$ operations on $O(n^2)$ data. We will now consider an implementation, due to Kazushige Goto ~ [GotoGeijn:2008:Anatomy] , that indeed achieves close to peak performance. The matrix-matrix algorithm has three loops, each of which we can block, giving a six-way nested loop. Since there are no recurrences on the output elements, all resulting loop exchanges are legal. Combine this with the fact that the loop blocking introduces three blocking parameters, and you'll see that the number of potential implementations is enormous. Here we present the global reasoning that underlies the Goto implementation; for a detailed discussion see the paper cited. We start by writing the product $C\leftarrow A\cdot B$ (or $C\leftarrow C+AB$ according to the Blas standard) as a sequence of low-rank updates: \begin{equation} C_{**} = \sum_k A_{*k}B_{k*} \end{equation}
FIGURE 6.4: Matrix-matrix multiplication as a sequence of low-rank updates.
See figure 6.4 . Next we derive the `block-panel' multiplication by multiplying a block of $A$ by a `sliver' of $B$; see figure 6.5 .
FIGURE 6.5: The block-panel multiplication in the matrix-matrix algorithm
Finally, the inner algorithm accumulates a small row $C_{i,*}$, typically of small size such as 4, by accumulating:
// compute C[i,*] : for k: C[i,*] = A[i,k] * B[k,*]See figure 6.6 .
FIGURE 6.6: The register-resident kernel of the matrix-matrix multiply
Now this algorithm is tuned.
We need enough registers for C[i,*] , A[i,k] and
B[k,*] . On current processors that means that we accumulate four elements of $C$.
Those elements of $C$ are accumulated, so they stay in register and the only data transfer is loading of the elements of $A$ and $B$; there are no stores!
The elements A[i,k] and B[k,*] stream from L1.
Since the same block of $A$ is used for many successive slivers of $B$, we want it to stay resident; we choose the blocksize of $A$ to let it stay in L2 cache.
In order to prevent TLB problems, $A$ is stored by rows. If we start out with a matrix in (Fortran) column-major storage, this means we have to make a copy. Since copying is of a lower order of complexity, this cost is amortizable.
This algorithm be implemented in specialized hardware, as in the Google TPU [GoogleTPUpage] , giving great energy efficiency.
crumb trail: > parallellinear > Matrix-matrix product > Cannon's algorithm for the distributed memory matrix-matrix product
{r}{4in}
WRAPFIGURE 6.7: Cannon's algorithm for matrix-matrix multiplication: (a) initial rotation of matrix rows and columns, (b)~resulting position in which processor~$(i,j)$ can start accumulating~$\sum_kA_{ik}B_{kj}$, (c)~subsequent rotation of $A,B$ for the next term.
In section 6.4.1 we considered the high performance implementation of the single-processor \indexterm{matrix-matrix product}. We will now briefly consider the distributed memory version of this operation. (It is maybe interesting to note that this is not a generalization based on the matrix-vector product algorithm of section 6.2 .)
One algorithm for this operation is known as Cannon's algorithm It assumes a square processor grid where processor $(i,j)$ gradually accumulates the $(i,j)$ block $C_{i,j}=\sum_kA_{i,k}B_{k,j}$; see figure 6.7 . (This is intrinsically a block matrix operation in the sense of section 5.3.7 . Because of the distributed nature of the implementation, here we actually store blocks, rather than considering them only conceptually.)
If you start top-left, you see that processor $(0,0)$ has both $A_{00}$ and $B_{00}$, so it can immediately start doing a local multiplication. Processor $(0,1)$ has $A_{01}$ and $B_{01}$, which are not needed together, but if we rotate the second column of $B$ up by one position, processor $(0,1)$ will have $A_{01},B_{11}$ and those two do need to be multiplied. Similarly, we rotate the third column of $B$ up by two places so that $(0,2)$ contains $A_{02},B_{22}$.
How does this story go in the second row? Processor $(1,0)$ has $A_{10},B_{10}$ which are not needed together. If we rotate the second row of $A$ one position to the left, it contains $A_{11},B_{10}$ which are needed for a partial product. And now processor $(1,1)$ has $A_{11},B_{11}$.
If we continue this story, we start with a matrix $A$ of which the rows have been rotated left, and $B$ of which the columns have been rotated up. In this setup, processor $(i,j)$ contains $A_{i,i+j}$ and $B_{i+j,j}$, where the addition is done modulo the matrix size. This means that each processor can start by doing a local product.
Now we observe that $C_{ij}=\sum_kA_{ik}B_{kj}$ implies that the next partial product term comes from increasing $k$ by $1$. The corresponding elements of $A,B$ can be moved into the processor by rotating the rows and columns by another location.
\Level 1 {The outer-product method for the distributed matrix-matrix product}
Cannon's algorithm suffered from the requirement of needing a square processors grid. The outer-product method %
is more general. It is based the following arrangement of the matrix-matrix product calculation:
for ( k ) for ( i ) for ( j ) c[i,j] += a[i,k] * b[k,j]That is, for each $k$ we take a column of $A$ and a row of $B$, and computing the rank-1 matrix (or `outer product') $A_{*,k}\cdot B_{k,*}$. We then sum over $k$.
Looking at the structure of this algorithm, we notice that in step $k$, each column $j$ receives $A_{*,k}$, and each row $i$ receives $B_{k,*}$. In other words, elements of $A_{*,k}$ are broadcast through their row, and elements of $B_{k,*}$ are broadcast through their row.
Using the MPI library, these simultaneous broadcasts are realized by having a subcommunicator for each row and each column.
crumb trail: > parallellinear > Sparse matrix-vector product
In linear system solving through iterative methods (see section 5.5 ) the matrix-vector product is computationally an important kernel, since it is executed in each of potentially hundreds of iterations. In this section we look at performance aspects of the matrix-vector product on a single processor first; the multi-processor case will get our attention in section 6.5.3 .
crumb trail: > parallellinear > Sparse matrix-vector product > The single-processor sparse matrix-vector product
We are not much worried about the dense matrix-vector product in the context of iterative methods, since one typically does not iterate on dense matrices. In the case that we are dealing with block matrices, refer to section 1.7.13 for an analysis of the dense product. The sparse product is a lot trickier, since most of that analysis does not apply.
Data reuse in the sparse matrix-vector product
There are some similarities between the dense matrix-vector product, executed by rows, and the CRS sparse product (section 5.4.1.4 ). In both cases all matrix elements are used sequentially, so any cache line loaded is utilized fully. However, the CRS product is worse at least the following ways:
The indirect addressing requires loading the elements of an integer vector. This implies that the sparse product has more memory traffic for the same number of operations.
The elements of the source vector are not loaded sequentially, in fact they can be loaded in effectively a random order. This means that a cacheline contains a source element will likely not be fully utilized. Also, the prefetch logic of the memory subsystem (section 1.3.5 ) cannot assist here.
It may be possible to improve this performance if the structure of the matrix is regular in some sense. One such case is where we are dealing with a block matrix consisting completely of small dense blocks. This leads at least to a reduction in the amount of indexing information: if the matrix consists of $2\times2$ blocks we get a $4\times$ reduction in the amount of integer data transferred.
Exercise
Give two more reasons why this strategy is likely to improve
performance. Hint: cachelines, and reuse.
End of exercise
Such a matrix tessellation may give a factor of 2 in performance improvement. Assuming such an improvement, we may adopt this strategy even if the matrix is not a perfect block matrix: if every $2\times2$ block will contain one zero element we may still get a factor of $1.5$ performance improvement [vuduc:thesis,ButtEijkLang:spmvp] .
Vectorization in the sparse product
In other circumstances bandwidth and reuse are not the dominant concerns:
On old vector computers, such as old Cray machines, memory was fast enough for the processor, but vectorization was paramount. This is a problem for sparse matrices, since the number of zeros in a matrix row, and therefore the vector length, is typically low.
On GPUs memory bandwidth is fairly high, but it is necessary to find large numbers of identical operations. Matrix can be treated independently, but since the rows are likely of unequal length this is not an appropriate source of parallelism.
A matrix with such a structure is good for vector architectures [DAzevedo2005:vector-mvp] . In this case the product is computed by diagonals.
Exercise
Write pseudo-code for this case. How did the sorting of the rows
improve the situation?
End of exercise
This sorted storage scheme also solves the problem we noted on GPUs [Bolz:GPUsparse] . In this case we the traditional CRS product algorithm, and we have an amount of parallelism equal to the number of rows in a block.
Of course there is the complication that we have permuted the matrix: the input and output vectors will need to be permuted accordingly. If the product operation is part of an iterative method, doing this permutation back and forth in each iteration will probably negate any performance gain. Instead we could permute the whole linear system and iterate on the permuted system.
Exercise
Can you think of reasons why this would work? Reasons why it wouldn't?
End of exercise
crumb trail: > parallellinear > Sparse matrix-vector product > The parallel sparse matrix-vector product
In section~ 5.4 you saw a first discussion of sparse matrices, limited to use on a single processor. We will now go into parallelism aspects. The dense matrix-vector product, as you saw above, required each processor to communicate with every other, and to have a local buffer of essentially the size of the global vector. In the sparse case, considerably less buffer space, as well as less communication, is needed. Let us analyze this case. We will assume that the matrix is distributed by block rows, where processor $p$ owns the matrix rows with indices in some set~$I_p$. The line $y_i=y_i+a_{ij}x_j$, which in the dense case is executed for all~$j$, and therefore induces an allgather, now has to take into account that $a_{ij}$ can be zero, so for some pairs $i\in I_p, j\not\in I_p$ no communication will be needed. Declaring for each $i\in I_p$ a~sparsity pattern set \begin{equation} S_{p;i}=\{j\colon j\not\in I_p, a_{ij}\not=0\} \end{equation}
our multiplication instruction becomes\begin{equation} y_i\mathop{+=}a_{ij}x_j\qquad\hbox{if $j\in S_{p;i}$}. \end{equation}
If we want to avoid, as above, a flood of small messages, we combine all communication into a single message per processor. Defining\begin{equation} S_p = \cup_{i\in I_p} S_{p;i}, \end{equation}
the algorithm now becomes:
Collect all necessary off-processor elements $x_j$ with $j\in S_p$ into one buffer;
Perform the matrix-vector product, reading all elements of $x$ from local storage.
This whole analysis of course also applies to dense matrices. This becomes different if we consider where sparse matrices come from. Let us start with a simple case.
Recall figure 4.2.3 , which illustrated a discretized boundary value problem on the simplest domain, a square, and let us now parallelize it. We do this by partitioning the domain; each processor gets the matrix rows corresponding to its subdomain. Figure 6.5.2
\caption{A difference stencil applied to a two-dimensional square domain, distributed over processors. A cross-processor connection is indicated..}
shows how this gives rise to connections between processors: the elements $a_{ij}$ with $i\in I_p,j\not\in I_p$ are now the `legs' of the stencil that reach beyond a processor boundary. The set of all such $j$, formally defined as
\begin{equation} G = \{ j\not\in I_p \colon \exists_{i\in I_p}\colon a_{ij}\not=0 \} \end{equation}
is known as the ghost region of a processor; see figure 6.8 .
FIGURE 6.8: The ghost region of a processor, induced by a stencil.
Exercise
Show that a one-dimensional partitioning of the domain leads to a
partitioning of the matrix into block rows, but a two-dimensional
partitioning of the domain does not. You can do this in the
abstract, or you can illustrate it: take a $4\times4$ domain (giving
a matrix of size 16), and partition it over 4 processors. The
one-dimensional domain partitioning corresponds to giving each
processor one line out of the domain, while the two-dimensional
partitioning gives each processor a $2\times2$ subdomain. Draw the
matrices for these two cases.
End of exercise
FIGURE 6.9: Band matrix with halfbandwidth $\sqrt N$.
Exercise
Figure 6.9 depicts a sparse matrix of size $N$ with a halfbandwidth $n=\sqrt N$. That is,
\begin{equation} |i-j|>n \Rightarrow a_{ij}=0. \end{equation}
We make a one-dimensional distribution of this matrix over $p$ processors, where $p=n=\sqrt N$.Show that the matrix-vector product using this scheme is weakly scalable by computing the efficiency as function of the number of processors
\begin{equation} E_p = \frac{T_1}{pT_p}. \end{equation}
Why is this scheme not really weak scaling, as it is commonly defined?
End of exercise
One crucial observation about the parallel sparse matrix-vector product is that, for each processor, the number of other processors it is involved with is strictly limited. This has implications for the efficiency of the operation.
In the case of the dense matrix-vector product (section 6.2.2 ), partitioning the matrix over the processors by (block) rows did not lead to a scalable algorithm. Part of the reason was the increase in the number of neighbors that each processors needs to communicate with. Figure 6.5.2 shows that, for the matrix of a 5-five point stencil, this number is limited to four.
Exercise
Take a square domain and a partitioning of the variables of the
processors as in figure
6.5.2
.
What is the maximum number of neighbors a
processor needs to communication with for the box stencil in
figure
4.2
? In three space dimensions, what is the
number of neighbors if a 7-point central difference stencil is
used?
End of exercise
The observation that each processor communicates with just a few neighbors stays intact if we go beyond square domains to more complicated physical objects. If a processor receives a more or less contiguous subdomain, the number of its neighbors will be limited. This implies that even in complicated problems each processor will only communicate with a small number of other processors. Compare this to the dense case where each processor had to receive data from every far more friendly to the interconnection network. (The fact that it also more common for large systems may influence the choice of network to install if you are about to buy a new parallel computer.)
crumb trail: > parallellinear > Sparse matrix-vector product > Parallel efficiency of the sparse matrix-vector product
For square domains we can make the above argument formal. Let the unit domain $[0,1]^2$ be partitioned over $P$ processors in a $\sqrt P\times \sqrt P$ grid. From figure 6.5.2 we see that every processor communicates with at most four neighbors. Let the amount of work per processor be $w$ and the communication time with each neighbor $c$. Then the time to perform the total work on a single processor is $T_1=Pw$, and the parallel time is $T_P=w+4c$, giving a speed up of
\begin{equation} S_P=Pw/(w+4c)=P/(1+4c/w)\approx P(1-4c/w). \end{equation}
Exercise
Express $c$ and $w$ as functions of $N$ and $P$, and show that the
speedup is asymptotically optimal, under weak scaling of the problem.
End of exercise
Exercise
In this exercise you will analyze the parallel sparse matrix-vector product for a hypothetical, but realistic, parallel machine. Let the machine parameters be characterized by (see section 1.3.2 ):
Network Latency: $\alpha=1\mu s=10^{-6}s$.
Network Bandwidth: $1Gb/s$ corresponds to $\beta=10^{-9}$.
Computation rate: A per-core flops rate of $1G$flops means $\gamma=10^9$. This number may seem low, but note that the matrix-vector product has less reuse than the matrix-matrix product, which can achieve close to peak performance, and that the sparse matrix-vector product is even more bandwidth-bound.
Case 1. Rather than dividing the matrix, we divide the domain, and we do this first by horizontal slabs of size $n\times (n/p)$. Argue that the communication complexity is $2(\alpha+n\beta)$ and computation complexity is $10\cdot n\cdot (n/p)$. Show that the resulting computation outweighs the communication by a factor 250.
Case 2. We divide the domain into patches of size $(n/\sqrt p)\times (n/\sqrt p)$. The memory and computation time are the same as before. Derive the communication time and show that it is better by a factor of 50.
Argue that the first case does not weakly scale: under the
assumption that $N/p$ is constant the efficiency will go down.
(Show that speedup still goes up asymptotically as $\sqrt p$.)
Argue that the second case does scale weakly.
End of exercise
The argument that a processor will only connect with a few neighbors is based on the nature of the scientific computations. It is true for FDM and FEM methods. In the case of the BEM , any subdomain needs to communicate with everything in a radius $r$ around it. As the number of processors goes up, the number of neighbors per processor will also go up.
Exercise
Give a formal analysis of the speedup and efficiency of the
BEM
algorithm. Assume again a unit amount of work $w$ per processor and
a time of communication $c$ per neighbor. Since the notion of
neighbor is now based on physical distance, not on graph
properties, the number of neighbors will go up. Give
$T_1,T_p,S_p,E_p$ for this case.
End of exercise
There are also cases where a sparse matrix needs to be handled similarly to a dense matrix. For instance, Google 's PageRank algorithm (see section 9.5 ) has at its heart the repeated operation $x\leftarrow Ax$ where $A$ is a sparse matrix with $A_{ij}\not=0$ if web page $j$ links to page $i$; see section 9.5 . This makes $A$ a very sparse matrix, with no obvious structure, so every processor will most likely communicate with almost every other.
crumb trail: > parallellinear > Sparse matrix-vector product > Memory behavior of the sparse matrix-vector product
In section 1.7.13 you saw an analysis of the sparse matrix-vector product in the dense case, on a single processor. Some of the analysis carries over immediately to the sparse case, such as the fact that each matrix element is used only once and that the performance is bound by the bandwidth between processor and memory.
Regarding reuse of the input and the output vector, if the matrix is stored by rows, such as in CRS format (section 5.4.1.3 ), access to the output vector will be limited to one write per matrix row. On the other hand, the loop unrolling trick for getting reuse of the input vector can not be applied here. Code that combines two iterations is as follows:
for (i=0; iThe problem here is that if $a_{ij}$ is nonzero, it is not guaranteed that $a_{i+1,j}$ is nonzero. The irregularity of the sparsity pattern makes optimizing the matrix-vector product hard. Modest improvements are possible by identifying parts of the matrix that are small dense blocks [ButtEijkLang:spmvp,DemEtAl:ieeeproc2004,oski] .
On a GPU the sparse matrix-vector product is also limited by memory bandwidth. Programming is now harder because the GPU has to work in data parallel mode, with many active threads.
An interesting optimization becomes possible if we consider the context in which the sparse matrix-vector product typically appears. The most common use of this operation is in iterative solution methods for linear systems (section 5.5 ), where it is applied with the same matrix in possibly hundreds of iterations. Thus we could consider leaving the matrix stored on the GPU and only copying the input and output vectors for each product operation.
crumb trail: > parallellinear > Sparse matrix-vector product > The transpose product
In section 5.4.1.3 you saw that the code for both the regular and the transpose matrix-vector product are limited to loop orderings where rows of the matrix are traversed. (In section 1.6.2 you saw a discussion of computational effects of changes in loop order; in this case we are limited to row traversal by the storage format.)
In this section we will briefly look at the parallel transpose product. Equivalently to partitioning the matrix by rows and performing the transpose product, we look at a matrix stored and partitioned by columns and perform the regular product.
The algorithm for the product by columns can be given as:
$y\leftarrow 0$
for $j$:
for $i$:
$y_i\leftarrow y_i+a_{ij}x_j$
Both in shared and distributed memory we distribute outer iterations over processors. The problem is then that each outer iteration updates the whole output vector. This is a problem: with shared memory it leads to multiple writes to locations in the output and in distributed memory it requires communication that is as yet unclear.
One way to solve this would be to allocate a private output vector $y^{(p)}$ for each process:
$y^{(p)}\leftarrow 0$
for $j\in$ the rows of processor $p$
for all $i$:
$y^{(p)}_i\leftarrow y^{(p)}_i+a_{ji}x_j$
after which we sum $y\leftarrow\sum_py^{(p)}$.
crumb trail: > parallellinear > Sparse matrix-vector product > Setup of the sparse matrix-vector product
Above we defined sets of processes $S_p$ that contain data to be sent to process $p$. It is easy enough for process $p$ to construct this set from its part of the sparse matrix.
Exercise Assume that the matrix is divided over the processors by block rows; see figure 6.1 for an illustration. Also assume that each processor knows which rows each other processor stores. (How would you implement that knowledge?)
Sketch the algorithm by which a processor can find out who it will
be receiving from; this algorithm should not involve any
communication itself.
End of exercise
However, there is an asymmetry between sending and receiving. While it is fairly easy for a processor to find out what other processors it will be receiving from, discovering who to send to is harder.
Exercise Argue that this is easy in the case of a
structurally symmetric matrix
:
$a_{ij}\not=0\Leftrightarrow a_{ji}\not=0$.
End of exercise
In the general case, a processor can in principle be asked to send to any other, so the simple algorithm is as follows:
Each processor makes an inventory of what non-local indices it needs. Under the above assumption that it knows what range of indices each other processor owns, it then decides which indices to get from what neighbors.
Each processor sends a list of indices to each of its neighbors; this list will be empty for most of the neighbors, but we can not omit sending it.
Each processor then receives these lists from all others, and draws up lists of which indices to send.
If a processor has only a few neighbors, the above algorithm is wasteful. Ideally, you would want space and running time proportional to the number of neighbors. We could bring the receive time in the setup if we knew how many messages were to be expected. That number can be found:
Each processor makes an array need of length $P$, where need[i]
is 1 if the processor needs any data from processor $i$, and zero otherwise.
A reduce-scatter collective on this array, with a sum operator, then leaves on each processor a number indicating how many processors need data from it.
The processor can execute that many receive calls.
The time and space needed for the setup can be reduced to $O(\log P)$ with some sophisticated trickery [Falgout:scalable-hypre,Hoefler:2010:SCP] .
crumb trail: > parallellinear > Computational aspects of iterative methods
All iterative methods feature the following operations:
A matrix-vector product; this was discussed for the sequential case in section 5.4 and for the parallel case in section 6.5 . In the parallel case, construction of FEM matrices has a complication that we will discuss in section 6.6.2 .
The construction of the preconditioner matrix $K\approx A$, and the solution of systems $Kx=y$. This was discussed in the sequential case in section 5.5.6 . We will go into parallelism aspects in section 6.7 .
Some vector operations (including inner products, in general). These will be discussed next.
crumb trail: > parallellinear > Computational aspects of iterative methods > Vector operations
There are two types of vector operations in a typical iterative method: vector additions and inner products.
Exercise Consider the CG method of section 5.5.11 , figure 5.7 , applied to the matrix from a 2D BVP 4.2.3 consider the unpreconditioned case $M=I$. Show that there is a roughly equal number of floating point operations are performed in the matrix-vector product and in the vector operations. Express everything in the matrix size $N$ and ignore lower order terms. How would this balance be if the matrix had 20 nonzeros per row?
Next, investigate this balance between vector and matrix operations
for the
FOM
scheme in section
5.5.9
. Since the number
of vector operations depends on the iteration, consider the first 50
iterations and count how many floating point operations are done in
the vector updates and inner product versus the matrix-vector
product. How many nonzeros does the matrix need to have for these
quantities to be equal?
End of exercise
Exercise
Flop counting is not the whole truth. What can you say about the
efficiency of the vector and matrix operations in an iterative
method, executed on a single processor?
End of exercise
crumb trail: > parallellinear > Computational aspects of iterative methods > Vector operations > Vector additions
The vector additions are typically of the form $x\leftarrow x+\alpha y$ or $x\leftarrow \alpha x+y$. If we assume that all vectors are distributed the same way, this operation is fully parallel.
crumb trail: > parallellinear > Computational aspects of iterative methods > Vector operations > Inner products
Inner products are vector operations, but they are computationally more interesting than updates, since they involve communication.
When we compute an inner product, most likely every processor needs to receive the computed value. We use the following algorithm:
\For {processor $p$} { compute $a_p\leftarrow x_p^ty_p$ where $x_p,y_p$ are the part of $x,y$ stored on processor~$p$ } do a global all-reduction to compute $a=\sum_p a_p$ \; broadcast the result
DISPLAYALGORITHM 6.10: Compute $a\leftarrow x^ty$ where $x,y$ are distributed vectors.
The {\tt Allreduce}) combines data over all processors, so they have a communication time that increases with the number of processors. This makes the inner product potentially an expensive operation (also they form a barrier-like synchronization), and people have suggested a number of ways to reducing their impact on the performance of iterative methods.
Exercise
Iterative methods are typically used for sparse matrices. In that
context, you can argue that the communication involved in an inner product
can have a larger influence on overall performance than the
communication in the matrix-vector product. What is the
complexity of the matrix-vector product and the inner product as a
function of the number of processors?
End of exercise
Here are some of the approaches that have been taken.
The CG method has two inner products per iteration that are inter-dependent. It is possible to rewrite the method so that it computes the same iterates (in exact arithmetic, at least) but so that the two inner products per iteration can be combined. See [ChGe:sstep,DAzEijRo:ppscicomp,Me:multicg,YandBrent:bicgstab] .
It may be possible to overlap the inner product calculation with other, parallel, calculations [dehevo92:acta] .
In the GMRES method, use of the classical GS method takes far fewer independent inner product than the modified
GS method, but it is less stable. People have investigated strategies for deciding when it is allowed to use the classic GS method [Langou:thesis] .
Since computer arithmetic is not associative, inner products are a prime source of results that differ when the same calculation is executed of two different processor configurations. In section 3.6.5 we sketched a solution.
crumb trail: > parallellinear > Computational aspects of iterative methods > Finite element matrix construction
The FEM leads to an interesting issue in parallel computing. For this we need to sketch the basic outline of how this method works. The FEM derives its name from the fact that the physical objects modeled are divided into small two or three dimensional shapes, the elements, such as triangles and squares in 2D, or pyramids and bricks in 3D. On each of these, the function we are modeling is then assumed to polynomial, often of a low degree, such as linear or bilinear.
\caption{A finite element domain, parallelization of the matrix construction, and parallelization of matrix element storage.}
The crucial fact is that a matrix element $a_{ij}$ is then the sum of computations, specifically certain integrals, over all elements that contain both variables $i$ and $j$:
\begin{equation} a_{ij}=\sum_{e\colon i,j\in e} a^{(e)}_{ij}. \end{equation}
The computations in each element share many common parts, so it is natural to assign each element $e$ uniquely to a processor $P_e$, which then computes all contributions $a^{(e)}_{ij}$. In figure 6.6.2 element 2 is assigned to processor 0 and element 4 to processor 1.Now consider variables $i$ and $j$ and the matrix element $a_{ij}$. It is constructed as the sum of computations over domain elements 2 and 4, which have been assigned to different processors. Therefore, no matter what processor row $i$ is assigned to, at least one processor will have to communicate its contribution to matrix element $a_{ij}$.
Clearly it is not possibly to make assignments $P_e$ of elements and $P_i$ of variables such that $P_e$ computes in full the coefficients $a_{ij}$ for all $i\in e$. In other words, if we compute the contributions locally, there needs to be some amount of communication to assemble certain matrix elements. For this reason, modern linear algebra libraries such as PETSc \PCSEref{sec:petsc-matset} allow any processor to set any matrix element.
crumb trail: > parallellinear > Computational aspects of iterative methods > A simple model for iterative method performance
Above, we have already remarked that iterative methods have very little opportunity for data reuse, and they are therefore characterized as bandwidth-bound algorithms. This allows us to make a simple prediction as to the floating point performance
of iterative methods. Since the number of iterations of an iterative method is hard to predict, by performance here we mean the performance of a single iteration. Unlike with direct methods for linear systems (see for instance section 6.8.1 ), the number of flops to solution is very hard to establish in advance.
matrix vector product}: the time spent in this operation is considerably more than in the vector operations. Additionally, most preconditioners have a computational structure that is quite similar to the matrix-vector product.
product} \index{Compressed Row Storage (CRS)!performance of the matrix-vector product}.
First we observe that the arrays of matrix elements and column indices have no reuse, so the performance of loading them is completely determined by the available bandwidth. Caches and prefetch streams only hide the latency, but do not improve the bandwidth. For each multiplication of matrix element times input vector element we load one floating point number and one integer. Depending one whether the indices are 32-bit or 64-bit, this means 12 or 16 bytes loaded for each multiplication.
The demands for storing the result vector are less important: an output vector element is written only once for each matrix row.
The input vector can also be ignored in the bandwidth calculation. At first sight you might think that indirect indexing into the input vector is more or less random and therefore expensive. However, let's take the operator view of a matrix-vector product and consider the space domain of the PDE from which the matrix stems; see section 4.2.3 . We now see that the seemingly random indexing is in fact into vector elements that are grouped closely together. This means that these vector elements will likely reside in L3 cache, and therefore accessible with a higher bandwidth (say, by a factor of at least 5) than data from main memory.
In sum, we conclude that a very simple model for the sparse matrix vector product, and thereby for the whole iterative solver, consists of measuring the effective bandwidth and computing the performance as one addition and one multiplication operation per 12 or 16 bytes loaded.
crumb trail: > parallellinear > Parallel preconditioners
Above (section 5.5.6 and in particular 5.5.6.1 ) we saw a couple of different choices of $K$. In this section we will begin the discussion of parallelization strategies. The discussion is continued in detail in the next sections.
crumb trail: > parallellinear > Parallel preconditioners > Jacobi preconditioning
The Jacobi method (section 5.5.3 ) uses the diagonal of $A$ as preconditioner. Applying this is as parallel as is possible: the statement $y\leftarrow K\inv x$ scales every element of the input vector independently. Unfortunately the improvement in the number of iterations with a Jacobi preconditioner is rather limited. Therefore we need to consider more sophisticated methods such ILU . Unlike with the Jacobi preconditioner, parallelism is then not trivial.
crumb trail: > parallellinear > Parallel preconditioners > Gauss-Seidel and SOR
Regarding GS and SOR we can remark that they are not often used as preconditioners since they are intrinsically unsymmetric. For SOR moreover there is the problem of determining a suitable value for $\omega$. The symmetrized SSOR (is similar enough to ILU ) that we will not discuss it separately.
These methods do find a use as multigrid smoother s, and here an interesting observation is to be made. Consider the solution of a GS system of equations by wavefronts (section 6.9.1 ), and imagine several iterations to be active simultaneously. It is not hard to see that this is equivalent to a multi-color scheme, be it with a different starting vector. See [AdJo:colorblind] for the full argument.
crumb trail: > parallellinear > Parallel preconditioners > The trouble with ILU in parallel
Above we saw that, in a flop counting sense, applying an ILU preconditioner (section 5.5.6.1 ) is about as expensive as doing a matrix-vector product. This is no longer true if we run our iterative methods on a parallel computer.
At first glance the operations are similar. A matrix-vector product $y=Ax$ looks like
for i=1..n y[i] = sum over j=1..n a[i,j]*x[j]In parallel this would look like
for i=myfirstrow..mylastrow y[i] = sum over j=1..n a[i,j]*x[j]Suppose that a processor has local copies of all the elements of $A$ and $x$ that it will need, then this operation is fully parallel: each processor can immediately start working, and if the work load is roughly equal, they will all finish at the same time. The total time for the matrix-vector product is then divided by the number of processors, making the speedup more or less perfect.
Consider now the forward solve $Lx=y$, for instance in the context of an ILU preconditioner:
for i=1..n x[i] = (y[i] - sum over j=1..i-1 ell[i,j]*x[j]) / a[i,i]We can simply write the parallel code:
for i=myfirstrow..mylastrow x[i] = (y[i] - sum over j=1..i-1 ell[i,j]*x[j]) / a[i,i]but now there is a problem. We can no longer say `suppose a processor has local copies of everything in the right hand side', since the vector $x$ appears both in the left and right hand side. While the matrix-vector product is in principle fully parallel over the matrix rows, this triangular solve code is recursive, hence sequential.
In a parallel computing context this means that, for the second processor to start, it needs to wait for certain components of $x$ that the first processor computes. Apparently, the second processor can not start until the first one is finished, the third processor has to wait for the second, and so on. The disappointing conclusion is that in parallel only one processor will be active at any time, and the total time is the same as for the sequential algorithm. This is actually not a big problem in the dense matrix case, since parallelism can be found in the operations for handling a single row (see section 6.11 ), but in the sparse case it means we can not use incomplete factorizations without some redesign.
In the next few subsections we will see different strategies for finding preconditioners that perform efficiently in parallel.
crumb trail: > parallellinear > Parallel preconditioners > Block Jacobi methods
Various approaches have been suggested to remedy this sequentiality the triangular solve. For instance, we could simply let the processors ignore the components of $x$ that should come from other processors:
for i=myfirstrow..mylastrow x[i] = (y[i] - sum over j=myfirstrow..i-1 ell[i,j]*x[j]) / a[i,i]This is not mathematically equivalent to the sequential algorithm (technically, it is called a block Jacobi method with ILU as the local solve ), but since we're only looking for an approximation $K\approx A$, this is simply a slightly cruder approximation.
Exercise
Take the Gauss-Seidel code you wrote above, and simulate a parallel
run. What is the effect of increasing the (simulated) number of processors?
End of exercise
The idea behind block methods can be appreciated pictorially; see figure 6.7.4 .
\caption{Sparsity pattern corresponding to a block Jacobi preconditioner.}
In effect, we make an ILU of the matrix that we get by ignoring all connections between processors. Since in a BVP all points influence each other (see section 4.2.1 ), using a less connected preconditioner will increase the number of iterations if executed on a sequential computer. However, block methods are parallel and, as we observed above, a sequential preconditioner is very inefficient in a parallel context, so we put up with this increase in iterations.
crumb trail: > parallellinear > Parallel preconditioners > Parallel ILU
The Block Jacobi preconditioner operates by decoupling domain parts. While this may give a method that is highly parallel, it may give a higher number of iterations than a true ILU preconditioner. Fortunately it is possible to have a parallel ILU method. Since we need the upcoming material on re-ordering of variables, we postpone this discussion until section 6.8.2.3 .
crumb trail: > parallellinear > Ordering strategies and parallelism
In the foregoing we have remarked on the fact that solving a linear system of equations is inherently a recursive activity. For dense systems, the number of operations is large enough compared to the recursion length that finding parallelism is fairly straightforward. Sparse systems, on the other hand, take more sophistication. In this section we will look at a number of strategies for reordering the equations (or, equivalently, permuting the matrix) that will increase the available parallelism.
These strategies can all be considered as variants of Gaussian elimination. By making incomplete variants of them (see section 5.5.6.1 ), all these strategies also apply to constructing preconditioners for iterative solution methods.
crumb trail: > parallellinear > Ordering strategies and parallelism > Nested dissection
In the previous discussions we typically looked at square domains and a lexicographic ordering of the variables. In this section you will see the nested dissection ordering of variables, which was initially designed as a way to reduce fill-in; see~ [Geo73] . More importantly to us, it is also advantageous in a parallel computing context, as we will now see. Nested dissection is a recursive process for determining a nontrivial ordering of the unknowns in a domain. In the first step, the computational domain is split in two parts, with a dividing strip between them; see figure~ 6.11 .FIGURE 6.11: Domain dissection into two unconnected subdomains and a separator.
\newcommand\Add{A^{\mathrm{DD}}} To be precise, the separator is wide enough that there are no connections between the left and right subdomain . The resulting matrix $\Add$ has a $3\times3$ structure, corresponding to the three parts of the domain. Since the subdomains $\Omega_1$ and $\Omega_2$ are not connected, the submatrices $\Add_{12}$ and $\Add_{21}$ are zero.
\begin{equation} \Add= \begin{pmatrix} A_{11}&\emptyset&A_{13}\\ \emptyset&A_{22}&A_{23}\\ A_{31}&A_{32}&A_{33} \end{pmatrix} = \left( \begin{array}{ccccc|ccccc|c} \star&\star & & & &&&&&&0\\ \star&\star &\star & & &&&&&&\vdots\\ &\ddots&\ddots&\ddots& &&&\emptyset&&&\vdots\\ & &\star &\star &\star &&&&&&0\\ & & &\star &\star &&&&&&\star\\ \midrule &&&&&\star&\star & & & &0\\ &&&&&\star&\star &\star & & &\vdots\\ &&\emptyset&&& &\ddots&\ddots&\ddots& &\vdots\\ &&&&& & &\star &\star &\star &0\\ &&&&& & & &\star &\star &\star\\ \midrule 0&\cdots&\cdots&0&\star&0&\cdots&\cdots&0&\star&\star \end{array} \right) \begin{array}{c} \left. \begin{array}{c} \phantom{0}\\ \phantom{\vdots}\\ \phantom{\vdots}\\ \phantom{0}\\ \phantom{\star}\\ \end{array} \right\} \\ \left. \begin{array}{c} \phantom{0}\\ \phantom{\vdots}\\ \phantom{\vdots}\\ \phantom{0}\\ \phantom{\star}\\ \end{array} \right\} \\ \left. \begin{array}{c} \phantom{0}\\ \end{array} \right\} \end{array} \begin{array}{c} \phantom{0}\\ \phantom{\vdots}\\ (n^2-n)/2\\ \phantom{0}\\ \phantom{\star}\\ \phantom{0}\\ \phantom{\vdots}\\ (n^2-n)/2\\ \phantom{0}\\ \phantom{\star}\\ n \end{array} \end{equation}
This process of dividing the domain by a separator is also called domain decomposition or substructuring , although this name is also associated with the mathematical analysis of the resulting matrices [BGSm:96] . In this example of a rectangular domain it is of course trivial to find a separator. However, for the type of equations we get from BVPs it is usually feasible to find a separator efficiently for any domain [LiTa:separator] ; see also section 19.6.2 .Let us now consider the $LU$ factorization of this matrix. If we factor it in terms of the $3\times 3$ block structure, we get
\begin{equation} \Add=LU= \begin{pmatrix} I\\ \emptyset&I\\ A_{31}A_{11}\inv&A_{32}A_{22}\inv&I \end{pmatrix} \begin{pmatrix} A_{11}&\emptyset&A_{13}\\ &A_{22}&A_{23}\\ &&S_{33} \end{pmatrix} \end{equation}
where\begin{equation} S_{33}=A_{33}-A_{31}A_{11}\inv A_{13}-A_{32}A_{22}\inv A_{23}. \end{equation}
The important fact here is that
the contributions $A_{31}A_{11}\inv A_{13}$ and $A_{32}A_{22}\inv A_{23}$ can be computed simultaneously, so the factorization is largely parallel; and
both in the forward and backward solve, components 1 and 2 of the solution can be computed simultaneously, so the solution process is also largely parallel.
Exercise
In section
5.4.3.1
you saw the connection between LU
factorization and graph theory: eliminating a node leads to a graph
with that node removed, but with certain new connections added.
Show that, after eliminating the first two sets $\Omega_1,\Omega_2$ of variables,
the graph of the remaining matrix on
the separator will be fully connected.
End of exercise
The upshot is that after eliminating all the variables in blocks 1 and 2 we are left with a matrix $S_{33}$ that is fully dense of size $n\times n$.
The introduction of a separator gave us a factorization that was two-way parallel. Now we iterate this process: we put a separator inside blocks 1 and 2 (see figure 6.12 ), which gives the following matrix structure:
QUOTE 6.12: A four-way domain decomposition.
\begin{equation} \Add= \left( \begin{array}{cccc|cc|c} A_{11}& & & &A_{15}& &A_{17}\\ &A_{22}& & &A_{25}& &A_{27}\\ & &A_{33}& & &A_{36}&A_{37}\\ & & &A_{44}& &A_{46}&A_{47}\\ \midrule A_{51}&A_{52}& & &A_{55}& &A_{57}\\ & &A_{63}&A_{64}& &A_{66}&A_{67}\\ \midrule A_{71}&A_{72}&A_{73}&A_{74}&A_{75}&A_{76}&A_{77} \end{array} \right) \end{equation}
(Note the similarities with the `arrow' matrix in section 5.4.3.4 , and recall the argument that this led to lower fill-in.) The LU factorization of this is:\begin{equation} \left( \begin{array}{cccc|cc|c} I& & & & & &\\ &I & & & & &\\ & &I & & & &\\ & & &I & & &\\ \midrule A_{51}A_{11}\inv&A_{52}A_{22}\inv& & &I & &\\ & &A_{63}A_{33}\inv&A_{64}A_{44}\inv& &I &\\ \midrule A_{71}A_{11}\inv&A_{72}A_{22}\inv&A_{73}A_{33}\inv&A_{74}A_{44}\inv& A_{75}S_5\inv&A_{76}S_6\inv&I \end{array} \right) \cdot \end{equation}
\begin{equation} \kern 3\unitindent \left( \begin{array}{cccc|cc|c} A_{11}& & & &A_{15}& &A_{17}\\ &A_{22}& & &A_{25}& &A_{27}\\ & &A_{33}& & &A_{36}&A_{37}\\ & & &A_{44}& &A_{46}&A_{47}\\ \hline & & & &S_{5}& &A_{57}\\ & & & & &S_{6} &A_{67}\\ \hline & & & & & &S_{7} \end{array} \right) \end{equation}
where\begin{equation} \begin{array}{l} S_5=A_{55}-A_{51}A_{11}\inv A_{15}-A_{52}A_{22}\inv A_{25},\quad S_6=A_{66}-A_{63}A_{33}\inv A_{36}-A_{64}A_{44}\inv A_{46}\\ S_7=A_{77}-\sum_{i=1,2,3,4}A_{7i}A_{ii}\inv A_{i7} - \sum_{i=5,6} A_{7i}S_i\inv A_{17}. \end{array} \end{equation}
Constructing the factorization now goes as follows:
Blocks $A_{ii}$ are factored in parallel for $i=1,2,3,4$; similarly the contributions $A_{5i}A_{ii}\inv A_{i5}$ for $i=1,2$, $A_{6i}A_{ii}\inv A_{i6}$ for $i=3,4$, and $A_{7i}A_{ii}\inv A_{i7}$ for $i=1,2,3,4$ can be constructed in parallel.
The Schur complement matrices $S_5,S_6$ are formed and subsequently factored in parallel, and the contributions $A_{7i}S_i\inv A_{17}$ for $i=5,6$ are constructed in parallel.
The Schur complement $S_7$ is formed and factored.
Exercise
Show that solving a system with $\Add$ has a similar parallelism to
constructing the factorization as described above.
End of exercise
For future reference, we will call the sets 1 and 2 each others' siblings, and similarly for 3 and 4. The set 5 is the parent of 1 and 2, 6 is the parent of 3 and 4; 5 and 6 are siblings and 7 is the parent of 5 and 6.
crumb trail: > parallellinear > Ordering strategies and parallelism > Nested dissection > Domain decomposition
In figure 6.12 we divided the domain four ways by a recursive process. This leads up to our discussion of nested dissection. It is also possible to immediately split a domain in any number of strips, or in a grid of subdomains. As long as the separators are wide enough, this will give a matrix structure with many independent subdomains. As in the above discussion, an LU factorization will be characterized by
parallel processing of the subdomains, both in the factorization and $L,U$ solves, and
a system to be solved on the separator structure.
FIGURE 6.13: One-way domain decomposition.
Exercise The matrix from a two-dimensional BVP has a block tridiagonal structure. Divide the domain in four strips, that is, using three separators (see figure 6.13 ). Note that the separators are uncoupled in the original matrix.
Now sketch the sparsity structure of the resulting system on the
separators are elimination of the subdomains. Show that the system
is block tridiagonal.
End of exercise
In all the domain splitting schemes we have discussed so far we have used domains that were rectangular, or `brick' shaped, in more than two dimensions. All of these arguments are applicable to more general domains in two or three dimensions, but things like finding a separator become much harder [LiRoTa:dissection] , and that holds even more for the parallel case. See section 19.6.2 for some introduction to this topic.
crumb trail: > parallellinear > Ordering strategies and parallelism > Nested dissection > Complexity
The nested dissection method repeats the above process until the subdomains are very small. For a theoretical analysis, we keep dividing until we have subdomains of size $1\times1$, but in practice one could stop at sizes such as $32$, and use an efficient dense solver to factor and invert the blocks.
To derive the complexity of the algorithm, we take another look at figure 6.12 , and see that complexity argument, the total space a full recursive nested dissection factorization needs is the sum of
one dense matrix on a separator of size $n$, plus
two dense matrices on separators of size $n/2$,
taking together $3/2\,n^2$ space and $5/12\,n^3$ time;
the two terms above then get repeated on four subdomains of size $(n/2)\times(n/2)$.
\begin{equation} \begin{array}{r@{{}={}}l} \mathrm{space}&3/2n^2+4\cdot 3/2(n/2)^2+\cdots\\ & N(3/2+3/2+\cdots)\quad\hbox{$\log n$ terms}\\ & O(N\log N) \end{array} \end{equation}
\begin{equation} \begin{array}{r@{{}={}}l} \mathrm{time}&5/12n^3/3+4\cdot 5/12(n/2)^3/3+\cdots\\ & 5/12 N^{3/2}(1+1/4+1/16+\cdots)\\ & O(N^{3/2}) \end{array} \end{equation}
Apparently, we now have a factorization that is parallel to a large extent, and that is done in $O(N\log N)$ space, rather than $O(N^{3/2})$ (see section 5.4.3.3 ). The factorization time has also gone down from $O(N^2)$ to $O(N^{3/2})$.Unfortunately, this space savings only happens in two dimensions: in three dimensions we need
one separator of size $n\times n$, taking $(n\times n)^2=N^{4/3}$ space and $1/3\cdot (n\times n)^3=1/3\cdot N^2$ time,
two separators of size $n\times n/2$, taking $N^{3/2}/2$ space and $1/3\cdot N^2/4$ time,
four separators of size $n/2\times n/2$, taking $N^{3/2}/4$ space and $1/3\cdot N^2/16$ time,
adding up to $7/4 N^{3/2}$ space and $21/16 N^2/3$ time;
on the next level there are 8 subdomains that contribute these terms with $n\rightarrow n/2$ and therefore $N\rightarrow N/8$.
\begin{equation} \frac{7}{4}N^{3/2}(1+(1/8)^{4/3}+\cdots)=O(N^{3/2}) \end{equation}
and the total time\begin{equation} \frac{21}{16}N^2(1+1/16+\cdots)/3=O(N^2). \end{equation}
We no longer have the tremendous savings of the 2D case.
crumb trail: > parallellinear > Ordering strategies and parallelism > Nested dissection > Irregular domains
So far, we have analyzed nested dissection for Cartesian domains. A much more complicated analysis shows that the order improvement holds for general problems in 2D, and that 3D in general has a higher complexity [LiRoTa:dissection] .
For irregular domains questions such as how to find the separator become more involved. For this, the concept of a `2D domain' needs to be generalized to that of a planar graph . Sufficiently `good' separators can be found in such graphs [LiTa:separator] , but with sequential algorithms. Parallel algorithms rely on spectral graph theory see
and 19.6 .
crumb trail: > parallellinear > Ordering strategies and parallelism > Nested dissection > Parallelism
The nested dissection method clearly introduces a lot of parallelism, and we can characterize it as task parallelism (section 2.5.3 ): associated with each separator is a task of factoring its matrix, and later one of solving a linear system on its variables. However, the tasks are not independent: in figure 6.12 the factorization on domain 7 has to wait for 5 and 6, and they have to wait for 1,2,3,4. Thus, we have tasks with dependencies in the form of a tree: each separator matrix can be factored only when its children have been factored.
Mapping these tasks to processors is not trivial. First of all, if we are dealing with shared memory we can use a simple task queue:
$\mbox{Queue}\leftarrow\{\}$\; \For{all bottom level subdomains $d$}{add $d$ to the Queue} \While{Queue is not empty} {\If{a processor is idle}{assign a queued task to it} \If{a task is finished AND its sibling is finished}{add its parent to the queue} }
The main problem here is that at some point we will have more processors than tasks, thus causing load unbalance. This problem is made more severe by the fact that the last tasks are also the most substantial, since the separators double in size from level to level. (Recall that the work of factoring a dense matrix goes up with the third power of the size!) Thus, for the larger separators we have to switch from task parallelism to medium-grained parallelism, where processors collaborate on factoring a block.
With distributed memory, we can now solve the parallelism problem with a simple task queue, since it would involve moving large amounts of data. (But recall that work is a higher power of the matrix size, which this time works in our favor, making communication relatively cheap.) The solution is then to use some form of domain decomposition. In figure 6.12 we could have four processors, associated with block 1,2,3,4. Processors 1 and 2 would then negotiate which one factors block 5 (and similarly processors 3 and 4 and block 6), or they could both do it redundantly.
crumb trail: > parallellinear > Ordering strategies and parallelism > Nested dissection > Preconditioning
As with all factorizations, it is possible to turn the nested dissection method into a preconditioner by making the factorization incomplete. (For the basic idea of incomplete factorizations, see section 5.5.6.1 ). However, here the factorization is formulated completely in terms of block matrices , and the division by the pivot element becomes an inversion or system solution with the pivot block matrix. We will not go into this further; for details see the literature [AxPo:dd2,Eij:general,Me:dd] .
crumb trail: > parallellinear > Ordering strategies and parallelism > Variable reordering and coloring: independent sets
Parallelism can be achieved in sparse matrices by using graph coloring (section~ 19.3 ). Since a `color' is defined as points that are only connected to other colors, they are by definition independent of each other, and therefore can be processed in parallel. This leads us to the following strategy:crumb trail: > parallellinear > Ordering strategies and parallelism > Variable reordering and coloring: independent sets > Red-black coloring
We start with a simple example, where we consider a tridiagonal matrix~$A$. The equation $Ax=b$ looks like \begin{equation} \begin{pmatrix} a_{11}&a_{12}&&&&\emptyset\\ a_{21}&a_{22}&a_{23}\\ &a_{32}&a_{33}&a_{34}\\ \emptyset&&\ddots&\ddots&\ddots \end{pmatrix} \begin{pmatrix} x_1\\ x_2\\ x_3\\ \vdots \end{pmatrix} = \begin{pmatrix} y_1\\ y_2\\ y_3\\ \vdots \end{pmatrix} \end{equation} We observe that $x_i$ directly depends on $x_{i-1}$ and $x_{i+1}$, but not $x_{i-2}$ or $x_{i+1}$. Thus, let us see what happens if we permute the indices to group every other component together.Pictorially, we take the points $1,\ldots,n$ and color them red and black (figure 6.14 ), then we permute them to first
QUOTE 6.14: Red-black ordering of a the points on a line.
take all red points, and subsequently all black ones. The correspondingly permuted matrix looks as follows:
\begin{equation} \begin{pmatrix} a_{11}&&&&a_{12}\\ &a_{33}&&&a_{32}&a_{34}\\ &&a_{55}&&&\ddots&\ddots\\ &&&\ddots\\ a_{21}&a_{23}&&&a_{22}\\ &a_{43}&a_{45}&&&a_{44}\\ &&\ddots&\ddots&&&\ddots \end{pmatrix} \begin{pmatrix} x_1\\ x_3\\ x_5\\ \vdots\\ x_2\\ x_4\\ \vdots \end{pmatrix} = \begin{pmatrix} y_1\\ y_3\\ y_5\\ \vdots\\ y_2\\ y_4\\ \vdots \end{pmatrix} \end{equation}
With this permuted $A$, the Gauss-Seidel matrix $D_A+L_A$ looks like\begin{equation} \begin{pmatrix} a_{11}&&&&\emptyset\\ &a_{33}\\ &&a_{55}\\ &&&\ddots\\ a_{21}&a_{23}&&&a_{22}\\ &a_{43}&a_{45}&&&a_{44}\\ &&\ddots&\ddots&&&\ddots \end{pmatrix} \end{equation}
What does this buy us? Well, let's spell out the solution of a system $Lx=y$.
\For{$i=1,3,5,\ldots$}{solve $x_i\leftarrow y_i/a_{ii}$} \For{$i=2,4,6,\ldots$}{compute $t=a_{ii-1}x_{i-1}+a_{ii+1}x_{i+1}$ \; solve $x_i\leftarrow (y_i-t)/a_{ii}$ }
Apparently the algorithm has three stages that are each parallel over half the domain points. This is illustrated in figure 6.15 .
FIGURE 6.15: Red-black solution on a 1d domain.
Theoretically we could accommodate a number of processors that is half the number of the domain points, but in practice each processor will have a subdomain. Now you can see in figure 6.16 how this causes a very modest amount of communication: each processor sends at most the data of two red points to its neighbors.
FIGURE 6.16: Parallel red-black solution on a 1d domain.
Exercise
Argue that the adjacency graph here is a
bipartite graph
.
We see that such graphs (and in general, colored graphs) are associated with parallelism.
Can you also point out performance benefits for non-parallel processors?
End of exercise
Red-black ordering can be applied to two-dimensional problems too. Let us apply a red-black ordering to the points $(i,j)$ where $1\leq i,j\leq n$. Here we first apply a successive numbering to the odd points on the first line $(1,1),(3,1),(5,1),\ldots$, then the even points of the second line $(2,2),(4,2),(6,2),\ldots$, the odd points on the third line, et cetera. Having thus numbered half the points in the domain, we continue with the even points in the first line, the odd points in the second, et cetera.
\caption{Red-black ordering of the variables of a two-dimensional domain.}
As you can see in figure 6.16 , now the red points are only connected to black points, and the other way around. In graph theoretical terms, you have found a coloring
(see appendix app:graph for the definition of this concept) of the matrix graph with two colors.
Exercise
Apply the red-black ordering to the 2D
BVP
4.2.3
Sketch the resulting matrix structure.
End of exercise
The red-black ordering is a simple example of graph coloring (sometimes called multi-coloring ). In simple cases, such as the unit square domain we considered in section 4.2.3 or its extension to 3D, the color number of the adjacency graph is readily determined; in less regular cases it is harder.
Exercise
You saw that a red-black ordering of unknowns
coupled with the regular five-point star stencil give two subsets of
variables that are not connected among themselves, that is,
they form a two-coloring of the matrix graph.
Can you find a coloring if nodes are connected by the second stencil in
figure
4.2
?
End of exercise
crumb trail: > parallellinear > Ordering strategies and parallelism > Variable reordering and coloring: independent sets > General coloring
There is a simple bound for the number of colors needed for the graph of a sparse matrix: the number of colors is at most $d+1$ where $d$ is the degree of the graph. To see that we can color a graph with degree $d$ using $d+1$ colors, consider a node with degree $d$. No matter how its neighbors are colored, there is always an unused color among the $d+1$ available ones.
Exercise
Consider a sparse matrix, where the graph can be colored with $d$
colors. Permute the matrix by first enumerating the unknowns of the
first color,
then the second color, et cetera. What can you say about the
sparsity pattern of the resulting permuted matrix?
End of exercise
If you are looking for a direct solution of the linear system you can repeat the process of coloring and permuting on the matrix that remains after you have eliminated one color. In the case of a tridiagonal matrix you saw that this remaining matrix was again tridiagonal, so it is clear how to continue the process. This is called recursive doubling . If the matrix is not tridiagonal but block tridiagonal , this operation can be performed on blocks.
crumb trail: > parallellinear > Ordering strategies and parallelism > Variable reordering and coloring: independent sets > Multi-color parallel ILU
In section 6.8.2 you saw the combination of graph coloring and permutation. Let $P$ be the permutation that groups like-colored variables together, then $\tilde A=P^tAP$ is a matrix with the following structure:
$\tilde A$ has a block structure with the number of blocks equal to the number of colors in the adjacency graph of $A$; and
each diagonal block is a diagonal matrix.
for $c$ in the set of colors:
for $i$ in the variables of color $c$:
$y_i\leftarrow x_i-\sum_{j
Exercise
Show that the flop count of solving a system $LUx=y$ remains the
same (in the highest order term) when you from an
ILU
factorization in the natural ordering to one in the color-permuted
ordering.
End of exercise
FIGURE 6.17: A partitioned domain with colored nodes.
Where does all this coloring get us? Solving is still sequential\ldots Well, it is true that the outer loop over the colors is sequential, but all the points of one color are independent of each other, so they can be solved at the same time.
FIGURE 6.18: Solving a parallel multicolor ILU in four steps.
So if we use an ordinary domain partitioning and combine that with a multi-coloring (see figure 6.17 ), the processors are all active during all the color stages; see figure 6.18 . Ok, if you take a close look at that figure you'll see that one processor is not active in the last color. With large numbers of nodes per processor this is unlikely to happen, but there may be some load imbalance.
One remaining problem is how to generate the multi-coloring in parallel. Finding the optimal color number is NP-hard. The thing that saves us is that we don't necessarily need the optimal number because we are making in incomplete factorization anyway. (There is even an argument that using a slightly larger number of colors decreases the number of iterations.)
An elegant algorithm for finding a multi-coloring in parallel was found by [jopl94,Luby:parallel] :
Assign a random value to each variable.
Find the variables that have a higher random value than all their neighbors; that is color 1.
Then find the variables that have a higher random value than all their neighbors that are not of color 1. That is color 2.
Iterate until all points are colored.
crumb trail: > parallellinear > Ordering strategies and parallelism > Irregular iteration spaces
Applying a computational stencil , in the context of explicit time-stepping or as a sparse matrix-vector product, is parallel. However, in practice it may not be trivial to split the iteration space. If the iteration space is a Cartesian brick, it is easy, even with nested parallelism.
However, in practice, iterations spaces can be much more complicated. In such cases the following can be done:
The iteration loop is traversed to count the total number of inner iterations; this is then divide in as many parts as there processes.
The loop is traversed to find the starting and ending index values for each process.
The loop code is then rewritten so that it can run over such a subrange.
crumb trail: > parallellinear > Ordering strategies and parallelism > Ordering for cache efficiency
The performance of stencil operations is often quite low. The operations have no obvious cache reuse; often they resemble stream operations where long streams of data are retrieved from memory and used only once. If only a single stencil evaluation was done, that would be the end of the story. However, usually we do many such updates, and we can apply techniques similar to loop tiling described in section 1.7.9 .
We can do better than with plain tiling if we take the shape of the stencil into account.
\caption{First and second step in cache-optimized iteration space traversal.}
Figure 6.8.4 shows (left) how we first compute a time-space trapezoid that is cache-contained. Then (right) we compute another cache-contained trapezoid that builds on the first [Frigo:2007:oblivious-stencil] .
\Level 0 {Parallelism in solving linear systems from PDEs }
The numerical solution of PDEs is an important activity, and the precision required often makes it a prime candidate for parallel treatment. If we wonder just how parallel we can be, and in particular what sort of a speedup is attainable, we need to distinguish between various aspects of the question.
First of all we can ask if there is any intrinsic parallelism in the problem. On a global level this will typically not be the case (if parts of the problem were completely uncoupled, then they would be separate problems, right?) but on a smaller level there may be parallelism.
For instance, looking at time-dependent problems, and referring to section 4.2.1 , we can say that every next time step is of course dependent on the previous one, but not every individual point on the next time step is dependent on every point in the previous step: there is a region of influence . Thus it may be possible to partition the problem domain and obtain parallelism.
crumb trail: > parallellinear > Ordering strategies and parallelism > Operator splitting
In some contexts, it is necessary to perform implicit calculations through all directions of a two or three-dimensional array. For example, in section 4.3 you saw how the implicit solution of the heat equation gave rise to repeated systems
\begin{equation} (\alpha I+\frac {d^2}{dx^2}+\frac{d^2}{dy^2})u^{(t+1)}=u^{(t)} \label{eq:heat-recap} \end{equation}
Without proof, we state that the time-dependent problem can also be solved by\begin{equation} (\beta I+\frac {d^2}{dx^2})(\beta I+\frac{d^2}{dy^2})u^{(t+1)}=u^{(t)} \label{eq:adi-recap} \end{equation}
for suitable $\beta$. This scheme will not compute the same values on each individual time step, but it will converge to the same steady state. The scheme can also be used as a preconditioner in the BVP case.This approach has considerable advantages, mostly in terms of operation counts: the original system has to be solved either making a factorization of the matrix, which incurs fill-in , or by solving it iteratively.
Exercise
Analyze the relative merits of these approaches, giving rough
operation counts. Consider both the case where $\alpha$ has
dependence on $t$ and where it does not. Also discuss the expected
speed of various operations.
End of exercise
A further advantage appears when we consider the parallel solution 6.8.5 variables $u_{ij}$, but the operator $I+d^2u/dx^2$ only connects $u_{ij},u_{ij-1},u_{ij+1}$. That is, each line corresponding to an $i$ value can be processed independently. Thus, both operators can be solved fully parallel using a one-dimensional partition on the domain. 6.8.5 hand, has limited parallelism.
Unfortunately, there is a serious complication: the operator in $x$ direction needs a partitioning of the domain in on direction, and the operator in $y$ in the other. The solution usually taken is to transpose the $u_{ij}$ value matrix in between the two solves, so that the same processor decomposition can handle both. This transposition can take a substantial amount of the processing time of each time step.
Exercise
Discuss the merits of and problems with a two-dimensional
decomposition of the domain, using a grid of $P=p\times p$
processors. Can you suggest a way to ameliorate the problems?
End of exercise
One way to speed up these calculations, is to replace the implicit solve, by an explicit operation; see section 6.9.3 .
crumb trail: > parallellinear > Parallelism and implicit operations
In the discussion of IBVPs (section~ 4.1.2.2 ) you saw that implicit operations can have great advantages from the point of numerical stability. However, you also saw that they make the difference between methods based on a simple operation such as the matrix-vector product, and ones based on the more complicated linear system solution. There are further problems with implicit methods when you start computing in parallel. The following exercise serves as a recap of some of the issues already identified in our general discussion of linear algebra; chapter~ Numerical linear algebra . Exercise Let $A$ be the matrix \begin{equation} A= \begin{pmatrix} a_{11}&&\emptyset\\ a_{21}&a_{22}\\ &\ddots&\ddots\\ \emptyset&&a_{n,n-1}&a_{nn} \end{pmatrix} . \label{eq:ex:bidiagonal} \end{equation} Show that the matrix vector product $y\leftarrow Ax$ and the system solution $x\leftarrow A\inv y$, obtained by solving the triangular system $Ax=y$, not by inverting $A$, have the same operation count.Now consider parallelizing the product $y\leftarrow Ax$. Suppose we have $n$ processors, and each processor $i$ stores $x_i$ and the $i$-th row of $A$. Show that the product $Ax$ can be computed without idle time on any processor but the first.
Can the same be done for the solution of the triangular system
$Ax=y$? Show that the straightforward implementation has every
processor idle for an $(n-1)/n$ fraction of the computation.
End of exercise
We will now see a number of ways of dealing with this inherently sequential component.
crumb trail: > parallellinear > Parallelism and implicit operations > Wavefronts
Above, you saw that solving a lower triangular system of size $N$ can have sequential time complexity of $N$ steps. In practice, things are often not quite that bad. Implicit algorithms such as solving a triangular system are inherently sequential, but the number of steps can be less than is apparent at first.
Exercise
4.2.3
from a two-dimensional
BVP
on the unit square,
discretized with central differences.
This matrix resulted from the
lexicographic ordering
of the variables.
Derive the matrix structure if we order the unknowns by diagonals.
What can you say about the sizes of
the blocks and the structure of the blocks themselves?
What is the implication for parallelism?
End of exercise
We don't have to form the matrix to arrive at a parallelism analysis. As already remarked, often a stencil approach is more fruitful. Let us take another look at figure 4.2.3 that describes the finite difference stencil of a two-dimensional BVP . The corresponding picture for the stencil of the lower triangular factor is
\caption{The difference stencil of the $L$ factor of the matrix of a two-dimensional BVP .}
in figure 6.9.1 . This describes the sequentiality of the lower triangular solve process $x\leftarrow L\inv y$:
\begin{equation} x_k = y_k - \ell_{k,k-1}x_{k-1} - \ell_{k,k-n}x_{k-n} \end{equation}
In other words, the value at point $k$ can be found if its neighbors to the left (that is, variable $k-1$) and below (variable $k-n$) are known.Turning this around, we see that, if we know $x_1$, we can not only find $x_2$, but also $x_{n+1}$. In the next step we can determine $x_3$, $x_{n+2}$, and $x_{2n+1}$. Continuing this way, we can solve $x$ by wavefronts : the values of $x$ on each wavefront are independent, so they can be solved in parallel in the same sequential step.
Exercise
Finish this argument. What is the maximum number of processors we
can employ, and what is the number of sequential steps? What is the
resulting efficiency?
End of exercise
Of course you don't have to use actual parallel processing to exploit this parallelism. Instead you could use a vector processor , vector instructions , or a GPU [Liu:cudasw2009] .
In section 5.4.3.5 you saw the Cuthill-McKee ordering for reducing the fill-in of a matrix. We can modify this algorithm as follows to give wavefronts:
Take an arbitrary node, and call that `level zero'.
For level $n+1$, find points connected to level $n$, that are not themselves connected.
For the so-called `reverse Cuthill-McKee ordering', reverse the numbering of the levels.
Exercise
This algorithm is not entirely correct. What is the problem; how can
you correct it? Show that the resulting permuted matrix is no
longer tridiagonal, but will likely still have a band structure.
End of exercise
crumb trail: > parallellinear > Parallelism and implicit operations > Recursive doubling
Recursions $y_{i+1} = a_i y_i + b_i $, such as appear in solving a bilinear set of equations (see exercise 4.2.2 ), seem intrinsically sequential. However, you already saw in exercise 1.3 how, at the cost of some preliminary operations, the computation can be parallelized. The basic idea is that you transform the sequence:
\begin{equation} x_{i+1} = a_i x_i+y_i \Rightarrow x_{i+2} = a'_i x_i + y'_i \Rightarrow x_{i+4} = a''_ix_i + y''_i \Rightarrow \cdots \end{equation}
We will now formalize this strategy, generally known as recursive doubling bidiagonal matrix 6.9 giving the problem to solve $x$ in:
\begin{equation} \begin{pmatrix} 1&&\emptyset\\ b_{21}&1\\ &\ddots&\ddots\\ \emptyset&&b_{n,n-1}&1 \end{pmatrix} x = y \end{equation}
which we write as $A=I+B$.
Exercise
Show that the scaling to normalized form can be done by multiplying
with a diagonal matrix.
How does solving the system $(I+B)x=y$ help in solving $Ax=y$? What
are the operation counts of solving the system in the two different ways?
End of exercise
Now we do something that looks like Gaussian elimination, except that we do not start with the first row, but the second. (What would happen if you did Gaussian elimination or LU decomposition on the matrix $I+B$?) We use the second row to eliminate $b_{32}$:
\begin{equation} \begin{pmatrix} 1&&&\emptyset\\ &1\\ &-b_{32}&1\\ &&&\ddots\\ \emptyset&&&&1 \end{pmatrix} \times \begin{pmatrix} 1&&&\emptyset\\ b_{21}&1\\ &b_{32}&1\\ &&\ddots&\ddots\\ \emptyset&&&b_{n,n-1}&1 \end{pmatrix} = \begin{pmatrix} 1&&&\emptyset\\ b_{21}&1\\ -b_{32}b_{21}&0&1\\ \emptyset&&b_{n,n-1}&1 \end{pmatrix} \end{equation}
which we write as $L^{(2)}A=A^{(2)}$. We also compute $L^{(2)}y=y^{(2)}$ so that $A^{(2)}x=y^{(2)}$ has the same solution as $Ax=y$. Solving the transformed system gains us a little: after we compute $x_1$, $x_2$ and $x_3$ can be computed in parallel.Now we repeat this elimination process by using the fourth row to eliminate $b_{54}$, the sixth row to eliminate $b_{76}$, et cetera. The final result is, summarizing all $L^{(i)}$ matrices: {\small
\begin{equation} \begin{pmatrix} 1&&&&&&&\emptyset\\ 0&1\\ &-b_{32}&1\\ &&0&1\\ &&&-b_{54}&1\\ &&&&0&1\\ &&&&&-b_{76}&1\\ &&&&&&\ddots&\ddots\\ \end{pmatrix} \times (I+B) = \begin{pmatrix} 1&&&&&&&\emptyset\\ b_{21}&1\\ -b_{32}b_{21}&0&1\\ &&b_{43}&1\\ &&-b_{54}b_{43}&0&1\\ &&&&b_{65}&1\\ &&&&-b_{76}b_{65}&0&1\\ &&&&&\ddots&\ddots&\ddots \end{pmatrix} \end{equation}
} which we write as $L(I+B)=C$, and solving $(I+B)x=y$ now becomes $Cx=L\inv y$.This final result needs close investigation.
First of all, computing $y'=L\inv y$ is simple. (Work out the details. How much parallelism is available?)
Solving $Cx=y'$ is still sequential, but it no longer takes $n$ steps: from $x_1$ we can get $x_3$, from that we get $x_5$, et cetera. In other words, there is only a sequential relationship between the odd numbered components of $x$.
The even numbered components of $x$ do not depend on each other, but only on the odd components: $x_2$ follows from $x_1$, $x_4$ from $x_3$, et cetera. Once the odd components have been computed, admittedly sequentially, this step is fully parallel.
\begin{equation} \begin{pmatrix} 1&&\emptyset\\ c_{31}&1\\ &\ddots&\ddots\\ \emptyset&&c_{n,n-1}&1 \end{pmatrix} \begin{pmatrix} x_1\\ x_3\\ \vdots\\ x_n \end{pmatrix} = \begin{pmatrix} y'_1\\ y'_3\\ \vdots\\ y'_n \end{pmatrix} \end{equation}
where $c_{i+1i}=-b_{2n+1,2n}b_{2n,2n-1}$. In other words, we have reduced a size $n$ sequential problem to a sequential problem of the same kind and a parallel problem, both of size $n/2$. Now we can repeat this procedure recursively, reducing the original problem to a sequence of parallel operations, each half the size of the former.The process of computing all partial sums through recursive doubling is also referred to as a parallel prefix operation . Here we use a prefix sum, but in the abstract it can be applied to any associative operator; see section app:prefix .
crumb trail: > parallellinear > Parallelism and implicit operations > Approximating implicit by explicit operations, series expansion
There are various reasons why it is sometimes allowed to replace an implicit operation, which, as you saw above, can be problematic in practice, by a different one that is practically more advantageous.
Using an explicit method for the heat equation (section 4.3 ) instead of an implicit one is equally legitimate, as long as we observe step size restrictions on the explicit method.
Tinkering with the preconditioner (section 5.5.8 ) in an iterative method is allowed, since it will only affect the speed of convergence, not the solution the method converges to. You already saw one example of this general idea in the block Jacobi method; section 6.7.4 . In the rest of this section you will see how recurrences in the preconditioner, which are implicit operations, can be replaced by explicit operations, giving various computational advantages.
Solving a linear system is a good example of an implicit operation, and since this comes down to solving two triangular systems, let us look at ways of finding a computational alternative to solving a lower triangular system. If $U$ is upper triangular and nonsingular, we let $D$ be the diagonal of $U$, and we write $U=D(I-B)$ where $B$ is an upper triangular matrix with a zero diagonal, also called a strictly upper triangular matrix ; we say that $I-B$ is a unit upper triangular matrix .
Exercise
Let $A=LU$ be an LU factorization where $L$ has ones on the
diagonal, and the pivots on the diagonal of $U$,
as derived in section
5.3
.
Show how solving a system $Ax=b$ can be done, involving
only the solution of unit upper and lower triangular systems. Show
that no divisions are needed during the system solution.
End of exercise
Our operation of interest is now solving the system $(I-B)x=y$. We observe that
\begin{equation} (I-B)\inv = I+B+B^2+\cdots \label{eq:neuman-series} \end{equation}
and $B^n=0$ where $n$ is the matrix size (check this!), so we can solve $(I-B) x = y$ exactly by\begin{equation} x = \sum_{k=0}^{n-1}B^k y. \end{equation}
Of course, we want to avoid computing the powers $B^k$ explicitly, so we observe that\begin{equation} \begin{array}{rl} x_1=\sum_{k=0}^1B^ky &= (I+B)y,\\ x_2=\sum_{k=0}^2B^ky &= (I+B(I+B))y = y+Bx_1,\\ x_3=\sum_{k=0}^3B^ky &= (I+B(I+B((I+B))))y = y+Bx_2\\ \end{array} \label{eq:horner} \end{equation}
et cetera. The resulting algorithm for evaluating $\sum_{k=0}^{n-1}B^k y$ is called Horner's rule (see section 21.3 ), and you see that it avoids computing matrix powers $B^k$.
Exercise
Suppose that $I-B$ is bidiagonal. Show that the above calculation
takes $n(n+1)$ operations. (What would it have been if you'd computed
the powers of $B$ explicitly?)
What is the operation count for computing
$(I-B)x=y$ by triangular solution?
End of exercise
We have now turned an implicit operation into an explicit one, but unfortunately one with a high operation count. In practical circumstances, as argued above, however, we can often justify truncating the sum of matrix powers.
Exercise Let $A$ be the tridiagonal matrix
\begin{equation} A= \begin{pmatrix} 2&-1&&&\emptyset\\ -1&2&-1\\ &\ddots&\ddots&\ddots\\ &&&&-1\\ \emptyset&&&-1&2 \end{pmatrix} \end{equation}
of the one-dimensional BVP from section 4.2.2 .
Recall the definition of diagonal dominance in section 5.3.4 . Is this matrix diagonally dominant?
Show that the pivots in an LU factorization of this matrix (without pivoting) satisfy a recurrence. Hint: show that after $n$ elimination steps ($n\geq0$) the remaining matrix looks like
\begin{equation} A^{(n)}= \begin{pmatrix} d_n&-1&&&\emptyset\\ -1&2&-1\\ &\ddots&\ddots&\ddots\\ &&&&-1\\ \emptyset&&&-1&2 \end{pmatrix} \end{equation}
and show the relation between $d_{n+1}$ and $d_n$.Show that the sequence $n\mapsto d_n$ is descending, and derive the limit value.
Write out the $L$ and $U$ factors in terms of the $d_n$ pivots.
Are the $L$ and $U$ factors diagonally dominant?
The above exercise implies (note that we did not actually prove it!) that for matrices from BVPs we find that $B^k\downarrow0$, in element size and in norm. This means that we can approximate the solution of $(I-B)x=y$ by, for instance, $x=(I+B)y$ or $x=(I+B+B^2)y$. Doing this still has a higher operation count than the direct triangular solution, but it is computationally advantageous in at least two ways:
The explicit algorithm has a better pipeline behavior.
The implicit algorithm has problems in parallel, as you have seen; the explicit algorithm is more easily parallelized.
Exercise
Describe the parallelism aspects of Horner's rule;
6.9.3
End of exercise
crumb trail: > parallellinear > Grid updates
One of the conclusions of chapter Numerical treatment of differential equations was that explicit methods for time-dependent problems are computationally easier than implicit ones. For instance, they typically involve a matrix-vector product rather than a system solution, and parallelizing explicit operations is fairly simple: each result value of the matrix-vector product can be computed independently. That does not mean that there are no other computational aspects worth remarking on.
Since we are dealing with sparse matrices, stemming from some computational stencil , we take the operator point of view. In figures 6.5.2
and 6.8 you saw how applying a stencil in each point of the domain induces certain relations between processors: in order to evaluate the matrix-vector product $y\leftarrow Ax$ on a processor, that processor needs to obtain the $x$-values of its ghost region . Under reasonable assumptions on the partitioning of the domain over the processors, the number of messages involved will be fairly small.
Exercise
Reason that, in a
FEM
or
FDM
context,
the number of messages is $O(1)$ as $h\downarrow\nobreak 0$.
End of exercise
In section 1.6.1 you saw that the matrix-vector product has little data reuse, though there is some locality to the computation; in section 5.4.1.4 it was pointed out that the locality of the sparse matrix-vector product is even worse because of indexing schemes that the sparsity necessitates. This means that the sparse product is largely a bandwidth-bound algorithm .
Looking at just a single product there is not much we can do about that. However, often we do a number of such products in a row, for instance as the steps in a time-dependent process. In that case there may be rearrangements of the operations that lessen the bandwidth demands. Consider as a simple example
\begin{equation} \forall_i\colon x^{(n+1)}_i = f\bigl( x^{(n)}_i, x^{(n)}_{i-1}, x^{(n)}_{i+1} \bigr) \label{eq:3p-average} \end{equation}
and let's assume that the set $\{x^{(n)}_i\}_i$ is too large to fit in cache. This is a model for, for instance, the explicit scheme for the heat equation in one space dimension; section 4.3.1.1 . Schematically:
\begin{equation} \begin{array}{ccccc} x^{(n)}_0&x^{(n)}_1&x^{(n)}_2\\ \downarrow\,\swarrow&\searrow\,\downarrow\,\swarrow&\searrow\,\downarrow\,\swarrow\\ x^{(n+1)}_0&x^{(n+1)}_1&x^{(n+1)}_2\\ \downarrow\,\swarrow&\searrow\,\downarrow\,\swarrow&\searrow\,\downarrow\,\swarrow\\ x^{(n+2)}_0&x^{(n+2)}_1&x^{(n+2)}_2\\ \end{array} \end{equation}
In the ordinary computation, where we first compute all $x^{(n+1)}_i$, then all $x^{(n+2)}_i$, the intermediate values at level $n+1$ will be flushed from the cache after they were generated, and then brought back into cache as input for the level $n+2$ quantities.However, if we compute not one, but two iterations, the intermediate values may stay in cache. Consider $x^{(n+2)}_0$: it requires $x^{(n+1)}_0,x^{(n+1)}_1$, which in turn require $x^{(n)}_0,\ldots,x^{(n)}_2$.
Now suppose that we are not interested in the intermediate results, but only the final iteration. This is the case if you are iterating to a stationary state, doing multigrid smoothing, et cetera. Figure 6.19 shows a simple example.
FIGURE 6.19: Computation of blocks of grid points over multiple iterations.
The first processor computes 4 points on level $n+2$. For this it needs 5 points from level $n+1$, and these need to be computed too, from 6 points on level $n$. We see that a processor apparently needs to collect a ghost region of width two, as opposed to just one for the regular single step update. One of the points computed by the first processor is $x^{(n+2)}_3$, which needs $x^{(n+1)}_4$. This point is also needed for the computation of $x^{(n+2)}_4$, which belongs to the second processor.The easiest solution is to let this sort of point on the intermediate level redundantly computed , in the computation of both blocks where it is needed, on two different processors.
Exercise
Can you think of cases where a point would be redundantly computed by
more than two processors?
End of exercise
We can give several interpretations to this scheme of computing multiple update steps by blocks.
First of all, as we motivated above, doing this on a single processor increases locality: if all points in a colored block (see the figure) fit in cache, we get reuse of the intermediate points.
Secondly, if we consider this as a scheme for distributed memory computation, it reduces message traffic. Normally, for every update step the processors need to exchange their boundary data. If we accept some redundant duplication of work, we can now eliminate the data exchange for the intermediate levels. The decrease in communication will typically outweigh the increase in work.
Exercise
Discuss the case of using this strategy for multicore computation.
What are the savings? What are the potential pitfalls?
End of exercise
crumb trail: > parallellinear > Grid updates > Analysis
Let's analyze the algorithm we have just sketched. As in 6.10 points and a function of three points. The parameters describing the problem are these:
$N$ is the number of points to be updated, and $M$ denotes the number of update steps. Thus, we perform $MN$ function evaluations.
$\alpha,\beta,\gamma$ are the usual parameters describing latency, transmission time of a single point, and time for an operation (here taken to be an $f$ evaluation).
$b$ is the number of steps we block together.
Adding all these terms together, we find a cost of
\begin{equation} \frac Mb\alpha+M\beta+\left(\frac {MN}p+Mb\right)\gamma. \end{equation}
We observe that the overhead of $\alpha M/b+\gamma Mb$ is independent of $p$,
Exercise
Compute the optimal value of $b$, and remark that it only depends on
the architectural parameters $\alpha,\beta,\gamma$ but not on the
problem parameters.
End of exercise
crumb trail: > parallellinear > Grid updates > Communication and work minimizing strategy
We can make this algorithm more efficient by overlapping computation with communication . As illustrated in figure 6.20 , each processor start by communicating its halo, and overlapping this communication with the part of the communication that can be done locally. The values that depend on the halo will then be computed last.
FIGURE 6.20: Computation of blocks of grid points over multiple iterations.
Exercise
What is a great practical problem with organizing your code (with
the emphasis on `code'!) this way?
End of exercise
If the number of points per processor is large enough, the amount of communication is low relative to the computation, and you could take $b$ fairly large. However, these grid updates are mostly used in iterative methods such as the CG method (section 5.5.11 ), and in that case considerations of roundoff prevent you from taking $b$ too large [ChGe:sstep] .
Exercise
Go through the complexity analysis for the non-overlapping algorithm
in case the points are organized in a 2D grid. Assume that each
point update involves four neighbors, two in each coordinate
direction.
End of exercise
A further refinement of the above algorithm is possible. Figure 6.21 illustrates that it is possible to use a halo region that uses different points from different time steps.
FIGURE 6.21: Computation of blocks of grid points over multiple iterations.
This algorithm (see [Demmel2008IEEE:avoiding] ) cuts down on the amount of redundant computation. However, now the halo values that are communicated first need to be computed, so this requires splitting the local communication into two phases.
crumb trail: > parallellinear > Block algorithms on multicore architectures
In section~ 5.3.7 you saw that certain linear algebra algorithms can be formulated in terms of submatrices. This point of view can be beneficial for the efficient execution of linear algebra operations on shared memory architectures such as current multicore processors. \newcommand\chol{\mathop{\mathrm{Chol}}} As an example, let us consider the Cholesky factorization , which computes $A=LL^t$ for a symmetric positive definite matrix~$A$; see also section~ 5.3.2 . Recursively, we can describe the algorithm as follows: \begin{equation} \chol \begin{pmatrix} A_{11}&A_{21}^t\\ A_{21}&A_{22} \end{pmatrix} = LL^t\qquad\hbox{where}\quad L= \begin{pmatrix} L_{11}&0\\ \tilde A_{21} &\chol(A_{22}-\tilde A_{21}\tilde A_{21}^t) \end{pmatrix} \end{equation} and where $\tilde A_{21}=A_{21}L_{11}\invt,$ $A_{11}=L_{11}L_{11}^t$.(This algorithm applies can be considered both as a scalar and block matrix algorithm. From a point of performance, by using BLAS 3 routines, we will implicitly consider only the blocked approach.)
In practice, the block implementation is applied to a partitioning
\begin{equation} \left( \begin{array}{c|c|cc} &\multicolumn{2}{c}{\mathrm{finished}}\\ \hline \multirow{2}{*}{}&A_{kk}&A_{k,>k}&\\ \hline &A_{>k,k}&A_{>k,>k} \end{array} \right) \end{equation}
where $k$ is the index of the current block row, and the factorization is finished for all indices $
for $k=1,\mathrm{nblocks}$:
Chol
: factor $L_kL_k^t\leftarrow A_{kk}$
Trsm
: solve $\tilde A_{>k,k} \leftarrow A_{>k,k}L_k\invt$
Gemm
: form the product $\tilde A_{>k,k}\tilde A_{>k,k}^t$
Syrk
: symmmetric rank-$k$ update
$A_{>k,>k}\leftarrow A_{>k,>k}-\tilde A_{>k,k}\tilde A_{>k,k}^t$
The key to parallel performance is to partition the indices $>k$ and write the algorithm in terms of these blocks:
\begin{equation} \left( \begin{array}{c|c|cc} &\multicolumn{2}{c}{\mathrm{finished}}\\ \hline \multirow{2}{*}{}&A_{kk}&A_{k,k+1}&A_{k,k+2}\cdots\\ \hline &A_{k+1,k}&A_{k+1,k+1}&A_{k+1,k+2}\cdots\\ &A_{k+2,k}&A_{k+2,k+2}\\ &\vdots&\vdots \end{array} \right) \end{equation}
The algorithm now gets an extra level of inner loops:
for $k=1,\mathrm{nblocks}$:
Chol
: factor $L_kL_k^t \leftarrow A_{kk}$
for $\ell>k$:
Trsm
: solve $\tilde A_{\ell,k} \leftarrow A_{\ell,k}L_k\invt$
for $\ell_1,\ell_2>k$:
Gemm
: form the product $\tilde A_{\ell_1,k}\tilde A_{\ell_2,k}^t$
for $\ell_1,\ell_2>k$, $\ell_1\leq\ell_2$:
Syrk
: symmmetric rank-$k$ update
$A_{\ell_1,\ell_2}\leftarrow A_{\ell_1,\ell_2}
-\tilde A_{\ell_1,k}\tilde A_{\ell_2,k}^t$
\caption{Graph of task dependencies in a $4\times4$ Cholesky factorization.}
Now it is clear that the algorithm has a good deal of parallelism: the iterations in every $\ell$-loop can be processed independently. However, these loops get shorter in every iteration of the outer $k$-loop, so it is not immediate how many processors we can accommodate. On the other hand, it is not necessary to preserve the order of operations of the algorithm above. For instance, after
\begin{equation} L_1L_1^t=A_{11},\quad A_{21}\leftarrow A_{21}L_1\invt,\quad A_{22}\leftarrow A_{22}-A_{21}A_{21}^t \end{equation}
the factorization $L_2L_2^t=A_{22}$ can start, even if the rest of the $k=1$ iteration is still unfinished. Thus, there is probably a lot more parallelism than we would get from just parallelizing the inner loops.The best way to approach parallelism in this case is to shift away from a control flow view of the algorithm, where the sequence of operations is prescribed, to a data flow view. In the latter only data dependencies are indicated, and any ordering of operations that obeys these dependencies is allowed. (Technically, we abandon the program order of the tasks and replace it with a partial ordering
(Note: {Let's write $a\leq b$ if $a$ is executed before $b$, then the relation $\cdot\leq\cdot$ is a partial order if $a\leq b\wedge b\leq a\Rightarrow a=b$ and $a\leq b\wedge b\leq c\Rightarrow a\leq c$. The difference with a total ordering, such as program ordering, is that it is not true that $a\leq b\vee b\leq a$: there can be pairs that are not ordered, meaning that their time ordering is not prescribed.}
)
.) The best way of representing the data flow of an algorithm is by constructing a DAG (see section app:graph for a brief tutorial on graphs) of tasks. We add an edge $(i,j)$ to the graph if task $j$ uses the output of task $i$.
Exercise In section 2.6.1.6 you learned the concept of
sequential consistency
: a threaded parallel code program
should give the same results when executed in parallel as when it's
executed sequentially. We have just stated that
DAG
-based
algorithms are free to execute tasks in any order that obeys the
partial order of the graph nodes. Discuss whether
sequential consistency is a problem in this context.
End of exercise
In our example, we construct a DAG by making a vertex task for every inner iteration. Figure 6.11 shows the DAG of all tasks of matrix of $4\times4$ blocks. This graph is constructed by simulating the Cholesky algorithm above,
Exercise
What is the diameter of this graph? Identify the tasks that lie on
the path that determines the diameter. What is the meaning of these
tasks in the context of the algorithm?
End of exercise
This path is called the critical path . Its length determines the execution time of the computation in parallel, even if an infinite number of processors is available. (These topics were already discussed in section 2.2.4 .)
Exercise
Assume there are $T$ tasks that all take a unit time to execute, and
assume we have $p$ processors. What is the
theoretical minimum time to execute the algorithm? Now amend this
formula to take into account the critical path; call its length $C$.
End of exercise
In the execution of the tasks a DAG , several observations can be made.
If more than one update is made to a block, it is probably advantageous to have these updates be computed by the same process. This simplifies maintaining cache coherence .
If data is used and later modified, the use must be finished before the modification can start. This can even be true if the two actions are on different processors, since the memory subsystem typically maintains cache coherence, so the modifications can affect the process that is reading the data. This case can be remedied by having a copy of the data in main memory, giving a reading process data that is reserved (see section 1.4.1 ).