\[ \newcommand\inv{^{-1}}\newcommand\invt{^{-t}} \newcommand\bbP{\mathbb{P}} \newcommand\bbR{\mathbb{R}} \newcommand\defined{ \mathrel{\lower 5pt \hbox{${\equiv\atop\mathrm{\scriptstyle D}}$}}} \] 52.1 : Cluster structure
52.2 : Queues
52.2.1 : Queue limits
52.3 : Job running
52.3.1 : The job submission cycle
52.4 : The script file
52.4.1 : sbatch options
52.4.2 : Environment
52.5 : Parallelism handling
52.5.1 : MPI jobs
52.5.2 : Threaded jobs
52.5.3 : Parameter sweeps / ensembles / massively parallel
52.6 : Job running
52.7 : Scheduling strategies
52.8 : File systems
52.9 : Examples
52.9.1 : Job dependencies
52.9.2 : Multiple runs in one script
52.10 : Review questions
Back to Table of Contents

52 Batch systems

Supercomputer clusters can have a large number of nodes , but not enough to let all their users run simultaneously, and at the scale that they want. Therefore, users are asked to submit jobs , which may start executing immediately, or may have to wait until resources are available.

The decision when to run a job, and what resources to give it, is not done by a human operator, but by software called a batch system . (The Stampede cluster at TACC ran close to 10 million jobs over its lifetime, which corresponds to starting a job every 20 seconds.)

This tutorial will cover the basics of such systems, and in particular SLURM .

52.1 Cluster structure

crumb trail: > slurm > Cluster structure

A supercomputer cluster usually has two types of nodes:

login nodes , and
compute nodes .

When you make an ssh connection to a cluster, you are connecting to a login node. The number of login nodes is small, typically less than half a dozen.

Exercise

Connect to your favourite cluster. How many people are on that login node? If you disconnect and reconnect, do you find yourself on the same login node?

Compute nodes are where your jobs are run. Different clusters have different structures here:

Compute nodes can be shared between users, or they can be assigned exclusively.
- Sharing makes sense if user jobs have less parallelisn than the core count of a node.
- \ldots~on the other hand, it means that users sharing a node can interfere with each other's jobs, with one job using up memory or bandwidth that the other job needs.
- With exclusive nodes, a~job has access to all the memory and all the bandwidth of that node.
Clusters can homogeneous, having the same processor type on each compute node, or they can have more than one processor type. For instance, the TACC Stampede2 cluster has Intel Knightslanding and Intel Skylake nodes.
Often, clusters have a number of `large memory' nodes, on the order of a Terabyte of memory or more. Because of the cost of such hardware, there is usually only a small number of these nodes.

52.2 Queues

crumb trail: > slurm > Queues

Jobs often can not start immediately, because not enough resources are available, or because other jobs may have higher priority (see section~ 52.7 ). It is thus typical for a job to be put on a queue , scheduled, and started, by a batch system such as SLURM .

Batch systems do not put all jobs in one big pool: jobs are submitted to any of a number of queues, that are all scheduled separately.

Queues can differ in the following ways:

If a cluster has different processor types, those are typically in different queues. Also, there may be separate queues for the nodes that have a GPU attched. Having multiple queues means you have to decide what processor type you want your job to run on, even if your executable is binary compatible with all of them.
There can be `development' queues, which have restrictive limits on runtime and node count, but where jobs typically start faster.
Some clusters have `premium' queues, which have a higher charge rate, but offer higher priority.
`Large memory nodes' are typically also in a queue of their own.
There can be further queues for jobs with large resource demands, such as large core counts, or longer-than-normal runtimes.

For slurm, the sinfo command can tell you much about the queues.

# what queues are there?
sinfo -o "%P"
# what queues are there, and what is their status?
sinfo -o "%20P %.5a"

Exercise

Enter these commands. How many queues are there? Are they all operational at the moment?

52.2.1 Queue limits

crumb trail: > slurm > Queues > Queue limits

Queues have limits on

the runtime of a job;
the node count of a job; or
how many jobs a user can have in that queue.

52.3 Job running

crumb trail: > slurm > Job running

There are two main ways of starting a job on a cluster that is managed by slurm. You can start a program run synchronously with srun , but this may hang until resources are available. In this section, therefore, we focus on asynchronously executing your program by submitting a job with sbatch .

52.3.1 The job submission cycle

crumb trail: > slurm > Job running > The job submission cycle

In order to run a batch job , you need to write a job script , or batch script . This script describes what program you will run, where its inputs and outputs are located, how many processes it can use, and how long it will run.

In its simplest form, you submit your script without further parameters:

sbatch yourscript

All options regarding the job run are contained in the script file, as we will now discuss.

As a result of your job submission you get a job id. After submission you can queury your job with squeue :

squeue -j 123456

or queury all your jobs:

squeue -u yourname

The squeue command reports various aspects of your job, such as its status (typically pending or running); and if it is running, the queue (or `partition') where it runs, its elapsed time, and the actual nodes where it runs.

squeue -j 5807991
  JOBID   PARTITION     NAME     USER ST   TIME  NODES NODELIST(REASON)
5807991 development packingt eijkhout  R   0:04      2 c456-[012,034]

If you discover errors in your script after submitting it, including when it has started running, you can cancel your job with scancel :

scancel 1234567

52.4 The script file

crumb trail: > slurm > The script file

A job script looks like an executable shell script:

It has an `interpreter' line such as
```
#!/bin/bash
```
at the top, and
it contains ordinary unix commands, including

the (parallel) startup of you program:

# sequential program:
./yourprogram youroptions
# parallel program, general:
mpiexec -n 123 parallelprogram options
# parallel program, TACC:
ibrun parallelprogram options

… and then it has many options specifying the parallel run.

52.4.1 sbatch options

crumb trail: > slurm > The script file > sbatch options

In addition to the regular unix commands and the interpreter line, your script has a number of SLURM directives, each starting with #SBATCH . (This makes them comments to the shell interpreter, so a batch script is actually a legal shell script.)

Directives have the form

#SBATCH -option value

Common options are:

-J : the jobname. This will be displayed when you call squeue .
-o : name of the output file. This will contain all the stdout output of the script.
-e : name of the error file. This will contain all the stderr output of the script, as well as slurm error messages.

It can be a good idea to make the output and error file unique per job. To this purpose, the macro %j is available, which at execution time expands to the job number. You will then get an output file with a name such as myjob.o2384737 .
-p : the partition or queue. See above.
-t hh:mm:ss : the maximum running time. If your job exceeds this, it will get cancelled
1. You can not specify a duration here that is longer than the queue limit.
2. The shorter your job, the more likely it is to get scheduled sooner rather than later.
-w c452-[101-104,111-112,115] specific nodes to place the job.
-A : the name of the account to which your job should be billed.
--mail-user=you@where Slurm can notify you when a job starts or ends. You may for instance want to connect to a job when it starts (to run top ), or inspect the results when it's done, but not sit and stare at your terminal all day. The action of which you want to be notified is specified with (among others) --mail-type=begin/end/fail/all
--dependency=after:123467 indicates that this job is to start after jobs 1234567 finished. Use afterok to start only if that job successfully finished. (See https://cvw.cac.cornell.edu/slurm/submission_depend for more options.)
--nodelist allows you to specify specific nodes. This can be good for getting reproducible timings, but it will probably increase your wait time in the queue.
--array=0-30 is a specification for `array jobs': a task that needs to be executed for a range of parameter values.
TACC note
Arry jobs are not supported at TACC; use a launcher instead; section 52.5.3 .
--mem=10000 specifies the desired amount of memory per node. Default units are megabytes, but can be explicitly indicated with K/M/G/T .
TACC note
This option can not be used to request arbitrary memory: jobs always have access to all available physical memory, and use of shared memory is not allowed.

See https://slurm.schedmd.com/sbatch.html for a full list.

Exercise

Write a script that executes the date command twice, with a sleep in between. Submit the script and investigate the output.

52.4.2 Environment

crumb trail: > slurm > The script file > Environment

Your job script acts like any other shell script when it is executed. In particular, it inherits the calling environment with all its environment variables. Additionally, slurm defines a number of environment variables, such as the job ID, the hostlist, and the node and process count.

52.5 Parallelism handling

crumb trail: > slurm > Parallelism handling

We discuss parallelism options separately.

52.5.1 MPI jobs

crumb trail: > slurm > Parallelism handling > MPI jobs

On most clusters there is a structure with compute nodes, that contain one or more multi-core processors. Thus, you want to specify the node and core count. For this, there are options -N and -n respectively.

#SBATCH -N 4              # Total number of nodes
#SBATCH -n 4              # Total number of mpi tasks

It would be possible to specify only the node count or the core count, but that takes away flexibility:

If a node has 40 cores, but your program stops scaling at 10 MPI ranks, you would use:
```
#SBATCH -N 1
#SBATCH -n 10
```
If your processes use a large amount of memory, you may want to leave some cores unused. On a 40-core node you would either use
```
#SBATCH -N 2
#SBATCH -n 40
```
or
```
#SBATCH -N 1
#SBATCH -n 20
```

Rather than specifying a total core count, you can also specify the core count per node with --ntasks-per-node .

Exercise

Go through the above examples and replace the -n option by an equivalent --ntasks-per-node values.

Python note

Python programs using mpi4py should be treated like other MPI programs, except that instead of an executable name you specify the python executable and the script name:

ibrun python3 mympi4py.py

52.5.2 Threaded jobs

crumb trail: > slurm > Parallelism handling > Threaded jobs

The above discussion was mostly of relevance to MPI programs. Some other cases:

For pure-OpenMP programs you need only one node, so the -N value is 1. Maybe surprisingly, the -n value is also 1, since only one process needs to be created: OpenMP uses thread-level parallelism, which is specified through the OMP_NUM_THREADS environment variable.
A similar story holds for the Matlab parallel computing toolbox (note: note the distributed computing toolbox), and the Python multiprocessing module.

Exercise

What happens if you specify an -n value greater than 1 for a pure-OpenMP program?

For hybrid computing MPI-OpenMP programs, you use a combination of slurm options and enviroment variables, such that, for instance, the product of the --tasks-per-node and OMP_NUM_THREADS is less than the core count of the node.

52.5.3 Parameter sweeps / ensembles / massively parallel

crumb trail: > slurm > Parallelism handling > Parameter sweeps / ensembles / massively parallel

So far we have focused on jobs where a single parallel executable is scheduled. However, there are use cases where you want to run a sequential (or very modestly parallel) executable for a large number of inputs. This is called variously a parameter sweep or an ensemble .

Slurm can support this itself with array jobs , though there are more sophisticated launcher tools for such purposes.

TACC note

TACC clusters do not support array jobs. Instead, use the launcher or pylauncher modules.

52.6 Job running

crumb trail: > slurm > Job running

When your job is running, its status is reported as R by squeue . That command also reports which nodes are allocated to it.

squeue -j 5807991
       JOBID   PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
     5807991 development packingt eijkhout  R       0:04      2 c456-[012,034]

You can then ssh into the compute nodes of your job; normally, compute nodes are off-limits. This is useful if you want to run top to see how your processes are doing.

52.7 Scheduling strategies

crumb trail: > slurm > Scheduling strategies

Such a system looks at resource availability and the user's priority to determine when a job can be run.

Of course, if a user is requesting a large number of nodes, it may never happen that that many become available simultaneously, so the batch system will force the availability. It does so by determining a time when that job is set to run, and then let nodes go idle so that they are available at that time.

An interesting side effect of this is that, right before the really large job starts, a `fairly' large job can be run, if it only has a short running time. This is known as backfill , and it may cause jobs to be run earlier than their priority would warrant.

52.8 File systems

crumb trail: > slurm > File systems

File systems come in different types:

They can be backed-up or not;
they can be shared or not; and
they can be permanent or purged.

On many clusters each node has as local disc, either spinning or a RAM disc . This is usually limited in size, and should only be used for temporary files during the job run.

Most of the file system lives on discs that are part of RAID arrays . These discs have a large amount of redundancy to make them fault-tolerant, and in aggregate they form a shared file system : one unified file system that is accessible from any node and where files can take on any size, or at least much larger than any individual disc in the system.

TACC note

The HOME file system is limited in size, but is both permanent and backed up. Here you put scripts and sources.

The WORK file system is permanent but not backed up. Here you can store output of your simulations. However, currently the work file system can not immediately sustain the output of a large parallel job.

The SCRATCH file system is purged, but it has the most bandwidth for accepting program output. This is where you would write your data. After post-processing, you can then store on the work file system, or write to tape.

Exercise

If you install software with cmake , you typically have

a script with all your cmake options;
the sources,
the installed header and binary files
temporary object files and such.

How would you orgnize these entities over your available file systems?

52.9 Examples

crumb trail: > slurm > Examples

Very sketchy section.

52.9.1 Job dependencies

crumb trail: > slurm > Examples > Job dependencies

JOB=`sbatch my_batchfile.sh | egrep -o -e "\b[0-9]+$"`


#!/bin/sh


# Launch first job
JOB=`sbatch job.sh | egrep -o -e "\b[0-9]+$"`


# Launch a job that should run if the first is successful
sbatch --dependency=afterok:${JOB} after_success.sh


# Launch a job that should run if the first job is unsuccessful
sbatch --dependency=afternotok:${JOB} after_fail.sh

52.9.2 Multiple runs in one script

crumb trail: > slurm > Examples > Multiple runs in one script

ibrun stuff &
sleep 10
for h in hostlist ; do
  ssh $h "top"
done
wait

52.10 Review questions

crumb trail: > slurm > Review questions

For all true/false questions, if you answer False, what is the right answer and why?

Exercise

T/F? When you submit a job, it starts running immediately once sufficient resources are available.

Exercise

T/F? If you submit the following script:

#!/bin/bash
#SBATCH -N 10
#SBATCH -n 10
echo "hello world"

you get 10 lines of `hello world' in your output.

Exercise

T/F? If you submit the following script:

#!/bin/bash
#SBATCH -N 10
#SBATCH -n 10
hostname

you get the hostname of the login node from which your job was submitted.

Exercise

Which of these are shared with other users when your job is running:

Memory;
CPU;
Disc space?

Exercise

What is the command for querying the status of your job?

sinfo
squeue
sacct

Exercise

On 4 nodes with 40 cores each, what's the largest program run, measured in

MPI ranks;
OpenMP threads?

Batch systems

Experimental html version of Parallel Programming in MPI, OpenMP, and PETSc by Victor Eijkhout. download the textbook at https:/theartofhpc.com/pcse

52 Batch systems

52.1 Cluster structure

52.2 Queues

52.2.1 Queue limits

52.3 Job running

52.3.1 The job submission cycle

52.4 The script file

52.4.1 sbatch options

52.4.2 Environment

52.5 Parallelism handling

52.5.1 MPI jobs

52.5.2 Threaded jobs

52.5.3 Parameter sweeps / ensembles / massively parallel

52.6 Job running

52.7 Scheduling strategies

52.8 File systems

52.9 Examples

52.9.1 Job dependencies

52.9.2 Multiple runs in one script

52.10 Review questions