Using Cluster Resources

From Statgen Internal Wiki
Jump to navigationJump to search

Asking for help[edit]

If you need any help regarding the cluster, how to submit jobs, etc. please send an email to:

And one of us will chime in to help.

Alternatively you could try our #slurm Slack channel.

Please do not hesitate to ask, especially if you have any concerns that your jobs may cause any issues for the cluster. We're happy to help.

What is a cluster?[edit]

The term compute cluster is very broad, and covers a wide variety of different kinds of collections of computers.

In the context of our group, a compute cluster tends to be a set of Intel based computers running some version of the Linux operating system. For more information about the exact hardware we use, see this PDF.

There are a wide variety of ways to make use of such clusters. The simplest model is simply to have each machine accessible via the ssh command, which is found on all Unix based computers, and can be downloaded for most Windows computers.

That model quickly breaks down when large numbers of users are trying to share large numbers of computers to run large numbers of compute intensive jobs.


Cluster overview[edit]

In the above picture, line:

  1. (r10...) Main compute nodes - can be used for compute-intensive jobs
    • Parallelized jobs are submitted by SLURM
  2. (fantasia...) Gateway nodes - where main home directory is
    • Only a few, simple, non-resource intensive jobs
    • Overloading a gateway will slow everyone down
    • Do not store large amounts of project data in your home directory
      • I/O intensive jobs that are reading/writing to your home directory will degrade the gateway for EVERYONE
      • So try to run I/O intensive jobs on project nodes reading/writing from/to project node storage
  3. (1000g...) Project nodes - for individual sequencing projects
    • Large storage, dedicated compute nodes
    • All large project data should be stored on Project nodes
  4. (m10...) Project compute nodes - for I/O intensive jobs
    • alignment, variant calling, or other direct access to BAMs (which should be located on the project node)

Appropriate use of nodes/hosts[edit]

Remember that everyone shares use of the gateways and main compute nodes. We need to work together and follow best practices so we can all make use of them.

Monitoring CPU/memory usage[edit]

It is your responsibility to know the memory and CPU requirements of your jobs, before running them on a gateway or the cluster.

You should be familiar with htop before running jobs on a gateway. htop is used to monitor CPU and memory usage of processes/jobs running on a machine. Please watch a tutorial on htop, such as:

Do not abuse the gateways!

  • If you need to run a large number of jobs, you should be using the SLURM cluster instead.
  • Only a very small number of short lived jobs should be run on a gateway, and only if the gateway is not currently running a large number of jobs.
  • Too many jobs on a gateway will overload it, slowing down everyone else's work.
  • Using too much memory will overload the gateway and may require a machine reboot, which will cause loss of work for other users. We have specific project nodes with high amounts of system memory, if you need them.

Monitoring disk/network usage[edit]

Use dstat to see aggregate disk/network bandwidth usage on a gateway machine.

Remember that bandwidth is limited. Too many disk accesses in parallel to a gateway can overload the gateway, making it VERY slow.

If you submit a large number of jobs that read data from a gateway, please regularly check the gateway with dstat and also attempt to list a few directories to get a sense for whether it is affecting the performance of the machine.

When possible, run I/O intensive jobs on project clusters reading from/writing to project storage rather than gateways.

Project data belongs on project machines, not in your main gateway home directories. Talk to Sean about moving your data to an appropriate place.

Running jobs on gateways[edit]

If your jobs will run for a long period of time, they should be submitted to the SLURM cluster.

In some cases, however, you may wish to run a very small number of them on a gateway. Reasons for doing so might be:

  • Test how long a typical job takes to run
  • Test memory usage of a typical job
  • You only need to run a tiny number of jobs for your whole project

Before doing so:

  • Ensure the machine is not already burdened by running htop and looking at the overall CPU usage.
  • The number of CPU cores can be found in htop, or by running lscpu. You should NEVER run more jobs on a gateway than there are available CPUs.
  • Submit only a small number of jobs, such that roughly 50-75% of CPU cores on the gateway remain available for others to use. If you need more than that, you should be running jobs on the cluster.

To make sure your jobs continue running when you logout of the gateway, use one of the following methods:

  • nohup
 nohup myjob.sh &
  • tmux
    • Allows you to disconnect from your terminal, reconnect later, and pull up the same terminal.
    • Can save multiple sessions, and window "panes" within each session.
    • Similar to screen but more readily maintained and still under active development
    • Can do split windows and manage multiple sessions easily
    • Scriptable
    • Tmux Tutorial

An example of tmux running multiple windows within one session:

Running jobs on the cluster (SLURM)[edit]

We keep our cluster SLURM documentation, along with examples of how to use SLURM, in a github repository:

https://github.com/statgen/SLURM-examples

The following slide decks provided by the SLURM developers for training purposes may also be helpful:

SLURM User Introduction

SLURM User Advanced

SLURM Administrator Introduction

SLURM Administrator Advanced

Please follow this guide when starting off with SLURM.

WARNINGS

  • Do not run MPI jobs with SLURM, use mpirun instead. This will change when openmpi is updated to version 1.5 or later.
  • Compute bound jobs scale well and are generally safe to run lots of copies of.
  • I/O bound jobs do not scale well - think, plan and experiment carefully before running more than 10-20 I/O intensive jobs at once.
  • Use dstat on the gateway you are accessing data from to see how much disk bandwidth is being used. If the machine begins to feel unresponsive, dial back your jobs accordingly.

Running jobs on project miniclusters[edit]

Many projects have their own clusters for use. Each cluster will have its own gateway node (got2d, t2d-genes, topmed, sardinia, etc.) along with a number of compute nodes.

If you are working on a particular project, you may wish to submit jobs specifically to that project's cluster nodes. Use scontrol show partitions to see which partitions are available. When submitting jobs, simply add --partition=NAME1,NAME2,NAME3 to your sbatch or srun command.

Sometimes, mini-clusters may be idle and available for other users to use, even if the work is not related to that project.

Send an email to csgclusters@umich.edu for requesting usage of project specific compute nodes. Appropriate cases for this might be if you are under a deadline crunch.

Preemption in SLURM explained[edit]

To maximize hardware utilization, most project mini-cluster nodes are a member of two SLURM partitions: their respective mini-cluster partition, and the "nomosix" partition.

This is not always immediately apparent because SLURM will not show mini-cluster partitions in sinfo unless you explicitly have access to them.

Under normal circumstances, mini-cluster nodes will pick up work from the "nomosix" partition if they are idle. They will run jobs from "nomosix" until they happen to receive jobs via their higher priority mini-cluster partition. When this occurs, work from the lower-priority "nomosix" partition will be pre-empted (terminated) and the node will begin working on jobs submitted via the mini-cluster partition until no more jobs are queued to run in the mini-cluster partition. When work from the mini-cluster partition is exhausted, the node will begin picking up work from the "nomosix" partition once again.

If you are concerned about your job being pre-empted (especially for long-running jobs, where pre-emption can be particularly painful), either submit to the "main" partition, which excludes all minicluster nodes:

sbatch --partition=main ...

Or use the SLURM --exclude parameter when submitting jobs to "nomosix" to restrict the scheduler to placing your jobs on nodes that are not also a member of a mini-cluster partition:

sbatch --exclude="list_of_nodes" ...

You can copy and paste the following --exclude to omit all nodes that are in any partition besides nomosix or main:

--exclude="1000g-mc[01-04],amd-mc[01-04],assembly-mc[01-04],bipolar-mc[01-08],dl[3601,3603-3619],esp-mc[01-04],finnseq-mc[01-04],got2d-mc[01-04],hunt-mc[01-13],psoriasis-mc[01-04],r[6301-6332,6334-6335],sardinia-mc[01-04],t2dgenes-mc[01-04]"

Nodes that are not a member of any mini-cluster partition and should never pre-empt running jobs include:

c[01-52]
r[01-05]
r[10-30]
sun[01-10]