S. Alireza Ghasemi's homepage

Cluster gavazang runs Qlustar operating system which is based on Ubuntu 14.04. The cluster contains fifteen compute nodes with specifications:

psm01-psm05: Intel(R) Xeon(R) CPU E5-2650 v2 (2.60GHz) processors and 32GB memory.
psm06-psm12: Intel(R) Xeon(R) CPU E5-2630 v3 (2.40GHz) processors and 64GB memory.
psm13-psm15: Intel(R) Xeon(R) CPU E5-2650 v4 (2.20GHz) processors and 128GB memory.

Cluster gavazang manages resources using Simple Linux Utility for Resource Management (SLURM).
An implementation of Environment Modules is installed on the cluster. The environment modules package provides for an easy dynamic modification of a user's environment via modulefiles which typically instruct the module command to alter or set shell environment variables such as PATH, MANPATH, etc. as well as define aliases over a variety of shells.

Environment Modules:
Here we provide a brief explanation of basic commands of the environment modules package.

module avail:
It provides the list of modules which are available.
module list:
It provides the list of modules which are loaded.
module load module_name:
It loads the module module_name.
module unload module_name:
In principle, it unloads the module module_name, however, currently it doe not work properly on gavazang.
module switch module_name1 module_name2:
In principle, it switched between two modules, however, currently it doe not work properly on gavazang.

Currently the main output of command module avail is

intel/13.1.0               module-info                openmpi/1.8.3/intel/13.1.0 QuantumESPRESSO/5.1.2

You can for instance load module intel/13.1.0. Then you can use Intel Fortran and C/C++ compilers, i.e. ifort and icc.

Basic commands in SLURM:
SLURM commands are thoroughly explained in its homepage. However, we present a minimal list of its commands you need on gavazang.

squeue:
It provides information about jobs located in the SLURM scheduling queue.
sbatch:
sbatch is used to submit a job script for later execution. The script will typically contain one or more srun commands to launch parallel tasks.
scancel:
scancel is used to cancel a pending or running job or job step.
srun:
srun is used to run a command on allocated compute nodes.
scontrol show job job_id:
It shows information of the job with job ID job_id.
sreport:
It generate reports from the slurm accounting data. In order to obtain information on CPU usage of all users, run:
sreport cluster AccountUtilizationByUser Start=2017-03-21T00:00 End=2018-03-20T00:00 -t hour
sinfo:
It provides information about SLURM nodes and partitions.
An alias of sinfo with a short and useful list of outputs is available in the bashrc.

How to compile and run a Fortran or C program on gavazang:
mpi01.f90 and mpi01.c are simple Fortran and C MPI programs. You also need the SLURM batch script, a simple example can be downloaded from run. Then if you need to load a compiler module explained above and compile the Fortran example by executing
mpif90 mpi01.f90
or the C example
mpicc mpi01.c
The default executable filename, created by the compiler, is a.out. Finally you can submit the job to the queue by sbatch run. It is recommended to redirect the output to a file.
sbatch run>SUBMIT
So you can easily find out later the job ID associated with the directory.

The content of the SLURM script (run in the example above) is shown below:

#!/bin/sh
#SBATCH --job-name="test"
#SBATCH -n 4
#SBATCH --output=bachoutput
#SBATCH --nodes=1-1
#SBATCH -p all
#SBATCH --time=00:20:00
mpirun -n 4 a.out >o1

The second line specifies the job name. The third line -n 4 determines the number processors you are going to allocate. The fourth line --output=bachoutput specifies the name of the SLURM output in which SLURM error and warning messages is written. --nodes=1-1 specifies that all requested processors must be allocated from one node. This is particularly important for those people who have jobs with extensive communications such users running DFT packages. -p all Specifies the partition (queue). The full list of all partitions is given in section partitions. --time=d-h:m:s specifies the requested time after which the job will be killed by SLURM.
Please note that there is a bottleneck of 1 Gbps for data transfer from the switch to the head node that is shared among all nodes. Therefore, jobs with too much I/O are not allowed to be executed on the cluster. With the phrase too much I/O, we also mean jobs which do not use too much disk space but write, read and overwrite small files. However, if a job has considerable amount of I/O may be allowed to be executed on the cluster provided it uses the local /scratch directory on each node. This means the outputs are not available on the head node since data on /scratch are not synchronized with the head node. An example of a script for running jobs on /scratch directory is run_scratch. In the script, the files in the run directory are first copied to a directory on the /scratch directory as well as changing to the directory. Then, the program is executed and finally after the program is terminated, the output files are copied from the /scratch directory to the original directory.

List of SLURM partitions in gavazang:

all:
The partition all runs over all twelve nodes with MAXTIME of 2 days.
snodes:
The partition snodes runs through nodes psm01 to psm05 with MAXTIME of 7 days.
qlong:
The partition qlong runs over only node psm05 with MAXTIME of 30 days.

Changing password:
In order to change your password, you must be in the head node. Then, you can change your password by running command yppasswd.

Recommended settings:

bashrc:
It is recommended to download bashrc and paste it to the end of ~/.bashrc
vimrc:
It is recommended to download vimrc and rename to ~/.vimrc

Welcome to S. Alireza Ghasemi's homepage