Example Run Scripts - High Performance Computing Wales. HPC User Guide. Version 2.2

8.3.1 Example 1

A simple single line command can be submitted to run on the compute nodes using the following syntax:

$ bsub –n 1 –o output.%J command [options]

where –n specifies the number of CPU cores required and the –o flag specifies the output file. Note that when specifying the output file with –o outout.%J, then %J is replaced with the Job ID number.

8.3.2 Example 2

The following more complicated example runs an executable compiled with the Intel compiler and Intel MPI:

#!/bin/bash --login #BSUB –o example.o.%J #BSUB –x

#BSUB –n 24 # Number of cores to use #BSUB –W 1:00 # 1 hour time limit

#BUSB –q q_cf_htc_work # Submit to a specific queue # load the intel compiler and MPI modules

module purge

module load compiler/intel module load mpi/intel

MYPATH=$HOME/mycode_directory

executable=$MYPATH/bin/my_executable_name # Run in specified directory WDPATH

WDPATH=$MYPATH/RESULTS cd ${WDPATH}

# $LSB_DJOB_NUMPROC is supplied by LSF and is number of processes mpirun –np $LSB_DJOB_NUMPROC $executable

Note that the appropriate modules needed by the intel-compiled code are specified first, while the executable is assumed to reside in a different directory to the directory in which the job will be run.

8.3.3 Example 3

The following builds on Example 2, but now runs the job in the Lustre global scratch directory belonging to the user(/scratch/$USER), deleting that directory on completion of the job:

#!/bin/bash --login #BSUB –o example.o.%J #BSUB –x

#BSUB –n 24 # Number of cores to use

#BSUB –W 1:00 # 1 hour time limit

module purge

module load compiler/intel module load mpi/intel

MYPATH=$HOME/mycode_directory

executable=$MYPATH/bin/my_executable_name # Run in Lustre global scratch directory WDPATH=/scratch/$USER/$LSB_JOBID

rm -rf $WDPATH mkdir -p $WDPATH

cd ${WDPATH} || exit $?

trap "rm -rf ${WDPATH}" EXIT # delete scratch directory on exit # $LSB_DJOB_NUMPROC is supplied by LSF and is number of processes mpirun –np $LSB_DJOB_NUMPROC $executable

Note the use of$LSB_JOBIDto create a unique lustre sub-directory in which to run the job

8.3.4 Example 4

The following example mirrors Example 3, running the job in the Lustre global scratch directory belonging to the user(/scratch/$USER), but now illustrates the use ofpdshto a run a command on all the compute nodes, here running thememhog utility to clear memory of the nodes to provide optimal and consistent performance:

#!/bin/bash --login #BSUB –o example.o.%J #BSUB –x

#BSUB –n 24 # Number of cores to use #BSUB –W 1:00 # 1 hour time limit

#BUSB –q q_cf_htc_work # Submit to a specific queue module purge

module load compiler/intel module load mpi/intel

MYPATH=$HOME/mycode_directory

executable=$MYPATH/bin/my_executable_name # Run in Lustre global scratch directory WDPATH=/scratch/$USER/$LSB_JOBID

rm -rf $WDPATH mkdir -p $WDPATH

cd ${WDPATH} || exit $?

trap "rm -rf ${WDPATH}" EXIT # delete scratch directory on exit # generate list of hosts for use by pdsh

PDSH_HOSTS=`echo $LSB_MCPU_HOSTS | awk '{for(i=1;i<NF;i=i+2) printf $i",";}'`

# pdsh can be used to run a command on all compute nodes of a job # for example when benchmarking one can run memhog to

# clear memory to provide optimal and consistent performance pdsh -w $PDSH_HOSTS memhog 35g > /dev/null 2>&1

# $LSB_DJOB_NUMPROC is supplied by LSF and is number of processes mpirun –np $LSB_DJOB_NUMPROC $executable

8.3.5 Example 5. Execution of the DLPOLY classic Code

The following example illustrates execution of the DLPOLY classic molecular simulation code under control of the associated module dlpoly-classic/1.8. The job is run using the Lustre global scratch directory belonging to the user (/scratch/$USER), with the DLPOLY input data sets residing in the user’s directory DLPOLY-classic/data/Bench5 initially copied into the scratch directory /scratch/$USER/DLPOLY- classic.$LSB_JOBID: note the use of the LSF parameter $LSB_JOBID to create a unique descriptor for that directory.

#!/bin/bash --login #BSUB -n 128 #BSUB -x #BSUB -o Bench5.HTC.o.%J #BSUB -e Bench5.HTC.e.%J #BSUB -J Bench5 #BSUB -R "span[ptile=12]" #BSUB -W 1:00 #BSUB -q q_cf_htc_work

module load compiler/intel-11.1.072 module load mpi/intel-4.0

module load dlpoly-classic/1.8 export OMP_NUM_THREADS=1 code=$DLPOLY_EXECUTABLE MYPATH=${HOME}/DLPOLY-classic/data MYDATA=${HOME}/DLPOLY-classic/data/Bench5 WDPATH=/scratch/$USER/DLPOLY-classic.$LSB_JOBID NCPUS=$LSB_DJOB_NUMPROC env rm -rf ${WDPATH} mkdir ${WDPATH} cd ${WDPATH}

# copy input files to working directory cp -r -p ${MYDATA}/CONFIG ${WDPATH} cp -r -p ${MYDATA}/CONTROL ${WDPATH} cp -r -p ${MYDATA}/FIELD ${WDPATH} cp -r -p ${MYDATA}/TABLE ${WDPATH} rm -f REVCON REVIVE STATIS

echo "CPUS=$NCPUS, NODES=$NNODES" time mpirun -r ssh -np $NCPUS ${code} # copy output file back to home filestore cp ${WDPATH}/OUTPUT ${MYPATH}/Bench5.out.impi cd ${WDPATH}

rm -f REVCON REVIVE STATIS OUTPUT

The simulation output is subsequently copied back into the user’s directory as file Bench5.out.impi on completion of the job. The user is referred to section 9 of this manual to compare the use of SynfiniWay when running the DLPOLY classic.

8.3.6

Compiling and running OpenMP threaded applications

Code with embedded OpenMP directives may be compiled and run on a single compute node with up to a maximum of NCORES threads via the smp parallel environment, where NCORES is the number of CPU cores per node.

Compiling OpenMP code

OpenMP code may be compiled with the Intel compiler and withgcc/gfortran version >= 4.2. Here is a simple example from the tutorial at http: //openmp.org/wp/

C************************************************************************* C FILE: omp_hello.f

C DESCRIPTION:

C OpenMP Example - Hello World - Fortran Version

C In this simple example, the master thread forks a parallel region. C All threads in the team obtain their unique thread number & print it. C The master thread only prints the total number of threads. Two OpenMP C library routines are used to obtain the number of threads and each C thread's number.

C AUTHOR: Blaise Barney C LAST REVISED:

C*************************************************************************

PROGRAM HELLO

INTEGER NTHREADS, TID, OMP_GET_NUM_THREADS, + OMP_GET_THREAD_NUM

C Fork a team of threads giving them their own copies of variables !$OMP PARALLEL PRIVATE(NTHREADS, TID)

C Obtain thread number

TID = OMP_GET_THREAD_NUM()

PRINT *, 'Hello World from thread = ', TID C Only master thread does this

IF (TID .EQ. 0) THEN

NTHREADS = OMP_GET_NUM_THREADS()

PRINT *, 'Number of threads = ', NTHREADS END IF

C All threads join master thread and disband !$OMP END PARALLEL

Here are basic compile options for Gnu and Intel compilers for the Fortran code (C code same - but substitute corresponding C compiler in each case).

Gnu:gfortran -fopenmp -o hello omphello.f

Intel:ifort -openmp -o hello omphello.f

Running OpenMP code

To run an OpenMP code, a job script needs to be created, and submitted it to the smp parallel environment. Using the hello example above here is a script called run.sh

#!/bin/bash --login #BSUB -n 4 #BSUB -x #BSUB -o OpenMP.HTC.rx600.o.%J #BSUB -J HELLO #BSUB -R "span[ptile=4]" #BSUB -W 1:00 #BSUB -q q_cf_htc_large # latest intel compilers

module load compiler/intel-11.1.072 export OMP_NUM_THREADS=4

code=${HOME}/linux_openmp/source_code_2/hello MYPATH=$HOME/linux_openmp/source_code_2

TESTS="OpenMP" cd ${MYPATH}

echo running OpenMP TEST OMP_NUM_THREADS=$LSB_DJOB_NUMPROC export OMP_NUM_THREADS=$LSB_DJOB_NUMPROC

${code}

To submit the script: $ bsub < run.sh

Job <202999> is submitted to queue < q_cf_htc_large>. The output file run.o.202999 after the job has completed:

running OpenMP TEST OMP_NUM_THREADS=4 Hello World from thread = 0 Number of threads = 4

Hello World from thread = 2 Hello World from thread = 1 Hello World from thread = 3

There are two points to note. Firstly the second line of the run.sh script contains q_cf_htc_large. This ensures that the script is submitted to the SMP parallel environment. Secondly the environment variable OMP_NUM_THREADS should be set to the number of required threads. It is recommended the value does not exceed the number of cores/CPUs per node. If the OMP_NUM_THREADS variable is not set then the default value depends in principle on which compiler was used – in practice both Gnu and Intel compilers set this to the number of cores.

In document High Performance Computing Wales. HPC User Guide. Version 2.2 (Page 56-61)