1
Configuration of High
Performance Computing
for Medical Imaging and
Processing
SunGridEngine 6.2u5
A manual guide for installing, configuring and using the cluster.
Mohammad Naquiddin Abd Razak Summer 2011
Introduction ... 1
Grid Engine Installation 101 ... 2
Installation Prerequisites ... 2
Installing Ubuntu 11.04 ... 2
Installing Java Version 11.6.0_22 ... 3
Checking your IP address ... 3
Edit /etc/hosts file ... 4
Installing Master Host ... 6
Installing Exec/Submit Hosts ... 7
Reinstall submit/exec host ... 9
Sun Grid Engine Configuration ... 10
Configure Hosts ... 10
Adding new host group ... 10
Adding Admin/Submit/Exec Host ... 10
Configure Users ... 11
Configure Queues ... 12
Adding New Queues ... 13
Attributes, Explanations and Configurations ... 14
Complex Resource Configuration ... 15
Adding a New Complex Resource Attribution: ... 15
Resource Quota Configuration ... 17
Network File Storage (NFS) Setup ... 18
Mounting /share ... 18
Mounting Alexandria ... 20
Configure Status Monitor ... 21
Using the cluster ... 22
Submitting job ... 22
Options for qstat ... 23
Job Control ... 23
Checking job status ... 24
Added computers/nodes: ... 26 Acknowledgements ... 27
1
Introduction
The MASI Lab has taken another step in fulfilling demands for High Performance Computing (HPC) applications in medical imaging and processing. 12 “rock-dudes- CPU processors” were bought where each CPU consists of 12 core processors, 273GHz systems with 47.25GB of RAM, and a GPU. These powerful machines have been connected via network file system (NFS) together with several computers available in MASI Lab. Thanks to Sun Grid Engine for providing free software which enables us to apply resource management to distribute jobs across the grid.
This documentation compiles step by step procedures which consist of installation process, GridEngine configuration and NFS setup. In order to get the GridEngine working, you need to tell the master host (mastadon) about the new nodes that you want to add into the system, and also setting up a network file sharing system (NFS). This sounds simple but it is easier said than done.
Together with this documentation is a simple tutorial on submitting job. You can find almost every basic thing that you need before getting your grid working in this documentation. Please go to Sun Grid Engine website (http://wikis.sun.com/display/GridEngine/Home )for more details. Don’t get frustrated if things are not working yet. There are solutions for every problem
2
Grid Engine Installation 101
Installation Prerequisites
Below are tools you need before installing master host and execution host: Ubuntu 11.04
Java version “1.6.0_22” Know your IP address
Edit /etc/hosts file on both master host and execution host
Installing Ubuntu 11.04
Please make sure that your computer is running Ubuntu 11.04. Skip this section if your computer is running Ubuntu 11.04. Type the following line to check you Ubuntu version:
masi@isis:~$ cat /etc/issue
Ubuntu 11.04 \n \l
If you are running an older version of Ubuntu, you must upgrade to link to the cluster. To get the latest version of Ubuntu, go to:
System >> Administration >> Update manager and install updates Go to the website below to get the complete instruction.
http://www.ubuntu.com/download/ubuntu/upgrade
Alternatively, you could type the following command on the terminal.
$sudo nano /etc/update-manager/release-upgrades
$sudo change “lts” to normal $sudo do-release-upgrade
IMPORTANT!:
You may need to repeat this upgrade process several time in order to update to Ubuntu 11.04. To check
3
Installing Java Version 11.6.0_22
Java is required for inter-host communication. Open terminal and type the following commands:
$ sudo apt-get update
$ sudo apt-get install openjdk-6-jre
Checking your IP address
Copy and paste your IP address on text editor/gedit for further use.
masi@qotsa:~$ ifconfig
eth0 Link encap:Ethernet HWaddr 60:eb:69:81:f6:ca
inet addr:10.20.201.59 Bcast:10.20.201.255 Mask:255.255.255.0 inet6 addr: fe80::62eb:69ff:fe81:f6ca/64 Scope:Link
4
Edit /etc/hosts file
This step is very important to ensure that your hosts are able to communicate with each other. It is like your contacts account. Without it, you will never able to call your buddies.
On master host, do:
$sudo nano /etc/hosts
127.0.0.1 mastadon.ds.vanderbilt.edu mastadon ::1 mastadon.ds.vanderbilt.edu mastadon localhost6 <exec/submit host ip address> <machine name>
5 On Execution / Submit hosts, do:
$sudo nano /etc/hosts
#127.0.0.1 localhost (please remove or comment this out)
127.0.1.1 masi-4.ds.vanderbilt.edu masi-4 (and replace with this line) 10.20.201.62 mastadon.ds.vanderbilt.edu mastadon
10.20.201.47 masi-4.ds.vanderbilt.edu masi-4
Note that you are adding IP address of your master host and your own machine in this file.
It is VERY IMPORTANT to complete this step on master host and exec/submit host. Failure to do so may result hosts communication error.
6
Installing Master Host
The master host is a brain, the center to the overall cluster activities. By default, the master host is also an administration host and submit host.
I found that installing master host and execution host as Ubuntu package is super easier compared to interactive installation procedure provided by the Sun Grid Engine (SGE) documentation:
1. Make sure that you have completed the prerequisites (ie; Ubuntu 11.04, Java, Editing /etc/hosts file on master and exec/submit hosts).
2. Install GridEngine on Master Host:
$sudo apt-get install gridengine-client gridengine-common gridengine-exec gridengine-master gridengine-qmon
General type of mail configuration: No Configuration Configure SGE automatically: Yes
SGE cell name: masicluster
SGE master hostname: mastadon.ds.vanderbilt.edu
3. Please make sure that you have edited the /etc/hosts file on both master and exec/submit host file. Now you are able to invoke QMON and start adding masi-4 as a submit host and execution host.
Please refer to GridEngine Configuration to add new hosts.
7
Installing Exec/Submit Hosts
Submit hosts allow you to submit jobs with the qsub command and monitor the job status with the qstat command. It also enables you to use Qmon; the graphical user interface (GUI).
Execution Hosts are systems that have permission to execute jobs. Therefore, queue instances are attached to the execution hosts.
1. Make sure that you have completed the prerequisites (ie; Ubuntu 11.04, Java, Editing /etc/hosts file on master and exec/submit hosts).
2. Install the GridEngine package and enter yes to all.
masi@atdi:~$ sudo apt-get install gridengine-common gridengine-exec gridengine-client
Press enter (OK) Yes
General type of mail configuration: No Configuration Configure SGE automatically: Yes
SGE cell name: masicluster
SGE master hostname: mastadon.ds.vanderbilt.edu
3. The installation is completed. Next, login back to mastadon (master host), invoke QMON and add masi-4 as execution and submit host.
8 4. Once your has been added as submit/exec host, you may check your computer status by typing
qstat –f on mastadon (master/admin/submit host only). Make sure that the master host is able to communicate with other hosts:
Positive:
Negative:
5. Reinstall the submit/exec hosts if you find that they are not communicating with the master host.
9
Reinstall submit/exec host
Do not get frustrated because it may happen a lot. This problem may be due to
localhost.domain type entry (it is on /etc/hosts file). I do not have any prove but my suggestion is to do repeat the installation process until it works.
1. Make sure that you have completed the prerequisites (ie; Ubuntu 11.04, Java, Editing /etc/hosts file on master and exec/submit hosts).
2. Remove the GridEngine:
$Sudo apt-get --purge remove gridengine-common gridengine-exec gridengine-client
3. This is the most important steps before reinstalling the GridEngine. The left-over execd from the previous install will cause a communication failure.
4. Kill the execd operation:
masi@atdi:~$ sudo kill 48128 (or any number that appears on that column)
10
Sun Grid Engine Configuration
The configuration can be done on terminal or QMON. Note that the full configuration can only be done by admin host. Qmon is currently installed on Mastadon and Manchu. You can go to this website
(http://linux.die.net/man/1/qconf) to configure grid engine on the terminal or the using Qmon as shown below:
$sudo qmon
Configure Hosts
Adding new host group
Host Configuration >> Click on Host Group Tab >> Click Add >> Type @group_name >> Insert Hosts that want to be added >> Click OK >> Done to finish
Adding Admin/Submit/Exec Host
Host Configuration >> Click on Administration Host/Submit Host/Execution Host >> Insert new host >> Click Done to finish
11
Configure Users
User Configuration >> Click on Userset >> Highlight arusers >> Click Modify >> Insert new User >> Click Done to finish
User Configuration >> Click on User >> Highlight arusers >> Insert new User >> Click Add and Done to finish
12
Configure Queues
An alternative way of checking your hosts’ communication as well as monitoring queues performances is by clicking on the “Queue Control” icon. There are 2 queues available on MASI Lab Cluster: cpujob, gpujob. Cpujob runs all job on each available slots, where the total number of slots is equal to the total number of processors. Also, there is 1 GPU for each MASI Lab Cluster main node.
Queue Name Queue Type Total slots
per node
Max GPU per Job
Details
Cpujob CPU 12 0 Controls non-GPY type
jobs
Gpujob GPU 1 1 Controls GPU-type
jobs only
13
Adding New Queues
Queue Control >> Click on tab “Cluster Queue” >> Click Add >> Enter a queue name >> Enter New Host/Host group
Below are the attributions which have been set up for cpujob and gpujob: Queue
Name
Seq_no Priority Load thresholds Slots Complexes
cpujob 2 -20 np_load_avg=1.75, [@cpu4=np_load_avg=1], [@cpu8=np_load_avg=1] 1, [@cpu4=4], [@exechost=12], [cpu8=8] None
14
Attributes, Explanations and Configurations
seq_no
Sequencing queues into orders starting from 0.
Queue Control >> Highlight a queue >> Modify >> change sequence number Priority
-20 is the lowest possible queue priority and 20 is the highest possible queue priority Queue Control >> Highlight a queue >> Modify >> Change priority value
Load Thresholds (np_load_avg)
overload prevents the queue from receiving further jobs
Queue Control >> Highlight a queue >> Modify >> Highlight a host on Attributes for Host or add a New Host on the bottom left corner >> change the np_load_avg value
Slots
Slots value was set to be the same as number of processors available. For gpujob, only one slot available on each node as we want only one job running on each GPU (there are one GPU for each node).
Queue Control >> Highlight a queue >> Modify >> Change the slots value Complexes (gpu=1)
Gpu is a new complex attribution added on cluster. This attribute was fixed to 1. However, if more gpu were added on each node, this value can simple be changed to any value.
Queue Control >> Highlight Queue >> Modify >> Complexes >> Add Cosumable/Fixed Attributes (ie; gpu=1).
15 “It is IMPORTANT for any gpu usage job to tell the cluster that they are using one gpu per job:”
please add this line on your script for gpu job: #$ -l gpu=1
Configuring queue on the terminal:
Complex Resource Configuration
Adding a New Complex Resource Attribution:
Qmon> Complex Configuration>” Fill in the Attributes”> Add> Commit
16 If a job with GPU usage is requested, the following command line should be added to the job script in order to request the type of queue and setting the number of GPU is equal to 1. You can set the Consumable attribute to YES if there were more GPU added on each node.
#$-q gpujob
#$ -l gpu=1 (we are using the new attribution here on this line)
For more details about complex resource attributes configuration, click on the link below:
http://wikis.sun.com/display/GridEngine/Configuring+Complex+Resource+Attributes#ConfiguringCompl exResourceAttributes-ExamplesofSettingUpConsumableResources
17
Resource Quota Configuration
A resource quota configuration is used to prevent users from consuming all available resources. For the MASI Lab Cluster purpose of usage, we limit @exechost host group to use the total number of slots to be equal to total number of processors. By providing this limit, only 144 jobs will be running at a time (as there are 144 number of processors), each of the job occupying the empty slots available, and the rest will be queued.
More information about static and dynamic resource quotas can be viewed on the link provided as below:
18
Network File Storage (NFS) Setup
NAS is a storage device that attaches directly to a network, allowing clients to access the storage as if it were directly attached to their system. In MASI Lab Cluster, storage becomes accessible directly across the network through /share directory.
Mounting /share
Install packages for NFS:
abdrazm@masi-7:~$ sudo apt-get install nfs-kernel-server nfs common abdrazm@masi-7:~$ sudo apt-get install smbfs
Create a new folder to mount the /share file on NAS device. abdrazm@masi-7:~$ sudo mkdir /share
Edit name service switch file:
abdrazm@masi-7:~$ sudo nano /etc/nsswitch.conf
hosts: files mdns4_minimal [NOTFOUND=return] wins dns mdns4
We need to create a credential file to mount on startup, and save the file: abdrazm@masi-7:~$ sudo nano /root/.smbcredentials
username= XXXXXXXX password= XXXXXXXX
Using chmod command to set file permission for /root/.smbcredentials so that only root can read and edit it:
abdrazm@masi-7:~$ sudo chmod 700 /root/.smbcredentials
Now we want to back up out fstab:
19 Edit fstab:
abdrazm@masi-7:~$ sudo nano /etc/fstab
//10.20.201.65/share /share cifs credentials=/root/.smbcredentials,iocharset=utf8,file_mode=0777,d ir_mode=0777 0 0
Copy fstab for share-> masi-7 (/etc/fstab)
root@masi-7:~# cp –r /home/likewise-open/VANDERBILT/abdrazm/etc/fstab /share
Make a backup for /home/likewise-open directory: root@masi-7:~# mkdir /home-local
root@masi-7:~# mv /home/likewise-open /home-local/
Soft link:
root@masi-7:~# ln –s /share/home/likewise-open /home/likewise-open
Do an “ls/home/” and you should see a likewise directory: root@masi-7:~# ls /home/
Log back in as your vunet id after reboot: root@masi-7:~# reboot
20
Mounting Alexandria
abdrazm@masi-7:~$ sudo –i
abdrazm@masi-7~# apt-get install nfs-kernel-server nfs common abdrazm@masi-7~# cd /
Create /home-nfs/likewise-open directory: root@masi-7:~# mkdir /home-nfs
root@masi-7:~# mkdir /home-nfs/likewise-open
Edit /etc/fstab.
root@masi-7:~# nano /etc/fstab
alexandria:/ /home-nfs/likewise-open nfs4 _netdev,auto 0 0
Mount /home-nfs/likewise-open
root@masi-7:~#mount /home-nfs/likewise-open
Create /homr.orig as a backup and move /home/likewise-open to /home.orig root@masi-7:~# mkdir /home.orig
root@masi-7:~# mv /home/likewise-open /home.orig
Soft link:
root@masi-7:~# ln –s /home-nfs/likewise-open /home/likewise-open
21
Configure Status Monitor
This enables us to monitor machine status of each computer from Inside Masi (https://masi.vuse.vanderbilt.edu/inside/index.php/Machine_Status)
sudo –i
ssh-keygen –t rsa
cat .ssh/id_rsa.pub | ssh [email protected] 'cat >> .ssh/authorized_keys' The password for status is XXXXXXXX
Install sshfs and autofs sudo apt-get install sshfs sudo apt-get install autofs
sudo crontab –e
*/2 * * * * /mnt/common/status/run-masi.sh 2> /dev/null > /dev/null
Create /mnt/common and add to /etc/auto.master mkdir /mnt/common
/mnt/common /etc/auto.sshfs
Create /etc/auto.sshfs sudo nano /etc/auto.sshfs *
-fstype=fuse,rw,nodev,nonempty,noatime,allow_other :sshfs\#[email protected]\:/mnt/c ommon/statu
Restart the autofs server /etc/init.d/autofs restart
22
Using the cluster
Submitting job
To submit a serial job first we need to make an SGE script file: 1. Create a file on linux:
2. Job example 1:
3. Job example 2:
The following commands are important to request GPU usage for a gpujob:
#$-q gpujob
The above command requesting gpujob script to be run into gpujob queue.
#$ -l gpu=1
this tells SGE that we are requesting a complex source gpu and it is equal to 1 (recall that there is only one graphic card per node)
23 to submit an array of jobs, use the command below:
5. Check your job status using $qstat or $qstat -f
6. You can read your job script by using the ‘cat’ command
Options for qstat
qstat Print general status each submitted job
qstat -f Print full job information. Very nice and clear
qstat –u user Print all jobs of a given user
qstat –j job_id Print full job information of the given job id
Qstat explain c -j job_id Print detail explanation of the given job id plus explanation on errors/pending
Job Control
SGE state letter symbol codes meanings:
Pending Pending qw
Pending, user/system hold hqw
Pending, user/system hold, re-queue hRqw
Running Running r
Transferring t
Running, re-submit Rr
Transferring, re-submit Rt
Suspended Job suspended s, ts
Queue suspended S, tS
Queue suspended by alarm T, tT
All suspended with re-submit Rs, Rts, RS, RtS, RT, RtT
Error All pending states with error Eqw, Ehqw, EhRqw
Delete All running and suspended states with deletion dr, dt, dRt, dRr, ds, dS, dT, dRs,
24
Checking job status
$ qstat : this will give you the general status of each submitted job.
No information will be given if none of the job available on each queues (finished job/ none job submitted).
25
$ qstat –explain c –j jobID : Job-ID 1069 is pending as you could see on the diagram. Any job with status
other than ‘r’ indicates that they are not happy. We can check the reason by using qstat –explain command. Please refer to table Table 4 for job status explanations.
26
List of Nodes and Computers in MASI Lab
Main Nodes:
MASI main nodes (@gpujob, @exechost) Host IP address
mastadon.ds.vanderbilt.edu Admin, Submit, Executive 10.20.201.63
Isis.ds.vanderbilt.edu Executive 10.20.201.43 Kyuss.ds.vanderbilt.edu Executive 10.20.201.50 Godspeed.ds.vanderbilt.edu Executive 10.20.201.56 Melvins.ds.vanderbilt.edu Executive 10.20.201.57 Outkast.ds.vanderbilt.edu Executive 10.20.201.58 Qotsa.ds.vanderbilt.edu Executive 10.20.201.59 Atdi.ds.vanderbilt.edu Executive 10.20.201.60 Felakuti.ds.vanderbilt.edu Executive 10.20.201.61 Slayer.ds.vanderbilt.edu Executive 10.20.201.62 Mogwai.ds.vanderbilt.edu Executive 10.20.201.51 Pelican.ds.vanderbilt.edu Executive 10.20.201.54
Added computers/nodes:
MASI extra nodes Host Group Host IP address No. of CPU
Manchu.ds.vanderbilt.edu @cpu8 Submit, Executive 10.20.201.40 8 Masi-1.ds.vanderbilt.edu Masi-2.ds.vanderbilt.edu Masi-2.ds.vanderbilt.edu Masi-4.ds.vanderbilt.edu Masi-7.ds.vanderbilt.edu Masi-10.ds.vanderbilt.edu @cpu4 @cpu4 @cpu4 @cpu4 @cpu4 @cpu4 Executive Executive Executive Executive Executive Executive 10.20.201.41 10.20.201.49 10.20.201.64 10.20.201.47 10.20.201.42 10.20.201.46 4 4 4 4 4 4
27
Acknowledgements
Special thanks to …..
Professor Bennett Landman ([email protected]). He is a great person, very supportive, warm and intellectual.
Albert Hampton ([email protected]), a hardworking technician, has almost every solution to fix computer problems.
Elliot Hall, a great friend and a coffee lover. May God make things easier for your wedding.
Andrew Asman, Xue Yang, Zhoubing Xu, Michael Esparza, Eesha Singh and Lauzon Carolyn who made the summer wonderful with smile, laugh and joy.