Beowulf Training
Using Parallel Computing to Run Multiple Jobs
Jeff Linderoth
Outline
• Introduction to Scheduling Software
¦ The Wonderful World of PBS
¦ The Equally Wonderful World of Condor
• Lab Time.
Resource Scheduling
• So people don't ght over the resources! • Schedulers...
¦ Locate appropriate resources,
¦ Manage resources, so multiple processes don't conict over
the same processor
¦ Ensure a fairness policy,
¦ Are integrated with accounting software.
Mmmmmmmmmmmmmm. Pie
• Our rst computational task will be to estimate π by numerical
integration. • Everyone knows... Z 1 0 1 1 + x2dx = arctan(x)| 1 x=0 = arctan(1) = π 4 .
The Rectangle Rule
0 0.5 1 1.5 2 2.5 3 3.5 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 4/(1+x*x)A Program to Estimate π
• I've written a π-calculator for you.
cd mkdir compute-pi cd compute-pi cp /tmp/Training/Session2/pi1.c . gcc pi1.c -lm -o pi1 ./pi1 1000
• This is not a parallel program. Just a simple (one process)
program.
? Nevertheless, we must submit it through a scheduling system
Running with PBS
• A simple four step process... • Create a PBS submission script
• Submit the script to the PBS system using the command qsub
• PBS runs the script on the rst available resources • PBS collects output for user's inspection
The PBS Submission ScriptOverview
(1) You make a request for resources,
(2) PBS will allocate a node pool to fulll your request.
(3) Now you have to tell the node pool what to do!
• Both steps (1) and (3) are accomplished through the PBS
submission script
⇒ The script contains
¦ PBS request statements
¦ Shell commands that will run your job on the allocated
resources.
¦ The shell commands are executed on the rst node in your
Our First PBS Submission Script
#PBS -q small
#PBS -l nodes=1:public #PBS -l cput=00:05:00 #PBS -V
echo "The PBS job ID is: ${PBS_JOBID}" echo "The PBS Node File is"
cat $PBS_NODEFILE
Format of the PBS Submission Script
• Lines that begin with #PBS are PBS directives • Everything else is a shell command
¦ Shell commands are just things that you would type at the
regular login-prompt.
¦ But you can also do fancy looping and conditions.
http://www.gnu.org/manual/bash/html chapter/bashref toc.html
⇒ After the PBS commands, you put any commands you would
like.
Usually the command to run your program is usually a good one to include. :-)
Breaking It Down. PBS Directives
• -q Species the queue in which to place the job. We have two queues, small and large
¦ smallMax CPU time 20 minutes/process.
¦ largeLower priority than jobs in small queue
• -lDenes the resources that are required by the job and establishes a limit to the amount of resource that can be consumed.
• -V Declares that all environment variables in the qsub
command's environment are to be exported to the batch job.
¦ If you would like the PBS job to inherit the same
environment as the one you are currently running in (same PATH variable, etc), you should include this directive.
The -l Story
• For resources, you will typically only need to declare
¦ the number of nodes,
¦ which class of nodes you request
⇒ #PBS -l nodes=4:public
¦ the maximum cpu time
⇒ #PBS -l cput=00:15:00
• For the truly brave and curious the command is
PBSThe Big Three
• qsub
¦ Submit a PBS job
• qstat
¦ Check the status of a PBS job • qdel
¦ Delete a PBS job
Let's do it!
[jtl3@fire1 compute-pi-1]$ qsub run.pbs 5972.fire1
[jtl3@fire1 compute-pi-1]$ qstat -a fire1:
Req’d Req’d Elap
Job ID Username Queue Jobname SessID NDS TSK Memory Time S Time
--- --- - --- --- - -
---5972.fire1 jtl3 small run.pbs 27018 1 -- -- 00:20 E
--• Note that the job ID is printed for you when you submit the job • qstat -a : Shows the status of all jobs
Looking at the Output
• By default standard output goes to <scriptname>.o<job number>
• By default standard error goes to <scriptname>.e<job number>
[jtl3@fire1 compute-pi-1]$ cat run.pbs.o5972 The PBS job ID is: 5972.fire1
The PBS Node File is fire34
pi is about 3.1614997369512658487167300 Error is 1.9907083361472733e-02
• Note how the PBS environment variables are interpreted in the
Other Cool PBS Stuff You May Want To Do
• #PBS -N <Name> : Name your job
• #PBS -o <File.out> : Redirect standard output to File.out
• #PBS -e <File.err> : Redirect standard error to File.err
• #PBS -m -M : Mail options
• Job dependencies
• For a list of all PBS command le options...
¦ man qsub
Condor
• For purposes of this discussion, think of Condor as a different
scheduler.
¦ Condor is a bit more fancy.
Used often for nondedicated resources. (Will run only when no one else would use the machine).
Checkpointing/Migration
Remote I/O
? Likely, the accounting charge will be less for jobs submit to the
Condor scheduler.
http://www.cs.wisc.edu/condor
Checkpointing/Migration
Professor’s Machine Grad Student’s Machine Checkpoint Server Grad Student Leaves}
5am 8am 5 min Professor Arrives}
12pm 5 min 8:10am Arrives Grad StudentCondor Universes
• Condor jobs are submit to a specic Condor Universe
• StandardHas cool features like checkpointing and migration
of jobs
¦ Requires special linking of your program • VanillaNo cool condor features (regular)
• MPI/PVM
Compiling for Condor
• Standard Universe
¦ Put the command condor compile in front of your normal
link line.
¦ [jtl3@fire1 condor]$ condor compile gcc pi1.c -o
pi1-standard -lm
• Vanilla Universe
¦ Do nothing
• Now Condor submission is like PBS submission
¦ Different command (job description) le
A Sample Condor Submission File
universe = standard executable = pi1-standard arguments = 1000000000 output = pi1.out error = pi1.err notification = Complete notify_user = [email protected] getenv = True rank = kflops queueThe Big Four
• condor submit <job.condor>
¦ Submit a job to the Condor scheduler
• condor q
¦ Check the status of the queue of Condor jobs
• condor status
¦ Check the status of the condor pool • condor rm <jobid>
Let's Do It!
[jtl3@fire1 condor]$ condor_submit run.condor Submitting job(s).
1 job(s) submitted to cluster 16. [jtl3@fire1 condor]$ condor_q
-- Submitter: fire1.cluster : <192.168.0.1:32777> : fire1.cluster
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
16.0 jtl3 8/4 11:22 0+00:00:16 R 0 3.4 pi1-standard 1000000000
[jtl3@fire1 condor]$ cat pi1.out
pi is about 3.1415926555921398488635532 Error is 2.0023467328655897e-09
• I could do condor rm 16.0 • Any Condor questions?
Quit Wasting My Time!
• OK, Linderoth, I thought today was supposed to be about
parallel computing!
¦ That will be the focus of the next section(s)
¦ For now, let's do some simple parallel computing.
• Suppose I'd like to run the same executable pi1, but with many
different input les or parameters.
Running Many Jobs
• We need a way to easily submit many different jobs • We will use the shell's scripting capabilities
¦ PBS
? Use a template command le and the sed utility
¦ Condor
PBSRun Multiple Jobs. Step #1
• Create a template submission le.
#!/bin/bash #PBS -q small
#PBS -l nodes=1:public #PBS -l walltime=00:05:00 #PBS -V
echo "The PBS job ID is: ${PBS_JOBID}" echo "The PBS Node File is"
cat $PBS_NODEFILE
PBSRun Multiple Jobs. Step #2
• Create a shell script to do the multiple submission
#!/bin/bash
for n in 100 1000 10000 100000 1000000 do
sed s/XXX_N_XXX/$n/g run.pbs.template > run.pbs.tmp qsub run.pbs.tmp
rm run.pbs.tmp done
• The sed commands replaces all occurances of the pattern
PBSRun Multiple Jobs.
[jtl3@fire1 pbs]$ sh run-many.sh 5989.fire1 5990.fire1 5991.fire1 5992.fire1 5993.fire1• sh the script you created
CondorRun Multiple Jobs Example
• condor submit allows the user to override statements in the submission le.
¦ Use the -a ag
Condor Run Multiple Jobs. Step #1
• Create the Condor submission le
• Note no arguments or output lines!
executable = pi1-standard universe = standard notification = Complete notify_user = [email protected] getenv = True rank = kflops queue
The Condor Multiple Job Submission Script
• Create the condor multiple job submission script • Note the use of the -a option!
#!/bin/bash
for n in 100 1000 10000 100000 1000000 do
condor_submit -a "arguments = $n" -a "output = pi.$n.out"\ run.condor.many
Multiple Condor Submission Example
[jtl3@fire1 condor]$ sh run-many.sh Submitting job(s).
1 job(s) submitted to cluster 32. Submitting job(s).
1 job(s) submitted to cluster 33. Submitting job(s).
1 job(s) submitted to cluster 34. Submitting job(s).
1 job(s) submitted to cluster 35. Submitting job(s).
1 job(s) submitted to cluster 36. [jtl3@fire1 condor]$ condor_q
-- Submitter: fire1.cluster : <192.168.0.1:32777> : fire1.cluster
ID OWNER SUBMITTED RUN_TIME ST PRI SIZE CMD
33.0 jtl3 8/4 12:16 0+00:00:01 R 0 3.4 pi1-standard 1000
34.0 jtl3 8/4 12:16 0+00:00:00 R 0 3.4 pi1-standard 10000
35.0 jtl3 8/4 12:16 0+00:00:00 I 0 3.4 pi1-standard 10000
36.0 jtl3 8/4 12:16 0+00:00:00 I 0 3.4 pi1-standard 10000
The End!
• Schedulers are required for use in a parallel computing
environment
• PBS and Condor are cool
• You can do parallel computing even with MPI
¦ The Beowulf cluster can by a CPU cycle server for your