• No results found

Introducing High Performance Computing at Marquette

N/A
N/A
Protected

Academic year: 2021

Share "Introducing High Performance Computing at Marquette"

Copied!
38
0
0

Loading.... (view fulltext now)

Full text

(1)

Introducing High Performance

Computing at Marquette

Xizhou Feng, Ph.D.

Research Engineer, IT Services Research Assistant Professor, MSCS

Marquette University [email protected]

(2)

The Need of High Performance Computing

Computing is the “third pillar” for scientific discovery

Research computing provide the infrastructure that

 Enables science at scale  Advances research program  Responds to new opportunity

2

Computing Infrastructure

Physical Sciences, Economics, Social Sciences, Engineering,

Humanities Ex perime nt Th e ory C omput ing

Discovery &

Innovation

(3)

Research Computing: Support HPC in Campus

Research computing is the application of computing

resources and tools in conducting research, scholarship

and creative activity.

Its scope includes but not limited to:

 Computing, storage, and networking resources  Large-scale data/database management

Software for modeling, simulation, and analysis

 Ubiquitous, fully-supported cyberinfrastructure

Support for incorporating advanced computing

technology to the research programs

3

(4)

Research Computing Support@ Marquette

4

HPCGC

ITS

System

RCS Advise Policy Direction Plan Monitor Manage Report Campus Champions Collaborate System

Researchers Computational Scientists HPC Users Service Support Collaborate Request Suggestion Collaborate

(5)

Available HPC Resources to Marquette Users

Local resources

Pere Cluster

 PARIO Cluster  HPCL Cluster

 MUGrid (Condor pool)

Regional resources

 SeWHIP

National resources

XSEDE

 Open Science Grid, …

 NCSA, ORNL, DOE resources

Commercial resources

(6)

The Pere Cluster

6 hn1 hn2 Gigabit Et her net INtet co nnec tion Inf iniband Int er co nne ction msa1 msa1 E1 : cn1-cn16 E8 : cn113-cn128 DDR 4x5 Gbps

Marquette Data Center Active Directory Center

(7)

Pere Hardware Configuration

2 ProLiant DL380 G6 Server as head node

 Two Intel Xeon [email protected] Quad-core CPU  Two 72GB hard drivers (RAID 1)

 One Mellanox MT26418 IB DDR NIC

 Two NetXen NX3031 Ethernet Controller

128 Compute nodes: HP roLiant BL280c G6 blade

 Two Intel Xeon [email protected] Quadcore CPU  Two local hard driver: 120GB + 500GB

 One Mellanox MT25418 IB DDR NIC

 One Intel 82576 Gigabit Ethernet controller

2 HP MSA2012sa storage racks

 Each rack has 3 enclosures

 Each enclosure has 11 750GB 7200 RPM SATA disks

configured with RAID10 (~20TB available storage)

(8)

Pere Software Configuration

O.S.: Red Hat Enterprise Linux 5

 5.6

Authentication: AD + winbind

 Integrated with Marquette authentication infrastructure

Workload scheduler:

TORQUE/PBS: cn1-64

Condor: cn65-128

Programming models

 Task parallel  OpenMP  MPI  MPI+OpenMP 8

(9)

Sample Applications Running on Pere

 Biomedical

 Simvascular (Blood flow)  Neuron (computational neuroscience)  Medical imaging processing  Nerual simualtion  Chemistry  Gaussian  Amber  cyana  Autodock  Molpro 

Mechanical

 Converge (CFD)  Electrical  MATLAB 

MSCS

 MATLAB  Bioinformatics apps

 Parallel computing course  Business

 Stata

(10)

Access the Pere Cluster

Get an account on Pere

Fill the account request form

Email it [email protected]

Login the Cluster

 ssh <your-mu-id>@pere.marquette.edu  ssh -X <your-mu-id>@pere.marquette.edu

Account management

 User authentication is based on Active Directory  Same user id and password emarq/checkmarq

(11)

Transfer File between Pere and Desktop

Method 1: sftp (text or GUI)

sftp <muid>@pere.mu.edu put simple.c bye 

Method 2: scp

scp simple.c [email protected]:example/ 

Method 3: rsync

rsync -rsh=ssh -av example \ [email protected]:

Method 4: svn or cvs

(12)

Transfer File between Pere and Desktop

Method 5: Mount your home on Pere as a network drive

 User needs request to enable this feature

(13)

Developing & Running Parallel Code

(14)

Workload Management/Job Scheduler

A kind of software that

provide

Job submission and

automatic execution

Job monitoring and

control

Resource management

Priority management

Checkpoint

Usually implemented as

master/slave architecture

Pere current uses both

(15)

Using PBS/TORQUE

Common used Command

qsub myjob.qsub

submit job scripts

qstat

view job status

qdel job-id

delete job

pbsnodes

show nodes status

pbstop

show queue status

(16)

Sample Job Scripts on Pere

 #!/bin/sh  #PBS -N hpl  #PBS -l nodes=64:ppn=8,walltime=01:00:00  #PBS -q batch  #PBS -j oe  #PBS -o hpl-$PBS_JOBID.log

 module load mpich2/intel/1.4.1  cd $PBS_O_WORKDIR

 cat $PBS_NODEFILE

 mpirun -np 512 --hostfile `echo $PBS_NODEFILE` xhpl

Assign a name to the job

Request resources: 64 nodes, each with 8 processors, 1 hour Submit to batch queue

Merge stdout and stderr output Redirect output to a file

Change work dir to current dir

Print allocated nodes (not required)

Run the mpi program

(17)

Using Condor

 Resources:

(18)

Using Condor

Universe = vanilla Executable = simple Arguments = 4 10 Log = simple.log Output = simple.out Error = simple.error Queue

1. Write a submit script – simple.job

2. Submit the script to condor pool

condor_submit simple.job

3. Watch the job run

condor_q

condor_q –sub <you-username>

(19)

Doing a Parameter Sweep

Universe = vanilla Executable = simple Arguments = 4 10 Log = simple.log

Output = simple.$(Process).out Error = simple.$(Process).error

Queue

Arguments = 4 11

Queue

Arguments = 4 12

Queue

Can put a collections of jobs in the same submit scripts to do a parameter sweep.

Tell condor to use different output for each job

Use queue to tell the individual jobs Can be run independently

(20)

Condor DAGMAN

DAGMAN lets you submit complex sequences of jobs as

long as they can be expressed as a directed acyclic graph

Commands:

condor_submit_dag simple.dag ./watch_condor_q

(21)

Using XSEDE Resources

If you need more computing power, c

onsider XSEDE.

What is XSEDE?

 Extreme Science and Engineering Discovery Environment  A single virtual system that scientists can use to

interactively share computing resources, data and expertise

XSEDE resources are free to academic users

 Allocation requests are need, but we can help  Campus Champions: Lars Olson and me

(22)

HPC Systems Available on XSEDE

(23)

Best Practice of using Shared HPC Systems

Setup a comfortable local environment on your desktop

 SSH client: SSH secure client, Putty)

 Linux VM: VMWare + CentOS + Shared folder  Use public key for authentication

Be familiar with Unix environment

 Editing files with vi or emacs  Working with files & directories

 Working with shell environment and scripting tools  Working with basic Unix programing tools

 Security concerns: backup, password, and file access

permission

(24)

Best Practice of using Shared HPC Systems

Understand the basics of HPC

 Typical HPC system architecture

 SMP/Cluster/Grid/Heterogeneous systems

 Parallel computing models/paradigms

 Job Parallel/Data Parallel/OpenMP/MPI/PGAS/MapReduce

Common tools available on HPC environment

 Environment modules

 Job schedulers: PBS, SGE, LSF, Condor, etc.  Parallel compilers: gcc, intel, pgi, etc

Consult system documentations

 Queue systems  Data storage  System policy

(25)

Best Practice of using Shared HPC Systems

Automate your workflow

 Develop scripts to wrap/simplify the commands for

preparing/transferring/cleaning data

 Use scripts/tools to glue related tasks

Use the appropriate queues

 cvtec: for simvascular, limited to 5 jobs  batch: for other PBS jobs, no limit

 condor: for Condor jobs

Request the right number of node for each job

 The bell-curve of typical parallel speedup

 Profile with short runs to determine the optimal number

of node before launching many long runs

 Try to use all the cores on a single node to prevent

interferences from other jobs

(26)

Pay attention to data management

 Consider using a database to manage input data,

simulation configuration, and results

 Store your data in a well-organized directory structure  Routinely back up data from cluster to your desktop  Regularly check the available storage space on the

cluster and remove unused temporal data 

Optimize job for better performance

 Use an optimized version of your code  Reduce unnecessary data movement

 Choose a proper intervals for check-pointing  Use different file systems for different purpose

26

(27)

Best Practice of using Shared HPC Systems

Get help from the community

 Research Computing Support at Marquette  Solve Technical issues

 Help scripts/solutions

 Advise Job/application optimize  Provide Special training sessions  Attend training/tutorial sessions  Local user Community

 XSEDE resources

(28)

System and User Support

28

Computing Resources

(clusters, networks, storage, power, cooling, etc) Operating System

Runtimes and Middleware (MPI, OpenMP, UPC, PBS, Condor)

Applications Data Store Visualization User Interface and Collaboration

Guar antee d On -dema n d Priority -bas ed

(29)

Motivating Examples

(30)

Example 1: High Performance Bayesian

Phylogenetics

The problem: accurately and efficiently construct large

evolutionary tree using genomic data

The challenges

 Extremely computational intensive  Large memory footprint

 Large number of datasets

30

Lemur Gorilla Chimpanzee Human

Italy 1998 Romania 1996 Kenya 1998 New York 1999 Israel 1998 Italy 1998 Romania 1996 Kenya 1998 New York 1999 Israel 1998

(31)

The solution

The solution

1.

Develop highly scalable parallel algorithms

(PBPI)

 1400X speedup for 256 processors (or reducing time from ~40 hours to 1.7 minutes)

 Support very large data set with distributed memory  Scaling up to 4000 processors enabling large science

2.

Customize scripts to automate data

generation, analysis, and summary

3.

Use HPC and Teragrid to speedup analysis by

running hundreds of analysis in parallel

 Research previous done in years can be completed

in weeks

(32)

Example 2: Individual-based computational

epidemiology

The problems: preparing pandemic influenza

with policy informatics

 1918 pandemics killed >25 million people worldwide

(548,452 in US)

 “It is only a matter of time that before the a human flu

pandemic grips the world.”

 “A novel flu strain that can easily transmit between

human could trigger a disease pandemic that

overburdens existing public health infrastructure”

(33)

The solution: HPC-supported Individual-based

computational epidemiology

 Investigate how infectious disease spread through large populations

 Provide tools for experts to test different public health interventions

Simulation Engines Disease Models Population Mobility

Social Contact Network

b a c L1 L2 L3 Ia Ca Ib Ic Cb Cb 8:00 12:00 8:00 12:00 8:00 12:00

(34)

The Results: High Fidelity, High Resolution,

and High Flexibility Models

(35)
(36)

Example 3: Cyber-Infrastructure for Complex

System Research

The problem: Translating HPC software to a user-centric

problem solving environment, making HPC analytical

capability available to domain expert who does not need to

be an HPC experts.

The Solution:

 Abstract the scientific workflow to a web-based

problem-solving platform

 Hide the complexity of data preparation, job submission,

resource scheduling, and simulation/analysis execution in HPC and data grid

 Let researchers and experts to focus on what problem to be

(37)

The DIDATIC/ISIS System

Formulate Problem Select Models/Data Design Experiments Execute Experiments Analyze Results Draw Conclusions Recommend Policy

Graphical User Interface

Database GUI Server Job Coordinator Service Broker Simulation Engine Analytical Engine SimfraSpace

(38)

Lessons and Summary

Computing, particular HPC, has been playing a

central role in today’s research.

Parallel computing is becoming mainstream

There are many challenges in applying HPC in a

new research program

User-centric Ubiquitous HPC and

Cyberinfrastructure is a candidate solution

Marquette ITS Research Computing Service

commits to help you build the environment and

explore new research opportunities

References

Related documents

Spanish Biomedical research Law (14/2007): Informed Consent and donor rights Spanish Biomedical research Law (14/2007): Biobanks and collections. International exchange and shipping

Citi Prepaid Services extends its payroll benefits and applications to a variety of sectors and industries.

To determine the prevalence of maternal risk behaviors and experiences during the perinatal period and their association with adverse pregnancy and infant health outcomes,

the mathematical reserves, amount to 11.7% of the Italian households' total wealth (see Ania 2014). This is why we study the determinants of insurance demand using Italian

Spatial Dim 92 Forest Stand Value Compartment Unit Time DIM Inventory Comparable Detailed Species Value Description Comparable Detailed Height Value Description..

• Replace the columns in innermost inline view with ROWID and join to table in outermost query. – May provide substantial

Such forward-looking statements include, without limitation: estimates of future earnings, the sensitivity of earnings to oil &amp; gas prices and foreign exchange rate movements;