Cluster Implementation and Management; Scheduling
CPS343
Parallel and High Performance Computing
Outline
1 Cluster components
Nodes Interconnect
Storage and file systems Software
2 Node provisioning, resource management, and job scheduling
Provisioning nodes
Acknowledgements
Some material used in creating these slides comes from
Cluster components
A typical cluster consists of the following components: master/login nodes (1 or more)
compute nodes (many) interconnect (1 or more) storage system
Outline
1 Cluster components
Nodes
Interconnect
Storage and file systems Software
2 Node provisioning, resource management, and job scheduling Provisioning nodes
master/login nodes
Master (or service) nodes run the resource manager and job scheduler login nodes handle interactive user logins, software development, submission of jobs, and pre- and post-processing of data
Compute nodes
Compute node configuration depends on applications cluster is designed to support. Important factor to consider are
number of processors, number of cores per processor amount of RAM, FSB speed
Outline
1 Cluster components Nodes
Interconnect
Storage and file systems Software
2 Node provisioning, resource management, and job scheduling Provisioning nodes
Interconnect
The network that connects the compute nodes to each other and the master/login nodes is called an interconnect fabric or just
interconnect.
As in the case of compute nodes, the type of interconnect chosen depends on the applications the cluster is designed to run. Key parameters are latency and bandwidth
A scalable, low-latency, high-bandwidth interconnect is desirable for the tightly coupled tasks typical in HPC.
Interconnect options
Two main options: Ethernet or InfiniBand.
Ethernet
Gigabit Ethernet (GigE), available since the early 2000’s, is now the Ethernet standard for general use
10-Gigabit Ethernet (10-GigE) became available in late 2000s
The names refer to the supplied bandwidth; 1 Gigabit/s is 125 MB/s while 10 Gigabit/s is 1.25 GB/s.
Typical GigE latency is 20 µsec.
Low-latency 10-GigE latency can be around 4 to 5 µsec. In many HPC applications low latency is more important than bandwidth — many short messages sent between tightly-coupled processes.
Unlike fast Ethernet and GigE, 10-GigE is full-duplex and is a switched network fabric (no hubs).
InfiniBand
InfiniBand (IB) is a switched network fabric Very low latency, 1 to 3 µsec
Bandwidth comparable to 10-GigE; InfiniBand QDR 12x bandwidth is 12MB/s
New InfiniBand EDR technology is pushing 36MB/s
Other/hybrid
Medium to large clusters often have multiple network interconnects IB or 10-GigE for compute node interconnect fabric; low-latency and high bandwidth
This interconnect may also connect to storage subsystem . . . . . . or a separate IB or 10-GigE network may be used for access to storage and the master/login node(s)
Outline
1 Cluster components Nodes
Interconnect
Storage and file systems
Software
2 Node provisioning, resource management, and job scheduling Provisioning nodes
Storage and file systems
In small clusters disks in the master/login node provide primary shared storage.
Compute nodes may have disks for scratch space
In larger clusters a separate storage area network (SAN) is used to provide storage to the cluster
Usually a distributed file system (DFS) is used to make the make the storage network appear transparently as a disk or disks to the cluster nodes
Currently Lustre is a popular DFS option; others include NFS, GPFS, and FhGFS.
NFS
NFS stands for Network File System
Developed by Sun Microsystems in the early 1980s Open source implementations exist for most systems
GPFS
This is IBM’s General Parallel File System
Used on some computers in the Top500 list and in many commercial clusters
First appeared in late 1990s
Lustre
Open source; name derived from “Linux Cluster”
Used by Titan and 5 other of top 10 computers in the Top500 list The Lustre system has three main components:
1 A MDS (metadata server) and associated MDTs (metadata targets;
one per Lustre file system)
2 One or more OSSes (object storage servers) that interact with OSTs
(object storage targets – disks, SAN, etc.)
3 clients: cluster nodes, workstations, archival storage systems, etc
Lustre
Outline
1 Cluster components Nodes
Interconnect
Storage and file systems
Software
2 Node provisioning, resource management, and job scheduling Provisioning nodes
HPC software stack
HPC software
Operating system
Most clusters today run some version of Linux; RedHat and CentOS (both RPM based) are most popular
Some venders (e.g. Cray) have customized versions of Linux
Cluster management and control
provision compute nodes schedule jobs
HPC development tools
Compilers
Outline
1 Cluster components Nodes
Interconnect
Storage and file systems Software
2 Node provisioning, resource management, and job scheduling
Provisioning nodes
The need for diskless provisioning
Original Beowulf clusters consisted of individual, stand-alone computers connected by a network
Each node has a disk with the OS and other software Our workstation cluster follows this model
It is untenable, however, for medium or large clusters to be configured like this, as each node would have to be installed individually
software upgrades would be a huge headache
PXE
Outline
1 Cluster components Nodes
Interconnect
Storage and file systems Software
2 Node provisioning, resource management, and job scheduling Provisioning nodes
Resource management
A cluster resource management system provides much of the same functionality for the cluster that the OS provides for an individual system
The most important resource in a cluster are the compute nodes Nodes may not all be equivalent: some may have more memory, a scratch disk, one or more accelerators (GPU, Xeon Phi), and/or share a faster interconnect with certain other nodes.
Job scheduler
The job scheduler uses information supplied by the resource manager to determine the best match between job requirements and available resources
It then provides this information to the resource manager, which starts jobs as the necessary resources become available
Multiple scheduling algorithms exist, including
FCFS – first come, first served FIFO – first in, first out RR – round robin SJF – shortest job first LJF – longest job first
Fair share
Job schedulers often make adjustments to rigid scheduling decisions based on use history
For example, during daytime hours a SJF policy may be enforced, giving preference to jobs with quick turn-around time
Suppose Susan keeps submitting jobs that take 10 minutes to run but Bob needs to run a 15 minute job.
using strict SJF, Susan’s jobs will always run before Bob’s