Cluster Implementation and Management; Scheduling

(1)

Cluster Implementation and Management; Scheduling

CPS343

Parallel and High Performance Computing

(2)

Outline

1 Cluster components

Nodes Interconnect

Storage and file systems Software

2 Node provisioning, resource management, and job scheduling

Provisioning nodes

(3)

Acknowledgements

Some material used in creating these slides comes from

(4)

Cluster components

A typical cluster consists of the following components: master/login nodes (1 or more)

compute nodes (many) interconnect (1 or more) storage system

(5)

Outline

1 Cluster components

Nodes

Interconnect

2 Node provisioning, resource management, and job scheduling Provisioning nodes

(6)

master/login nodes

Master (or service) nodes run the resource manager and job scheduler login nodes handle interactive user logins, software development, submission of jobs, and pre- and post-processing of data

(7)

Compute nodes

Compute node configuration depends on applications cluster is designed to support. Important factor to consider are

number of processors, number of cores per processor amount of RAM, FSB speed

(8)

Outline

1 Cluster components Nodes

Interconnect

(9)

Interconnect

The network that connects the compute nodes to each other and the master/login nodes is called an interconnect fabric or just

interconnect.

As in the case of compute nodes, the type of interconnect chosen depends on the applications the cluster is designed to run. Key parameters are latency and bandwidth

A scalable, low-latency, high-bandwidth interconnect is desirable for the tightly coupled tasks typical in HPC.

(10)

Interconnect options

Two main options: Ethernet or InfiniBand.

(11)

Ethernet

Gigabit Ethernet (GigE), available since the early 2000’s, is now the Ethernet standard for general use

10-Gigabit Ethernet (10-GigE) became available in late 2000s

The names refer to the supplied bandwidth; 1 Gigabit/s is 125 MB/s while 10 Gigabit/s is 1.25 GB/s.

Typical GigE latency is 20 µsec.

Low-latency 10-GigE latency can be around 4 to 5 µsec. In many HPC applications low latency is more important than bandwidth — many short messages sent between tightly-coupled processes.

Unlike fast Ethernet and GigE, 10-GigE is full-duplex and is a switched network fabric (no hubs).

(12)

InfiniBand

InfiniBand (IB) is a switched network fabric Very low latency, 1 to 3 µsec

Bandwidth comparable to 10-GigE; InfiniBand QDR 12x bandwidth is 12MB/s

New InfiniBand EDR technology is pushing 36MB/s

(13)

Other/hybrid

Medium to large clusters often have multiple network interconnects IB or 10-GigE for compute node interconnect fabric; low-latency and high bandwidth

This interconnect may also connect to storage subsystem . . . . . . or a separate IB or 10-GigE network may be used for access to storage and the master/login node(s)

(14)

Outline

Interconnect

Storage and file systems

Software

(15)

Storage and file systems

In small clusters disks in the master/login node provide primary shared storage.

Compute nodes may have disks for scratch space

In larger clusters a separate storage area network (SAN) is used to provide storage to the cluster

Usually a distributed file system (DFS) is used to make the make the storage network appear transparently as a disk or disks to the cluster nodes

Currently Lustre is a popular DFS option; others include NFS, GPFS, and FhGFS.

(16)

NFS

NFS stands for Network File System

Developed by Sun Microsystems in the early 1980s Open source implementations exist for most systems

(17)

GPFS

This is IBM’s General Parallel File System

Used on some computers in the Top500 list and in many commercial clusters

First appeared in late 1990s

(18)

Lustre

Open source; name derived from “Linux Cluster”

Used by Titan and 5 other of top 10 computers in the Top500 list The Lustre system has three main components:

1 _{A MDS (metadata server) and associated MDTs (metadata targets;}

one per Lustre file system)

2 _{One or more OSSes (object storage servers) that interact with OSTs}

(object storage targets – disks, SAN, etc.)

3 _{clients: cluster nodes, workstations, archival storage systems, etc}

(19)

Lustre

(20)

Outline

Interconnect

Storage and file systems

Software

(21)

HPC software stack

(22)

HPC software

Operating system

Most clusters today run some version of Linux; RedHat and CentOS (both RPM based) are most popular

Some venders (e.g. Cray) have customized versions of Linux

Cluster management and control

provision compute nodes schedule jobs

HPC development tools

Compilers

(23)

Outline

Interconnect

2 Node provisioning, resource management, and job scheduling

Provisioning nodes

(24)

The need for diskless provisioning

Original Beowulf clusters consisted of individual, stand-alone computers connected by a network

Each node has a disk with the OS and other software Our workstation cluster follows this model

It is untenable, however, for medium or large clusters to be configured like this, as each node would have to be installed individually

software upgrades would be a huge headache

(25)

PXE

(26)

Outline

Interconnect

(27)

Resource management

A cluster resource management system provides much of the same functionality for the cluster that the OS provides for an individual system

The most important resource in a cluster are the compute nodes Nodes may not all be equivalent: some may have more memory, a scratch disk, one or more accelerators (GPU, Xeon Phi), and/or share a faster interconnect with certain other nodes.

(28)

Job scheduler

The job scheduler uses information supplied by the resource manager to determine the best match between job requirements and available resources

It then provides this information to the resource manager, which starts jobs as the necessary resources become available

Multiple scheduling algorithms exist, including

FCFS – first come, first served FIFO – first in, first out RR – round robin SJF – shortest job first LJF – longest job first

(29)

Fair share

Job schedulers often make adjustments to rigid scheduling decisions based on use history

For example, during daytime hours a SJF policy may be enforced, giving preference to jobs with quick turn-around time

Suppose Susan keeps submitting jobs that take 10 minutes to run but Bob needs to run a 15 minute job.

using strict SJF, Susan’s jobs will always run before Bob’s