Systems - Research Computing Infrastructure

5.2 Research Computing Infrastructure

5.2.1 Systems

In an effort to become a world leading research institution the University recog- nised the importance of High Performance Computing (HPC) for research and deployed several HPC systems. HPC systems are the most effective in processing large computational problems. The University’s two primary HPC systems are known as Eridani and Sol. All systems on the campus are linked together to form the Queensgate Grid (QGG) (Holmes & Kureshi, 2010).

Eridani, shown in Figure 5.1 and 5.2 is a Beowulf type HPC cluster. It is capa- ble of delivering a peak of 500 Giga FLOPS (GFLOPS) of computational power with a sustained power rating of 380 GFLOPS. It comprises of thirty six com- modity workstations linked together over a gigabit backplane. Each node has

5. University of Huddersfield RCS 74

FIGURE5.2: Networking and Power for Eridani: Hot Isle

FIGURE5.3: SOL Cluster in the University Datacentre

a 4 core Intel R _{processor. Entire system control, monitoring, Message Pass-}

ing Interface (MPI), and user data is delivered over the single gigabit network channel. The nodes are housed in a custom build shelf. The system uses Community Enterprise Operating System (CentOS) version 5 as the operating system with Open Source Cluster Application Resources (OSCAR) 5.1b2 pro- viding the linking middleware. Terascale Open-Source Resource and QUEue Manager (TORQUE) 2.5.7 with Maui 2 manage the system resources and jobs. Users home storage is mounted from a mirrored Gluster File System Storage Server.

5. University of Huddersfield RCS 75

compute nodes that are based on SUN X4200 server hardware with two dual core AMD R _OpteronTM _{processors each. The system is housed in two server}

racks (as shown in figure 5.3). There is a separate gigabit network layer for each of monitoring and control; user data; and interprocessor MPI. Each rack, which has 32 nodes, has three of its own switches to deliver the different network layers. Each switch is linked across each rack using an ring configuration giving a throughput of 80 gbps. Sol runs CentOS 6 as the operating system, with Warewulf Cluster Manager as the middleware. Jobs and resources are man- aged by TORQUE 4 and Maui 3.3.0. Despite the hardware being older, the ad- ditional network layers and the latest software has ensured that Sol is the most powerful system available on-campus. Using High Performance Linpack (HPL) Sol achieved a sustained performance rating of 1.03 Tera FLOPS (TFLOPS) with a peak of 1.9 TFLOPS.

In addition to two HPC clusters, there are two specialised HPC systems that make up the Huddersfield RCI. The first is an Nvidia R _{and Intel} R _{based Graphic}

Processing Unit (GPU) cluster. This cluster comprises of two compute nodes and a head node linked via Gigabit Ethernet. The nodes connect to the GPUs, housed in a special chassis, using a Peripheral Component Interconnect - Ex- press (PCI-E) interface. This system runs Microsoft R _{Windows HPC Server}

2008TM _{as its operating system and middleware.}

The second system is data mining cluster made up of 14 nodes. This cluster, QGG-Hadoop, uses CentOS 6 as its operating system and Apache’s Hadoop software as the job manager. This system is utilised to tackle big data problems, like error detection in medical records.

In order to deliver the latest in HPC technology to the users the University of Huddersfield has bought into a shared HPC service. This system is housed at the Science and Technology Facilities Council (STFC) Daresbury. The system

5. University of Huddersfield RCS 76

has been listed on the Top500 list since 2012. The University of Huddersfield pays for priority access to a set amount of CPU hours each month. Once the CPU quota is exceeded, jobs initiated by the Huddersfield group on the system become the lowest priority.

The University of Huddersfield campus network is also linked by several High Throughput Computing (HTC) middleware. This allows the HPC-RC to farm jobs to idle CPUs on campus. The primary HTC system is known as QGG- Condor. QGG-Condor is a heterogeneous pool of resources linked using HT- Condor. All the major schools of the University share their systems by deploying the HTCondor client on their lab machines. This has led to the creation of a computing network with over 5000 processing slots. The engineering depart- ment runs virtualised Linux environments that HTCondor can utilise when idle. Even at the absolute peak hours of term, it has been observed that there are at least 500+ slots available for use (Gubb, Holmes, Kureshi, Liang, & James, 2012).

The School of Arts does not run HTCondor within their labs. To meet their computational needs a second high throughput middleware Autodesk R _BackburnerTM

is utilised to harness idle CPUs. Backburner is a proprietary middleware that can only be used by Autodesk software. Within the University of Huddersfield it is primarily used for rendering graphics from 3D Studio Max.

All systems on the Huddersfield Queensgate campus are linked by multiple grid interfaces. This allows users to stay with the Job Description Language (JDL) they are most comfortable using. Primarily, the campus grid is a single sign-on network with a globally shared file system. All users must connect and authenti- cate to the central access server named Bellatrix. This system is connected by very high speed to the University backbone and to a storage server. Users can seamlessly SSH to any computing end-point and submit natively. This creates a

5. University of Huddersfield RCS 77

”trusted grid” environment. The links are established over a private high speed fibre optic network that is dedicated to the HPC environment.

The more traditional form of the grid is delivered using the Globus middleware. From any of the compute end-points or the central access node, a user can invoke processes on remote end-points using the Globus toolkit. Each system on the QGG trusts a Huddersfield only Virtual Organisation (VO). Using the Globus JDL a user can submit jobs to the HTCondor or the HPC clusters. Users however need to know the location of the required applications, that is, which system the application is installed on and where on the system the binaries exist.

Those users familiar with HTCondor can use the Condor-G features to submit jobs from the HTCondor node to the other HPC clusters on the campus network. While HTCondor is configured to ”know” about all the HPC resources in the network, users still need to know the location of their applications. Build- ing on top of the Globus and HTCondor installations is the gLite middleware. Through gLite users get the advantage of a single sign-on, single JDL, and access to all the applications and execution end-points without having to explicitly know about them (J. Brennan et al., 2013). This system has its benefits and drawbacks. Debugging errors becomes more cumbersome as there are now multiple points of failure. However the gLite system can lead to a degree of load balancing, as gLite will divide jobs across different systems if the application is available. A second advantage is that users can use the gLite system to seamlessly scale beyond the campus grid to the UK national grid and the European grid.

The University of Huddersfield was an affiliate member of the now defunct UK National Grid Infrastructure for eScience. Local compute resources at Hudder- sfield were linked to the national gLite deployment for use by researchers at

5. University of Huddersfield RCS 78

other institutions. Several institutions still share resources for training purposes.

In document An Intelligent Robust Mouldable Scheduler for HPC & Elastic Environments (Page 75-80)