using Constraint Programming. We adopt a rolling horizon approach, where our scheduler is awakened at certain events. At each of such activations, we build a full schedule and resource assignment for all the waiting jobs, but then we dispatch only those jobs that are scheduled for immediate execution. By tak- ing into account forthcoming jobs, we avoid making dispatching decisions with undesirable consequences; by starting only the ones scheduled for immediate execution, the system can manage uncertain execution times.
3.2
Eurora System
The Eurora supercomputer prototype has ranked first in the Green500 list in July 2013, achieving 3.2 GFlops/W on the Linpack Benchmark with a peak power consumption of 30.7 KW. Eurora has been supported by PRACE 2IP project [PRA] and it serves as testbed for next generation Tier-0 system. Its outstanding energy efficiency is achieved by adopting a direct liquid cooling so- lution and a heterogeneous architecture with general purpose HW components (Intel Xeon E5, Intel Xeon Phi and NVIDIA Kepler K20). Eurora cooling so- lution is highly efficient and enables hot water cooling, that is suitable for hot water recycling and free-cooling solutions [KRA12c, 9.911]. For its character- istics Eurora is a perfect vehicle for testing and characterizing next-generation “greener” supercomputers.
3.2.1
System Description
As described in [BCC+14] the architecture of Eurora consists of 8 stacked chassis
(half-rack), each of them hosting 8 node cards and 16 expansion cards (Fig. 3.1). The node card is the basic element of the system and comprises 2 Intel Xeon E5 Series (SandyBridge) processors and 2 expansion cards configured to host an accelerator module. One half of the nodes use E5-2658 processors including 8 cores with 2.0 GHz clock speed while the other half uses E5-2687W processors including 8 cores with 3.1 GHz clock speed; 58 nodes have 16 GB RAM and the remaining 6 (with processors at 3 GHz clock rate) have 32 GB RAM. The accelerator modules can be Nvidia Tesla (Kepler) or, alternatively, Intel MIC KNC (Xeon phi).
Figure 3.1: EURORA Architecture
Each node of Eurora currently executes a SMP CentOS Linux distribution version 6.3. Eurora is interfaced with the outside world through two dedicated
computing nodes, physically positioned outside the Eurora rack - the login node, linking Eurora to the users, executes the batch job dispatcher (PBS) and con- nects to the same shared file system, and the master node, connected to all the root cards and visible only to system administrators. Moreover, Eurora adopts a hot liquid cooling technology, i.e. the water inside the system can reach up to 50◦C. This strongly reduces the cooling energy required for operating the
system, since no power is used for actively cooling down the water, and the waste-heat can be recovered as energy source for other applications.
Eurora features an integrated and low-overhead monitoring system com- posed by a set of software daemons and parsing scripts. The SW daemons run periodically (every 5 second) on each node to collect traces of the processing elements (CPUs, GPUs, Xeon Phy) activity by mean of HW performance coun- ters. For each core it gathers values from the Performance Monitoring Unit 2 as well as the core temperature sensors, and the time-step counter. In addi- tion, for each CPU it gathers the monitoring counters (power unit, core energy, dram energy, package energy) present in the Intel Running Average Power Limit (RAPL) interface. The parsing scripts process offline the raw log of the per- formance counters to generate performance metrics (CPI, Load, Temperature, Power, etc.) and relate them with the job running on the node.
3.2.2
Current Dispatcher
The tool currently used to manage the workload on Eurora system is PBS (Portable Batch System) [Wor15], a proprietary job scheduler by Altair PBS Works with the primary duty of allocating computational tasks, i.e. batch jobs, among available computing resources. The main components of PBS are a server (which manages the jobs) and several daemons running on the execution hosts (i.e. the 64 nodes of Eurora), which track the resource usage and answer to polling requests about the host state issued by the server component.
Jobs are submitted by the users into one of multiple queues, each one charac- terized by different access requirements and by a different approximate waiting time. Users submit their jobs by specifying 1) the number of required nodes; 2) the number of required cores per node; 3) the number of required GPUs and MICs per node (never both of them at the same time); 4) the amount of required memory per node; 5) the maximum execution time. All processes that exceed their maximum execution time are killed. The main available queues on the Eurora system are called debug, parallel, and longpar, and are described in Table 3.1 - for each of those queues we report the maximum number of resources that a job could ask if it desires to belong to that queue, i.e. maximum num- ber of nodes, maximum number of cores and GPUs (second column), maximum execution time, and also the approximate time it might wait before starting its execution.
Cyclically, PBS selects a job for execution by polling the state of one or more nodes, trying to find enough available resources to actually start the job execution. If the attempt is unsuccessful, the job is sent back to its queue and PBS proceeds to consider the following candidate. The choices are guided by priority values and hard-coded constraints defined by the Eurora administrators with the aim to have a good machine utilization and small waiting times. For example, the administrators decided to reserve some nodes to the debug queue and to force jobs in the longpar queue to start at night.