SEVENTH FRAMEWORK PROGRAMME
Research Infrastructures
FP7-ICT-2011-7
DEEP
Dynamical Exascale Entry Platform
Grant Agreement Number: 287530D7.1
Data centre infrastructure requirements
Approved
Version: 2.0
Author(s): Nils Meyer, Stefan Solbrig, Tilo Wettig (UniReg), Axel Auweter, Herbert Huber (BADW-LRZ) Date: 12.06.2012
Project and Deliverable Information Sheet
DEEP Project Project Ref. №: 287530
Project Title: Dynamical Exascale Entry Platform Project Web Site: http://www.deep-project.eu
Deliverable ID: D7.1 Deliverable Nature: Report Deliverable Level:
PU*
Contractual Date of Delivery: 31 / May / 2012
Actual Date of Delivery: 29 / May / 2012
EC Project Officer: Leonardo Flores
* - The dissemination levels are indicated as follows: PU – Public, PP – Restricted to other participants (including the Commission Services), RE – Restricted to a group specified by the consortium (including the Commission Services). CO – Confidential, only for members of the consortium (including the Commission Services).
Document Control Sheet
Document
Title: Data centre infrastructure requirements ID: D7.1
Version: 2.0 Status: Approved Available at: http://www.deep-project.eu
Software Tool: Microsoft Word File(s):
DEEP_D7.1_Data_centre_infrastructure_requirements_2.0_ECapproved Authorship
Written by: Nils Meyer (UniReg), Stefan Solbrig (UniReg), Tilo Wettig (UniReg), Axel Auweter (BADW-LRZ), Herbert Huber (BADW-LRZ)
Contributors:
Reviewed by: Suraj Prabhakaran (GRS), Wolfgang Gürich (JUELICH)
Document Status Sheet
Version Date Status Comments
0.1 03/May/2012 Draft Initial version 0.2 07/May/2012 Draft Version for internal
review
0.3 09/May/2012 Draft Updated version for internal review
0.4 15/May/2012 Draft Internal review by Suraj Prabhakaran
0.5 19/May/2012 Draft W.Gürich/PMT
0.9 22/May/2012 Pre-Final Final review before submission to EC 1.0 29/May/2012 Final EC submission 2.0 12/June/2012 Approved Approved by EC
Document Keywords
Keywords: DEEP, HPC, Exascale, Data centre, Infrastructure, Energy Efficiency
Copyright notices
2011-2012 DEEP Consortium Partners. All rights reserved. This document is a project document of the DEEP project. All contents are reserved by default and may not be disclosed to third parties without the written consent of the DEEP partners, except as mandated by the European Commission contract 287530 for reviewing and dissemination purposes.
All trademarks and other rights on third party products mentioned in this document are acknowledged as own by the respective holders.
Table of Contents
Project and Deliverable Information Sheet ... i
Document Control Sheet ... i
Document Status Sheet ... ii
Document Keywords ... iii
Table of Contents ... iv
List of Figures ... v
List of Tables ... v
Executive Summary ... 1
1 Introduction ... 2
1.1 Motivation and Scope of the Document ... 2
1.2 Metrics for Energy Efficiency ... 2
2 Principles of Energy-Efficient Data Centre Design ... 5
2.1 Data Centre Location ... 5
2.2 Power ... 5
2.2.1Electrical Efficiency ... 5
2.2.2Safety, Cost and Compatibility ... 6
2.3 Cooling ... 6
2.3.1Air Cooling vs. Water Cooling ... 6
2.3.2Free Cooling ... 8
2.3.3Water Cooling Special Considerations ... 9
2.3.4Immersion Cooling ... 11
2.4 Energy Reuse ... 13
2.5 Monitoring & Control ... 14
3 Existing Experimentation Platforms ... 15
3.1 iDataCool ... 15
3.1.1Processing Hardware ... 15
3.1.2Water Cooling Solution ... 16
3.1.3Future Perspectives ... 16
3.2 CooLMUC ... 17
3.2.1Processing Hardware ... 17
3.2.2Cluster Arrangement and Cooling... 18
3.3 Current Results on the Experimentation Platforms ... 19
4 Recommendations for HPC Centre Infrastructures ... 23
References and Applicable Documents ... 24
List of Figures
Figure 1: Various levels of the data centre infrastructure stack ... 4
Figure 2: Schematic overview of dry- and wet-bulb temperatures ... 9
Figure 3: iDataCool experimentation cluster at the University of Regensburg... 15
Figure 4: iDataCool compute node before and after modification ... 16
Figure 5: CooLMUC experimentation cluster at Leibniz Supercomputing Centre ... 17
Figure 6: CooLMUC compute node with copper pipes for direct liquid cooling ... 18
Figure 7: CPU temperatures in relation to water inlet temperature ... 20
Figure 8: Temperature difference Tdelta depending on inlet temperature ... 21
Figure 9: Comparison of node power consumption and CPU temperature of air-cooled nodes and direct liquid cooled nodes at different inlet temperatures ... 22
List of Tables
Table 1: Overview of the efficiencies of different power distribution schemes. ... 5Executive Summary
In the Exascale challenge, which the DEEP project is tackling, energy efficiency is a key aspect. The ranking of high-performance computing systems by their power-performance ratio was introduced only a few years ago with the establishment of the Green500 list of the most energy-efficient supercomputers. Still, such rankings mostly ignore the fact that not only the machine itself is a major consumer of power, but also the surrounding infrastructure for power distribution and cooling.
This document covers important principles for energy-efficient data centre design. Key elements range from the selection of a suitable location for the data centre to the implementation of a suitable liquid cooling system with the potential for reusing waste heat for heating and cooling. We show that liquid cooling is by far the most efficient way of cooling high-performance computers since it enables free cooling year-round and facilitates energy reuse. Yet, actually implementing a liquid cooled system imposes additional challenges such as the proper treatment of water. All findings are backed by experiments conducted and insight gained on in-kind experimentation hardware at the University of Regensburg and the Leibniz Supercomputing Centre. As guidance for HPC centres working on improving their site’s energy efficiency, a set of five recommendations for energy-efficient design of HPC centre infrastructures concludes this report.
The recommendations and the principles worked out here will be the blueprint for the design of the DEEP Booster, the installation environment and the operation of the DEEP System.
1 Introduction
1.1 Motivation and Scope of the Document
With the aim to pave the way for European Exascale Computing, the DEEP Project addresses several limitations of today’s technology that need to be overcome before Exascale high-performance computing becomes reasonable. Among those limitations, the energy efficiency of today’s supercomputers is one of the biggest challenges. If we were to build a supercomputer with Exascale performance using today’s technology, the machine would consume electrical power on the order of 1 GW, which is roughly the amount of power generated by a nuclear power plant. Among experts, it is commonly accepted that reasonable power consumption for Exascale supercomputers is on the order of 20 MW. Thus, a 50x improvement in energy efficiency is necessary.
Since 1993, the Top500 list of supercomputers has ranked the fastest supercomputers in the world according to the number of floating-point operations per second (FLOP/s). Since the fall of 2007, this list is accompanied by the Green500 list of the most energy-efficient supercomputers, which ranks the systems according to the number of FLOP/s per Watt. While this has been a big step in recognizing the necessity of energy-efficient supercomputers, it only addresses the efficiency of the compute hardware, leaving aside the energy required to power the infrastructure surrounding the hardware, e.g., for cooling.
According to a study conducted by Gartner [1], “Data center space, power, and/or cooling” was listed as the number one challenge of data centres in 2009 and 2011. Another study conducted by IDC in 2009 [2] shows that the costs for power and cooling of servers worldwide have increased by more than a factor of five since 1996.
These numbers clearly show that alongside efficient compute hardware and energy-aware system operation, supercomputing sites and infrastructures must be optimized for energy efficiency, too. This document reviews and summarizes the state of the art in high-performance computing infrastructures with respect to power, cooling, and energy reuse at the data centre level. All of the proposed techniques can be applied to data centres and high-performance computing sites independently of the operation of DEEP hardware. Clearly, it is not feasible to implement all the proposed techniques from this document within the DEEP time frame at the DEEP prototype sites. For a detailed description of the DEEP prototype’s hardware concept, DEEP project members may refer to D3.1 “System hardware concept document”. Others will find more information on the DEEP website [3]. For a detailed description of the infrastructure requirements of the DEEP prototype, please see D6.1 “Definition of environmental requirements”.
1.2 Metrics for Energy Efficiency
The Green Grid consortium [4] makes an effort at standardizing the assessment of data centre energy efficiency by defining appropriate metrics. To measure the energy efficiency of an entire data centre the energy consumed by the entire site is put in relation to the energy consumed by the IT equipment only. This fraction forms the Power Usage Effectiveness (PUE):
The best theoretical PUE that can be achieved is 1.0. Since the IT equipment energy is a part of the total site energy, by definition, the PUE cannot be smaller than 1.0.
The PUE metric has become very popular in recent years. People started to use it in order to advertise the energy efficiency of single (sub-)systems in data centres. The resulting metric is called Partial PUE (pPUE) in order to indicate that some power-consuming entities might have been left out of the calculation.
A common mistake often made in the calculation of the PUE is to subtract any energy being reused in the data centre, e.g., by implementation of one of the technologies for energy reuse described later in this document. There is another metric to quantify the benefits from reusing energy. The Energy Reuse Effectiveness (ERE) is defined as:
Alternatively, one can specify the Energy Reuse Factor (ERF), which specifies the amount of energy being reused as a number ranging from 0 to 1:
⇔
To acknowledge the growing use of renewable sources of energy and to bring the use of water into an all-encompassing view of a data centre’s environmental footprint, the Green Grid has recently extended its efficiency metrics to also include CUE and WUE to put carbon emissions and water usage in relation to the IT energy.
When talking about the infrastructure of data centres, we typically refer to the conversion and distribution of power as well as the cooling system. Although the actual architectural design of high-performance computing centres varies widely, a structure common to all centres can be identified.
Figure 1: Various levels of the data centre infrastructure stack
We call this hierarchy of levels the data centre infrastructure stack. Every level of the stack can address important aspects of the infrastructure’s energy efficiency, and measures can be taken to improve the efficiency at each level. However, we will show later that it is beneficial not only to optimize locally within each level, but also to monitor and optimize globally across the data centre infrastructure stack. Since different vendors typically supply components at individual levels, this global optimization can become quite challenging.
2 Principles of Energy-Efficient Data Centre Design
2.1 Data Centre Location
Various current examples show that the geographic location of a data centre can reduce the total cost of ownership (TCO).
The climate at the data centre location directly affects the cooling options. For example, free cooling (explained in section 2.3.2) works more efficiently when the outside temperature is low and, if evaporative cooling is required, when the humidity is also low. Apart from the climate, access to natural cooling resources like rivers or the sea can be important. Naturally available cool water can be used as the cold reservoir for cooling. One example is a data centre in southern Finland, at the Baltic Sea, run by Google, the major internet company [5]. For very large data centres, the negative environmental effects of river warming should be considered: higher temperatures lead to lower oxygen concentrations, thus affecting the fish population.
On the other hand, cooling resources are just one example of many other resources that data centres rely on. Among others are:
Energy infrastructure, e.g., sufficiently well developed power lines and proximity to power generation. This might become more important in the future, if power generation will become decentralized or power has to be transported over larger distances.
Social infrastructure to recruit well-trained personnel.
Network infrastructure, especially if network latency is also an issue. The data centre should not be located in a remote place to achieve reasonable bandwidth and latency.
2.2 Power
In a typical data centre, power is delivered using alternating current (AC) that goes through multiple conversions between the main building power supply and the 12 V direct current (DC) internal distribution voltage of most IT equipment. Each conversion creates inefficiencies, thus wasting energy and producing heat, necessarily removed by the data centre cooling system. In order to further improve electrical efficiency of data centres, the number of electrical conversions must be minimized. This section investigates the efficiencies of AC and DC power distribution architectures considering electrical efficiency, safety, cost, and compatibility. 2.2.1 Electrical Efficiency Power Distribution UPS Efficiency Distribution Efficiency IT Power Supply Efficiency Overall Efficiency 480 to 208 V AC 96.20% 96.52% 90.00% 83.56% 400/230 V AC 96.20% 99.50% 90.25% 86.39% 48 V DC 92.86% 99.50% 91.54% 80.74% 380 V DC 96.00% 99.50% 91.75% 87.64%
When comparing the efficiencies of AC and DC one can distinguish three standard types of power distribution to an optimal hypothetical approach:
Common AC distribution in North America (480/277 V AC to 208/120 V AC) Common AC distribution outside North America (400/230 V AC)
Typical telecom DC distribution (48 V DC) Hypothetical approach (380 V DC)
Table 1 lists efficiencies for various non-redundant power distribution scenarios under 50% load as published by Neil Rasmussen in the APC White Paper “AC vs. DC Power Distribution for Data Centers” [6]. The data clearly show that 380 V DC and 400/230 V AC are the most efficient power distribution systems at the data centre infrastructure level. Despite the slight advantage of the 380V DC distribution scheme, additional aspects such as safety and cost (described in the next section) can justify following the 400/230 V AC approach.
The use of Uninterruptible Power Supplies (UPS) is an important measure for system stability. However, typical HPC systems have fewer availability requirements than the services provided in other data centres: through the use of checkpoints, calculations can be restarted after a system failure or power outage. Thus, HPC centre infrastructures can either abstain from UPS systems entirely or favour systems that provide shorter protection times at higher electrical efficiencies (e.g., fly-wheel systems) over systems with longer protection times that typically suffer from worse electrical efficiencies (e.g., battery backed systems). 2.2.2 Safety, Cost and Compatibility
AC power distribution is the de-facto standard for power distribution worldwide, with regulations at the international as well as national level. Only a few regulations exist for commercial DC power distribution which could result in increased certification costs for meeting local safety and electromagnetic compatibility standards. These costs will have to be covered by the data centre operator.
With respect to the electric components in use, in principle, the cost of a 380 V DC power distribution system should be lower. However, due to low volume associated with the corresponding technical equipment, there is currently no noteworthy cost advantage over 400/230 AC power distribution.
IT equipment such as servers, storage systems, and network switches are in general designed for AC power input. As of today DC versions of the aforementioned IT equipment are either not available or upscale custom products.
In summary, the advantages of DC versus AC are small. The use of a 380 V DC power distribution might only be economically advantageous in large supercomputing centres or very large data centres operating a huge amount of identical servers. In such cases, the UPS systems should be implemented at the AC level nevertheless. DC distribution and conversion can be left to the system integrator.
2.3 Cooling
2.3.1 Air Cooling vs. Water Cooling
While currently most of the equipment in large computing centres is air-cooled, there is a trend towards water-cooled systems. The purpose of this section is to consider the pros and
1. Air-cooled systems are systems in which the heat is removed from the components by air blown or drawn into the rack. The hot air thus generated is then removed from the machine room by an air-conditioning system.
2. Indirectly liquid-cooled systems are much like air-cooled systems, except that they themselves contain an air-water heat exchanger (e.g., a rear-door heat exchanger) where the heat is transferred to water supplied by the building’s chilled-water circuit. 3. Directly liquid-cooled systems are systems in which the coolant is brought very close
to the components and airflow is not necessary.
Let us consider some important issues arising from these solutions.
Heat sinks: Air-cooled and indirectly liquid-cooled systems need capable heat sinks between the processor or other hot devices (accelerators, network cards) and the cool air. These are readily available but can be rather large given the heat production of today’s devices. In directly liquid-cooled systems the heat sinks on the chips can be smaller.
Facility infrastructure: Air-cooled systems have the simplest infrastructure requirements. Only the airflow through the racks and the building needs to be designed. However, large quantities of air have to be circulated, and the hot air has to be cooled by the air conditioning system of the data centre. Liquid-cooled systems need a significant infrastructure that directs the coolant to the various heat sources. Designing a plumbing system that fulfils all flow and pressure requirements can be a nontrivial task.
Packaging density and floor space: Directly liquid-cooled systems allow for the highest packaging density. Since no airflow within the rack or within the building is needed, more compute units can be fit in a rack, and more racks can be fit in the same building. Indirectly liquid-cooled systems can also fit more racks in the same building than air-cooled systems, but the need for sufficient airflow within a rack may limit the number of compute units in the same rack.
Noise: Air-cooled systems tend to create far more noise than other systems because of large fans and turbulent airflow. This is partially true also for indirectly liquid-cooled systems, but in that case the racks are closed so that the noise coming from the fans in the rack is reduced. For directly liquid-cooled systems the only significant noise is created when the heat is transferred to outside air. This typically happens outside the building. If fans are needed at all, then large, slow-running ones can be used.
Maintenance costs: Liquid-cooled systems come with additional maintenance costs, especially for water treatment (see section 2.3.3 on water treatment). Extra care has to be taken to reduce the risk of leaks, or to provide some emergency measures in case of water leaks. Extra sensors to monitor pressure, water flow, or dew point temperature are frequently necessary. Exchanging faulty equipment in directly liquid-cooled systems can also be more time-consuming and thus more expensive than in air-cooled or indirectly liquid-cooled systems.
Cooling costs: In case of air-cooled or indirectly liquid-cooled systems the hot air generated by the computing equipment needs to be removed, which can either be done directly by the air conditioning system (air-cooled systems), or by air-water heat exchangers such as rear-door heat exchangers, in-row coolers, etc. (indirectly liquid-cooled systems). In all these cases chilled water needs to be provided unless free cooling (as explained in the next section) is possible. Chilled water is typically generated by compressor-based refrigeration systems, which are connected to cooling towers. The physical principles underlying their operation are discussed in [7] and [8]. The important point from the energy-efficiency point of view is that the production of chilled water is costly. It requires serious investments in equipment and the
maintenance thereof. More importantly, the recurrent costs for the production of chilled water are directly proportional to the energy consumption of the computing equipment. The proportionality factor varies widely between computing centres due to different setups and different geographic locations, but as an indication we quote the numbers for the current Japanese flagship system, the K-computer, with 20 MW of power for the computer room and a 10 MW water-cooling facility. Apart from the costs for generating chilled water, the fans needed for such systems also make a significant contribution to the total energy bill. Directly liquid-cooled systems need pumps to maintain a sufficient coolant flow, but their power consumption is much lower than that of traditional fans.
Option for free cooling: Directly liquid-cooled systems in general offer an easier option for free cooling, as described in the next section.
Given the steady increase in electrical power consumption and denser packaging fraction of modern computer equipment, liquid-cooled systems (at least indirect ones) will be necessary. Especially the power per volume of the equipment poses direct problems for air cooling:
To transport the heat away from the equipment, the airflow has to increase.
The increased airflow requires more powerful fans, which in turn require more electrical power and use up more rack space.
To transport the heat from the computing devices to the air, complicated and large heat sinks are necessary.
Since the thermal capacity of air is rather low, chilled air has to be used in order to keep the airflow at an acceptable level. Depending on local resources, extra electrical power is necessary for chilling.
Water cooling can overcome these issues, but comes at an increased complexity for system design and water treatment. As a special case, immersion cooling can be seen as an indirectly liquid-cooled system, where the primary coolant is some electrically non-conductive liquid instead of air.
2.3.2 Free Cooling
The term “free cooling” is used collectively for a number of methods that use ambient air for cooling. The obvious advantage of these methods is the lower energy cost since the need for chilled water is greatly reduced or even eliminated. For example, in winter cold ambient air could simply be drawn into the computing centre (and in general needs to be filtered and warmed up beforehand). In the following, we will concentrate on a more restrictive meaning of free cooling, where we assume that we have a liquid cooling system in which the water temperature is sufficiently high so that heat can be transferred from water to ambient air using a dry cooler. The latter is a relatively simple and cheap piece of equipment that essentially consists of an air-water heat exchanger and fans. Since the power consumption of the computing equipment increases slightly with increasing cooling water temperature, it is sensible to adjust the latter to the minimum value, which still guarantees that all heat can be removed. This minimum value depends on the outside temperature. On hot summer days it may be necessary to increase the cooling water temperature to about 40°C (in Northern and Central Europe), which means that the water-cooling system of the computer must be designed to support such temperatures. Note that the cost of free cooling is still nonzero since the water pumps and the fans consume energy, but this cost is generally much lower than the cost of generating chilled water. Note also that in most cases the water circuits of the computer and of the dry cooler should be separated by a water-water heat exchanger for a number of operational and safety reasons (e.g., glycol in the dry cooler circuit, controlled
water quality in the computer circuit, minimal water volume in the computer circuit to limit the consequences of leakages, etc.).
There is a refinement to free cooling, i.e., evaporative cooling, that is worth discussing. To understand the principle we first need to distinguish the dry-bulb and wet-bulb temperatures of air, see, e.g. [9], [10], and [11]. The dry-bulb temperature is the temperature measured by a thermometer that is insulated from the moisture of the air. This is what we usually mean by temperature. The wet-bulb temperature is the temperature measured by a moistened thermometer bulb exposed to airflow. It is lower than the dry-bulb temperature since the water moisture evaporates adiabatically into the air, and the latent heat associated with the evaporation process is taken from the thermometer (“evaporative cooling”). The lower the humidity, the greater the difference between wet-bulb and dry-bulb temperature since the air can absorb more moisture. There is also the dew-point temperature, which is the temperature at which air is saturated with water vapour so that the vapour condenses.
One can use evaporative cooling to enable free cooling even on hot days, when the temperature of the cooling water may be lower than the dry-bulb temperature. This is done by spraying water onto the heat exchanger of the dry cooler. This water evaporates and the latent heat for the evaporation process is taken from the cooling water inside the heat exchanger, whose temperature is thus lowered. We show a schematic picture here, where the blue dots indicate the evaporating water and T1 and T2 are the temperatures of the cooling water before and after passing through the heat exchanger.
Figure 2: Schematic overview of dry- and wet-bulb temperatures
The point is that for heat transfer to take place from water to air, T1 only needs to be higher than Twet and not Tdry. The former can be significantly lower than the latter if the humidity is not too high. For example, for Tdry = 40°C and a humidity of 40% we have Twet = 28°C, see [12]. Of course, T2 cannot be lower than Twet. One disadvantage of evaporative cooling is that it consumes water. However, in typical installations evaporative cooling is only used on hot days so that the water consumption is generally low.
2.3.3 Water Cooling Special Considerations Material Mix
Care has to be taken when choosing the materials of pipes, heat-sinks, or heat exchangers to be used in a liquid cooling circuit. Commonly used is a mix of stainless steel and copper or brass. Since the electro-negativity of copper is rather high, there is little danger of corrosion for copper. On the other hand, aluminium and some of its alloys can corrode in the presence of other metals. Therefore, a mix of stainless steel and copper is mostly considered unproblematic, whereas a mix of aluminium and steel or even aluminium and copper may likely cause problems on the aluminium parts of the plumbing. A system where only stainless steel and plastic are in contact with water would be ideal. However, this is hard to archive because copper and brass parts are typically easier and cheaper to manufacture than steel.
It is advisable to check the plumbing regularly for corrosion and to monitor the cooling water for ions, especially if copper is part of the material mix.
Water Treatment
The treatment of the coolant water is still a topic of debate. Various viewpoints exist.
One approach is to use purified or deionized water. The downside of this approach is that deionized water tends to draw ions from the surroundings, e.g., the plumbing. Additionally, deionized water is only weakly buffered. This means that a small change in the ion concentration can lead to drastic changes in the pH value. A drop in pH, e.g., a more acidic coolant, will further aid corrosion. Therefore, pure deionized water is rarely used. Instead the deionized water is supplanted with anti-corrosives, buffer substances, and biocides that are meant to prevent the growth of algae, fungi, and bacteria.
Instead of using deionized water and then adding buffers, the other approach is to use tap water, or a mix of tap water and deionized water. The calcium hydrogen carbonate contained in tap water is a natural buffer substance, thus corrosion is expected to be less of an issue. However, for warm or hot cooling water, there is another issue: at higher temperatures, the calcium hydrogen carbonate may turn into solid calcium carbonate that sticks to the inner walls of the plumbing.
The biological treatment of the cooling water can be challenging. Note especially that even deionized water is usually not bacteria-free. Preventing an initial pollution of the cooling water is highly impractical. Not only would one have to treat all parts (including the coolant) by autoclave, but one would also have to prevent a subsequent pollution if a compute node has to be exchanged.
Instead, one should focus on methods to reduce the unavoidable bacterial growth. The perceived absence of nutrients in the cooling water will not prevent bacteria from growing. Many kinds are autotrophic, i.e., they can assimilate carbon dioxide and nitrogen, and may build biological structures using the few ions that will inevitably enter the water through the plumbing.
A possible method is to use biocides, but the benefit of this method is not guaranteed. Even if a high initial dose of biocides kills most of the bacteria in the cooling system, the surviving ones can develop resistance, and further application of a biocide will hardly affect the remaining bacterial population. Therefore, the biocide should not remain in the cooling water. The biocide used might also impose legal restrictions on how the cooling water can be disposed of. Degradable substances can circumvent this issue.
An alternative approach may be to clean the system regularly using 70% ethanol. Note that the system has to be rinsed thoroughly afterwards, because any remaining ethanol will be a carbon source and thus foster bacterial growth. However, ethanol will not affect bacterial spores, so depending on the bacterial species present in the cooling water, using ethanol may not be effective. Furthermore, using ethanol is only practical for relatively small systems: It is flammable and even a medium-sized computing cluster including all heat exchangers can contain on the order of 100 litres of liquid. Thus, safety considerations might preclude the idea of cleaning with ethanol.
From our discussion about water quality and treatment we conclude that currently no best practice exists to deal with the potential problems coming with a water cooling installation. Unfortunately, our own experience with issues regarding the coolant and its treatment is still limited due to the yet short lifetime of our computer installations, but we are confident to find a valuable solution in the future. If a directly liquid-cooled system is installed, we recommend
to monitor the containment of the cooling circuit attached to the nodes on a regular basis. The system integration should ensure the ability to take samples and exchange the coolant without interruption of operation.
Dealing with Air-Cooled Components
While it was shown that liquid cooling is superior to air cooling in many ways, actually implementing an entire liquid-cooled HPC system poses some additional challenges. The problem is that as of today, HPC systems rely on parts for which no directly liquid-cooled solutions exist. Examples include entire hardware units such as interconnect network switches as well as individual parts like power supply units. The amount of heat emitted by those components is relatively small compared to the active components for which directly liquid-cooled solutions exist. Yet, at the typical scale of HPC installations, the aggregated amount of heat emitted by those devices can be significant, and care has to be taken in order to ensure energy-efficient cooling of those components along with the directly liquid-cooled parts. Since air cooling is still the de-facto standard in data centre cooling, many approaches exist to improve the cooling efficiency of air-cooled systems. For example, hot or cold aisle enclosures help separating hot from cold air areas in the computer room. As a result, no mixing of hot and cold air occurs, and the cooling efficiency is improved. In experiments [13] the separation of hot and cold aisles has shown an improvement of the PUE from 1.8 to 1.48. Another improvement for efficient cooling of the remaining air-cooled components comes from the proper choice of the inlet air temperature. The latest ASHRAE specification suggests an inlet air temperature of 27ºC. In the same experimentation setup, this led to an improvement of the PUE down to 1.4. Care has to be taken though, since higher inlet temperatures may also cause increasing fan speeds and higher leakage currents in the devices, which increases the total energy consumption of the site.
While the separation of hot and cold aisles and the choice of a proper inlet temperature already result in big improvements, further improvements can be made by shortening the distance along which air is used to transport heat. Rack-based solutions exist that use ambient air to cool the systems and re-cool the air, e.g., in a water-based rear-door heat exchanger before the air re-enters the room. Other rack solutions avoid the use of computer room air and use water-cooled heat exchangers that continuously cool a closed loop of air in a single rack. We assume that these technologies can bring the PUE down to about 1.2 but are not aware of any scientific study backing this number.
Components that do not make use of liquid cooling at all (e.g., network switches) are easy to integrate into the solutions described above. However, special care has to be taken when components make use of both direct liquid cooling and air cooling. In experiments, it has been shown that the ambient temperature has a direct impact on the amount of heat that can be removed through the directly liquid-cooled circuits. While the set point for the inlet air temperature has to be chosen low enough to ensure proper cooling of all air-cooled components, a too low inlet temperature can take great amounts of heat away from the parts carrying the cooling water for the directly liquid-cooled components, hampering the possibilities for efficient energy reuse. To avoid this effect, system integrators should try to separate liquid- and air-cooled components as much as possible (e.g., no mixing of interconnect network switches and compute nodes, separation of power supply units from compute notes, etc.) by providing proper insulation.
2.3.4 Immersion Cooling
In water-cooled systems the coolant never comes in touch with the fluid-sensitive electronics. Instead, the heat is transferred from the components to the coolant by heat bridges. Such
bridges should have good thermal conductance properties. For example, solids made of copper or aluminium are often used to conduct as much heat away from the chips as possible. Two examples for such systems are discussed in later sections.
A different approach to direct liquid cooling is to expose the chips (and also PCBs) directly to the coolant. The heat is transferred from the chips to the liquid to some outer containment, where it has to be removed by other cooling mechanisms, e.g., air cooling or water cooling. This kind of cooling is called “immersion cooling”.
Water as a coolant has desirable thermo-physical properties, but its chemical and especially electrical properties render it unfeasible for immersion cooling. Typically, fluorocarbon fluids are used because of their excellent chemical and dielectric properties1. The effectiveness of immersion cooling strongly depends on the characteristics of the cooling mechanism. Heat fluxes of up to 100 Watt/cm² are reported. Roughly speaking, three basic types of mechanisms can be identified. Each of them provides increasing cooling abilities, but also requires increasing engineering skills:
Natural convection: The heat transfer process is driven by the fluid motion induced by the differences in the local density due to temperature gradients. No external forces, e.g., no pump system to drive the fluid motion, are required to transfer the heat from the chips to the containment. For low power-density systems the outside air convection may be enough to cool the containment.
Forced convection: The fluid is circulated over the chips by some external forces, e.g., a pump system. Typically, the heat is removed from the fluid circuit by heat exchangers. This mechanism allows for heat removal even in high power-density systems in confined space, such as supercomputers.
Boiling: The liquid-to-vapour phase transition is used to remove the heat from the chips. Above the boiling point of the fluid, vapour bubbles form at the heated surfaces, transferring the heat to the containment. Although this mechanism can be used for cooling of very high power-density chips, it is also the most complex process to control and requires sophisticated engineering skills.
One of the most prominent examples for immersion-cooled computer architectures is the Cray-2 supercomputer designed in 1985 (see also [14]). The largest installation officially supported by Cray contains more than 200,000 silicon chips densely packed in a footprint of only a few square feet with a total power consumption of up to 195 kW [15]. The dense packaging was achieved by 3-dimensional stacking of PCB modules populated with silicon chips. The cooling problem for the very high power density resulting from this stacked design was tackled by an immersion cooling system using fluorocarbon fluid in combination with forced convection and chilled-water heat exchangers. The design was sophisticated enough to provide the cooling power for logic devices, memories, and also power supplies within the same fluid circuit. After the success of the Cray-2 design the Cray company continued to use immersion cooling also for other supercomputer architectures within their product line.
However, large-scale systems based on immersion cooling are rather exotic, and only a few companies specializing in immersion cooling for the server sector offer off-the-shelf solutions. For e.g., the Green Revolution Cooling [16] company offers a complete cooling solution aiming at data centre servers. The package includes an immersion-cooled rack that serves as a consolidated containment for all server boards. Heat is removed from the rack by forced convection of the liquid in combination with a heat exchanger. The companies
Iceotope [17] and Hardcore Computer [18] offer different solutions, where each of the server boards is encapsulated in a single containment. The solution offered by Iceotope uses natural convection to transfer the heat from the board to the containment, where it has to be removed by an external water circuit. Hardcore Computer follows a different strategy using forced convection to remove the heat from the entire system within a single fluorocarbon cooling circuit connected to a heat exchanger.
We conclude that immersion cooling has some advantages over other kinds of cooling technology, especially the applicability in high power-density systems, the high packaging densities that can be achieved, and also the flexibility coming from the diversity of immersion-cooling mechanisms. However, if direct water cooling is sufficient to cool the system, the engineering effort for an immersion-cooling solution is likely to be more cost intensive and time consuming without additional benefits. Another drawback of immersion cooling is that due to the direct contact with the coolant, all components need to be certified for being compatible with the liquid in use. This can be a major problem for system integrators when negotiating warranty agreements with the component suppliers. Vendors need to work on certification of their IT components for these novel environments to avoid warranty issues.
2.4 Energy Reuse
It would be advantageous if some of the energy spent on the computing equipment could be reused, i.e., ERF > 0, possibly resulting in ERE < 1. Here we again have to differentiate between air-cooled and liquid-cooled systems.
The warm air generated by air-cooled systems is typically not warm enough to be reused on a larger scale. One possible exception is in winter, where an air-air heat exchanger could be used to warm up the air in a forced-air heating system for offices or laboratories. Whether the corresponding savings can offset the necessary infrastructure expenses is far from obvious and needs to be evaluated on a case-by-case basis.
There are more possibilities for energy reuse from liquid-cooled systems. One possibility is heating in winter. Underfloor heating systems, which typically do not require very high water temperatures, as well as forced-air heating systems could be driven by the coolant of the computer if the return temperature is on the order of 30∼40°C. Heating systems based on radiators require much higher temperatures so that the cooling system should support return temperatures of at least 65°C (that this is possible has been demonstrated in the iDataCool project, see below). If this can be achieved there is yet another possibility for energy reuse: the generation of chilled water using adsorption chillers. There are now adsorption chillers on the market (e.g., by InvenSor, SorTech or others) that operate efficiently already at hot-water inlet temperatures of about 65°C. This use case is particularly interesting in summer when heating is generally not needed and demand for chilled water peaks.
Two issues in this respect should be considered carefully. First, the costs for the additional infrastructure need be balanced against the savings that can be obtained from energy reuse. Second, at high water temperatures very good insulation of the computing equipment and the coolant pipes against heat convection is necessary to prevent the heat from escaping into the air of the computing centre (from which it would have to be removed by an air-conditioning system at additional expense).
Ideally, the waste heat from the computers could be used to generate electricity, but there does not seem to be any technology capable of doing so at reasonable efficiency given the maximum possible coolant temperatures (e.g., steam turbines need boiling water).
2.5 Monitoring & Control
Ordinary HPC centres implement monitoring infrastructures at various levels of the HPC infrastructure stack. Yet, these monitoring activities are typically not centrally managed and operate independently of each other. The reason is that it is the goal of the monitoring infrastructure to detect system failures and to react appropriately within the given subsystem. When trying to optimize an HPC system for energy efficiency, the separation of monitoring systems turns out to be insufficient. For example, in typical data centres, the HPC system operator might have a good overview of the air-conditioning system, but the potential impact on the system power consumption when changing the temperature set point of the HVAC system cannot easily be seen. Unfortunately, many examples of such interactions between systems at different layers of the HPC infrastructure stack can be given and can only be overcome by integrating the existing monitoring solutions into a single holistic view.
Such a holistic view of all parameters influencing power efficiency should include: Power consumption data
Environmental data (temperature, humidity)
System infrastructure data (fan speeds, flow rates, etc.) Application runtime performance data
All data need to be sampled at given intervals and stored into a central database for keeping a record history that allows for a thorough analysis after system failures. Also, a history of all the necessary data allows one to quickly pinpoint the cause of operational inefficiencies, even if no proper change management policies are established.
Insight gained through extensive monitoring has to be accompanied by a wide range of control capabilities. Possible control knobs are:
Inlet temperatures (water or air) Fan speeds and flow rates IT workload scheduling
IT system parameters (e.g., processor frequency, sleep modes)
Being able to control these parameters of the infrastructure and the machine can help adjusting the total power consumption of the site, a feature that will make future HPC centres a key player in upcoming smart grid infrastructures. This topic is beyond the scope of the DEEP project, but other research projects such as All4Green2 started conducting research in this field.
3 Existing Experimentation Platforms
3.1 iDataCool
Figure 3: iDataCool experimentation cluster at the University of Regensburg
The iDataCool installation at the University of Regensburg is a modified version of the commercially available IBM System x iDataPlex large-scale compute cluster. iDataCool consists of three iDataPlex racks which are used by the particle physics group for simulations of relativistic quantum field theory. In collaboration with the IBM Research and Development Lab Böblingen, Germany, the iDataPlex system has been modified to allow for direct water cooling and now serves as a testing platform for prospective hardware developments in high-performance computing with emphasis on cooling efficiency and energy reuse. The first partial rack was installed in February 2011, and an upgrade to the full 3-rack installation was performed in September 2011.
3.1.1 Processing Hardware
Each iDataCool rack contains 72 nodes. Each node is a distributed shared memory dual-server board equipped with either Intel Xeon E5630 or Intel Xeon E5645 Westmere server processors. These processors provide four or six physical cores with additional support for HyperThreading. Per node 24 GB of shared DDR3 memory are available. The fast main interconnect network is realized through InfiniBand QDR, arranged as a hybrid ring/tree network. Gigabit Ethernet is used for disk I/O, system booting via NFS, and job scheduling. Furthermore, in the recent setup of the dual-server boards, the baseboard management controller (BMC), which is a part of the Intelligent Platform Management Interface, shares one Ethernet port with the processors.
3.1.2 Water Cooling Solution
Figure 4: iDataCool compute node before and after modification
The original iDataPlex system is entirely air-cooled, with all waste heat transferred to the data centre through perforated front doors. A quad-fan block attached to the backside of the node chassis generates the airflow required for cooling the nodes. The chassis holds up to two dual-server boards and one power supply. In collaboration with IBM a direct water-cooling concept has been developed which allows us to remove the heat from each node's temperature-critical components. The direct cooling includes the server processors, memories, InfiniBand daughter card, southbridge, voltage regulators, and several other chips. Other components, e.g., power supplies and switches, remain air-cooled. The original fans and heat spreaders are replaced by a copper pipe providing the water flow necessary to remove the heat from the node. The copper pipe is connected to the critical components on the server board by heat bridges made of copper (for processors and memories) or aluminium (for all other components). To maximize the amount of heat transferred from the components to the water circuit, the heat flow to the surrounding environment, e.g., the airflow of the power supply located next to the server boards, is minimized using Armaflex thermal insulation.
The cooling solution for the iDataCool cluster is optimized for a) a very small temperature difference between water and processor cores and b) low cost. All newly developed parts of the cooling system were manufactured by the university's machine shop. The conversion of the iDataPlex racks from air to water cooling was also done at the University of Regensburg. Only a minor modification of the original server board chassis was necessary to allow for the connection of the copper pipes to the Tichelmann water distribution system attached to the backside of each rack. The Tichelmann system is a special form of pipe installation in which all elements connected to the system in parallel are exposed to the same pressure loss and thereby the water flow rates balance themselves automatically (see [19] for an illustration). The parallel connection of the nodes' water pipes to the water distribution system by inexpensive standard water connectors is cost-effective at reasonable maintenance effort. To avoid the installation of additional sensors required for monitoring of the water circuit the temperature sensors coming with the original server boards are reused. The temperatures of the water inlet and outlet of each individual node are monitored by the BMC and accessible via IPMI without the need for an update of the existing firmware.
3.1.3 Future Perspectives
In the final stage of extension all three iDataCool racks will be used to operate an adsorption chiller and reuse the waste heat to support the cooling of other computer systems, e.g., a GPU server rack. Preliminary tests have shown that a water inlet temperature on the order of 60°C is feasible to safely operate the compute nodes under full load. However, it is not yet clear how permanent operation of the cluster at high temperatures affects the lifetime of the system
3.2 CooLMUC
Figure 5: CooLMUC experimentation cluster at Leibniz Supercomputing Centre
The CooLMUC experimentation cluster at LRZ was built by MEGWARE Computer in collaboration with Kälte Klima Umwelt. It was acquired using funding from the PRACE 1IP FP7 project to assess the benefits of direct warm-water cooling and waste-heat reuse through adsorption refrigeration. It was first put into service in July 2011 and moved to its current location in the new building of LRZ to be hooked up to the new warm-water cooling loop in November 2011.
3.2.1 Processing Hardware
The CooLMUC cluster at LRZ consists of 178 nodes. A single node contains two AMD Opteron 6128HE CPUs (MagnyCours) with 8 cores each and 12MB L3 cache. In their standard setting, the CPUs run at 2GHz clock frequency. Each node is equipped with 16GB RAM arranged in eight 2GB DDR3 modules.
The main interconnect network is realized through InfiniBand QDR using a fat tree topology. In addition, each node has two Gbit Ethernet ports for IPMI and a service network, which is used to boot the diskless nodes and to provide the root filesystem over NFS. A network bridge component connects the InfiniBand network to the upstream Ethernet services at LRZ.
A central appliance server provides cluster management functions such as temperature/power monitoring and remote power control, and also acts as the NFS server providing the OS images to the nodes.
3.2.2 Cluster Arrangement and Cooling
Figure 6: CooLMUC compute node with copper pipes for direct liquid cooling
The cluster is arranged in 5 racks. The compute hardware is contained in three racks while the cooling components are contained in the other two. Next to the five racks, a SorTech ACS-08 is used to turn the hot water emitted from the cluster into cold water. The cold water is then used to cool the rear-door heat exchanger of a sixth, otherwise unrelated rack.
To make the cooling system independent of LRZ's CRAC infrastructure, the racks' doors were made solid and two independent cooling loops are used to cool the compute equipment. One loop provides water at 40ºC directly to the nodes, where the flow runs through copper pipes connecting special heat sinks on top of CPUs, chipset, and InfiniBand HCAs. This technology completely eliminates the necessity to use air as a heat transfer medium for the corresponding components. Yet, some components remain that rely on air cooling. Examples of such components are the power supply units contained in each compute node as well as InfiniBand switches or the power distribution units in the racks. To provide cooling to those components as well, a second cooling loop exists. It is based on standard compressor-based cooling technology with special 19" in-rack evaporators that generate airflow from the rear part of the racks to the front while at the same time re-cooling the air to the set temperature of 30ºC. In order to use the heat collected from both cooling circuits to drive the adsorption chiller, the condenser of the second cooling circuit is cooled with water originating from the first cooling loop's outlet. This way, water inlet temperatures to the server of 40ºC allow supply temperatures of 60ºC to the adsorption chiller.
3.3 Current Results on the Experimentation Platforms
In this section the results of the measurements taken on the local computer installations are discussed. Table 2 lists the available quantities for the two systems.
Sensor iDataCool CooLMUC
Water inlet temperature (Tin) Yes Yes Water outlet temperature (Tout) Yes Yes
CPU Temperature (Tcpu) Yes (Core) Yes (Package) Power Consumption per Node (Pnode) Yes (from PSU) Yes (from PDU) Power Consumption of the entire System (Psys) Yes (from DC) Yes (sum of nodes +
cooling + network) Table 2: Available Measurement Quantities on iDataCool and CooLMUC
The test strategy for the systems is to vary the water inlet temperature while monitoring all other quantities described above. While in theory such a test should be run under full load of all system components, such a setup is difficult to realize in practice. Typically, the load on the server processors is reduced during main memory operations or communication via network devices. Since the CPUs are the dominating contributors to heat generation, a system test should stress mainly the processors’ pipeline systems as well as the cache systems.
The pre-compiled 64-bit mprime v2.26 benchmark was chosen to run the temperature tests. This benchmark is freely available from the author’s website (see [20]). The program is flexible in terms of floating-point computation and main memory access pattern, performing Fast Fourier Transforms in a repeated fashion to search for Mersenne prime numbers. Due to its heavy load on the system it has also become one of the favoured system stability tests in the community of private computer users (see [21] for more details). The “torture test” is used with custom settings to stress the system. The custom setup mainly stresses the server processors, while main memory access is mild and no communication via network devices is performed.
Measurements are carried out of the quantities stated above for a range of water inlet temperatures Tin from 34.5°C up to 63.5°C on a single iDataCool rack with 56 processors under load of the mprime benchmark. On CooLMUC the entire system was tested in a range of Tin from 27°C up to 50°C using the same benchmark.
Figure 7 shows the temperature Tcpu as a function of the water inlet temperature Tin. On the AMD processors in the CooLMUC system only the CPU package temperature can be obtained. In contrast, the monitoring subsystem of the Intel CPUs in the iDataCool system can provide individual core temperatures. Shown here is the average individual node CPU temperature3. On both machines, a roughly linear increase of the core temperatures with the inlet temperature Tin was observed. The spread of the CPU temperatures is caused by multiple factors. In both systems, a single pipe system provides the heat transfer from two (iDataCool) or four (CooLMUC) CPU sockets. Consequently, CPUs that are cooled later in the chain are provided with slightly warmer water causing a higher CPU temperature. Another effect on the spread is the quality of the heat transfer from the CPUs to the water pipes: a few nodes show exceptional behaviour, i.e., the core temperatures are very high even at low water
3
The average node core temperature is the mean of the temperature values delivered by the temperature sensors embedded into each of the processor’s cores.
temperatures. At high water temperatures automatic CPU throttling inhibits severe damage of these chips4. Clearly, those exceptional nodes have to be investigated and the water pipe system has to be re-assembled at the board level. Finally, the two different types of server processors used in the iDataCool system cause a wider spread of temperatures compared to CooLMUC.
Figure 7: CPU temperatures in relation to water inlet temperature
Assuming proper assembly of the cooling loops, the water inlet temperature could safely be increased even further up to the order of 70°C without affecting the system performance due to down-clocking of individual processors on the iDataCool system. An important observation is that for the iDataCool system no influence of the position of the server boards within the rack on the temperature can be seen, affirming that the Tichelmann system is an excellent choice for the water distribution at the rack level. On CooLMUC the upper temperature limit is determined by the in-rack air-cooling equipment. Since this air-cooling loop further increases the temperature of the water returning from the nodes, lower inlet temperatures of water are still sufficient to drive the adsorption chiller.
For energy reuse, i.e., in this particular case the operation of an adsorption chiller, the water outlet temperature should be as high as possible. When measuring the water outlet temperature as a function of the water inlet temperature an approximately linear scaling can be observed with a temperature difference between outlet and inlet of Tdelta = Tout - Tin of only up to 6.4°C. However, as shown in Figure 8, if the water inlet temperature is increased, the temperature difference drops below 4.5°C. It is well known that at high water temperatures the amount of heat transferred to the water circuitry drops since the power dissipation of the cooling circuitry by convection depends linearly on the temperature difference between water and surrounding air. The amount of heat that is not transferred to the coolant has to be removed by the data centre’s air-cooling mechanism and thus is not available for energy reuse. 20 30 40 50 60 70 80 90 100 25 35 45 55 65 CPU T em per atu re [° C]
Water inlet temperature [°C]
iDataCool (Core) CooLMUC (Package)
Figure 8: Temperature difference Tdelta depending on inlet temperature
In a last experiment, air-cooled compute nodes containing hardware similar to the water- cooled nodes of iDataCool and CooLMUC were tested under the same load. Figure 9 shows the comparison of the air-cooled node which was cooled at 23°C to the water-cooled node with respect to TCPU and PNode at different water inlet temperatures. In this experiment, the advantage of direct liquid-cooling becomes clearly visible: not only is the power consumption of the air-cooled nodes higher due to the necessity of additional chassis fans, but also the CPU temperature of the liquid cooled nodes stays below the CPU temperature of the air cooled node, even at high inlet temperatures.
Finally, Figure 9 also shows an effect that has to be considered for energy reuse: the dependence of the nodes’ power consumption on the water temperature. Semiconductors are known to have poor electric conductance properties at high operation temperatures, effectively resulting in an exponential increase of the power required to operate the logic circuitry. For the iDataCool system this effect is clearly visible, but of minor concern. Even at 63.5°C water inlet temperature the increase of the power consumption is only an effect of about 5 per cent. On CooLMUC, a similar increase of power consumption at higher operating temperatures can be observed. Whether this overhead is justified by the increasing ability to reuse waste heat at higher temperatures will be subject to future analysis.
Yet, with the preliminary system tests it can safely be assumed that the iDataCool installation is well suited for optimization towards energy reuse once the final stage of extension has been reached. 4 4,5 5 5,5 6 6,5 25 35 45 55 65 Tem per atu re Diff er ence [ °C] Inlet Temperature [°C] iDataCool CooLMUC
Figure 9: Comparison of node power consumption and CPU temperature of air-cooled nodes and direct liquid cooled nodes at different inlet temperatures
290W 248W 254W 260W 256W 229W 232W 238W 51 °C 37 °C 48 °C 56 °C 82 °C 62 °C 72 °C 82 °C 0 10 20 30 40 50 60 70 80 90 0W 50W 100W 150W 200W 250W 300W 350W
Air, 23°C Water, 30°C Water, 40°C Water, 50°C Water, 60°C
No de P ow er Co nsu mp tion
CooLMUC Node Power iDataCool Node Power CooLMUC CPU Package Temp iDataCool CPU Core Temp
4 Recommendations for HPC Centre Infrastructures
Based on the information contained in this document and the additional results from the experimentation platforms, a set of recommendations for energy-efficient operation of the data centre infrastructure surrounding the DEEP System and the DEEP system integration can be given. Since these recommendations are of a generic nature, they can also be of help for anyone designing novel energy-efficient HPC centre infrastructures or enhancing existing data centres.
1. Careful choice of the data centre location
If possible, the location of the data centre operating a DEEP System should be chosen in order to maximize the use of available natural resources such as water for electricity generation or cooling. A dry cold climate provides the best potential for free cooling year-round.
2. Use of direct liquid cooling
The DEEP System should use direct liquid cooling, and the operating sites should provide the necessary site infrastructure. The experiments conducted clearly show the benefits of direct liquid cooling. The superior thermal properties of liquids over air allow for free cooling year-round even in regions with a hot climate. Furthermore, directly liquid-cooled systems can be driven at inlet temperatures high enough to allow for energy reuse. Finally, directly liquid-cooled IT equipment does not require fans so that substantial amounts of power can be saved. Further adoption of this technology will help to establish standards for direct liquid cooling and lead to lower hardware costs.
3. Use of free cooling - at the lowest temperature possible
Coolant provided to the DEEP System should be recooled using free cooling. Free cooling avoids the use of power-hungry compressors and therefore is a key aspect of energy-efficient data centres. Supplying the coolant to the DEEP System at the lowest possible temperature to enable free cooling will help to save on semiconductor power consumption since leakage currents are reduced.
4. Reuse of waste heat depending on the local climate
Supercomputing centres should investigate their options for reusing energy from the waste heat emitted by the DEEP System. In a mild local climate, free cooling is possible with moderate coolant temperatures. In this case, the waste heat may drive low-temperature heating systems such as underfloor heating. In regions with a hot climate, free cooling requires higher coolant temperatures. A further temperature raise in the cooling circuits - at the cost of a slight increase in power consumption due to higher leakage currents - could then enable chilled-water generation by adsorption chillers. One could also switch between these two approaches depending on the season.
5. Thermal insulation of the components
The DEEP system integration should ensure proper insulation of warm water cooled components from ambient air. As seen in the experiments, high temperature differences between the liquid coolant and ambient air significantly decrease the cooling efficiency. This can be overcome by better thermal insulation of the IT components from ambient air.
Within the DEEP project, the recommendations above are used by WP6 “System Integration and Installation” and WP3 “System Hardware”. Ongoing interactions between the work packages will ensure that the recommendations will be honoured to the greatest extent possible.
References and Applicable Documents
[1] M. Chuba, “Data Center Executives Must Address Many Issues in 2012,” Gartner, ID: G00229650, 2012.
[2] IDC, “Worldwide Server Research,” 2009.
[3] DEEP Consortium, “DEEP Project Website,” 2012. [Online]. Available: http://www.deep-project.eu.
[4] “The Green Grid,” [Online]. Available: http://www.thegreengrid.org/. [5] Google, “Hamina Data Center,” [Online]. Available:
http://www.google.com/about/datacenters/locations/hamina/.
[6] N. Rasmussen, “AC vs. DC Power Distribution for Data Centers,” Schneider Electric, 2011.
[7] Wikipedia, “Vapor-compresion refrigeration,” [Online]. Available: http://en.wikipedia.org/wiki/Vapor-compression_refrigeration. [8] Wikipedia, “Cooling Tower,” [Online]. Available:
http://en.wikipedia.org/wiki/Cooling_tower.
[9] Wikipedia, “Wet-bulb temperature,” [Online]. Available: http://en.wikipedia.org/wiki/Wet-bulb_temperature.
[10] The Engineering ToolBox, “Dry Bulb, Wet Bulb and Dew Point Temperature,” [Online]. Available: http://www.engineeringtoolbox.com/dry-wet-bulb-dew-point-air-d_682.html. [11] Idealex, “How It Works: The Maisotsenko Cycle - Basic,” [Online]. Available:
http://www.idalex.com/technology/how_it_works.htm.
[12] DigiTemp, “Wet Bulb Humidity Calculations,” [Online]. Available: http://www.digitemp.com/wetbulb.shtml.
[13] DataCenter 2020, “First results for energy-optimization at existing data centers,” 2011. [14] Wikipedia, “Cray 2,” [Online]. Available: http://en.wikipedia.org/wiki/Cray-2.
[15] Cray Research Inc., “The Cray-2 Series of Computer Systems,” 1988. [16] Green Revolution Cooling, “Company Website,” [Online]. Available:
http://www.grcooling.com.
[17] Iceotope, “Company Website,” [Online]. Available: http://www.iceotope.com. [18] Hardcore Computer, “Liquid Submerged Server (LSS 200) Technology Video,”
[Online]. Available: http://www.hardcorecomputer.com/liquid-blade-video/index.html. [19] Grundfos, “Tichelmann System,” [Online]. Available:
http://cbs.grundfos.com/CBS_Master/lexica/HEA_Tichelmann_system.html. [20] Great Internet Mersenne Prime Search, “Prime95 Benchmark,” [Online]. Available:
http://www.mersenne.org/.
[21] Wikipedia, “Prime95,” [Online]. Available: http://en.wikipedia.org/wiki/Prime95. [22] CoolingZone, “Direct liquid immersion cooling for high power density
microelectronics,” 2005. [Online]. Available: http://www.coolingzone.com/library.php?read=408.