Why we Need Benchmarks - Analytical Query Processing Using Heterogeneous SIMD Instruction Sets

4.3 Work-Energy-Profiles for Vectorized Query Processing

4.4 Balancing Performance and Energy for Vectorized Query Processing

4.5 Related Work 4.6 Summary

Operators Template Vector Library

L/S Class Arithmetic Class Create Class Manipulate Class Comparison Class Boolean Logic Class Extract Class Development Implementation Compile time Optimization Performance-Energy-Models Runtime Benchmarks Query Consists of

Figure 4.1: In the last chapter we introduced the Template Vector Library (TVL). In this chapter we will use the TVL as a basis to optimize queries.

In the last chapter, we introduced the Template Vector Library (TVL), which enables an easy selection of the SIMD instruction set to achieve optimal performance. As we have shown in Section 2.3.1, there is a significant performance optimization potential when the instruction set is chosen per operator, and not per query. However, as we have also shown, this choice is not trivial and different queries benefit from a different choice. Moreover, in Section 2.3.2 we explain that the use of SIMD puts tighter limits to other optimization knobs, e.g. frequency scaling. This also affects the energy-efficiency, which has become another important optimization goal. For instance, the achievable CPU frequency, which is a main factor for the energy consumption of CPUs, is influenced by the chosen instruction set and the number of active CPU cores. Hence, performance and energy-efficiency are related optimization goals influencing each other. To optimize for both of these two goals, a mapping between performance and energy-efficiency is necessary. In this chapter, we propose a model, called Work-Energy-Profiles (WEPs), which shows this mapping. We create such WEPs for the primitives of the classes in our TVL and some frequently used primitive combinations. Then, we use these WEPs to create models for more complex operators, which were implemented with the TVL. These models can then be used to select the optimal instruction set and cpu frequency on an operator level. Figure 4.1 shows a rough outline of this approach. The upper part of the figure, i.e. the TVL and the implementation using it, was covered in the last chapter. The lower part, i.e. the optimization using WEPs, is the subject of this chapter1. In particular, we the following topics are discussed:

1. We show why benchmarks are necessary to create reliable models in Section 4.1, i.e. we explain why a completely analytic model is not feasible.

2. In Section 4.2, we explain our Work-Energy-Profiles. This includes the general idea, the characteristics of our test systems, and example profiles for these test systems. 3. Since benchmarking is time-consuming, we present a method for creating WEPs for

all applications, which rely on the TVL, by combining the WEPs of the primitives (Section 4.3).

4. We show how to apply WEPs in order to optimize a workload or a query in Sec- tion 4.4.

5. Finally, in Section 4.5 we focus on previous approaches of optimization for energy- efficiency in a discussion of related work.

Parts of the material in this chapter have been developed jointly with Alexander Krause, Patrick Damme, Johannes Pietrzyk, Thomas Kissinger, Willi-Wolfram Mentzel, Dirk Habich, and Wolfgang Lehner. The chapter is based on [UKHL16, UKM+

16, UDP+

17]. The copyright of [UKHL16] and [UDP+

17] is held by Springer International Publishing AG; the original publications are available athttps://doi.org/10.1007/978-3- 319-54334-5_10 andhttps://doi.org/10.1007/978-3-319-67162-8_5. The copyright of [UKM+

16] is held by the authors; the original publication is available athttps://doi.org/10.1145/2882903.2899390.

V_DD R C t ( ଵ 2Ǒf) V_C V_DD V_th Subthreshold region Ohmic region Saturation region V_ov

Figure 4.2: Circuit for the input capacitance of a MOSFET including the lead resistance (left) and the characteristic curve (right) of the capacitance. Additionally, the threshold and overdrive voltages of the MOSFET are illustrated. In the subthreshold region, there is no current at the drain, the transistor is switched off. In the ohmic region, there is a current passing the capacitor, but gate switching is not reliable. Transistors in CPUs work in the saturation region. Note that real characteristic curves are steeper, and the subthreshold region is narrower and the threshold and overdrive voltages are usually not shown in these diagrams. This figure aims to explain the sheer concept.

4.1 WHY WE NEED BENCHMARKS

The exact performance and energy consumption of a CPU are not trivially predictable, and they are not computable in an acceptable amount of time. To understand the reasons for this claim, a short excursion to the characteristics of transistors is use- ful. There are different kinds of transistors. In modern CPUs, mostly MOSFETs (metal–oxide–semiconductor field-effect transistors) are used. The transistors consist at least of a source, a drain, a gate between the source and the drain, a body, and an isolator separating the gate from the body. The name MOSFET was kept, even when the gate material changed from metal to silicon. Simply put, a transistor works like a resistor, which is controlled by voltage. By changing the voltage between the gate and the source, the resistance between the drain and the source is changed. This causes a change in the current. A high and a low current level equal a 1 or a 0. In the equivalent circuit diagram2_of a transistor for the frequencies relevant in CPUs, this is modeled by a resistance and three capacitances. For the sake of simplicity, we show only the resistance and a single capac- itance in Figure 4.2. The supply voltage is denoted as VDD. The diagram in Figure 4.2

shows the characteristic curve of the capacitance, i.e. the voltage as a function of the time. Only if the voltage between the gate and the source surpasses a threshold VT H, there is a

channel between the source and the drain, i.e. there is a current. If the voltage increases, the current also increases. The region below this threshold is called the subthreshold region, in which the transistor is switched off. The region above this threshold is called the linear region or ohmic region. In this region, the drain current grows approximately proportional to the drain-source-voltage. This can be used to amplify incoming signals. If the voltage passes a second limit, the overdrive voltage VOV, the current is not changed significantly

anymore. Instead, the supply voltage can be used directly to change the current. There is a clear distinction between the current levels when the switch is open versus when the switch is closed. This region is called the saturation region or overdrive region. CPUs work

2_{An equivalent circuit diagrams do not show existing circuits, but the behavior of an electric component.}

For this reason, we are using the terms resistance and capacitance over resistor and capacitor.

t (_2Ǒfଵ ) VC VDD1 Vov ଵ Ǒ t1 t2 VDD2 ଵ Ǒ Ǖ Ǖ

=> Kondensator wird schneller aufgeladen

ଵ Ǒ ଵǑ t ( ଵ 2Ǒf) V_C V_DD V_ov Ǖ1=R2C2 Ǖ2=R1C1

Figure 4.3: There are two ways of changing the physically possible gate switching time: (1) The supply voltage is increased and (2) the resistance and/or capacitance is decreased. The latter are constants, which cannot be changed after the transistor has been built, because they depend on the hardware design.

in this saturation region, because transistors working in the linear region do not reliably map the supply voltage to distinct 0 or 1 values3_.

This means, the minimal length of a cycle (x-axis), and therefore the maximum switching frequency, is determined by the intersection of the characteristic curve of the capacitance with the overdrive voltage. To increase the maximum frequency, there are two different ways: (1) Increase the supply voltage, and (2) Change the shape of the curve. Both of these possibilities are pictured in Figure 4.3. The left side shows option (1). As shown, the curve becomes steeper when the supply voltage is increased from VDD1to VDD2. This

results in a smaller cycle length at t1instead of t2. Option (2) is shown on the right side. The shape of the curve depends on the resistance R and the capacitance C. If R or C changes, the curve changes, too.

However, both options have limits. Changing the capacitance or the resistance can only be done during hardware development, because these quantities follow from the mate- rials and the physical design of the transistors. The supply voltage can be changed dy- namically, but the switching power consumption depends quadratically on the supply voltage. Additionally, the energy conversion efficiency of a transistor is not 100%. There is leakage energy, which is emitted as heat. A high number of densely packed transistors working in the saturation region produce enough heat that the threshold and overdrive voltage are affected, which requires a higher increase of the supply voltage to increase the frequency. But again, this produces even more heat. Thus, increasing the supply voltage only makes sense to a certain degree, where this degree depends on the number and density of active transistors. This is a main reason why the maximum CPU frequency for vectorized processing is below the frequency for scalar processing, which uses fewer transistors per instruction, and why it decreases even further when using multiple cores. Even when the supply voltage stays within these boundaries, energy consumption is not trivially predictable. For instance, an increased voltage would lead to a higher energy consumption per cycle, if the cycle length does not change. But a shorter cycle length could amortize for the higher voltage. Jejurikar et al. showed this complex relationship

3_{This statement is backed by personal experience with the Tomahawk chip, where we were able to select}

the supply voltage manually.

exemplarily for the Transmeta Crusoe Processor [JPG04]. They show that with an increasing supply voltage, the static energy consumption and the leakage energy per cycle decrease, while the switching energy per cycle increases. If all of these parts are added to show the overall energy consumption per cycle, there is an optimum for energy consumption, which is far below the maximum frequency.

While this is an interesting finding, it is still not sufficient to estimate the energy consumption for a given application on a modern CPU. First, the Crusoe CPU was con- structed to be especially energy-efficient and simple. The compatibility with x86 CPUs is reached by emulating the instructions, not by implementing them in hardware. However, in most modern CPUs there are several instruction sets available, e.g. for vectorization, and the properties of the transistors are not something a user can find in a documenta- tion. To estimate their energy consumption it would be necessary to know how many and which transistors are used for each instruction, and the physical chip design. As we have shown, even a single transistor is already a quite complex element. A gate can be realized with different amounts of transistors, e.g. there are implementations of the XOR gate, which use between 4 and 16 transistors. An instruction is built from an arbitrarily high number of gates. With the introduction of more instruction sets on a CPU, the va- riety of gate combinations increases further. Nowadays, the way these gates are placed and connected on a chip is not even known by the developer. Chip design is done for the logical layout, while the physical layout is automatically computed. On top of this, all instructions, which use data from main memory or any shared cache, introduce additional potential stalling cycles, where the occurrence of stalling cycles depends not only on the CPU, but on the memory and its usage by any other system components.

In conclusion, there are several hardware layers built on each other: transistors are connected to create gates, which are connected to create instructions, which are scheduled depending on the software and the availability of shared resources. Each of these layers has its own properties, which are usually not known by the software developer or which can only be determined during runtime. If all of these variables are known, it is possible to retrieve reliable estimations from a simulation done with the automatically generated physical design of the CPU and all other system components. However, such simulations have a longer runtime than the application on the actual hardware, and the developer would have to provide the design. Hence, this method is not practicable. For this reason, we use a benchmark-based approach, which does not require any knowledge of the chip design but provides real-world and hardware-specific insights.

In document Analytical Query Processing Using Heterogeneous SIMD Instruction Sets (Page 65-69)