Specific contributions of our study include:

(1)

Calibrating the Relationship between Hardware

Customization and Energy Efficiency

Apala Guha, Yao Zhang, Raihan ur Rasool, and Andrew A. Chien

University of Chicago, Department of Computer Science, [email protected]

Problem and Motivation

With Dennard scaling [20] at an end, chip-level performance scaling is heavily dependent on both parallelism (multicore and other dimensions) as well as customization [25]. These techniques can deliver increased energy efficiency, the key to continued performance growth. However, one major challenge in the path forward is to effectively balance general-purpose coverage, customization for energy efficiency, and programmability [21].

To this end, we have proposed 10x10 [21, 22], perhaps the most aggressive approach to heterogeneous architecture for general-purpose computing. 10x10 leverages hardware customization (known to be up to 10,000x more energy-efficient than general-purpose processors) by splitting each core into multiple customized micro-engines such that they individually deliver high energy efficiency when executing their target software, and, collectively cover the general-purpose application space. Only one micro-engine (that is the most customized for the currently executing code) per core will be active at a time and will execute the code many times more efficiently than a general-purpose core.

A central challenge for 10x10 (or any broad-based heterogeneous architecture approach) is determining the exact engines that will compose a core given the large number of different choices for micro-engines and combinations. Micro-micro-engines can be designed on the basis of operations and data types they support, their issue widths and vector widths, specialized operations and many more. Micro-engines can also be designed to support combinations of these factors. The goal is to select a combination of micro-engines that will deliver the most benefit. It is unrealistic to actually build different micro-engine combinations and evaluate them. It is more generally useful to develop an abstract methodology for evaluating micro-engine candidates by modeling the relationship between the degree of hardware customization and benefit. Such a model is useful not only for 10x10, but for a wide range of architectural exploration. For example, others have done parametric studies coupled with design synthesis to explore customization opportunities [24]. Such efforts are broadly useful and show significant opportunities in customization. However, such approaches cannot explore more aggressive forms of customized design–circuit optimization and physical optimization as is often pursued in the highest performance accelerators. By cataloging extant accelerators we hope to capture the impact of these lower level optimizations.

Our narrower interest is to apply this model to proposed 10x10 micro-engines, employing it to rapidly estimate their potential benefit, enabling rapid high level design space exploration. In later stages we will build detailed implementations of the top few micro-engines to create thorough evaluations.

Specific contributions of our study include:

• Collection of a spectrum of accelerator designs and classification by degree of customization; • Fitting of the empirical data to create models of customization benefit versus specialization; • Assessment of previously proposed abstract models [19] and derivation of calibration constants

(2)

Approach

Our approach was to collect data on a wide variety of chips spanning a wide range of customization. For each chip, we collected data about its energy efficiency and categorized the chip into one of three categories – “not customized”, “moderately customized”, and “highly customized”. Each chip represented one data point, with its categorization being its x-value and its energy efficiency being the y-value. Statistical tools were used to extract models of energy efficiency as dependent on the degree of customization.

The first challenge was that the x-values were not numerical, making it difficult to use statistical tools on them. Therefore, we assumed each customization category to be an order of magnitude more customized than the next lower category and assigned values of 1, 10, and, 100 for the “not customized”, “moderately customized”, and, “highly customized” categories respectively.

The second challenge was that the energy efficiency of a chip is not reported in standard units. Gops/W, Gpixels/W etc. are common units in which chip energy efficiency is reported. To enable comparison, we converted all energy efficiency values to Ginstructions/W. If the relative performance (against a baseline general-purpose processor) is available, we derive the instructions/sec simply by scaling the performance of the baseline processor. If a relative performance is not available, we count the operations/sec using application-specific knowledge, and scale the operation/sec by a factor of 3 to get the instructions/sec number, assuming an average of 3 RISC instructions to perform an operation (load, compute, and store). The third challenge was that the chips spanned a wide range of semiconductor technologies. We scaled their energy efficiency to 45nm technology using scaling factors reported in [23].

Accelerator Classification Results

Below in table 1, we report the classification of over 20 general-purpose CPUs, customized parallel engines such as GPUs and tiled multicores, and aggressive accelerators such as a 3D lighting accelerator. We report the scaled energy and normalized energy efficiency, compared to a baseline of the Intel Nehalem processor, normalized for process technology.

Table 1: Classification of general-purpose CPUs, customized parallel engines, and accelerators Accelerator customization _factor

Scaled energy efficiency wrt 45 nm (Gins/W) Normalized energy efficiency (wrt Nehalem) Atom N2800 [1] 1 0.1758 0.225384615 Dyser [10] 1 0.289300412 0.370897963 Sandybridge [5] 1 0.746727273 0.957342657 Nehalem [4] 1 0.78 1 Cortex A8 [] 1 0.8 1.025641026 IBM Cell [2] 10 2.271604938 2.912314023 Kepler [17] 10 7.117548162 9.125061747 Epiphany [3] 10 33.21522476 42.58362148 Tilera [6] 10 70.83333333 90.81196581 Grape-‐8 [13] 100 21 26.92307692

(3)

Anton [8] 100 31.27291667 40.09348291 QsCores [14] 100 44.7 57.30769231 CyptoManiac [15] 100 266.955 342.25 3D lighting accelerator [9] 100 417.2338636 534.9152098 Image recognition [16] 100 1860 2384.615385 Object recognition [12] 100 14014 17966.66667 Augmented reality [11] 100 19279.296 24717.04615 H.264 encoder -‐ ASIC [18] 100 59616 76430.76923 Based on this raw data, we have plotted and fit the points on the basis of a 10th_{percentile, median, and}

90th percentile values, using both linear and quadratic models. The plots and derived constants are shown in the following figures1, 2 and 3.

Figure 1: Model of customization factor and normalized energy efficiency based on 10th percentile point in each customization category (pessimistic model for customization benefit).

(4)

Figure 2: Model of customization factor and normalized energy efficiency based on median point in each customization category.

Figure 3: Model of customization factor and normalized energy efficiency based on 90th_{percentile point}

in each customization category (optimistic model for customization benefit).

The data suggests that the relationship is somewhere between linear and quadratic. A purely linear relationship is pessimistic. We believe that the actual relationship is closer to quadratic, particularly for the more aggressively customized accelerators. Table 2 lists the analytical models reported in [19] for a comparison with the empirically derived models in this work.

Table 2: Models reported in [19]: Square Root E= a * I0.5 + c; a=0.04898 Linear E= a * I + c; a=0.00194 Quadratic E= a * I2_{+ c; a=3.033e-6}

Cubic E= a * I3_{+ c; a=4.746e-9}

where, I = number of implemented opcodes

The quadratic model is closest to 90th_{percentile model. The linear model is closest to our 10}th_percentile

model. Therefore, this again suggests that the best model is somewhere in between linear and quadratic relationship.

Conclusions

We have studied the relationship between hardware customization by cataloguing and calibrating a range of extant accelerators. This approach allows us to factor in aggressive architecture and implementation customization techniques widely explored to deliver accelerator energy efficiencies as high as 1000x greater than general-purpose processors. These studies are calibrated in comparison to the Nehalem processor in energy efficiency and performance. These models have been calculated empirically covering

(5)

models. Therefore, we believe that these models are valid and can be applied to rapidly evaluate the energy efficiency potential of customized hardware.

Acknowledgements

This work was supported in part by the National Science Foundation under awards NSF OCI-1057921 and the Defense Advanced Research Projects Agency under PERFECT program award HR0011-13-2-0014. The contents of this report do not necessarily reﬂect the position or the policy of the Government, and no ofﬁcial endorsement should be inferred.

References

[1] "Intel Atom™ Processor N2800", http://ark.intel.com/products/58917/Intel-Atom-Processor-N2800-(1M-Cache-1_86-GHz)#infosectionadvancedtechnologies, Dec 2012.

[2] "Cell/B.E. processor-based systems and software offerings IBM BladeCenter® QS22 and SDK 3.0", http://www.spscicomp.org/ScicomP14/talks/Grice-QS22.pdf, Dec 2012.

[3] Epiphany: Adapteva Inc. EpiphanyTM Architecture Reference (G3). 2012.

[4]Nehalem: http://www.advancedclustering.com/company-blog/high-performance-linpack-on-xeon-5500-v-opteron-2400.html, July 2013.

[5] Sandybridge: http://forums.anandtech.com/showthread.php?t=2135705, July 2013. [6] Tilera : Tilera. Tile64 processor product brief. 2009.

[7] Ardavan Pedram, Robert A. van de Geijn Member, and Andreas Gerstlauer. “Co-Design Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures”.

[8] D. E. Shaw, M. M. Deneroff et al. “Anton, a special-purpose machine for molecular dynamics simulation”, In Proceedings of the 34th _{International Symposium on Computer Architecture, June 2007.}

[9] Farhana Sheikh, Sanu Mathew, Mark Anders, Himanshu Kaul, Steven Hsu, Amit Agarwal, Ram Krishnamurthy and Shekhar Borkar. “A 2.05GVertices/s 151mW Lighting Accelerator for 3D Graphics Vertex and Pixel Shading in 32nm CMOS”. ISSCC 2012

[10] Govindaraju, V., C. Ho and K. Sankaralingam (2011). “Dynamically Specialized Data paths for Energy Efficient Computing”, 17th IEEE International Symposium on High Performance Computer Architecture.

[11] Jae-Sung Yoon, Jeong-Hyun Kim, Hyo-Eun Kim, Won-Young Lee, Seok-Hoon Kim, Kyusik Chung, Jun-Seok Park, Lee-Sup Kim. “A Graphics and Vision Unified Processor with 0.89µW/fps Pose Estimation Engine for Augmented Reality”. ISSCC 2012

[12] Jinwook Oh, Gyeonghoon Kim, Junyoung Park, Injoon Hong, Seungjin Lee, Hoi-Jun Yoo. “A 320mW 342GOPS Real-Time Moving Object Recognition Processor for HD 720p Video Streams”. ISSCC 2012

[13] Junichiro Makino and Hiroshi Daisaka. 2012. “GRAPE-8: an accelerator for gravitational N-body simulation with 20.5Gflops/W performance”. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE Computer Society Press, Los Alamitos, CA, USA, , Article 104 , 10 pages.

(6)

[14] Venkatesh, G., J. Sampson, N. Goulding, S. Garcia, S. Swanson and M.B. Taylor. “QSCORES: Trading Dark Silicon for Scalable Energy Efﬁciency with Quasi- Speciﬁc Cores”. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture 2011.

[15] Wu, L., C. Weaver and T. Austin, ”CryptoManiac: A Fast Flexible Architecture for Secure Communication”. In proceedings of the 28th Annual International Symposium on Computer Architecture (ISCA-2001), June 2001

[16] Yasuki Tanabe, Masato Sumiyoshi, Manabu Nishiyama, Itaru Yamazaki, Shinsuke Fujii, Katsuyuki Kimura, Takuma Aoyama, Moriyasu Banno, Hiroo Hayashi and Takashi Miyamori. “A 464GOPS 620GOPS/W Heterogeneous Multi-Core SoC for Image-Recognition Applications”. ISSCC 2012

[17] Whitepaper : NVIDA Geforce GTX 680,

http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf

[18] R. Hameed et al., ‘‘Understanding Sources of Inefficiency in General-Purpose Chips’’. In proceedings of the 37th Int’l Symposium of Computer Architecture, IEEE CS Press, 2010, pp. 37-47 [19] Apala Guha, Yao Zhang, Raihan ur Rasool and Andrew A. Chien, "Systematic Evaluation of Workload Clustering for Extremely Energy-Efficient Architectures". ACM SIGARCH Computer Architecture News. Vol. 41, Issue 2, Pages 22-29, May 2013.

[20] R. Dennard "Design of ion-implanted MOSFETs with very small physical dimensions", IEEE

Journal of Solid State Circuits, vol. SC-9, no. 5, pp.256 -268 1974

[21] S. Borkar and Andrew Chien, "The future of microprocessors," Communications of ACM, vol. 54, May 2011

[22] Andrew A Chien, Allan Snavely and Mark Gahagan, “10x10: A general-purpose architectural approach to heterogeneity and energy efficiency”, Dec 2011

[23] Borkar, Shekhar: “Design perspectives on 22nm CMOS and beyond”, DAC 2009: 93-94

[24] Vinod Kathail, Shail Aditya, Robert Schreiber, B. Ramakrishna Rau, Darren C. Cronquist, Mukund Sivaraman: “PICO: Automatically Designing Custom Computers”. IEEE Computer 35(9): 39-47 (2002) [25] Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th annual international

symposium on Computer architecture (ISCA '11). ACM, New York, NY, USA, 365-376