• No results found

Specific contributions of our study include:

N/A
N/A
Protected

Academic year: 2021

Share "Specific contributions of our study include:"

Copied!
6
0
0

Loading.... (view fulltext now)

Full text

(1)

Calibrating  the  Relationship  between  Hardware  

Customization  and  Energy  Efficiency  

Apala Guha, Yao Zhang, Raihan ur Rasool, and Andrew A. Chien

University of Chicago, Department of Computer Science, [email protected]

 

Problem  and  Motivation  

With Dennard scaling [20] at an end, chip-level performance scaling is heavily dependent on both parallelism (multicore and other dimensions) as well as customization [25]. These techniques can deliver increased energy efficiency, the key to continued performance growth. However, one major challenge in the path forward is to effectively balance general-purpose coverage, customization for energy efficiency, and programmability [21].

To this end, we have proposed 10x10 [21, 22], perhaps the most aggressive approach to heterogeneous architecture for general-purpose computing. 10x10 leverages hardware customization (known to be up to 10,000x more energy-efficient than general-purpose processors) by splitting each core into multiple customized micro-engines such that they individually deliver high energy efficiency when executing their target software, and, collectively cover the general-purpose application space. Only one micro-engine (that is the most customized for the currently executing code) per core will be active at a time and will execute the code many times more efficiently than a general-purpose core.

A central challenge for 10x10 (or any broad-based heterogeneous architecture approach) is determining the exact engines that will compose a core given the large number of different choices for micro-engines and combinations. Micro-micro-engines can be designed on the basis of operations and data types they support, their issue widths and vector widths, specialized operations and many more. Micro-engines can also be designed to support combinations of these factors. The goal is to select a combination of micro-engines that will deliver the most benefit. It is unrealistic to actually build different micro-engine combinations and evaluate them. It is more generally useful to develop an abstract methodology for evaluating micro-engine candidates by modeling the relationship between the degree of hardware customization and benefit. Such a model is useful not only for 10x10, but for a wide range of architectural exploration. For example, others have done parametric studies coupled with design synthesis to explore customization opportunities [24]. Such efforts are broadly useful and show significant opportunities in customization. However, such approaches cannot explore more aggressive forms of customized design–circuit optimization and physical optimization as is often pursued in the highest performance accelerators. By cataloging extant accelerators we hope to capture the impact of these lower level optimizations.

Our narrower interest is to apply this model to proposed 10x10 micro-engines, employing it to rapidly estimate their potential benefit, enabling rapid high level design space exploration. In later stages we will build detailed implementations of the top few micro-engines to create thorough evaluations.

Specific contributions of our study include:

• Collection of a spectrum of accelerator designs and classification by degree of customization; • Fitting of the empirical data to create models of customization benefit versus specialization; • Assessment of previously proposed abstract models [19] and derivation of calibration constants

(2)

Approach  

Our approach was to collect data on a wide variety of chips spanning a wide range of customization. For each chip, we collected data about its energy efficiency and categorized the chip into one of three categories – “not customized”, “moderately customized”, and “highly customized”. Each chip represented one data point, with its categorization being its x-value and its energy efficiency being the y-value. Statistical tools were used to extract models of energy efficiency as dependent on the degree of customization.

The first challenge was that the x-values were not numerical, making it difficult to use statistical tools on them. Therefore, we assumed each customization category to be an order of magnitude more customized than the next lower category and assigned values of 1, 10, and, 100 for the “not customized”, “moderately customized”, and, “highly customized” categories respectively.

The second challenge was that the energy efficiency of a chip is not reported in standard units. Gops/W, Gpixels/W etc. are common units in which chip energy efficiency is reported. To enable comparison, we converted all energy efficiency values to Ginstructions/W. If the relative performance (against a baseline general-purpose processor) is available, we derive the instructions/sec simply by scaling the performance of the baseline processor. If a relative performance is not available, we count the operations/sec using application-specific knowledge, and scale the operation/sec by a factor of 3 to get the instructions/sec number, assuming an average of 3 RISC instructions to perform an operation (load, compute, and store). The third challenge was that the chips spanned a wide range of semiconductor technologies. We scaled their energy efficiency to 45nm technology using scaling factors reported in [23].

Accelerator  Classification  Results  

Below in table 1, we report the classification of over 20 general-purpose CPUs, customized parallel engines such as GPUs and tiled multicores, and aggressive accelerators such as a 3D lighting accelerator. We report the scaled energy and normalized energy efficiency, compared to a baseline of the Intel Nehalem processor, normalized for process technology.

Table 1: Classification of general-purpose CPUs, customized parallel engines, and accelerators Accelerator customization factor

Scaled energy efficiency wrt 45 nm (Gins/W) Normalized energy efficiency (wrt Nehalem) Atom  N2800  [1]   1   0.1758   0.225384615   Dyser  [10]   1   0.289300412   0.370897963   Sandybridge  [5]   1   0.746727273   0.957342657   Nehalem  [4]   1   0.78   1   Cortex  A8  []   1   0.8   1.025641026   IBM  Cell  [2]   10   2.271604938   2.912314023   Kepler  [17]   10   7.117548162   9.125061747   Epiphany  [3]   10   33.21522476   42.58362148   Tilera  [6]   10   70.83333333   90.81196581   Grape-­‐8  [13]   100   21   26.92307692  

(3)

Anton  [8]   100   31.27291667   40.09348291   QsCores  [14]   100   44.7   57.30769231   CyptoManiac  [15]   100   266.955   342.25   3D  lighting  accelerator  [9]   100   417.2338636   534.9152098   Image  recognition  [16]   100   1860   2384.615385   Object  recognition  [12]   100   14014   17966.66667   Augmented  reality  [11]   100   19279.296   24717.04615   H.264  encoder  -­‐  ASIC  [18]   100   59616   76430.76923   Based on this raw data, we have plotted and fit the points on the basis of a 10th percentile, median, and

90th percentile values, using both linear and quadratic models. The plots and derived constants are shown in the following figures1, 2 and 3.

 

Figure 1: Model of customization factor and normalized energy efficiency based on 10th percentile point in each customization category (pessimistic model for customization benefit).

(4)

Figure 2: Model of customization factor and normalized energy efficiency based on median point in each customization category.

 

Figure 3: Model of customization factor and normalized energy efficiency based on 90th percentile point

in each customization category (optimistic model for customization benefit).

The data suggests that the relationship is somewhere between linear and quadratic. A purely linear relationship is pessimistic. We believe that the actual relationship is closer to quadratic, particularly for the more aggressively customized accelerators. Table 2 lists the analytical models reported in [19] for a comparison with the empirically derived models in this work.

Table 2: Models reported in [19]: Square Root E= a * I0.5 + c; a=0.04898 Linear E= a * I + c; a=0.00194 Quadratic E= a * I2 + c; a=3.033e-6

Cubic E= a * I3 + c; a=4.746e-9

where, I = number of implemented opcodes

The quadratic model is closest to 90th percentile model. The linear model is closest to our 10th percentile

model. Therefore, this again suggests that the best model is somewhere in between linear and quadratic relationship.

Conclusions  

We have studied the relationship between hardware customization by cataloguing and calibrating a range of extant accelerators. This approach allows us to factor in aggressive architecture and implementation customization techniques widely explored to deliver accelerator energy efficiencies as high as 1000x greater than general-purpose processors. These studies are calibrated in comparison to the Nehalem processor in energy efficiency and performance. These models have been calculated empirically covering

(5)

models. Therefore, we believe that these models are valid and can be applied to rapidly evaluate the energy efficiency potential of customized hardware.

Acknowledgements  

This work was supported in part by the National Science Foundation under awards NSF OCI-1057921 and the Defense Advanced Research Projects Agency under PERFECT program award HR0011-13-2-0014. The contents of this report do not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

References  

[1] "Intel Atom™ Processor N2800", http://ark.intel.com/products/58917/Intel-Atom-Processor-N2800-(1M-Cache-1_86-GHz)#infosectionadvancedtechnologies, Dec 2012.

[2] "Cell/B.E. processor-based systems and software offerings IBM BladeCenter® QS22 and SDK 3.0", http://www.spscicomp.org/ScicomP14/talks/Grice-QS22.pdf, Dec 2012.

[3] Epiphany: Adapteva Inc. EpiphanyTM Architecture Reference (G3). 2012.

[4]Nehalem: http://www.advancedclustering.com/company-blog/high-performance-linpack-on-xeon-5500-v-opteron-2400.html, July 2013.

[5] Sandybridge: http://forums.anandtech.com/showthread.php?t=2135705, July 2013. [6] Tilera : Tilera. Tile64 processor product brief. 2009.

[7] Ardavan Pedram, Robert A. van de Geijn Member, and Andreas Gerstlauer. “Co-Design Tradeoffs for High-Performance, Low-Power Linear Algebra Architectures”.

[8] D. E. Shaw, M. M. Deneroff et al. “Anton, a special-purpose machine for molecular dynamics simulation”, In Proceedings of the 34th International Symposium on Computer Architecture, June 2007.

[9] Farhana Sheikh, Sanu Mathew, Mark Anders, Himanshu Kaul, Steven Hsu, Amit Agarwal, Ram Krishnamurthy and Shekhar Borkar. “A 2.05GVertices/s 151mW Lighting Accelerator for 3D Graphics Vertex and Pixel Shading in 32nm CMOS”. ISSCC 2012

[10] Govindaraju, V., C. Ho and K. Sankaralingam (2011). “Dynamically Specialized Data paths for Energy Efficient Computing”, 17th IEEE International Symposium on High Performance Computer Architecture.

[11] Jae-Sung Yoon, Jeong-Hyun Kim, Hyo-Eun Kim, Won-Young Lee, Seok-Hoon Kim, Kyusik Chung, Jun-Seok Park, Lee-Sup Kim. “A Graphics and Vision Unified Processor with 0.89µW/fps Pose Estimation Engine for Augmented Reality”. ISSCC 2012

[12] Jinwook Oh, Gyeonghoon Kim, Junyoung Park, Injoon Hong, Seungjin Lee, Hoi-Jun Yoo. “A 320mW 342GOPS Real-Time Moving Object Recognition Processor for HD 720p Video Streams”. ISSCC 2012

[13] Junichiro Makino and Hiroshi Daisaka. 2012. “GRAPE-8: an accelerator for gravitational N-body simulation with 20.5Gflops/W performance”. In Proceedings of the International Conference on High Performance Computing, Networking, Storage and Analysis (SC '12). IEEE Computer Society Press, Los Alamitos, CA, USA, , Article 104 , 10 pages.

(6)

[14] Venkatesh, G., J. Sampson, N. Goulding, S. Garcia, S. Swanson and M.B. Taylor. “QSCORES: Trading Dark Silicon for Scalable Energy Efficiency with Quasi- Specific Cores”. In Proceedings of the 44th Annual IEEE/ACM International Symposium on Microarchitecture 2011.

[15] Wu, L., C. Weaver and T. Austin, ”CryptoManiac: A Fast Flexible Architecture for Secure Communication”. In proceedings of the 28th Annual International Symposium on Computer Architecture (ISCA-2001), June 2001

[16] Yasuki Tanabe, Masato Sumiyoshi, Manabu Nishiyama, Itaru Yamazaki, Shinsuke Fujii, Katsuyuki Kimura, Takuma Aoyama, Moriyasu Banno, Hiroo Hayashi and Takashi Miyamori. “A 464GOPS 620GOPS/W Heterogeneous Multi-Core SoC for Image-Recognition Applications”. ISSCC 2012

[17] Whitepaper : NVIDA Geforce GTX 680,

http://www.geforce.com/Active/en_US/en_US/pdf/GeForce-GTX-680-Whitepaper-FINAL.pdf

[18] R. Hameed et al., ‘‘Understanding Sources of Inefficiency in General-Purpose Chips’’. In proceedings of the 37th Int’l Symposium of Computer Architecture, IEEE CS Press, 2010, pp. 37-47 [19] Apala Guha, Yao Zhang, Raihan ur Rasool and Andrew A. Chien, "Systematic Evaluation of Workload Clustering for Extremely Energy-Efficient Architectures". ACM SIGARCH Computer Architecture News. Vol. 41, Issue 2, Pages 22-29, May 2013.

[20] R. Dennard "Design of ion-implanted MOSFETs with very small physical dimensions", IEEE

Journal of Solid State Circuits, vol. SC-9, no. 5, pp.256 -268 1974

[21] S. Borkar and Andrew Chien, "The future of microprocessors," Communications of ACM, vol. 54, May 2011

[22] Andrew A Chien, Allan Snavely and Mark Gahagan, “10x10: A general-purpose architectural approach to heterogeneity and energy efficiency”, Dec 2011

[23] Borkar, Shekhar: “Design perspectives on 22nm CMOS and beyond”, DAC 2009: 93-94

[24] Vinod Kathail, Shail   Aditya, Robert Schreiber, B. Ramakrishna Rau, Darren   C.   Cronquist, Mukund   Sivaraman: “PICO: Automatically Designing Custom Computers”. IEEE  Computer  35(9): 39-47 (2002) [25] Hadi Esmaeilzadeh, Emily Blem, Renee St. Amant, Karthikeyan Sankaralingam, and Doug Burger. 2011. Dark silicon and the end of multicore scaling. In Proceedings of the 38th annual international

symposium on Computer architecture (ISCA '11). ACM, New York, NY, USA, 365-376

   

References

Related documents

FLEA was also used to characterize the longitudinal env population that drove development of a broadly neutralizing antibodies against the apex of the env trimer, sampled from

There is gross inadequacy in the number of these facilities, and the few available are unevenly distributed The question this research seeks to answer are (a)

Our results, based on extreme bounds analysis, suggest that some of the most important economic and political explanatory variables of IMF program participation are past

The framework of relationship profitability chain incorporates basic se- quence of five links - beginning with perceived value which leads to customer satisfaction influencing

Models were computed with parameters representing each non-White racial/ethnic group to assess mortality difference compared to non-His- panic Whites (Tables 2 (unadjusted survival),

We demonstrate that this approach improves the energy efficiency of the digital core of the accelerator by 5.1×, and the throughput by 1.3×, with respect to a baseline

The NameNode coordinates the data storage function (with the HDFS), while the JobTracker oversees and coordinates the parallel processing of data using MapReduce. Worker nodes make

sharply decrease of infant mortality rate stated that health care workers. should have visited neonates at least