Trends and future of supercomputers - High Performance Computing as a Combination of Machines a

Processor manufacturers are constantly improving their products by tweaking CPU components and implementing new hardware concepts. The aim is to keep providing increasingly powerful computers for basic issues and large-scale supercomputers for cutting-edge research and engineering. There is a kind of game between progress and need, where we iteratively push the limits and try to go beyond. Harvesting computing cycles for science will certainly change the landscape of experimental research and shorten the path to scientiﬁc discovery and technical insights.

As we have so far explained, increasing the (aggregated) processor speed raises number of technical challenges that need to be addressed carefully in order to make their beneﬁt clear to the community. Indeed, the gap between the peak performance and the sustained performance is a genuine concern. This is like gross salary and net salary from the employee viewpoint. Users expect supercomputers to be powerful enough for their applications, not in absolute. Thus, getting close to the maximum performance will be a crucial request. From the hardware point of view, this means number of improvement: memory latency at all hierarchy levels should be reduced; opportunity should be given to the programmer to manage memory features as desired; data exchanges between diﬀerent memory levels should be improved by adding additional buses; the penalty for accessing distant parts of a NUMA memory should be revisited; the set of vector instructions should be soundly extended; network capability should be improved (topology, bandwidth, and latency) in order to lower enough the communication overhead. At the algorithmic level, the scheduling should be aware of the Amdhal law [21].

The question of heat dissipation and power consumption will sit on top of major concerns. It is possible that, at some points, performance will be sacrificed because of the energy constraint. A typical node of a supercomputer will be made of a traditional multicore processor with several moderate cores, coupled with high-speed accelerator units (mainly GPUs). The idea behind relying on accelerators is that they will be fast enough to significantly reduce the overall execution time, thereby reducing the corresponding heat dissipation. It is important to understand this is a local reasoning, the case of high throughput computation remain- ing problematic. Indeed, we cannot expect to always compute by spots. Certain kinds of application like simulations, tracking, data assimilation, to name a few, require continuous heavy calculations. The question will be how to keep the benefit of acceleration over a long period of computing time without the punishment of an unacceptable power consumption or hardware failure. Thus, research investigations on the energy efficiency of computing systems will be of a particular interest, both from the hardware side and the programming standpoint. Alongside these efforts, researches on efficient and affordable cooling systems will be also crucial.

Another trend for future innovation, a part from increasing processors horsepower, is the ability to leverage distant power with an increasingly diverse collection of devices. Cloud computing offers a great alternative on mass storage, software and computing devices. Federating available computing resources, assuming sufficiently fast network, is certainly a valuable way to offer a more powerful computing system to the community. The main advantage is that the maintenance cost is mutualized and the users pay only for what they have really consumed. In addition, more related to the Software as a Service (SaaS) feature, users instantly benefit from updates, new releases, and new software. There is also an opportunity to share data and key parameters. This approach of federating available resources can be also seen as a way to save power consumption, as it prevents wastage. The topic of cloud computing is coming to

1.5. TRENDS AND FUTURE OF SUPERCOMPUTERS 29

the vogue and will probably be adopted for major large scale scientiﬁc experiments, assuming non sensitive data. The challenge for computer scientist is how to eﬃciently schedule a given set of tasks on the available set of resources in order to serve the request at the user convenience, while taking care of energy.

From the programming point of view, there are number of serious challenges that need to be addressed or remain under deeper investigations. The heterogeneity of current and upcom- ing supercomputers requires the use of hybrid codes, which is another level of programming complexity. One might think of using (semi-)automatic code generators, thus concentrate on a higher level abstraction. Programmers will, at certain point, rely on the output of those code generation frameworks, which is not always easy to accept, and otherwise raises a number of practical issues related to debugging, maintenance, adaptability, tuning, and refactoring. Figure 1.17 displays an example of a complex code design framework [11].

Figure 1.17: Sample hybrid programming chain

As the number of cores is increasing, with various packaging models, scalability will be an important issue for programmers. Some of the considerations that suited for single-threaded code have to be revised when it comes to multi-threaded version. Data locality is one of them, since the so-called false-sharing is also caused by an inappropriate locality. Mixing distributed memory model and shared memory model should become a standard.

Bibliography

[1] . , Get Connected With Ohm’s Law http://www.tryengineering.org/lessons/ohmslaw.pdf [2] Bianca Schroeder and Garth A. Gibson, A Large-Scale Study of Failures in High-Performance

Computing Systems, Proceedings of the International Conference on Dependable Systems and

Networks (DSN2006), Philadelphia, PA, USA, pp. 249-258, June 25-28, 2006. http://repository.cmu.edu/pdl/46/

[3] Brock, M., Goscinski, A., Execution of Compute Intensive Applications on Hybrid Clouds (Case

Study with mpiBLAST), Sixth International Conference on Complex, Intelligent,and Software

Intensive Systems, (CISIS) 2012, Palermo, Italy, July 4-6, 2012.

http://www.epcc.ed.ac.uk/wp-content/uploads/2011/11/AleksandarDraganov.pdf [4] K.W. Cameron and R. Ge, Predicting and Evaluating Distributed Communication Performance,

Proc. 2004 ACM/IEEE Conf. Supercomputing, IEEE CS Press, 2004.

[5] K.W. Cameron et al., High-Performance, Power-Aware Distributed Computing for Scientiﬁc Ap-

plications, Computer, vol. 38(11), pp.40-47, Nov. 2005,

[6] Cell SDK 3.0. www.ibm.com/developerworks/power/cell

[7] Charles Severance and Kevin Dowd, High Performance Computing, online book, Rice University, Houston, Texas, 2012.

http://open.umich.edu/sites/default/files/col11136-1.5.pdf

[8] Chris Gregg and Kim Hazelwood, Where is the Data? Why You Cannot Debate CPU vs. GPU

Performance Without the Answer, International Symposium on Performance Analysis of Systems

and Software (ISPASS), Austin, TX. April 2011.

http://www.cs.virginia.edu/kim/docs/ispass11.pdf

[9] Chris McClanahan, History and Evolution of GPU Architecture,

http://mcclanahoochie.com/blog/wp-content/uploads/2011/03/gpu-hist-paper.pdf, 2010.

[10] David Brock, Understanding Moore’s Law: Four Decades of Innovation, Chemical Heritage Foun- dation, 2006.

[11] M. Djurfeldt, C. Johansson, . Ekeberg, M. Rehn, M. Lundqvist, and A. Lansner, Brain-scale

simulation of the neocortex on the IBM Blue Gene/L supercomputer, IBM J. Res. Dev., vol.

52(1-2), pp.31-41, january 2008.

http://www.diva-portal.org/smash/get/diva2:220701/FULLTEXT01

[12] M. Djurfeldt, C. Johansson, . Ekeberg, M. Rehn, M. Lundqvist, and A. Lansner, Massively Par-

allel Simulation of Brain-Scale Neuronal Network Models, Technical Report TRITA-NA-P0513,

KTH, School of Computer Science and Communication Stockholm, 2005. http://www.diva-portal.org/smash/get/diva2:220701/FULLTEXT01

32 BIBLIOGRAPHY

[13] J. Dongarra and P. Luszczek, Introduction to the HPC Challenge Benchmark Suite, tech. report, Univ. Tennessee, 2004.

www.cs.utk.edu/ luszczek/pubs/hpcc-challenge-intro.pdf.

[14] X. Fan, C.S. Ellis, and A.R. Lebeck, The Synergy between Power-Aware Memory Systems and

Processor Voltage Scaling, Proc. 3rd Int’l Workshop Power-Aware Computing Systems, LNCS

3164, Springer-Verlag, 2003, pp. 164-179;

[15] Fan, Zhe and Qiu, Feng and Kaufman, Arie and Yoakum-Stover, Suzanne, GPU Cluster for

High Performance Computing, ACM/IEEE conference on Supercomputing (SC ’04), Pittsburgh,

November 6-12, 2004.

[16] Frank Wilczek, What QCD Tells Us About Nature and Why We Should Listen, Nuc. Phys. A 663, 320, 2000.

[17] G. Giunta, R. Montella, G. Agrillo, G. Coviello, A GPGPU Transparent Virtualization Compo-

nent for High Performance Computing Clouds, 16th International Euro-Par Conference, Ischia,

Italia, August 31 - September 3, 2010.

[18] S. Graham, M. Snir, and C. Patterson, eds., Getting Up to Speed: The Future of Supercomputing, National Academies Press, February 2005.

[19] Gupta, Abhishek; Milojicic, Dejan, Evaluation of HPC Applications on Cloud,HP Laboratories Technical Report (HPL-2011-132), 2011.

http://www.hpl.hp.com/techreports/2011/HPL-2011-132.pdf

[20] Helias Moritz, Kunkel Susanne, Masumoto Gen, Igarashi Jun, Eppler Jochen Martin, Ishii Shin, Fukai Tomoki, Morrison Abigail, Diesmann Markus, Supercomputers ready for use as discovery

machines for neuroscience, Frontiers in Neuroinformatics, vol. 6(26), 2012.

http://www.frontiersin.org/Neuroinformatics/10.3389/fninf.2012.00026/abstract [21] M. Hill and M. Marty, Amdahl?s law in the multicore era, Computer, vol. 41, no. 7, pp. 33 ?38,

2008.

[22] Hsu, Chung-hsing and Feng, Wu-chun, A Power-Aware Run-Time System for High-Performance

Computing, ACM/IEEE conference on Supercomputing (SC ’05), Seattle, November 12-18, 2005.

[23] Jakub Kurzak and Jack Dongarra, QR factorization for the Cell Broadband Engine, Scientiﬁc Programming, vol. 17(1-2), P. 31-42, 2009.

[24] S. Kumar, Optimization of All-to-All Communication on the Blue Gene/L Supercomputer, 37th International Conference on Parallel Processing (ICPP’08), Portland, Oregon, USA, September 8?12, 2008.

[25] Michael A. Bender, David P. Bunde, Erik D. Demaine, Sndor P. Fekete, Vitus J. Leung, Henk Meijer, Cynthia A. Phillips, Communication-Aware Processor Allocation for Supercomputers, 9th International Workshop, WADS 2005, Waterloo, Canada, August 15-17, 2005.

[26] Nagel, Wolfgang E.; Krner, Dietmar B.; Resch, Michael M (Eds.), High Performance Computing

in Science and Engineering ’12, Springer Book Archives,, 2013.

[27] H. Peter Hofstee, Power Eﬃcient Processor Design and the Cell Processor,

http://www.hpcaconf.org/hpca11/slides/Cell Public Hofstee.pdf.

[28] Peter Kogge et al., ExaScale Computing Study: Technology Challenges in Achieving Exascale

BIBLIOGRAPHY 33

http://users.ece.gatech.edu/mrichard/ExascaleComputingStudyReports/ exascale final report 100208.pdf

[29] Rajesh Bordawekar, Uday Bondhugula, Ravi Rao, Can CPUs Match GPUs on Performance

with Productivity?: Experiences with Optimizing a FLOP-intensive Application on CPUs and GPU, Research Report RC25033, IBM T.J. Watson Research Center, 2010.

[30] Rajkumar Buyya , High Performance Cluster Computing: Architectures and Systems (volume 1 volume 2), Prentice Hall, 1999.

[31] S. Ryoo, C. Rodrigues, S. Stone, S. Baghsorkhi, S.-Z. Ueng, and W. mei Hwu, Program Op-

timization Study on a 128-Core GPU, Workshop on General Purpose Processing on Graphics

Processing, 2008.

[32] Shuai Che, Michael Boyer, Jiayuan Meng, David Tarjan, Jeremy W. Sheaﬀer, Sang-Ha Lee and Kevin Skadron, Rodinia: A benchmark suite for heterogeneous computing, IEEE International Symposium on Workload Characterization (IISWC), Oct. 2009.

[33] Stephen Jones, Inside the Kepler Architecture, Supercomputing (SC12), Salt Lake City, USA, November 10-16, 2012.

[34] C. Tadonki, G. Grosdidier, and O. Pene, An eﬃcient CELL library for Lattice Quantum Chromo-

dynamics, International Workshop on Highly Eﬃcient Accelerators and Reconﬁgurable Technolo-

gies (HEART) in conjunction with the 24th ACM International Conference on Supercomputing (ICS), pp. 67-71, Epochal Tsukuba, Tsukuba, Japan, June 1-4, 2010.

[35] C. Tadonki, L. Lacassagne, E. Dadi, M. Daoudi, Accelerator-based implementation of the Harris

algorithm, 5th International Conference on Image Processing (ICISP 2012), Agadir, Maroc, June

28-30, 2012.

[36] C. Tadonki, Ring pipelined algorithm for the algebraic path problem on the CELL Broadband

Engine, Workshop on Applications for Multi and Many Core Architectures (WAMMCA 2010) in

conjunction with the International Symposium on Computer Architecture and High Performance Computing , Petropolis, Rio de Janeiro, Brazil, October 27-30, 2010.

[37] Thomas Lippert, Lattice Quantum Chromodynamics and the Impact of Supercomputing on The-

oretical High Energy Physics, DEISA Symposium, Paris, May 9-10, 2005.

http://www.deisa.eu/news press/symposium/Paris2005/presentations/ lippert DEISA 2005.pdf

[38] Tomas Oppelstrup, Introduction to GPU programming, PDC summer school 2008. http://www.training.prace-ri.eu/uploads/tx pracetmo/gpu-talk-pdc.pdf

[39] Victor W Lee, Changkyu Kim, Jatin Chhugani, Michael Deisher, Daehyun Kim?, Anthony D. Nguyen?, Nadathur Satish?, Mikhail Smelyanskiy, Srinivas Chennupaty., Per Hammarlund., Ronak Singhal. and Pradeep Dubey, Debunking the 100X GPU vs. CPU Myth: An Evaluation of Throughput Computing on CPU and GPU, ISCA?10, June 19?23, Saint-Malo, France, 2010. http://pcl.intel-research.net/publications/isca319-lee.pdf

[40] S. Williams, J. Shalf, L. Oliker, S. Kamil, P. Husbands, and K. Yelick, Scientiﬁc Computing

Kernels on the Cell Processor, International Journal of Parallel Programming, 2007.

[41] Xiang Rao, Huaimin Wang, Dianxi Shi, Zhenbang Chen, Hua Cai, Qi Zhou, Tingtao Sun, Identi-

fying faults in large-scale distributed systems by ﬁltering noisy error logs, IEEE/IFIP 41st Inter-

national Conference on Dependable Systems and Networks Workshops, pp.140-145, Hong Kong, China, June 27-30, 2011.

34 BIBLIOGRAPHY

[42] Zhou Zhou, Wei Tang, Zhiling Lan, Ziming Zheng, Desai, N., Evaluating Performance Impacts of

Delayed Failure Repairing on Large-Scale Systems, Proceedings of the 2011 IEEE International

Conference on Cluster Computing (CLUSTER’11), pp.532-536, 2011. [43] Enhanced Intel SpeedStep Technology for the Intel Pentium Processor

ftp://download.intel.com/design/network/papers/30117401.pdf, white paper, march 2004.

[44] Performance Analysis of High Performance Computing Applications on the Amazon Web Services

Cloud, Proceedings of the 2010 IEEE Second International Conference on Cloud Computing

Technology and Science, pp. 159-168, 2010. [45] http://fr.wikipedia.org/wiki/Cell

[46] http://www.mathworks.fr/company/newsletters/articles/ gpu-programming-in-matlab.html

[47] http://www.mathworks.fr/products/demos/shipping/distcomp/ paralleldemo gpu backslash.html

[48] http://www.nvidia.com/docs/IO/123576/nv-applications-catalog-lowres.pdf

[49] https://developer.nvidia.com/cublas

Chapter 2

Large-scale optimization

36 CHAPTER 2. LARGE-SCALE OPTIMIZATION

2.1 Foundations and background

Operations research is the science of decision making. The goal is to derive suitable mathematical models for practical problems and study effective methods to solve them as efficient as possible. For this purpose, mathematical programming has emerged as a strong formalism for major problems. Nowadays, due to the increasing size of the market and the pervasiveness of network services, industrial productivity and customers services should scale up with a whooping need and a higher quality requirement. In addition, the interaction between busi- ness operators has reached a noticeable level of complexity. Consequently, for well established companies, dealing with optimal decisions is critical to survive, and the key to achieve this purpose is to exploit recent operation research advances. The objective is to give a quick and accurate answer to practical instances of critical decision problems. The role of operation research is also central in cutting-edge scientific investigations and technical achievements. A nice example is the application of the traveling salesman problem (TSP) on logistics, genome

sequencing, X-Ray crystallography, and microchips manufacturing[5]. Many other examples

can be found in real-world applications[108]. A nice introduction of combinatorial optimization and complexity can be found in [105, 39].

The noteworthy increase of supercomputers capability has boosted the enthusiasm for solving large-scale combinatorial problems. However, we still need powerful methods to tackle those problems, and afterward provide eﬃcient implementation on modern computing systems. We really need to seat far beyond brute force or had hoc (unless genius) approaches, as increasingly bigger instances are under genuine consideration. Figure 2.1 displays an overview of a typical workﬂow when it comes to solving optimization problems.

Figure 2.1: Typical operation research workﬂow

Most of common combinatorial problems can be written in the following form    minimize F (x) subject to P (x) x∈ S, (2.1)

where F is a polynomial, P (x) a predicate, and S the working set, generally {0, 1}n or Zn. The predicate is generally referred to as feasibility constraint, while F is the known as the

2.1. FOUNDATIONS AND BACKGROUND 37

objective function. In the case of a pure feasibility problem, F could be assumed to be constant. An important class of optimization problem involves a linear objective function and linear constraints, thus the following generic formulation

   minimize cTx subject to Ax≤ b x∈ Zp× Rn−p, (2.2)

where A∈ Rm×n, c∈ Rn, b∈ Rm, and p∈ {0, 1, · · · , n}. If p = 0 (resp. p = n), then we have a so-called linear program (resp. integer linear program), otherwise we have a mixed integer

program. The corresponding acronyms are LP, ILP, and MIP respectively. In most cases,

integer variables are binary 0− 1 variables. Such variables are generally used to indicate a choice. Besides linear objective functions, quadratic ones are also common, with a quadratic

term proportional to xtQx. We now state some illustrative examples.

Example 1 The Knapsack Problem (KP)[79]. The Knapsack Problem is the problem of

choosing a subset of items such that the corresponding proﬁt sum is maximized without having the weight sum to exceed a given capacity limit. For each item type i, either we are allowed to pick up at most 1 (binary knapsack)[35], or at most mi (bounded knapsack), or whatever

quantity (unbounded knapsack). The bounded case may be formulated as follows(2.3):

                       maximize n ∑ i=1 pixi subject to n ∑ i=1 wixi≤ c xi≤ mi i = 1, 2,· · · , n x∈ {0, 1}n×n (2.3)

Example 2 The Traveling Salesman Problem (TSP)[5]. Given a valuated graph, the

Traveling Salesman Problem is to ﬁnd a minimum cost cycle that crosses each node exactly once (tour). Without lost of generality, we can assume positive cost for every arc and a zero cost for every disconnected pair of vertices. We formulated the problem as selecting a set of arcs (i.e. xij ∈ {0, 1}) so as to have a tour with a minimum cost(2.4). Understanding how

the way constraints are formulated implies a tour is left as an exercise for the reader.

                             minimize n ∑ i=1 n ∑ j=1 cijxij subject to n ∑ j=1 xij = 1 i = 1,· · · , n, i ̸= j n ∑ i=1 xij = 1 j = 1,· · · , n, i ̸= j x∈ {0, 1}n×n (2.4)

The TSP has an a priori n! complexity. Solving any instance with n = 25 using the current world fastest supercomputer (TITAN-CRAY XK7) might require 25 years of calculations.

38 CHAPTER 2. LARGE-SCALE OPTIMIZATION

Example 3 The Airline Crew Pairing Problem (ACPP)[125]. The objective of the

ACPP is to find a minimum cost assignment of flight crews to a given flight schedule. The problem can be formulated as a set partitioning problem(2.5).

   minimize cTx subject to Ax = 1 x∈ {0, 1}n (2.5)

In equation (2.5), each row of A represents a ﬂight leg, while each column represents a feasible pairing. Thus, aij tells whether or not ﬂight i belongs to pairing j.

In practice, the feasibility constraint is mostly the heart of the problem. This the case for the TSP, where the feasibility itself is a diﬃcult problem (the Hamiltonian cycle). However, there are also notorious cases where the dimension of the search space S (i.e. n) is too large to be handled explicitly when evaluating the objective function. This is the case of the ACPP, where the number of valid pairings is too large to be included into the objective function in one time. We clearly see that we can either focus on the constraints or on components of the objective function. In both cases, the basic idea is to get rid of the part that makes the problem diﬃcult, and then reintroduce it progressively following a given strategy. Combined with the well known branch-and-bound paradigm[26], these two approaches have led to well studied variants named branch-and-cut[124] and branch-and-price[12] respectively.

Figure 2.2: Branch-and-bound overview

The key ingredient of this connection between discrete and continuous optimization is

linear programming (LP). Indeed, applying a linear relaxation on the exact formulation of

a combinatorial problem, which means assuming continuous variables in place of integer variables, generally leads to an LP formulation from which lower bounds can be obtained (upper bounds are obtained on feasible guests, mostly obtained through heuristics). LP is also used to handle the set of constraints in a branch-and-cut, or to guide the choice of new components (columns) of the objective function in the branch-and-price scheme. Figure 2.3 depicts the linear relaxation of an IP conﬁguration, while Figure 2.4 provides a sample snapshot of an LP driven branch-and-bound.

2.1. FOUNDATIONS AND BACKGROUND 39

Figure 2.3: Interger programming & LP Figure 2.4: Branch-and-bound & LP Linear programming has been intensively studied and has reached a very mature state, even from the computing standpoint. Nowadays, very large scale LP can now be routinely solved using specialized software packages like CPLEX[133] or MOSEK[135].

Branch-and-bound and its variants can be applied to a mixed integer programming formu-

lation by means of basic techniques like Bender decomposition[17] or Lagrangian relaxation[87]. Figure 2.5 depicts the basic idea behind these two approaches.

Figure 2.5: Bender decomposition & Langrangian relaxation

The later is likely to yield non-diﬀerentiable optimization (NDO) problems. Several ap- proaches for NDO are described in the literature[65], including an oracle-based approach[7], which we will later describe in details as it illustrates our major contribution on that topic. Figure 2.6 gives an overview of an oracle based mechanism.

In document High Performance Computing as a Combination of Machines and Methods and Programming (Page 31-43)