research_statement.pdf

(1)

Jee Whan

Choi

Research Statement

“Any intelligent fool can make things bigger and more complex. It takes a touch of genius - and a lot of courage to move in the

opposite direction.” - Albert Einstein

Due to the complexity of hardware-software interactions, predicting how changes to an algorithm or code will impact time and energy is difficult, and as a result, there is often a large gap between observed and attainable performance or energy efficiency on modern systems. My research focuses on (a) gaining a better understanding of this relationship through performance and energy models, and (b) finding means ofautomatically eliminating this “gap” through systematic analysis of architecture bottlenecks. I plan to apply this work to both traditional scientific applications and Big Data mining – my current area of research – for hardware-software co-design.

My research engages a diverse range of computer science disciplines – including computer architecture, compiler optimization, algorithm analysis, and more – and leverages their wealth of ideas and insights. I have collaborated with scientists from multiple domains to gain a better understanding of, and to exploit domain-specific properties, and have written code that consistently outperforms state-of-the-art. My work has been recognized by the computer science community, where my papers have been cited over 670 times, and from the numerous books and book chapters that I have written.

Model-driven Performance and Energy Tuning of Scientific Kernels

The heart of my research is indesigning andtuning scientific algorithms. My strategy for tuning can be generalized as (a) identify general areas of limitation - i.e., is it limited by memory, computation, or concurrency - by studying the algorithm, and this informs us towhich algorithms/optimizations are most suitable for a given system, and (b) using a performance model, systematically analyze the code to eliminate architecture-specific bottlenecks. I have implemented several critical scientific kernels that outperforms prior state-of-the-art on numerous HPC architectures.

SpMV on GPU

(2)

Scientific kernels on accelerators

The robustness of our tuning approach has been validated on two additional problems – quantum chromodynamics (QCD) [6] and the fast multipole method (FMM) [7,8].

○␣ QCD is an important method in physics for calculating strong interactions, and our

study of its algorithm revealed that it closely resembles a stencil calculation. We chose cache blocking – specifically, 3.5D and4.5D blocking – to optimize this kernel on the cache-rich Xeon Phi and CPU platforms. We used a performance model based on data movement to approximate the maximum attainable performance and iteratively identified and eliminated architecture-specific bottlenecks (e.g., bank conflicts, and TLB misses) to arrive at 90% efficiency, and achieved strong–scaling to128nodes.

○␣ FMM performs work-optimal O(n) operations for n-body simulations, and is used

in many scientific and engineering problems. Our CPU-GPU hybrid implementation achieved2×speedup over prior state–of–the–art, and our model was successfully used to find the optimal schedule for different phases of FMM between the two systems. Our model revealed that: (a) assigning the flop intensive phase to the GPU does not necessarily give the best performance, and (b) FMM may become memory-bound before Exascale if current scaling trends hold, despite the general consensus that FMN will scale beyond Exascale.

Our success with SpMV, QCD, and FMM on different hardware architectures demon-strates the utility and robustness of our modeling and tuning approach, particularly in light of energy and power limitations suggesting non-conventional and rapidly evolving architectures.

A roofline model of energy

Due to energy and power constraint, energy efficiency is expected to be as important as performance for future software and hardware systems. My work was one of the first that addressed this problem from the perspective of algorithm design.

The energy roofline model [9] explains – in simple, analytic terms accessible to algorithm designers and performance tuners – how time, energy, and power to execute an algorithm relate. The model is grounded in the first principles of algorithm analysis, and considers an algorithm in terms of operations, concurrency, and memory traffic; and a machine in terms of time and energy cost per operation or word of communication. From this model, we suggest under what conditions we ought to expect an algorithmic time-energy trade-off, and show how algorithm properties may help inform power management.

Informed by our findings, we conducted an extensiveempirical study on the latest HPC component systems [10,11]. One interesting insight from our work is that low-power embedded processors can perform as well as high-power accelerators if they are operating under the same power budget, suggesting that future HPC systems may be built from low-power ARM processors, depending on application needs. In fact, Fujitsu recently announced that their Exascale supercomputer will use ARM processors.

(3)

Tensor decomposition for Big Data analytics

Many social and scientific domains give rise to data with multi-way relationships that can naturally be represented by tensors, or multi-dimensional arrays. Decomposing – or factoring – tensors can reveal latent information that are otherwise difficult to see within large data sets. Tensor decomposition is rapidly gaining popularity in applications ranging from phenotyping electronic health records (EHR) and detecting network intrusion to compressing scientific data and neural network layers in machine learning. As Big Data grows, more efficient methods for analyzing data on HPC systems will be required, and to that end, I have applied my expertise in modeling and tuning to the following areas:

Optimizing data movement

We have recently conducted an in-depth analysis and tuning of the sparse matricized tensor times Khatri-Rao product (MTTKRP), a key kernel in the canonical polyadic decomposition (CPD) method. We identified a number of key bottlencks, and using various blocking mechanisms, achieved up to3.2×speedup over state-of-the-art SPLATT tensor library. A critical component of our work is aqualitative performance model that we use to guide our heuristic for choosing the best blocking sizes.

Distributed dense Tucker method

Two key bottlenecks for the distributed dense Tucker decomposition method are calcu-lating tensor-times-matrix (TTM) and re-distributing the data between different TTM phases. A particular sequence of mode traversal determines the total computation load, and data distribution determines the total communication volume. Prior work proposed heuristics to determine the best sequence and decomposition. Rather than relying on heuristic, we used computation and communication models to study the two metrics in a formal and systematic manner, and designed strategies based ondynamic programming to find the optimal sequence. We achived up to 7× speedup over prior state-of-the-art [13].

Dense Tucker on a GPU cluster

Given the similarity of TTM to matrix multiplication, we have also explored ways to efficiently utilize a multi–GPU cluster for dense Tucker decomposition. In this work, we focused on improving the scalability of the singular value decomposition (SVD) kernel by using a randomized method, and reduced communication by half by re-formulating the tensor matricization problem into a matrix transpose. We achieve up to14.4×speedup using four GPUs over two CPUs, and scale up to 64 IBM POWER8 nodes. As the next–generation supercomputersSummit and Aurora will be powered by accelerators. efficiently utilizing accelerators on distributed systems will become increasingly important to solving problems in a timerly manner.

Model-driven memoiza-tion

Similarly to dense TTM, the sequence of mode traversal determines the computational load of sparse MTTKRP and memoizing the intermediate computation can impact performance significantly for extremely larger-order tensors. In this work, we derive a cost model to determine the optimal number of intermediate MTTKRP calculation to calculate & store to minimize time, as well as allow users to make space-time trade-off automatically through a model-driven auto-tuning framework. Our CP algorithm achieve up to 8×speedup over the SPLATT library for a 85-mode tensor.

(4)

Future directions

Collectively, my work is building towards a much larger framework that consists of (a) composable and accurate kernel–level energy and performance models that can be combined to generalize more complex scientificapplications, and (b) a set of tools that automates the systematic tuning method for identifying performance bottlenecks. The modeling component will enable programmers and domain scientists to better understand how their algorithm design decisions impact performance and energy so that they can select the one that best meets their target, whether it be performance, energy, or both. The automated tools will eliminate the gap that exists between how their algorithmshould perform and what is actuallyobserved, which is often caused by hardware bottlenecks that require detailed knowledge of the microarchitecture to identify.

I propose the following short and long term goals that will both gradually build towards my vision of a “smart” tuning framework, and explore alternative solutions to the overall problem.

Statistical modeling

Many real scientific applications have asynchronous, overlapping phases that are not tractable with simple composable models. A monolithic model based on statistical analysis can be a good alternative, as it can provide not only an easy-to-use black-box approach, but also provide insight into which algorithm properties have the most profound impact on performance, if correctly formulated. Tensor decomposition would be highly suited for this purpose.

Automated performance tuning

In our previous work, all our code was hand–tuned. However, by taking advantage of its systematic nature, we can automate this process. The process would involve “perturbing” the source code (e.g., removing instructions, limiting data accesss to L1 cache, removing dependencies, etc.) and then observing their impact on performance. When done in a structured manner using carefully designed perturbations, we can automatically detect, and possibly, eliminate these bottlencks.

Performance tuning for distributed systems

Many scientific applications and most Big Data analytics problems involve large amounts of data that do not fit in a single node. By leveraging our experience in distributed applications in scientific computing [6,7] and Big Data analytics [13], we can provide network communication models for different topologies as another piece in our composable modeling framework, and provide a better understanding of how network communication impacts performance.

Tensor framework for Big Data analytics

As a precurors to a more general framework for modeling and auto-tuning, we can conduct a feasibility study using tensor decomposition. By combining our prior work in CPD and Tucker on CPUs and GPUs, we can create a framework where different components can beswapped out to create new algorithms or applications (e.g., Poisson regression or tensor completion), and we can leverage our experience in tuning tensor algorithms to create a set of perturbation tools for auto-tuning.

Algorithm-architecture co-design

(5)

Publications

[1] Jee W. Choi, Amik Singh, and Richard W. Vuduc. Model-driven autotuning of sparse matrix-vector multiply on GPUs. InProceedings of the 15th ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming (PPoPP 11), pages 115–126, New York, NY, USA, 2010. ACM.

[2] Richard Vuduc, Aparna Chandramowlishwaran,Jee Choi, Murat Guney, and Aashay Shringarpure. On the limits of GPU acceleration. InProceedings of the 2nd USENIX conference on Hot topics in parallelism, HotPar’10, pages 13–13, Berkeley, CA, USA, 2010. USENIX Association. [3] Richard Vuduc andJee Choi.A brief history and introduction to GPGPU. In Xuan Shi, Volodymyr

Kindratenko, and Chaowei Yang, editors,Modern Accelerator Technologies for Geographic Infor-mation Science, pages 9–23. Springer US, 2013.

[4] Sam Williams, Nathan Bell, Jee Choi, Michael Garland, Leonid Oliker, and Richard Vuduc. Sparse matrix vector multiplication on multicore and accelerator systems. In Jakub Kurzak, David A. Bader, and Jack Dongarra, editors,Scientific Computing with Multicore Processors and Accelerators. CRC Press, 2010.

[5] Hyesoon Kim, Richard Vuduc, Sara Baghsorkhisorkhi,Jee Choi, and Wen mei Hwu.Performance

analysis and tuning for general purpose graphics processing units (GPGPU). Synthesis Lectures

on Computer Architecture. Morgan & Claypool Publishers, San Rafael, CA, USA, November 2012.

[6] M. Smelyanskiy, K. Vaidyanathan,Jee Choi, B. Joo, J. Chhugani, M.A. Clark, and P. Dubey. High-performance lattice QCD for multi-core based parallel systems using a cache-friendly hybrid threaded-MPI approach. In High Performance Computing, Networking, Storage and Analysis (SC), 2011 International Conference for, pages 1 –10, nov. 2011.

[7] Aparna Chandramowlishwaran,Jee Whan Choi, Kamesh Madduri, and Richard Vuduc. Towards a communication optimal fast multipole method and its implications for exascale. InProc. ACM Symp. Parallel Algorithms and Architectures (SPAA), Pittsburgh, PA, USA, June 2012. Brief announcement.

[8] Jee Choi, Aparna Chandramowlishwaran, Kamesh Madduri, and Richard Vuduc. A CPU:GPU hybrid implementation and model-driven scheduling of the fast multipole method. InProceedings of Workshop on General Purpose Processing Using GPUs, GPGPU-7, pages 64:64–64:71, New York, NY, USA, 2014. ACM.

[9] Jee Choi, Richard Vuduc, Robert Fowler, and Dan Bedard. A roofline model of energy. In Proceedings of the 27th IEEE International Parallel and Distributed Processing Symposium (IPDPS 13), 2013.

[10] Jee Choi, Marat Dukhan, Xing Liu, and Richard Vuduc. Algorithmic time, energy, and power on candidate HPC compute building blocks. InProceedings of the 28th IEEE International Parallel and Distributed Processing Symposium (IPDPS 14), 2014.

[11] J. W. Choiand R. W. Vuduc.Analyzing the Energy Efficiency of the Fast Multipole Method Using a DVFS-Aware Energy Model. In2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW), pages 79–88, May 2016.

[12] Jee Whan Choi and Richard W. Vuduc. How much (execution) time and energy does my algorithm cost? XRDS, 19(3):49–51, March 2013.

[13] Venkatesan Chakaravarthy,Jee Choi, Xing Liu, Douglas Joseph, Prakash Murali, Yogish Sabhar-wal, and Dheeraj Sreedhar. On Optimizing Distributed Tucker Decomposition for Dense Tensors. Proceedings of the 31st IEEE International Parallel and Distributed Processing Symposium (IPDPS’17), 2017.