Evaluation and comparability - GPU Array Access Auto-Tuning

One big problem of auto-tuning is the evaluation and comparability. Nearly no auto-tuning project publishes its code. We would have liked to compare MATOG against other auto-tuners [Sung et al. 2012;Koﬂer et al. 2015;Peng et al. 2016] in our evaluation, but as their projects are not publicly available, this was not possible. We want to encourage researchers to publish their work as open source as we did for MATOG from the beginning. Another option is to provide a web platform, as e.g., the 3D Web Reconstruction project5_{or the DawnCC project}6

that allow to use their tools without giving away the code. If such a web-based approach is feasible for auto-tuning has to be evaluated.

Another issue is, that there is no default way or application to evaluate auto-tuners and given the variety of possible optimizations and methods, this is most likely difficult to establish in the community. This causes that some authors compare their auto-tuner to hand-optimized code and “only” achieve a speed up of a few percents, while others compare against “naïve” solutions (which are most likely the worst performing they were able to find) and achieve speed ups of several orders of magnitude. This is equal to the “GPUs are 100x faster than CPUs”-myth, where an optimized GPU implementation is compared against an unoptimized CPU version [Lee et al. 2010]. Neither variant can be blamed but this variety does not help to compare the different approaches. Further, often the used hardware and software stack (drivers, CUDA version, operating system, ...) are omitted, which make it even more uncomparable. There are ambitions in the community to change this using benchmark suites such as Rodinia [Che et al. 2009], Parboil

5_{gccvmwebreconstruction.igd.fraunhofer.de} 6_{cuda.dcc.ufmg.br/dawn}

[Stratton et al. 2012] or Polybench [Grauer-Gray et al. 2012] and building upon this the project ofFursin et al. [2016]. However, yet none of these approaches provides a satisfying solution, as we will lay out in the following sections.

10.2.1 Benchmark Suites

The mentioned benchmark suites usually consist of too simplistic applications, as e.g., SRAD or Hotspot from our evaluation, whose runtime is in a millisecond range. To show the ability of auto-tuners, in our opinion it is necessary to test them on realistic applications with runtime of seconds, minutes or even hours. Usually most of the time is used for I/O or setup of the application and only a very small fraction of the already short execution time is used for the actual computation. Further, these benchmarks do not come with predeﬁned tests, so it is impossible to recreate the same results and no author can be blamed to choose datasets and parameters that work good with his tool. Another issue is the maintainability of these benchmarks. They often come with no common build chain, but every single application has its own way to be build and usually requires some adaption to work on other machines. Tools such as CMake7_{could be a solution for this,}

as this would even allow to use the benchmarks on multiple platforms such as Windows, Linux and MacOS, without changes (if the source code does not use platform dependent functionality). Further, some of the benchmarks contain even programming errors or non-functioning code (Section 7.1). We only took a look on some of the benchmarks in the Rodinia suite but are certain that we would be able to ﬁnd similar errors in the other benchmarks as well, which raises the question for quality and usability of these.

Better organized examples for benchmarks can be found in other communities, e.g., the Common Visual Data Foundation [CVDF 2016] that poses explicit challenges with precisely determined task descriptions that have to be completed and how the results are scored. The problem for auto-tuning in such challenges is most likely the scoring, as different hardware (even equal GPUs from different manufacturers can vary in performance, caused by customizations such as varying clock frequencies or modified cooling systems) could yield in different scoring. Another example is the Middlebury benchmark [Baker et al. 2011] which explicitly contains a training and an evaluation dataset and even allows to submit results to the author’s homepage.

10.2.2 Collective Knowledge

The project ofFursin et al. [2016](cTuning Foundation) goes into the right direction, as they provide a set of predefined benchmarks, datasets and a repository to store the benchmark results. However, one major problem is the comparability of different hardware and software setups. Everyone is using different hardware, drivers and operating systems and even small changes (e.g., a newer GPU driver version) can result in a change of performance, making it difficult to compare the results. They also store information about the test setup, but this does not solve the comparability problem. Metrics as FLOPS are also no good alternative, as higher FLOPS not necessarily guarantee shorter execution times. We do not have a solution for this. One idea could be a community driven centralized evaluation cluster, where researchers can upload their code and evaluate it on a standardized software/hardware stack with a predefined set of benchmarks, parameter and datasets. This would enable comparability across different approaches, but setting up and maintaining such an infrastructure is a very costly endeavor.

In document GPU Array Access Auto-Tuning (Page 134-136)