Analysis Time - Execution Performance - GPU Array Access Auto-Tuning

7.2 Execution Performance

7.2.4 Analysis Time

Finally, we take a look at the time required for the analysis on the GTX 1080.Fig- ure 7.18shows the time required for the profiling. It is easy to see that for the exhaustive search the time is significantly higher than for our predictive profiling. Given the high numbers of configurations for the REYES and KD-Tree benchmarks, it can be estimated that the overall exhaustive profiling for these two benchmarks would require a lot of time. Assuming that compiling a kernel requires approx- imately 3s, on 16cores the compilation allone would require ~15days for REYES and ~44days for the KD-Tree.

Figure 7.19shows the ratio between profiling compared a normal application run. As can be seen, this depends on the number of configurations, while our method requires usually below 10x, for REYES and KD-Tree less than 100x for the profiling while the exhaustive can reach up to 10,000x (for SRAD).

InFigure 7.20we show the measured times for the profiling and analysis of the benchmarks. As stated before, REYES and KD-Tree require too much time for the exhaustive profiling and have therefore been excluded. Except for the Bitonic Sort, our predictive profiling is always significantly faster than the exhaustive search. The reason for this is that the Bitonic Sort only has very few possible configurations, so that the number of configurations hardly differ.

5.65 1 5 4 9 0 .92 23 .24 13 2 .20 13 6.76 5.59 5.66 2 .43 7.55 2 4 .80 ₅₈.20 14 8.69 0.1 1.0 10.0 100.0 1000.0 10000.0 100000.0

Bitonic SRAD Hotspot DPID COMIC REYES KD-Tree

Tim

(s)

Exhaustive Predictive

Figure 7.18:Execution time for the proﬁling. It is easy to see that the predictive

profiling is significantly faster than the exhaustive profiling and can profile all applications within a few minutes.

1 10 100 1,000 10,000

Bitonic SRAD Hotspot DPID COMIC REYES KD-Tree

P ro fili ng / Re fe rence Ex ecutio n Tim e Exhaustive Predictive

Figure 7.19: Ratio between proﬁling and execution time. As can be seen, with

increasing complexity of the benchmark, the profiling time significantly increases, especially for the exhaustive search. It can be seen that our predictive method requires even for the complex KD-Tree benchmark less than 100x compared to a normal application run. The exhaustive search can be multiple orders of magni- tudes slower. Be reminded, the profiling time not only contains the pure execution time, but also the time required to compile the GPU kernels, convert data into other layouts and restoring the input data before starting a kernel in another configuration!

0.01 0.10 1.00 10.00

Bitonic SRAD Hotspot DPID COMIC REYES KD-Tree

Tim

(s)

Exhaustive - Exhaustive Predictive - Exhaustive Predictive - Predictive Figure 7.20:Time required for the analysis of the proﬁling data. The analysis is

quite fast for all methods (compared to the proﬁling), however our proposed method (green) is always the fastest. Further, it can easily be seen that with increasing complexity the execution time rises. The exhaustive analysis requires more time using exhaustive data (blue) than with predictive data (orange) caused by much more data that needs to be loaded and processed.

Empirical Performance Models

Before we discuss the results of our evaluation, we take a look onto work we have conducted to further extend the decision making of MATOG. This chapter contains recent research results that have not made it into the active development of MATOG yet and therefore are not part of the evaluation in the previous chapter. This work was conducted in conjunction with Sandra C. Amend for her Master Thesis [Amend 2017].

So far we have explained our concepts for auto-tuning array layouts, analyzing applications, how to use meta data to predict optimal layouts, using categorical decision models and how all of this is implemented in MATOG. What these model do not provide is an estimate for how long the kernel will execute on a specific hardware. This is an important feature. One use case would be to estimate prior a kernel execution whether converting an array into another layout between two kernel calls would yield enough speed up to compensate for the time necessary for the conversion.Figure 8.1shows in which situations this would yield an improvement and in which it would not. Another application would be data partitioning in multi-heterogeneous-device applications. The problem here is that the heterogeneous devices can require a different amount of time to process the same amount of data. With an accurate prediction model it would be possible to divide the data into exactly the required pieces, so that all devices complete their task at the same time, reducing the synchronization overhead between the devices to an minimum. Our assumption is, that when an application is executed using the same data on the same GPU, the kernel runtimes will be deterministic. Therefore we expect, that it is possible to extract characteristics of the data that enable us to estimate how long a kernel will run on a specific GPU. In this chapter we evaluate whether our automatically gathered meta data supplies the required information.

Already a lot of research has been conducted to find performance models. There are two main methods, analytical and empirical performance models. Analytical methods model the correlation of the computation in relation to the used hardware. To establish these analytical models, a deep knowledge of the algorithm and hardware is necessary [Wolf et al. 2014]. These are usually handcrafted and therefore are not suited for our application, as we have the ambition to have a fully automatic auto-tuner. Empirical performance models use measured performance data, similar to the data that MATOG gathers during its profiling, and then use a method to fit some kind of performance curve into the measured data. Please

Δ(A,B) < convert(A,B) Δ(A,B) > convert(A,B) 0 1 2 3 4 5 0 10 20 30 40 50 60 Exec uti on Ti m e Meta Data

Layout A Layout B Δ(A,B) convert(A,B)

Figure 8.1: Artiﬁcial example for the execution time of two diﬀerent layouts in

dependency of a given meta data. ∆(A,B)is the speed up that could be achieved

if the faster layout would be used andconvert(A,B) is the time necessary for

converting the data. In the left example, the conversion takes more time, so that this would not yield in an signiﬁcant improvement. However, in the right example the conversion would result in a mentionable speed up.

refer toSection 4.5.1for an overview of state-of-the-art techniques in this area. Which technique is suitable to establish these empirical models depends on the application and on which data is available. Approaches that utilize Neural Net-

works (NNs)[Ipek et al. 2005;Lee et al. 2007;Wu et al. 2015] achieve high quality

estimations but require a signiﬁcant number of data samples to train the networks. MATOG does automatically gather data, however, after reducing redundant, constant and linearly dependent information these datasets are usually rather small and usually contain less than 100 entries, which are too few to be used in NNs.

Gaussian Processes (GP)[Rasmussen and Williams 2006] are another method that

can be used with the advantage that these only require very few data samples. Another advantage of GP is that they not only provide a predicted value, but also an error estimate, how certain the model is of the value.

In the following sections we will explain how we use GP to automatically generate performance models, evaluate their accuracy and how these could be used in MATOG. These experiments have been conducted using MATLAB and the Gaussian

Processes for Machine Learning (GPML) Toolbox1to quickly generate results and

evaluate diﬀerent model implementations. 1_{www.gaussianprocess.org/gpml/code}

8.1 Model Training and Prediction Accuracy

In order to train our models, we use the MATOG profiler to gather all necessary data. This data is then extracted from the MATOG database, converted in a format that can be processed by MATLAB and then directly fed into the GPML Toolbox. For the data we use the same filtering as described inSection 6.4, where we re- move all constant and linear dependent entries. We evaluated multiple GP model implementations, using different covariance functions: linear (GP lin.), squared

exponential (GP SE), squared exponential with automatic relevance determina- tion (GP SE + ARD)and combined linear plus squared exponential (GP lin. + SE)

[Rasmussen and Williams 2006]. Further, we compare all results against a normal

linear regression (lin. reg.)to evaluate if the usage of a complex model as GP is

beneﬁcial. For the evaluation we chose the “bound and split” kernel from the REYES (Section 7.1.6) and the ﬁrst main kernel of the KD-Tree benchmark (Sec- tion 7.1.7), both executed on a GTX 680. We chose these kernels, as these are the main kernels of the two most complex benchmarks we have available. Further, these two benchmarks produce the highest amount of meta data.

In document GPU Array Access Auto-Tuning (Page 108-113)