Original benchmarks from various benchmark suites are not optimized. In this case, they manually tuned as well to compare the proposed architecture and configuration framework to an ideal case. For fair comparison, the same algorithm within the benchmark is run on both accelerators. Training
is therefore done to optimize all parameters off-line using OpenTuner [13]. HeteroMap’s output
is compared with an ideal output that manually optimizes by running all possible configurations. Percentage accuracies are found by comparing the integer outputs (constituting choice selections) of the learners. Accuracy is measured by finding the percentage difference acquired performance using the proposed predictors, to the ideal case that optimizes all M variables. Target baselines are also taken from multicore-only and GPU-only runs. Different learners, namely regression and adaptive libraries, are also compared with HeteroMap in terms of accuracy and overheads. Completion times are compared for all benchmarks. Synthetically generated graphs for training the automated learners are depicted in Table6.2.2.
Graphs that are larger than the accelerator’s main memory size are broken into chunks and
processed one by one spatio-temporally using the Stinger framework [21]. To maintain fairness
between accelerators, memory transfer times are not included in the completion time. Thus, only the time spent on the accelerator is measured, where the overhead of HeteroMap is added to the completion time. Energy numbers are also compared to allow the framework to utilize it as a
metric. Power measurements are acquired using micsmc [36] and powerstat [37] utilities. Core
38
Evaluation
This section first evaluates HeteroMap by selecting an optimal learning model, and then compares to multicore-only and GPU-only baselines. Primary comparisons and analysis are done using the Xeon Phi and GTX-750Ti GPU setup, however, comparisons are also done on a stronger 40-core Intel Xeon E5-2650 v3 with 1 TB memory and a NVidia GTX-970 GPU. In addition to the primary (GTX-750TI, Xeon Phi) configuration, the following accelerator combinations are analyzed: (GTX- 970, Xeon Phi), (GTX-750TI, CPU-40-Core), and (GTX-970, CPU-40-Core). Further analysis is done in terms of model overhead and energy objectives. The various automated performance predictors are also compared with a baseline which optimizes all choices with no learner overheads (marked as ideal).
7.1
Selecting a Learning Model
It is important to understand whether different learners in HeteroMap are optimal for the given choice space. We therefore take different parallel learning algorithms for comparison. Multiple
40 Table 7.1.1: Learning Model Strategies. Speedup shown over the GTX-750 GPU.
Learner SpeedUp Accuracy Overhead
(%) (%) (ms) Decision Tree 28 86.2 0.10 Linear Regression 6 50.1 0.05 Multi Regression 27 85.4 4.11 Adaptive Library[40] 8 56.5 0.17 Deep.16[28,41] 11 59.3 1.52 Deep.32 22 68.4 2.52 Deep.64 26 82.2 3.01 Deep.128 31 90.5 3.48 Deep.256 30 92.9 6.39
Non-Linear Regression (fitted equations from Section5.5) and Decision Trees (IF-ELSE systems
using thresholds from Section4) are thus compared. A simple linear regression is also trained and
compared. XAPP [33] uses regression with more than 7 variables, similar to the one evaluated
in this paper. Rinnegan [40] uses a performance model adaptive library scheme, which profiles
program performance and then uses a simple model equation to predict performance. The equation’s output is directly proportional to only the data movement and accelerator utilization parameters given by a programmer/profiler. Deep learners are compared using various model sizes (explained
in Section5.4). However, as they do not discuss details regarding their learning models, they cannot
be directly compared against, and hence these two schemes are compared separately. Automated decision trees consist of 64 variables, with each level containing a binary threshold generated via training data. All are trained with the same amount of training data/time used for the proposed learners. All learners are parallelized on the CPU as well. Geomean completion times are taken for all benchmark-input combinations, and the speedup of each performance predictor is shown over the GPU. Evaluations and accuracy are computed when optimizing for performance objectives only, and are computed by comparing to the ideal case
Table7.1.1shows that the adaptive library and linear regression paradigms do not perform well for our setup. This happens because of non-linear variations associated with graph benchmark-input combinations and multi-accelerator architecture choices. Regression does perform well enough, and results in a higher overhead, as complex equations are required to maintain accuracy. The decision
tree model from Section4provides low overhead, but does not provide a comparable speedup to
the best deep learning model. The deep learning model exhibits a larger overhead, with higher classification accuracy. Larger deep learners follow quadratic trends in overheads and classification accuracy. This raises the acquired speedup to a certain extent, after which returns diminish due to the increasing overhead. At the end, the deep learning model performs the best in terms of performance, and hence it is selected for analysis. Overall, a speedup of 31% is acquired using the
deep learning model as shown in Table7.1.1, with a classification accuracy of 90.5% and overhead
of 3.48ms. Hence, all further evaluations are done using the deep learning model with 128 neurons.