In the previous section, we have proposed two greedy algorithms developed for ef- fective kernel mapping. These algorithms focused on a system which consists of one CPU and one GPU only. However, in real systems there are potentially higher num- ber of devices. Moreover, manufacturers come up with new designs and architectures, hence devices differ from one generation to another. In addition, each manufacturer may follow completely different design strategies. Therefore, a system may consist of wide range of devices in terms of architecture, and eventually each architecture will be used for different variety of types. Besides, the computing devices are not necessarily meant to be only CPU and GPU, they can be other accelerators.
Devices Kernels A B C 1 2 3 6 2 3 5 2 3 3 1 1 4 4 1 1 5 4 2 1
Table 3.2: Sample data for multi device mapping. Each column shows the execution time of kernels on the corresponding device.
The improved algorithm proposed in Chapter 3.3.2 can be tweaked to cover sys- tems with multiple computing devices. Since CPU-GPU systems are the most common and easy to integrate pair of devices, the previous sections covered only such systems. However, the target devices in the algorithms can be replaced with different ones. The algorithm itself considers cost of each device as the sum of running time and data movement to that particular device.
The biggest challenge in multi-device mapping is that it is not easy to test on simple benchmarks such as NAS benchmark set used in this proposed work. The main reason is each benchmark implements one type of operation. Therefore, assigning each kernel to different devices is not possible. Because, the tasks share the data, therefore when data is moved to a device, all of the kernels have a tendancy to run on that particular device, instead of moving the whole data to another device. In order to implement the otherwise, a benchmark should consist of unrelated tasks with private data. Therefore, each task may be assigned to a different device, and each device can be utilized.
In this section, the multi-device algorithm will run with a sample benchmark in order to show the benefits of the algorithm. Table 3.2 gives sample data to be used in the benchmark. Note that, for simplicity, the data transfer time from one device to another one will be taken as 2 and the data will be assumed to initially reside on device A. Additionally, Tables 3.3 and 3.4 show the results and mappings for the base algo- rithm and the improved algorithm applied to the sample data. Note that, single device mappings are calculated as the sum of initial data transfer and running times for each kernel. The overall costs of single device mappings are 16, 14 and 13 correspondingly.
Cost
Kernels A B C Decision Cumulative Cost
1 2 3+2=5 6+2=8 A 2
2 3 5+2=7 2+2=4 A 5
3 3 1+2=3 1+2=3 A 8
4 4 1+2=3 1+2=3 B 11
5 5+2=7 2 1+2=3 B 13
Table 3.3: Trace data for running the base algorithm on the sample data given in Ta- ble 3.2. Note that, the data transfer cost for each transfer is two. If the transfer cost is added to the execution latency, the sum operation is shown in the cost column of each device.
only local minimum, it only sums the running time and data transfer time. Therefore, device A dominates the other devices since it has no transfer cost initially, and for the 4thkernel device B becomes more profitable and mapping chooses B as the best device
for 4th and 5th devices. Eventually, the overall running cost becomes 13 for the base algorithm, which is better than any single device mapping (16,14 and 13).
Table 3.4 shows the results for the improved algorithm, where each device is treated as the potential best device. Our approach simulates the future kernels as if that par- ticular device is selected and runs the base algorithm accordingly. This is followed by a comparison among the device costs and the best one is selected. For device A, the algorithm sets data on device A and runs the base algorithm and the cost is found as 13, which is the sum of running costs of kernels 1,2 and 3 for device A plus the data movement to B, and running costs of kernels 4 and 5 for device B. On the other hand, when the improved algorithm is applied to the same data, although device C was not present in the mapping of the base algorithm, due to the improvements brought C dominates the mapping. Moreover, the overall cost becomes 12 and it is better than the base algorithm and any of the single device mappings. This example shows that the critical point problem still exists in multi-device mapping, since improved algorithm generates a better mapping than the base algorithm. To sum up, the proposed greedy algorithms can also be applied to the multi-device systems.
Cost
Kernels A B C Decision Cumulative Cost
1 13 12 13 B 5
2 12 9 7 C 9
3 9 6 3 C 10
4 9 3 1 C 11
5 6 4 1 C 12
Table 3.4: Trace data for running the base algorithm on the sample data given in Ta- ble 3.2. Note that, the data transfer cost for each transfer is two. Cost column shows the decision cost for running the the kernel k on each device.