5.5 Performance Evaluation
5.5.4 Preemption Latency
The preemption latency, as described in Figure 35, is defined as the time between the arrival of preemption request and the beginning of context (and communica- tion data) extraction. Whereas, the total preemption time is defined as the sum of latency and context switch time. There are two components in the preemption latency in our system. The first component is caused by the checkpoint architecture. In order to start the context extraction from a hardware task, the execution flow has to reach a checkpoint state. This latency depends on the implementation of scan- chain structures in the task. The second latency component is due to the ongoing communication flows which prevent a hardware task to be context switched. In CS, this latency is shown by the delay in context switch until there is no communication data intermediately stored in the channels. In contrast, CSComm removes this delay since it can handle the situation where the communication channels are still busy when a preemption request is received.
Table 10 describes the comparison of average preemption latency between CS
and CSComm in ZC706. It is worth mentioning that our implementation of CS manipulated the input data flow to cut the delay due to the communication (cf.
Figure 17). The input data flow is stopped when a preemption request is given so that no new data could arrive at the communication buffers, as proposed in [Vu+16] and presented in Section 5.3.3.2. At the output side, output data was immediately transferred to the FIFO of the software task every time they were produced until the context extraction began. This was possible due to the large size of FIFOs of the software task in the experiments. This condition is beneficial to the preemption latency of CS since, normally, the output data can only be transferred outside the FPGA if the receiver is ready (enough space in the FIFO of the successor task). Despite this, the latency of benchmark applications shown in the table are still significant, particularly for applications which are communication-intensive, e.g., BLOWFISH and SHA. For applications with less intensity in communication, the latencies were 147 cycles or more. This shows the cost of the penalty due to the communication.
With our solution in CSComm, the latency in preemption was significantly re- duced to 56 cycles or less for all benchmark applications. This value reflects the initial latency caused by the checkpoint architecture since the preemption requests were not stalled anymore when the preempted task has ongoing communication flows. This result shows the evident benefit of our solution. While maintaining the communication data consistency in a preempted task, it also significantly reduces the latency in a hardware context switch operation.
With the latency and total preemption time in Table 10, we also present the ratio between them. When we compare the ratio of latency to total preemption time between CS and CSComm, we observe substantial differences between CS and CSComm. This ratio correlates to the system responsiveness to preemption requests.
82 e x p e r i m e n t s a n d r e s u lt s
Table 10: Comparison of average preemption latency (FPGA cycles) between CS and CSComm in ZC706
App CS CSComm
Latency Total
Preemption Ratio Latency PreemptionTotal Ratio
adpcm 147 2472 5.93 % 6 2387 0.23 % aes 148 3143 4.71 % 17 3066 0.56 % blowfish 20017 24847 80.56 % 56 5177 1.08 % gsm 178 1256 14.16 % 14 1168 1.16 % idct 180 1039 17.3 % 10 1050 0.93 % motion 876 9889 8.86 % 7 9278 0.07 % sha 10153 10971 92.54 % 7 1467 0.51 %
How fast hardware tasks in reconfigurable architectures react to preemption requests from the scheduler is frequently important in a system with a strict time constraint. When the latency ratio is high in the hardware context switch, the system has low responsiveness to preemption requests. CS spends more time waiting until the communication channels do not contain data anymore rather than performing the context extraction and restoration. In Table 10, the latency of CS took up to 92.54 % of the total preemption time whereas CSComm reduced the latency to 1.16 % and below for the benchmark applications. This result means that our communication management provides the high system responsiveness in a hardware context switch.
With the reduction of preemption latency in CSComm, we also obtained a shorter preemptive context switch time for most of our bench applications. This advantage was shown in most of the benchmark applications used in the experiments. However, more obvious differences between CS and CSComm regarding the total preemption time are shown in BLOWFISH and SHA. The overhead caused by extracting and restoring the I/O communication data became negligible thanks to the reduction in preemption latency. Nonetheless, the overhead in data footprint is still higher for IDCT due to its small context size and its less intensity in data exchange. Similar results when the evaluation was performed in A5SOC are presented in Table 11.
Finally, we would also like to highlight the predictability in the preemption offered by our solution.Figure 37 shows the preemption time presented in box-and-whisker plots. The box describes 50 % of the data and the whiskers represent the maximum and minimum values in our measurements. A wide range of values in the plots shows a high variation in the preemption time. Not only does our communication solution reduce the average latency in preemption, it also improves the predictability of the total preemption time for most of the benchmark applications. Again, these