5.6 Complexity of E-ELT RTC systems
6.1.2 Zero Overhead Linux
6.1.2.1 ZOL Mode
In normal operation, the TILE-Gx runs a non real-time micro Linux kernel, each core will be affected by standard Linux interrupts. One of the more interesting and useful modes of operation of the TILE- Gx is the Zero Overhead Linux (ZOL) mode. In the ZOL mode, the TILE-Gx offers the possibility for the user to specify a subset of tiles (i.e. cores), each of which will run a single, user-space task, without incurring any Linux system overheads[4]. The specified cores are free of all Linux system overheads and interrupts. The ZOL mode allows for near real-time performance and is capable of removing major variations in execution time. This is an important aspect for the WPU, to perform fast and stable pixel processing and is naturally the mode that has been chosen to test the TILE-Gx.
6.1.2.2 Comparison of non-ZOL and ZOL performance
Figure 6.3 presents histograms for the execution time of the TILE- Gx performing an image calibration followed by CoG calculations (see section 5.2.3.1)) using a 500×500 WFS detector. These calculations are repeated 106 times. The calculation is performed both using ZOL as well as in normal operation (i.e. non-ZOL). In normal operation the pthreads implementation is used while in ZOL mode threading is performed using the TILE-Gx specific libraries.
Not only do we see that the ZOL mode has a lower mean (298 µs vs 900 µs), but also that it reduces the variation in execution time referred to as jitter (calculated as the standard deviation).
(a) (b)
Figure 6.3: Comparison between non-ZOL and ZOL modes of operation and its effect on the wavefront processing time and stability. (a) Non-ZOL and (b) ZOL
Table 6.3 summarises the main results obtained in figure 6.3 and further illustrates how performance can be increased by using the ZOL mode. We can observe a strong reduction in mean execution time (a factor 3) and a reduction in standard deviation by a factor of almost 100.
Table 6.3: Summary of the main quantities obtained in figure 6.3 comparing calcula- tion times using ZOL and non-ZOL modes. Values are given in µs.
Mode mean σ min max range
Non ZOL 893.66 98.00 645.34 1430.21 784.86
ZOL 298.72 1.33 292.72 304.252 12.151
For this test we used a detector of size 500×500 pixels, not all WFS detectors for future AO systems will be this size and it is important to understand how ZOL and non-ZOL modes evolve with detector size. Figure 6.4 shows the image calibration and CoG calculations performed by the WPU for both ZOL (blue) and non-ZOL (red) modes. Each time measurement is repeated 106 times per detector size. For each data point, the plot also shows the standard deviation (black) and the total measured range (green).
Figure 6.4: Comparison between non-ZOL and ZOL modes of operation for the scaling of the detector size (the full detector size being N×N). Red is non-ZOL results and blue is ZOL results. The vertical bars represent the standard deviation (black lines) and the total measured range (green lines).
further illustrates how performance can be increased by using the ZOL mode across multiple problem sizes. We can observe a strong reduction in mean execution time and a reduction in standard deviation, which is more pronounced for large system sizes. The variation in execution time for the ZOL mode are very stable, staying at the level of 1.3µs regard- less of the detector size. The ZOL mode very convincingly offers both increased mean performance and increased stability (determinism).
Table 6.4: Summary of the main quantities obtained in figure 6.4 comparing calcula- tion times using ZOL and non-ZOL modes. Values are given in µs.
size (N×N)
ZOL non-ZOL
mean std range mean std range
100 41.163 1.312 22.685 519.61 68.338 576.048
200 74.152 1.335 13.9 645.551 70.769 645.809
300 131.252 1.323 11.743 703.694 70.465 597.897 400 204.158 1.268 12.008 785.256 84.533 620.083 500 298.007 1.329 12.151 893.657 98.006 784.862
The significant reduction of the mean and jitter in ZOL mode can come from multiple sources. Using the ZOL mode, we reduce the op- erating system (OS) overheads and stop the OS from interrupting cal- culations. We believe the major improvements in performance actually come from using the hardware specific libraries and not only from using the ZOL mode. We make this assumption due to the fact that the ef- fect of moving from the standard libraries to the ZOL and proprietary libraries shifts of the entire distribution. If this was an effect of only re- moving the OS activity and interrupts, we would likely see a reduction in the outliers but not the entire distribution.
It is unlikely that the impact of the OS would be seen at every it- eration, in other words the impact of the OS is periodic and not a
constant. A more likely scenario is that the OS impacts only a subset of iterations and therefore creates outliers. As we see a shift in the entire distribution and not only the outliers, we believe that the im- proved performance is due to a better optimised algorithm provide by a different implementation of the desired calculation.
The hardware specific libraries are optimised specifically for the TILE-Gx and provide the main performance improvements. Since it is not possible to use standard C/C++ libraries in the ZOL mode, pro- prietary libraries must be used. This could potentially increase devel- opment time but ensures that we achieve the very best possible perfor- mance. For the rest of the chapter we will use both the vendor-supplied hardware specific libraries and the ZOL mode, unless otherwise men- tioned.
6.1.2.3 ZOL mode limitations
When using ZOL mode, the TILE-Gx does not allow certain functions to be used. These functions typically are system level functions such as printing to screen (printf and cout) or debugging methods. This can lead to some difficulty debugging code. For this reason the TILE- Gx comes with the MDE to help the development and reduce time to market of products.
Shared memory
The TILE-Gx libraries include a proprietary implementation of a multi- threading library. It has been developed to optimise the TILE-Gx available resources and it is recommended over the standard C/C++ li- braries. This library uses very similar interfaces to pthreads and allows for simple portability of code as shown in listing 6.1. This is a exam-
ple of the Tile-Gx specific code when compared against pthreads the standard multi-thread library available in C. This shows the similarities between the libraries interfaces and how code can be developed to be portable without making any major changes to the code structure.
Listing 6.1: Code example in C illustrating the portability of software specifically written for the TILE-Gx to another platform using standard multi-threading libraries.
#i f d e f TILEGX
p t h r e a d s m u t e x l o c k ( p t h r e a d m u t e x t mutex ) ;
#e l s e
t m c s p i n m u t e x s p i n l o c k ( t m c s p i n m u t e x t mutex ) ;
#e n d i f
Another important aspect of the ZOL mode is that each thread is a single user space task with no access to the global memory. For example, using malloc to allocate memory will work within a single thread, but this memory cannot be accessed by other threads. To allow threads to access and share memory, the memory has to be allocated using the TILE-Gx libraries and has to be flagged as shared memory; this is shown in listing 6.2.
Listing 6.2: Example of C code to allocate memory that can be accessed and shared by all threads. t m c a l l o c t a l l o c ; t m c a l l o c s e t s h a r e d (& a l l o c ) ; i n t∗ data = t m c a l l o c m a p (& a l l o c , l e n g t h ∗s i z e o f (i n t ) ) ; // do s o m e t h i n g // f r e e memory
t m c a l l o c u n m a p ( data , l e n g t h ∗s i z e o f (i n t ) ) ;
Debuggers
The profiler, for example, gives information on bottlenecks or ineffi- ciencies in the synchronization between threads and helps the developer to optimise complex multi-threaded code. Both of theses tools typically interact with the OS and for this reason cannot be used in ZOL mode. This can add complexity during code development since other standard debugging techniques (such as print variable to screen) are also not available. The standard debugger was used during code development but was not included during the actual TILE-Gx performance tests.
Timings
It is crucial in our application to be able to measure the execution time of sections of code accurately. In standard C/C++, the profiler is typically used to measure times. However, it adds unwanted overheads that influence the overall timing accuracy. Another possibility is to use the Linux clock. In that case, we measure the difference between the time before and after the section of code of interest. This method has been shown to be accurate but unfortunately doesn’t work in ZOL mode because of the need to access system levels functions. The MDE also offers multiple tools to measure performance in ZOL mode. There are multiple techniques offered though they all stem from clock cycles between two points in the code.
Because of the above limitations, we have decided to measure per- formance using the simplest instance of counting clock cycles. This
method is illustrated in code listing 6.3. This method requires the spec- ified clock frequency to be accurately known to give an absolute time measurement. We believe that this limitation is in reality negligible, especially when comparing relative performance. We have compared this method to using the Linux clock in non-ZOL mode and have found them to give identical results.
Listing 6.3: Code example to calculate execution time of a section of code based on the CPU clock frequency.
s t a r t c o u n t = g e t c y c l e c o u n t ( ) ; // Do s t u f f s o m e f u n c t i o n ( ) e n d c o u n t = g e t c y c l e c o u n t ; c l o c k f r e q u e n c y = t m c p e r f g e t c p u s p e e d ( ) ; tim e = ( e n d c o u n t − s t a r t c o u n t ) / ( c l o c k f r e q u e n c y ) ;