for the KNL case where there are more threads on comparatively slower cores, this serial addition is a processing bottleneck and a large source of extra latency. A solution that I have developed is a branching algorithm which allows groups of threads to add their partial DM vectors and so each group can work in parallel. A group will be defined by a spin-lock and a thread-barrier. Within the group a thread will get the lock whilst it copies its partial DM vector into a temporary output array and once each thread has copied its partial vector it waits at the barrier for the other threads in the group to finish. At the completion of a group’s work, one thread from each group will take ownership of the temporary output array and move on to the next group. This can be seen in Figure 3.8 where each group adds its partial DM vector into the red box before that moves down to the next group where the process is repeated until the final DM command vector results.
This can help reduce the computational latency of an ELT-scale SCAO system by up to 200 µs or 18 %. The benefit of the algorithm is reduced in the pipe-lined case where it can help to reduce the latency by up 20 µs or 5 %. This difference is likely due to the individual threads finishing their execution at slightly different times when they receive pipe-lined pixels. For an optimal unequal subaperture allocation as described in Section 3.6.2.1, however, this branching algorithm would reduce the waiting time for each thread.
3.7
Host Optimisation and Tuning
3.7.1 Tuning the OS, Kernel and BIOS for Low Latency RTC
The operating system (OS) installed on the Xeon Phi used in this thesis is CentOS Linux 7.3 (The-CentOS-Project, 2001). To obtain the best low latency and low jitter performance various changes have been made to the default settings of the BIOS, the operating system and the kernel. The main changes to the BIOS settings
3.7.1. Tuning the OS, Kernel and BIOS for Low Latency RTC
Group 1 Group 2 Group 3 Group 4
Final DM Command Vector Stage 1
Stage 2 Stage 3
Figure 3.8: A schematic of the branching vector addition algorithm for a 4->2->2 situation with 16 threads, the first stage involves groups of four threads adding up their partial DM vectors, stages 2 and 3 reduce the resulting temporary DM commands to final DM command vector. This example allows up to 4 vector additions to happen in parallel and a total of 3 sequential stages instead of simply adding up all 16 threads’ partial DM vectors sequentially. For larger thread counts, the effect is even more pronounced.
involve turning off Intel Hyper-threading, which allows more logical threads to execute concurrently on hardware cores. Removing Hyper-threading allows each software thread to be pinned to a single hardware core and removes scheduling inefficiencies caused when cores switch between different Hyper-Threads. During initial testing a Linux kernel with a real-time patch was considered. The real-time patch attempts to increase the kernel’s real time response and allows the scheduler to pre-empt tasks to allow processes with higher priorities to proceed. However I discovered that with the tuning described above a real-time kernel was not required and in some cases degraded performance or even caused the system to crash. Other BIOS settings include Xeon Phi specific settings which relate to how the CPU handles memory addressing, with information available online (Intel, 2015), and different modes which determine how the fast Multi-channel DRAM (MCDRAM) is allocated, either accessible like standard RAM, reserved for the OS as a large last level cache (LLC), or a mixture of the two; these modes are termed ‘flat’, ‘cache’, and ‘hybrid’ respectively.
3.7.1. Tuning the OS, Kernel and BIOS for Low Latency RTC
the OS doesn’t schedule any program to run on these cores without specific in- struction, and also to other options relating to CPU interrupts and different power and performance modes. The main kernel options used are:
• isolcpus=[corelist]- specify isolated CPU cores
• nohz_full=[corelist] - stop certain CPU core ticks whenever possible • idle=poll- improve the performance of waking up idle cores
• irqaffinity=[corelist]- specify cores that handle interrupt requests (IRQs) • nohalt- turns off some power saving functions
The isolcpus option isolates all but the first 2 CPU cores from the OS scheduler such that processes must be explicitly allocated to them. This prevents the OS from potentially interrupting the simulator processes. The nohz_full option sets the specified CPUs whose tick will be stopped whenever possible, which can reduce the number of scheduling-clock interrupts and reduce jitter. The irqaffinity options set the specified CPUs to handle interrupt requests (IRQs). This can reduce jitter by allowing the necessary interrupts to be processed on the correct CPU cores. The nohalt option tells the kernel not to use certain power saving functions which reduces interrupt wake-up latency and can improve performance for real-time systems. Finally the idle=poll option forces a polling idle loop that can slightly improve the performance of waking an idle CPU at the expense of power consumption. A comprehensive description of the kernel command line parameters can be found at The Linux Kernel (2019).
During our testing, we have identified that best performance is achieved with the CPU set to Quadrant memory addressing mode, and the MCDRAM was set to ‘flat’ mode. In ‘flat’ mode, the MCDRAM is visible to the CPU on a separate NUMA (Section 1.2.1.2) node from the standard RAM and so this must be addressed either by explicitly allocating the memory in the program (using a NUMA library), or
3.7.2. Compiler Tuning
by executing the program on the specific NUMA node to make use of the fast MCDRAM. In this report the MCDRAM was allocated by running software with the numactl command with the –membind=nodes option, ensuring that the entire RTC is allocated on this NUMA node. On the Xeon Phi, the MCDRAM is 16 GB in size, which is sufficient to fit a whole ELT-scale RTC.
3.7.2 Compiler Tuning
There are multiple compilers available for compiling software written in the c pro- gramming language to target x86 hardware. During initial testing, two compilers were considered to achieve the best performance of the AO RTC. These were the Intel C compiler, icc, and the GNU’s Not Unix (GNU) C compiler, gcc. By far the main benefit to using gcc is that it is the default Linux compiler and is there- fore widely available, it is also completely free to use and modify under the GNU General Public License (GPL). It is being constantly updated to incorporate new features such as the Intel AVX-512 instruction set. Intel’s icc is not open source and not free to use, being available only as part of a paid license subscription to the Intel Parallel Studio XE or Intel System Studio packages. Intel do however offer a free version of these packages to students and classroom educators which was used to compile software used for this thesis.
For optimal compilation with either icc or gcc, certain compiler flags were nec- essary to achieve best performance on the Intel Xeon Phi. The -O3 compiler flag (GNU; Intel, 2017b) was used with both compilers as it enables the most aggressive automatic compiler optimisations including vectorisation, inlining of function calls and optimising loop structures. The gcc specific flags used were
• -mavx512f -mavx512er -mavx512cd -mavx512pf - enable AVX-512 • -march=knl - optimise for the Xeon Phi KNL
3.7.2. Compiler Tuning
• -finline-functions- attempt to inline functions
• options to statically link the Intel Math Kernel Library (MKL)
– The options to enable MKL for gcc have been omitted for brevity
To enable AVX-512, gcc needed to be of version 7 or above and so version 7.3.1 was installed manually, as the default CentOS gcc version is only 4.8.5. The Intel MKL is used to accelerate common basic linear algebra subroutines (BLAS) such as MVMs using pre-compiled optimised libraries.
The icc specific compiler flags used were
• -static-intel- link intel libraries statically
• -xMIC-AVX512 optimise for the many integrated core (MIC) architecture • -fma- ensure fused-multiply add (FMA) operations are used
• -align- attempt to align memory allocations to natural boundaries • -mkl=sequential - dynamically link the Intel math kernel library (MKL) The shared Intel libraries were statically linked to avoid having to install the com- piler software package on every target system; this was needed as the free student license had a limit to the number of machines it could be installed on simultane- ously.
It was found that icc provided the best performance for the AO RTC. This is as expected due to the number of optimisations available for the Intel platform. However the performance difference was only of order 10-15 % and so depending on the dimensions of the AO system, it may be more beneficial to use the free and open GNU gcc. Intel’s icc was used for the compilation of all software used in this thesis unless otherwise stated.