Intel Xeon Phi - GPRM: a high performance programming framework for manycore processors

comprehensive discussion is provided in [93].

Although, our localisation technique can be effectively implemented by GPRM, we leave it as a future work, and for the purposes of this work, we use the default hashing policy of the TILEPro64. Also in general, hash for home is the preferred option for most of the parallel applications running on the Tilera chip, as it aims to reduce the potential for bottlenecks.

In [22], we have also investigated the effect of other possible options provided by the Tilera hypervisor. The hypervisor configuration file (.hvc) used for this study is as follows [136]:

1 o p t i o n s s t r i p e m e m o r y =d e f a u l t 2 o p t i o n s d e f a u l t s h a r e d = 0 , 0 3 d e v i c e srom / 0 srom 4 d e v i c e p c i e / 0 p c i e 5 d e d i c a t e d 7 , 7 6 c l i e n t v m l i n u x 7 a r g s $XARGS

Listing 3.1: Hypervisor Configuration File for the TILEPro64

Memory pages can be allocated either through a specific memory controller or in striping mode, where each page is striped across all memory controllers in 8KB chunks. With memory striping, Linux will boot up believing it has a single memory controller that is four times larger than any of the actual physical memory controllers. The effect of memory striping is considerable when caching is turned off across the system. However, when caching is enabled, it is mostly transparent to the user.

The default memory striping policy is used, and the last core in the mesh (7,7) is dedicated to the PCI communication between the device and host. Therefore, 63 cores can be used to run the user code.

The most important conclusion of our work was that the native GNU/Linux thread scheduling is not as efficient as expected. By static mapping of threads to cores, high-cost thread migrations do not occur multiple times during the execution time. The result of this work helped us consider a static thread to core mapping inside our framework, and instead focus on the higher level scheduling of tasks on threads.

3.5 Intel Xeon Phi

The Intel Xeon Phi coprocessor 5110P used in this study is an SMP (Symmetric Multiproces- sor) on-a-chip which is connected to a host Xeon processor via a PCI Express bus interface. The Intel Many Integrated Core (MIC) architecture used by the Intel Xeon Phi coprocessors

3.5. Intel Xeon Phi 39

Figure 3.2: Intel Xeon Phi Architecture (picture borrowed from [2])

gives developers the advantage of using standard, existing programming tools and meth- ods. Our Xeon Phi comprises 60 cores (240 logical cores) connected by a bidirectional ring interconnect (see Figure 3.2).

If an application is scalable on the Xeon processors, can make use of vector units, and is able to utilise more memory bandwidth than available with the Xeon processors, then it could be a potential target for the Xeon Phi [2].

3.5.1 Xeon Phi Architecture

The Xeon Phi coprocessor provides four hardware threads sharing the same physical core and its cache subsystem in order to hide the latency inherent in in-order execution. As a result, the use of at least two threads per core is almost always beneficial [2]. The Xeon Phi has eight memory controllers supporting 2 GDDR5 memory channels each. The clock speed of the cores is 1.053GHz. Each core has an associated 512KB L2 cache. Data and instruction L1 caches of 32KB are also integrated on each core. Another important feature of the Xeon Phi is that each core includes a SIMD 512-bit wide VPU (Vector Processing Unit). The VPU can be used to process 16 single-precision or 8 double-precision elements per clock cycle.

3.5.2 Xeon Phi Performance Considerations

Providing four hardware threads (logical cores) sharing the same physical core is known as multithreading in the Xeon Phi. We use the term multithreading [2] here to describe the difference with hyper-threading on the Xeon processors. Throughout this dissertation,

3.6. Summary 40

however, multithreading refers to software thread parallelism.

The use of multithreading as a part of the Xeon Phi architecture is crucial to hide latencies of its in-order microarchitecture. Hyper-threading on the Xeon processors, on the other hand, is designed to feed a dynamic execution engine, and depending on the application can be fully ignored without having negative impact of performance. The hardware multithreading on the Xeon Phi should not be ignored similarly.

Generally, the floating-points and memory capabilities the hardware threads offer cannot be achieved with a single thread per physical core. On the other hand, it is also important to note that saturation could happen with even two hardware threads, and as we will see in the fol- lowing chapters, different applications implemented by different parallelisation approaches experience varying levels of saturation.

It is beneficial to parametrise the number of cores as well as the number of hardware threads per core for applications targeting future manycore architectures.

3.6 Summary

This chapter covered a background on parallel architectures. We started the discussion with the Flynn’s taxonomy and continued with a section on memory organisation. We then shortly reviewed a number of multicore machines. We also discussed the architecture of two manycore systems, the TILEPro64 and the Intel Xeon Phi in more detail.

With the increasing number of cores, new programming challenges arise. Understanding the core architectural concepts of a given parallel platform is key to writing correct and efficient parallel programs, regardless of the programming model used. Improving the data locality in manycore architectures is an important factor for achieving high performance. Although there is a lot of fine-grained architecture-specific control that every new platform offer to its users, but in general, to benefit from these features, existing codes have to be changed significantly. Considering such details (e.g. the effect of distributed caches) in the design of runtime systems could help the programmers notably.

It is also important to bear in mind that task and thread scheduling decisions can impose a significant overhead as the number of cores grows. The trade-off is thus to maximise the performance and data locality, while keeping the runtime overhead low. This will be the basis of our discussions in the next chapters.

Chapter 4 Task-based Parallel Models for

Shared Memory Programming

In a general-purpose system, applications residing in the system compete for shared re- sources. Thread and task scheduling in such a multithreaded multiprogramming environment is a significant challenge.

After an introduction to parallel programming models and the concept of task parallelism in the background chapter (Ch. 2), in this chapter we would like to investigate performance characteristics of three popular task-based parallel programming models on a modern manycore system, the Intel Xeon Phi. The main three task-based models that are supported by icpc(Intel’s C/C++ Compiler) are Intel OpenMP, Intel Cilk Plus, and Intel TBB.

We have used three benchmarks with different features which exercise different aspects of the system performance. Moreover, a multiprogramming scenario is used to compare the behaviours of these models when all three applications reside in the system.

Furthermore, at the end of this chapter we continue the discussion about multiprogramming using examples from our research work on OpenMP applications running on the TILEPro64.

In summary, this chapters reviews our work on other approaches and presents the lessons learnt that helped us design, tune, and improve the GPRM runtime system from its beginning to the present.

In document GPRM: a high performance programming framework for manycore processors (Page 54-57)