–ECE 842 Survey Paper–
Selected Papers on Shared Memory Multiprocessors
and Distributed Shared Memory Multiprocessors
from the Past 10 Years
William M JonesContents
1 Programming Language and Compiler Support 3
1.1 An Efficient Shared Memory Layer for Distributed Memory Machines . . . 3
1.2 Fine-grain Access Control for Distributed Shared Memory . . . 3
1.2.1 Software look-up . . . 4
1.2.2 Using ECC as cache-block Valid Bits . . . 4
1.2.3 The Hybrid Version . . . 4
1.2.4 Blizzard Performance . . . 5
1.3 Combining Compile-Time and Run-Time Support for Efficient Software Dis-tributed Shared Memory . . . 5
1.4 Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations . . . 5
1.5 Shared Memory Multiprocessors and Sequential Programming Languages: A Case Study . . . 5
1.6 On the Influence of Programming Models on Shared Memory Computer Perfor-mance . . . 6
1.7 Improving Fine-Grained Irregular Shared-Memory Benchmarks by Data Reorder-ing . . . 6
1.8 LPVM - experimental Lightweight process PVM . . . 6
2 Communication Latency Reduction 6 2.1 Rationale and Design of the Hydra Multiprocessor . . . 7
2.1.1 Overview . . . 7
2.1.2 The MCM Technology . . . 7
2.1.3 Performance versus Cost . . . 8
2.1.4 The Hydra Design . . . 8
2.2 The Case for a Single-Chip Multiprocessor . . . 8
2.3 Techniques for Reducing Consistency-Related Communication in Distributed Shared Memory Systems . . . 9
2.4 Data Forwarding in Scalable Shared-Memory Multiprocessors . . . 9
2.5 Adaptive Protocols for Software Distributed Shared Memory . . . 10
2.6 ASCOMA: An Adaptive Hybrid Shared Memory Architecture . . . 10
Abstract
Over the course of the last decade, considerable research has been done in the areas of Shared Memory Multiprocessors (SMP’s) as well as Distributed Shared Memory (DSM’s) machines. Suppling powerful computing solutions to a wide variety of scientists and engi-neers requires ongoing research efforts. Such efforts include improving the existing technolo-gies by analyzing the currently limiting bottlenecks in both hardware and software. Often it is necessary to investigate the requirements of the end-users’ applications in an attempt to identify features that may be used to the designers advantage. Additionally, it is not sufficient to simply improve the bottom-line raw performance that one may obtain from such machines. To truly make SMP’s and DSM’s viable solutions to advanced computing, there has to be support from a front-end point of view that allows typical users to exploit these machines to their fullest extent.
In this paper, I will address several personally selected research activities in this area from the past ten years. Predominately, these activities may be partitioned into two main areas: Communication Latency Reduction (Section 2) and Programming Language and Compiler Support (Section 1). This paper is intended to be a broad overview of the numerous papers I encountered during research process. I will, however, be taking a more in-depth look at a few of the papers that interested me the most. It is my intention that this will provide additional learning, which of course, is what this is all about.
1
Programming Language and Compiler Support
Much of the research in the area of programming language support for DSM’s resides in provid-ing an intelligent layer of abstraction that gives the programmer a shared memory look-and-feel while using physically distributed memory systems. This often includes dealing with such ques-tions as: “How do I allow programmers access to the type of synchronization primitives they are accustomed to in this programming model?” and “How do I address global name-spacing and data access patterns to optimize for the efficient use of the underlying communication sub-system?” This section presents research that provides insight into these questions as well as others.
1.1 An Efficient Shared Memory Layer for Distributed Memory Machines
Daniel Scales, et al. at Stanford University are working on a system referred to as SAM [12] SAM is essentially a shared memory (SM) interface for a network of workstations (NOW). It allows the programmer to utilize the power of clusters with the ease of the SM programming paradigm. SAM’s creators assert that the ability to have a global name space and caching of remotely accessed data are of the utmost importance in scientific applications with irregular communication patterns and parallelism. SAM is based on explicit use of constructs in the language that allow the programmer to indicate where both communication and synchronization is to take place. In this way, the system offers a hybrid message passing (MP) capability as well as the SM model for the programmer. They were able to implement SAM on several systems including: the CM-5, Intel iPSC/860, IBM SP/1, as well as many others.
1.2 Fine-grain Access Control for Distributed Shared Memory
system is able to dynamically restrict reads and writes to cache-block-sized memory regions. This capability forms the basis for efficient cache-coherent (cc) SM systems. They describe several techniques that provide traditional SM programming constructs for DSM machines. In this way, Blizzard offers the kind of portability for SM codes that have traditionally associated with message passing (MP). There are three interesting techniques for making this type of access control possible: 1. Adding a software loop-up before each SM reference by modifying the programs executable. 2. Using the memory’s ECC as cache-block valid bits. 3. A hybrid version.
1.2.1 Software look-up
Message passing systems often lack the ability to provide fine-grain access control that hardware SM machines rely on. Access control is the ability to selectively restrict reads and writes to memory regions. At every instance a memory reference occurs, the system normally is required to perform a look-up to determine whether the data being referred to is in local memory and is in an appropriate state. If is turns out that these criteria are not satisfied, then the system is required to perform an action to make the data being referred to available on the local node. Furthermore, the granularity is defined to be the smallest amount of data that can be independently controlled. Here when we use fine-grain, we are referring to the size typically associated with hardware cache blocks.
Blizzard implements this by having a library file that is linked with the application during the compilation process. This keeps one from having to rewrite all the compiler code to include the proposed functionality of the library file. The library uses a variant of Fast-Cache to rewrite the existing executable by inserting a state table loop-up before every SM reference. This on average adds 15 instructions per every load and store instruction that cannot be determined by inspection to be a stack reference. This normally adds 18 clock cycles on the CM-5, but only in the absence of cache and TLB misses.
1.2.2 Using ECC as cache-block Valid Bits
On the CM-5 (the test-bed for Blizzard), the ECC can be forced to operate in several different modes. Specifically the diagnostic mode allows the ECC in the SPARC cache to consider all cache misses to be uncorrectable errors. In this way, a precise exception can be generated and the kernel can vector the error to a user-level handler. This handler is allowed to control the state-machines associated with indicating Invalid, ReOnly, and ReadWrite states.
In the paper they go through several proofs that by using the ECC bits for this purpose does not result in a loss of reliability.
1.2.3 The Hybrid Version
1.2.4 Blizzard Performance
The benchmarks on Blizzard were compared to the results from running the same applications on the KSR-1. On average, the software look-up version of Blizzard performed 4 times as slow (note here slower, not faster). The ECC version was considerably faster (as one might expect) but not as fast as the KSR-1. The hybrid version provides performance between that of its components alone.
1.3 Combining Compile-Time and Run-Time Support for Efficient Software Distributed Shared Memory
Sandhya Dwarkadas, et. al. at the University of Rochester and Rice University have developed an integrated compile-time and run-time (RT) system that allows the programmer to use a traditional SM programming model on a DSM platform [11]. Here again we see an effort to make intelligent decisions about data-access patterns and caching strategies. The interesting niche filled here seems to be the attempt to have the compiler improve the SM implementation by directing the RT system to exploit the MP capabilities of the underlying hardware. It attempts to do this by analyzing the the SM accesses, and transforms the code to include calls to the RT system that provide it with the information that the complier determined. In instances where the compiler is not able to make good decisions a priori, then the system relies on the RT system’s management of the SM. This avoids the complexity and the limitations of compilers that directly target MP. The rest of the time, the system resides somewhere in the middle of these two extremes, and only uses the compile-time features when there is likely to be a high degree of success by doing so.
1.4 Reducing False Sharing on Shared Memory Multiprocessors through Compile Time Data Transformations
Susan J. Eggers, et. al. at the University of Washington developed several compile-time algorithms to identify data-structures that tend to lead to false-sharing, and then apply a transformation to reduce or in many cases eliminate this phenomenon [7]. In practice, this technique has been applied to very coarse-grained explicitly parallel program code with quite successful results. One of my personal questions here is if this type of optimization could be performed for program code that was either not explicitly parallel (or as explicit) and/or for finer-grained program code.
1.5 Shared Memory Multiprocessors and Sequential Programming Languages: A Case Study
1.6 On the Influence of Programming Models on Shared Memory Computer Performance
Ton Ngo, et. al. at the University of Washington discuss the importance of the idea that sharing is ok, but don’t share too much [10]. They have indicated that programs that are written in the non-shared mode generally offer better performance. They also tend to be more portable and scalable for obvious reasons. They studied several problems including the famous LU decomposition as well as the Molecular Dynamics simulation on a variety of SMP’s. The programming model for SM machines needs to include previsions for the degree of sharing that best facilitates the particular problem being addressed.
1.7 Improving Fine-Grained Irregular Shared-Memory Benchmarks by Data Reordering
Cox et. al at Rice University have been involved in demonstrating that data reordering can improve fine-grained irregular SM applications (both on HW and SW SM systems) [13]. The techniques they present attempt to co-locate objects in memory that were physically near one another in the physical system that the program hopes to model. By doing so, the system gains increased spatial locality and deters false sharing. They test their theories on the Barnes-Hut and Water simulations as well as a host of other applications. They have implemented their work on a 16-node SGI Origin 2000 as well as 2 software SM systems. Their initial results are positive.
1.8 LPVM - experimental Lightweight process PVM
Al Geist, et. al. at Oak Ridge National Laboratory discuss some of the trends towards clusters of SMP’s with a global shared memory (GSM) [6]. Typically these machines are arranged in a hierarchical or tiered multilevel system. Lightweight-process PVM (LPVM) is an experimental PVM version that supports the use of threads as the basic unit of parallelism. The essential idea was to provide a test-bed for a multithreaded MP system. The initial system was implemented on a SUN SMP. The aspiration here was to determine to future of PVM and what future versions of PVM should look like to support this type of clustering. Some of the issues that they were investigated included:
1. Determining to what extent the latency and bandwidth of PVM could be improved by using threads in an SMP environment
2. How to take advantage of the GSM of a cluster of SMP’s as these type of systems grow increasingly popular in HPC
2
Communication Latency Reduction
been done in reducing the the interprcessor latency of traditional SMP’s. This section address research that provides insight into these areas.
2.1 Rationale and Design of the Hydra Multiprocessor
In this subsection I will describe some of the more interesting points of the original Hydra project. I will go into more detail here, as this paper, in my opinion, was considerably more interesting than the others that I found during the course of the survey papers’ construction. The following is based on the original multichip module (MCM) design circa 1994. It should be noted that the design has changed since then in that the individual CPU’s communicate via two specialized buses [5]. Also the current Hydra project does not utilize the multichip module packaging technology; i.e. all four CPUs are on the same die. The ability to extract parallelism from intrinsically non-parallel sequential programs is a major focus area for the Hydra group. Interestingly enough, this is achieved by multithreading the sequential program so that speculative threads now run in parallel on the CPUs. One of the interesting implications is that loop parallelization is now easily automated.
2.1.1 Overview
The Hydra is a collection of four HP processors that use a shared secondary cache. This cache is implemented using a multichip module (MCM) [8]. The Hydra is intended to support automatically parallelized programs where the expectation is that there will exist a high degree of fine grained sharing. The current (then 1994) trends in microprocessor designs indicated that microprocessors are growing at a rate of 60% a year. This differs greatly from the percentage we see in the improvements in memory speeds. This has led to two main trends: 1. very large on-chip caches and 2. very complex microarchitectures (MA) designed to exploit ILP. The problem is that as the processors become faster, they require more bandwidth and lower latency from the memory hierarchy. Some designs such as the HP PA-RISC have employed large off-chip commodity SRAMs with an advanced electrical interface as an alternative to dedicating large areas of silicon to large on-chip caches. Often designs such as these tend to provide comparable performance to the much larger cache designs.
On the other hand we have complex MAs to exploit ILP. This has taken two trends as well: 1. superscalar architectures (SCAs) and VLIW architectures. Due to the complexities of the hardware associated with wide superscalars (for mapping instructions to issue windows, pipeline and circuit optimization at higher clock frequencies), these designs typically result in lower clock frequencies. VLIW CPUs do not require hardware for dynamic scheduling, however good support from the compiler is required to generate good code to exploit the ILP in the programs. Additionally, one of the major shortcomings of VLIWs is that if one of the instructions in the word stalls, all of the instructions in that word are required to stall, regardless of their individual conditions. Also the source code must be recompiled for this architecture, something that we did not see for SCAs.
2.1.2 The MCM Technology
capacitances (inductance and resistance as well!). All of these factors combine and result in better (faster) signal propagation. The Hydra relies on this MCM technology because of the low latency and high bandwidth associated with the MCM. The MCM resides between the two extremes of trying to put all of the logic on a single chip versus a multichip design separated on a PCB.
2.1.3 Performance versus Cost
The Hydra attempts to provide comparable performance to wide SCAs while costing less. The Hydra relies on the MCM and the parallelizing compiler technology to achieve these somewhat grandious goals.
2.1.4 The Hydra Design
Inside the MCM, the Hydra has the four uP’s, the second-level cache and the cache-controller. The cache controller supports such synchronization primitives as LL-SC. The idea here is that more course grained constructs such as semaphores and barriers can be formed from the LL-SC. The memory consistency model being used by the Hydra is a partial store ordering (PSO). The Hydra also uses a invalidation-based protocol. The second-level cache is where the sharing takes place between the processors which facilitates the processing of finer-grained parallelism.
Since the Hydra does not rely on extracting ILP from a single thread of control, each uP can be less complicated and possibly faster. However, the main problem with this scenario, is that Hydra requires sophisticated compiler technology that will extract multiple threads of control for a sequential program. As I mentioned earlier, this is a main focus area for the Hydra group currently. It appears that they have meet with some success in this focus area. Another interesting point here is that the Hydra uses the Stanford MAGIC controller to connect the MCM to the I/O subsystem, the network, and the DRAMs.
2.2 The Case for a Single-Chip Multiprocessor
This paper offers a somewhat different view of single chip silicon usage. Wilson, et. al. at Stanford University indicate that with the advances in IC technology, particularly gate density and wire cost, is it possible to implement multiple processors on a single chip using the same area as a wide superscalar [9]. The interesting advantage to the single chip multiprocessor is that is they offer both localized implementation of a high-clock rate processor for sequential programs, while providing low latency interprocessor communication for parallel applications.
The use of hardware to track dependencies between instructions and to determine when to execute an instruction (possibly out of program order), etc. is referred to dynamic scheduling. Besides the fact that there tends to be a limit to the amount of ILP in a given thread of control, the instruction issue queue requires increasingly (quadratic with respect to issue width) more hardware for broadcast mechanisms (including wires and communication tags). This results in longer clock cycle times and ultimately limits the future performance of wide superscalar machines.
containing multiple processors per chip. Multiprocessors have traditionally been used in three ways:
1. To execute multiple processes in parallel to increase throughput in a multiprogramming environment
2. To execute multiple threads in parallel that come from a single application (manually parallelized program code – shared memory scientific apps)
3. To accelerate the execution of sequential applications without manual intervention (re-quiring automatic parallelizing compiler technology)
The single-chip simulated in the paper shows that the direction for more efficient use of silicon area should naturally be in the direction of placing multiple CPUs on a chip. This direction is likely to greatly improve the performance of the way in which typical multiprocessors have been used.
2.3 Techniques for Reducing Consistency-Related Communication in Dis-tributed Shared Memory Systems
John Carter et. al. at Rice University have been exploring efficient ways to reduce the commu-nication in DSMs associated with maintaining consistency. They have implemented four basic techniques on the Munin DSM system:
1. Software release consistency – completely software approach that aims at reducing the number of messages required to maintain consistency in a software DSM system. This approach only requires memory to be consistent at specific points, namely synchronization points.
2. Multiple consistency protocols – a collection of protocols to keep memory consistent in accordance with the observation that no single scheme is perfect for all situations. 3. Write-shared protocols – this addresses the issue of false sharing in DSMs by allowing
multiple concurrent writes to a shared page. Updates are merged at the synchronization points defined by release consistency.
4. Update without timeout mechanism – remote copies of shared data are updated rather than invalidated, however copies that are not referred to after a certain time interval are deleted, thus reducing the need for further updates and thus reducing the total amount of communication.
Each of these techniques were implemented on the Munin DSM. Results indicate that efforts to reduce communications in the consistency realm offers considerable rewards, as one might expect after considering the same fiasco with bus-based machines.
2.4 Data Forwarding in Scalable Shared-Memory Multiprocessors
If the compiler can identify additional possible consumers of the datum that a given processor requires, then in addition to caching a copy for itself, it can also send a copy to those consumers. They address two main issues: 1. Developing a framework for a compiler algorithm that can identify the instances where forwarding is required, 2. The results that can be obtained from such a design using address traces in their simulation of a 32 node SMP. This seems really interesting in that it could be possible to create more scalable SMPs.
2.5 Adaptive Protocols for Software Distributed Shared Memory
Alan Cox et. al. at Rice University are attempting to develop protocols that adapt to run-time memory access patterns. What sets them apart from other research in this area, is that they are attempting to achieve this without any user annotations nor additional compiler or hardware support. They intend for the adaption to be completely automatic. They have investigated adaption of the form of switching between single and multiple writer protocols and dynamic aggregation of pages into a larger transfer unit. Interestingly enough, they have been looking at being able to switch between invalidation and update protocol versions. The programming model is explicitly parallel with primitives for process creation and destruction and synchronization including mutex locks and barriers. The model makes provisions for release consistency (RC) as well. This concept is very similar to the efforts I describe in the Section ( 1).
2.6 ASCOMA: An Adaptive Hybrid Shared Memory Architecture
John Carter et. al. at the University of Utah have been looking at an adaptive approach to SM architectures by considering S-COMA versus ccNUMA [4]. Carter recognizes that recent architectures have taken a hybrid approach that combine aspects of both schemes. He offers a few improvements over the hybrid version. The first consideration is a page algorithm that prefers S-COMA pages at low memory pressures. Then, once the local free page pool is depleted, the additional needed pages are mapped in the ccNUMA mode. They remain in the ccNUMA state until there are sufficient remote misses to warrant upgrade to S-COMA status. The second improvement is a page allocation algorithm that employs a dynamic backoff rate of page remappings from ccNUMA to S-COMA when the the memory is at high pressure. This helps reduce the amount of induced cold misses caused by the needless thrashing of the page cache.
The research that was done was to identify the conditions that cause these traditional schemes to perform badly. With this additional knowledge, the adaptive S-COMA protocol can take advantage of the best of both worlds.
3
Concluding Remarks
hardware and protocols that will bridge the gap between processor performance and network speed? If there was such a technology, what would that mean for the future of HPC, specifically DSM or MP systems?
References
[1] Lawrence A. Crowl. Shared memory multiprocessors and sequential programming lan-guages: A case study. In Bruce D. Shriver, editor, Proceedings of the Twenty-First Annual Hawaii International Conference on Systems Sciences, volume II, pages 103–108. Univer-sity of Hawaii, IEEE Computer Society Press, January 1988.
[2] David K. Poulsen David A. Koufaty, Xiangfeng Chen and Josep Torrellas. Data forwarding in scalable shared-memory multiprocessors. pages 103–108, October 1996.
[3] Ioannis Schoinas et. al. Fine-grain access control for distributed shared memory. In Proceed-ings of the Sixth International Symposium on Architectural Support for Parallel Languages and Operating Systems. University of Wisconsin-Madison, October 1994.
[4] John Carter et. al. Ascoma: An adaptive hybrid shared memory architecture. International Conference on Parallel Processing, August 1998.
[5] Mike Chen et. al. The stanford hydra cmp. Hot Chips ’99, August 1999. Talk slides. [6] Al Geist and Honbo Zhou. Lpvm - experimental lightweight process pvm. Oak Ridge
National Laboratory. Mathematical Sciences Section.
[7] T.E. Jeremiassen and S.J. Eggers. Reducing false sharing on shared memory multiproces-sors through compile time data transformations. Technical Report 94-09-05, University of Washington, 1994.
[8] Kun-Yung Chang Kunle Olukotun, Jules Bergmann and Basem A. Nayfeh. Rationale and design of the hydra multiprocessor. Technical Report Computer Systems Lab Technical Report CSL-TR-94-645, Stanford University, 1994. This is the original MCM based design. [9] Lance Hammond Ken Wilson Kunle Olukotun, Basem A. Nayfeh and Kun-Yung Chang. The case for a single-chip multiprocessor. In Proceedings of the Seventh International Symposium on Architectural Support for Parallel Languages and Operating Systems, pages 103–108. Stanford University, October 1996.
[10] Ton A. Ngo and Lawrence Snyder. On the influence of programming models on shared memory computer performance. In Proceedings of the Scalable High Performance Com-puting Conference Conference, pages 284–291, 1992.
[11] A.L. Cox R. Rajamony S. Dwarkadas, H. Lu and W. Zwaenepoel. Combining compile-time and run-time support for efficient software distributed shared memory. In Proceedings of IEEE, Special Issue on Distributed Shared Memory, volume 87 No. 3, pages 476–486, 1999. [12] Daniel J. Scales and Monica S. Lam. An efficient shared memory layer for distributed