Parallel and High-Performance Systems

1. Discuss at least three distinguishing factors that can be used to differentiate among parallel computer systems. Why do systems vary so widely with respect to these factors?

Some of the main factors that differ from one parallel system to another are the number and type of processors used and the way in which they are connected to each other. Some parallel systems use shared main memory for communication between processors, while others use a message-passing paradigm. With either of these approaches, a wide variety of networks can be used to facilitate the exchange of information. The main reason why parallel systems exhibit such a wide range of characteristics is probably because there is such a wide variety of applications. The characteristics of the intended applications drive the characteristics of the machines that are built to run them.

2. Michael Flynn defined the terms SISD, SIMD, MISD, and MIMD to represent certain classes of computer architectures that have been built or at least considered. Tell what each of these abbreviations stands for; describe the general characteristics of each of these architectures; and explain how they are similar to and different from one another. If possible, give an example of a specific computer system fitting each of Flynn’s

classifications.

SISD stands for Single Instruction stream, Single Data stream – this is a single-processor system such as a typical desktop or notebook PC or workstation. Generally, such systems are built around a processor with a conventional von Neumann (Princeton) or Harvard architecture. SIMD stands for Single Instruction stream, Multiple Data stream. SIMD systems are commonly known as array

processors because they execute the same operation on a large collection of operands at the same time. The control unit of a SIMD is much like that of a SISD machine, but it controls a number of processing elements simultaneously. Examples of SIMD computers include the ILLIAC IV, the Connection Machine, etc. MISD (Multiple Instruction stream, Single Data stream) machines would have carried out multiple algorithms on the same data sets. Such machines, while conceptually possible, have yet to be developed. MIMD is an acronym for Multiple Instruction stream, Multiple Data stream. This classification of machines encompasses the vast majority of parallel systems, which consist of multiple CPUs (each with a Princeton or Harvard architecture like the one in a typical SISD system) connected together by some type of communications network. Examples of MIMD computers include the Silicon Graphics Origin series, the Cray T3E, and any “Beowulf” class cluster system.

3. What is the main difference between a vector computer and the scalar architectures that we studied in Chapters 3 and 4? Do vector machines tend to have a high, or low, degree of generality as defined in Section 1.4? What types of applications take best advantage of the properties of vector machines?

The main difference between a vector computer and conventional scalar architectures is the fact that instructions executed by vector processors operate not on individual values or pairs of values, but on vectors (one-dimensional arrays of values). In a scalar machine the ADD instruction adds two numbers to produce their sum; in a vector processor the ADD instruction adds each element of one set of numbers to the corresponding element of a second set, producing a corresponding set of results. This is usually accomplished with a deeply pipelined execution unit(s)

through which vector elements are fed in succession.

Because of their unique construction, vector machines have a very low degree of generality. They are extremely well suited to certain applications, particularly scientific and engineering applications like weather forecasting, CFD simulations, etc. that do a great deal of “number crunching” on vectors or arrays of data. However, they offer little to no advantage when running office applications or any type of scalar code. While vector processors are more powerful now than they have ever been, they are not as popular in the overall supercomputer market as they once were because they are useful for a relatively narrow range of specialized

applications. Cluster computers based on RISC and superscalar microprocessors are now more popular since they tend to be less expensive (per MIP or MFLOP) and can run a wider range of applications efficiently.

4. How are array processors similar to vector processors and how are they different? Explain the difference between fine-grained and coarse-grained array processors. Which type of array parallelism is more widely used in today’s computer systems? Why?

Array processors are similar to vector processors in that a single machine instruction causes a particular computation to be carried out on a large set of operands. They are different in their construction: array processors use a number of relatively simple processing elements (spatial parallelism) while vector processors generally employ a small number of deeply pipelined processing units (temporal parallelism). Fine-grained array processors consist of a very large number of extremely simple processing elements, while coarse-grained array processors have a few (but usually somewhat more capable) processing elements. Coarse-grained

array parallelism is more widely used in today’s computers, particularly in the multimedia accelerators that have been added to many popular microprocessor families. This is probably because a coarse-grained SIMD is useful in a wider range of applications than a fine-grained array processor would be.

5. Explain the difference between multiprocessor and multicomputer systems. Which of these architectures is more prevalent among massively parallel MIMD systems? Why? Which architecture is easier to understand (for programmers familiar with the

uniprocessor model)? Why?

Multiprocessors are systems in which the CPUs communicate by sharing main memory locations, while multicomputers are systems in which each CPU has its own, local memory and communication is accomplished by passing messages over a network. Most massively parallel MIMD systems are multicomputers because sharing main memory among large numbers of processors is difficult and expensive. Multiprocessors are easier for most programmers to work with because the shared memory model allows communication to be done using the same approaches that are used in systems with a single CPU. Multicomputers, on the other hand, must make use of message passing – a more "artificial" and counterintuitive paradigm for communication.

6. Explain the similarities and differences between UMA, NUMA, and COMA multiprocessors.

All three of these architectural classifications refer to machines with shared main memory. Any location in memory can be read or written by any processor in the system; this is how processes running on various processors communicate with

each other. In a system with UMA (Uniform Memory Access), any memory location can be read or written by any CPU in the same amount of time (unless the memory module in question is already busy). This is a desirable property, but the hardware required to accomplish it does not scale economically to large systems.

Multiprocessors with many CPUs tend to use the NUMA (Non-Uniform Memory Access) or the more recently developed COMA (Cache-Only Memory Architecture) approaches. NUMA systems use a modular interconnection scheme in which memory modules are directly connected to some CPUs but only indirectly connected to others. This is more cost-effective for larger multiprocessors, but access time is variable (remote modules take longer to read or write than local modules) and thus code/data placement must be “tuned” to the specific

characteristics of the memory system (a non-trivial exercise) for best performance. In a COMA system, the entire main memory space is treated as a cache; all addresses represent tags rather than physical locations. Items in memory can be migrated and/or replicated dynamically so they are nearer to where they are most needed. This experimental approach requires even more hardware support than the other two, and is therefore more expensive to implement, but it has the potential to make larger multiprocessors behave more like SMPs and thus perform well without the software having to be tuned to a particular hardware configuration.

7. What does “cache coherence” mean? In what type of computer system would cache coherence be an issue? Is a write-through strategy sufficient to maintain cache coherence in such a system? If so, explain why. If not, explain why not and name and describe an approach that could be used to ensure coherence.

Cache coherence means that every CPU in the system sees the same view of memory (“looking through” its cache(s) to the main memory). In a coherent system, any CPU should get the same value as any other when it reads a shared location, and this value should reflect updates made by this or any other processor. A write- through strategy is not sufficient to ensure this, because updating the contents of main memory is not enough to make sure the other caches’ contents are consistent with the updated memory. Even if main memory contains the updated value, the other caches might have previously loaded that refill line and thus might still contain an old value.

To ensure a coherent view of memory across the whole machine, copies of a line that has been modified (written) need to be updated in other caches (so that they immediately receive the new data) or invalidated by them (so that they will miss if they try to access the old data and update it at that point by reading the new value from main memory). The write-update or write-invalidate operations can be accomplished by implementing a snoopy protocol (typical of smaller, SMP systems) in which caches monitor a common interconnection network, such as a bus, to detect writes to cached locations. In larger multiprocessor systems such as NUMA

machines where bus snooping is not practical, a directory protocol (in which caches notify a centralized controller(s) of relevant transactions and, in turn, receive notifications of other caches’ write operations) is often used.

8. What are the relative advantages and disadvantages of write-update and write-invalidate snoopy protocols?

bandwidth since there is no need for other caches to load modified data. It works well when data are lightly shared, but not as well when data are heavily shared since the hit ratio is usually lower. A write-update snoopy protocol keeps the hit ratio higher for heavily shared data and works well when reads and writes alternate, but it is more complex to implement and may use more bus bandwidth (which can be a limiting factor if several processors are sharing a common bus).

9. What are directory-based protocols and why are they often used in CC-NUMA systems?

Directory-based protocols are cache coherence schemes that do not rely on the “snooping” of a common bus or other single interconnection between CPUs in a multiprocessor system. (While snooping is feasible in most SMP systems, it is not so easily accomplished in larger NUMA architectures with distributed shared memory using a number of local and system-wide interconnections.) Communications with the system directory, which is a hardware database that maintains all the

information necessary to ensure coherence of the memory system, are done in point- to-point fashion and thus scale better to a larger system. In very large systems, the directory itself may be distributed (split into subsets residing in different locations) to further enhance scalability.

10. Explain why synchronization primitives based on mutual exclusion are important in multiprocessors. What is a read-modify-write cycle and why is it significant?

A read-modify write (RMW) cycle is one in which a memory location is read, its contents are modified by a processor, and the new value is written back into the same memory location in indivisible, “atomic” fashion. This type of operation is important in accessing mutual exclusion primitives such as semaphores. The RMW

cycle protects the semaphore test/update operation from being interrupted by any other process; if such an interruption did occur, this could lead to a lack of mutual exclusion on a shared resource, which in turn could cause incorrect operation of the program.

11. Describe the construction of a “Beowulf cluster” system. Architecturally speaking, how would you classify such a system? Explain.

A Beowulf-type cluster is a parallel computer system made up of a number of inexpensive, commodity computers (often generic Intel-compatible PCs) networked together, usually with off-the-shelf components such as 100 megabit/s or 1 gigabit/s Ethernet. The operating system is often an open-source package such as Linux. The idea is to aggregate a considerable amount of computational power as inexpensively as possible. Because they are comprised of multiple, complete computer systems and communicate via message passing over a network, Beowulf clusters are considered multicomputers (LM-MIMD systems).

12. Describe the similarities and differences between circuit-switched networks and packet- switched communications networks. Which of these network types is considered “static” and which is “dynamic”? Which type is more likely to be centrally controlled and which is more likely to use distributed control? Which is more likely to use asynchronous timing and which is more likely to be synchronous?

Circuit-switched and packet-switched networks are both used to facilitate communications in parallel (SIMD and MIMD) computer systems. Both allow connections to be made for the transfer of data between nodes. They are different in that circuit-switched networks actually make and break physical connections to

allow various pairs of nodes to communicate, while packet-switched networks maintain the same physical connections all the time and use a routing protocol to guide message packets (containing data) to their destinations.

Because physical connections are not changed to facilitate communication, packet-switched networks are said to be static, while circuit-switched networks dynamically reconfigure hardware connections. Packet-switched networks

generally exhibit distributed control and are most likely to be asynchronous in their timing. Circuit-switched networks are more likely to be synchronous (though some are asynchronous) and more commonly use a centralized control strategy.

13. What type of interconnection structure is used most often in small systems? Describe it and discuss its advantages and disadvantages.

Small systems often use a single bus as an interconnection. This is a set of address, data, and control/timing signals connected to all components in the system to allow information to be transferred between them. The principal advantage of a bus-based system is simplicity of hardware (and therefore low cost). Its main disadvantage is limited performance; only one transaction (read or write operation) can take place at a time. In a system with only one or a few bus masters (such as CPUs, IOPs, DMACs, etc.) there will typically be little contention for use of the bus and performance will not be compromised too much; in larger systems, however, there will often be a desire for multiple simultaneous data transfers which, of course, cannot happen. Thus one or more bus masters will have to wait to use the bus and overall performance will suffer to some degree.

degree do its nodes have? What is its communication diameter? Discuss the advantages and disadvantages of this topology.

A star network has all the computational and/or memory nodes connected to a single, central communications node or “hub”. All the nodes except the hub have a connection degree of one (the hub is of degree n if there are n other nodes). Such a network has a communication diameter of two (one hop from the source to the hub, one hop from the hub to the destination). Advantages include the network’s small communication diameter (2) and the fact that the communication distance is the same for all messages sent across the network. Another advantage is that the network structure is simple and it is easy to add additional nodes if system expansion is desired. The main disadvantage of a star network is that all

communications must pass through the hub. Because of this, the hub may become saturated with traffic, thus becoming a bottleneck and limiting system performance.

15. How are torus and Illiac networks similar to a two-dimensional nearest-neighbor mesh? How are they different?

Both torus and Illiac networks are similar to a two-dimensional nearest- neighbor mesh in that their nodes are of degree four (they are connected to neighbors in each of the x and y directions; in other words, to a “north”, “south”, “east”, and “west” neighbor). This is true of the interior nodes in a nearest-

neighbor mesh as well. The differences between these three network topologies lie in how the “edge” and “corner” nodes are connected. In a basic nearest-neighbor mesh, the nodes on the side and top/bottom edges lack one of the possible

one connection in the x direction and one in the y direction and therefore are of degree two.

In a torus network, the edge nodes have a “wrap-around” connection to the node in the same row or column on the opposite edge (corner nodes have wrap- around connections in both dimensions); therefore all nodes in the network have a connection degree of four. The Illiac network has the same configuration, except that the rightmost node in each row has a wrap-around connection to the leftmost node in the next (rather than the same) row, with the rightmost node in the last row being connected to the leftmost node in the first row.

16. Consider a message-passing multicomputer system with 16 computing nodes.

(a) Draw the node connections for the following connection topologies: linear array, ring, two-dimensional rectangular nearest-neighbor mesh (without edge

connections), binary n-cube.

(The drawings will be similar to the appropriate figures in Chapter 6.)

(b) What is the connection degree for the nodes in each of the above interconnection networks?

Linear array: 2 (1 for end nodes). Ring: 2 (all nodes). 2-D mesh: 4 (3 on edges, 2 in corners). Binary n-cube: 4 (all nodes).

Linear array: 15. Ring: 8. 2-D mesh: 6. Binary n-cube: 4.

(d) How do these four networks compare in terms of cost, fault tolerance, and speed of communications? (For each of these criteria, rank them in order from most desirable to least desirable.)

Cost: linear array, ring, 2-D mesh, n-cube. Fault tolerance: n-cube, 2-D mesh, ring, linear array. Speed: n-cube, 2-D mesh, ring, linear array.

17. Describe, compare, and contrast store-and-forward routing with wormhole routing. Which of these approaches is better suited to implementing communications over a static network with a large number of nodes? Why?

Both wormhole and store-and-forward routing are methods for transmitting message packets across a static network. Store-and-forward routing treats each message packet as a unit, transferring the entire packet from one node in the

routing path to the next before beginning to transfer it to a subsequent node. This is a simple but not very efficient way to transfer messages across the network; it can cause significant latency for messages that must traverse several nodes to reach their destinations. Wormhole routing divides the message packets into smaller pieces called flits; as soon as an individual flit is received by an intermediate node along the routing path, it is sent on to the next node without waiting for the entire packet to be assembled. This effectively pipelines, and thus speeds up, the

transmission of messages between remote nodes. A network with a large number of nodes is likely to have a large communication diameter, with many messages

requiring several “hops” to reach their destinations, and thus is likely to benefit

In document Solutions Manual (Page 71-86)