Share Memory Multiprocessor UNIT-6.pptx

(1)

(2)

 _{A system with multiple CPUs "Sharing" the same main memory is called}

multiprocessor.

 _{In a multiprocessor system all processes on the various CPUs share a}

unique logical address space, which is mapped on a physical memory that can be distributed among the processors.

 _{Each process can read and writ a data item simply using load and store}

operation, and process communication is through share memory.

 _{It the hardware that makes all CPUs access and use the same main}

memory

(3)

3

Systems with Multiple CPUs

• _{Collection of independent CPUs (or computers) that}

appears to the users/applications as a single system

• _{Technology trends}

• _{Powerful, yet cheap, microprocessors}

• _{Advances in communications}

• _{Physical limits on computing power of a single CPU}

• _Examples

• _{Network of workstations}

• _{Servers with multiple processors}

• _{Network of computers of a company}

(4)

4

Advantages

• _{Data sharing:}_{allows many users to share a common data}

base

• _{Resource sharing:}_{expensive devices such as a color printer} • _{Parallelism and speed-up:}_{multiprocessor system can have}

more computing power than a mainframe

• _{Better price/performance ratio}_{than mainframes}

• _Reliability:_{Fault-tolerance can be provided against crashes}

of individual machines

• _Flexibility:_{spread the workload over available machines} • Modular expandability: Computing power can be added in

(5)

5

Design Issues

• _{Transparency:}_{How to achieve a single-system image}

• _{How to hide distribution of memory from applications?}

• _{How to maintain consistency of data?}

• _Performance

• _{How to exploit parallelism?}

• _{How to reduce communication delays?}

• _Scalability:_{As more components (say, processors) are}

added, performance should not degrade

(6)

6

Classification

• Multiprocessors

• Multiple CPUs with shared memory

• Memory access delays about 10 – 50 nsec

• Multicomputers

• Multiple computers, each with own CPU and memory, connected by a high-speed interconnect

• Tightly coupled with delays in micro-seconds

• Distributed Systems

• Loosely coupled systems connected over Local Area Network (LAN), or even long-haul networks such as Internet

(7)

7

(8)

8

Multiprocessor Systems

• _{Multiple CPUs with a shared memory}

• _{From an application’s perspective, difference with}

single-processor system need not be visible

• _{Virtual memory where pages may reside in memories}

associated with other CPUs

(9)

There are three issues in

particular

• _{Cache coherence} • _{Synchronization}

(10)

(11)

The Cache Coherence Problem

• _{In a multiprocessor system, data inconsistency may occur among}

adjacent levels or within the same level of the memory hierarchy. For example, the cache and the main memory may have inconsistent

copies of the same object.

• _{As multiple processors operate in parallel, and independently multiple}

caches may possess different copies of the same memory block, this creates cache coherence problem. Cache coherence schemes help to avoid this problem by maintaining a uniform state for each cached

(12)

(13)

Snoopy Bus Protocols

• _{Snoopy protocols achieve data consistency between the cache memory and the}

shared memory through a bus-based memory system.

(14)

• _{In this case, we have three processors P1, P2, and P3 having a consistent copy of data}

(15)

Cache Events and Actions

Following events and actions occur on the execution of memory-access and invalidation commands −

• _Read-miss_{− When a processor wants to read a block and it is not in the cache, a} read-miss occurs. This initiates a bus-read operation. If no dirty copy exists, then the main memory that has a consistent copy, supplies a copy to the requesting cache memory. If a dirty copy exists in a remote cache memory, that cache will restrain the main memory and send a copy to the requesting cache memory. In both the cases, the cache copy will enter the valid state after a read miss.

• _Write-hit_{− If the copy is in dirty or}_reserved_{state, write is done locally and the} new state is dirty. If the new state is valid, write-invalidate command is

(16)

• _Write-miss_{− If a processor fails to write in the local cache memory, the copy}

must come either from the main memory or from a remote cache memory with a dirty block. This is done by sending a read-invalidate command, which will invalidate all cache copies. Then the local copy is updated with dirty state.

• _Read-hit_{− Read-hit is always performed in local cache memory without}

causing a transition of state or using the snoopy bus for invalidation.

• _{Block replacement}_{− When a copy is dirty, it is to be written back to the main}

(17)

Hardware Synchronization

Mechanisms

• _{Synchronization is a special form of communication where instead}

of data control, information is exchanged between communicating processes residing in the same or different processors.

• _{Multiprocessor systems use hardware mechanisms to implement}

low-level synchronization operations. Most multiprocessors have hardware mechanisms to impose atomic operations such as

memory read, write or read-modify-write operations to

implement some synchronization primitives. Other than atomic

(18)

Cache Coherency in Shared Memory Machines

Maintaining cache coherency is a problem in multiprocessor system when the processors contain local cache memory. Data inconsistency between different caches easily occurs in this system.

The major concern areas are −

• _{Sharing of writable data} • _{Process migration}

(19)

Sharing of writable data

• _{When two processors (P1 and P2) have same data element (X) in their local caches and}

(20)

Process migration

• _{In the first stage, cache of P1 has data element X, whereas P2 does not have}

anything. A process on P2 first writes on X and then migrates to P1. Now, the

process starts reading data element X, but as the processor P1 has outdated data the process cannot read it. So, a process on P1 writes to the data element X and then migrates to P2. After migration, a process on P2 starts reading the data

(21)

I/O activity

• As illustrated in the figure, an I/O device is added to the bus in a two-processor multiprocessor architecture. In the beginning, both the caches contain the data element X. When the I/O

(22)

Cache Coherence

• _{Since all the processors share the same address space, it is possible}

for more than one processor to cache an address at the same time. (coherence issue )

(23)

Synchronization issues.

• _{Synchronization mechanisms are typically built with user-level software routines}

that rely on hardware supplied synchronization instructions.

• _{For smaller multiprocessors or low-contention situations, instruction sequence}

capable of atomically retrieving.

• _{In larger-scale multiprocessors or high-contention situations, synchronization can}

(24)

Types of Synchronization.

• _{Mutual exclusion.}

• _{Synchronize entry into critical sections.} • _{Normally done with locks.}

• _{Point-to-point synchronization.}

• _{Tell a set of processors (normally set cardinality is one) that they can proceed.} • _{Normally done with flags.}

• _{Global synchronization.}

• _{Bring every processor to sync.}

(25)

Memory Consistency Model

• _{A memory consistency model is a set of rules which specify when a}

written value by one thread can be read by another thread.

• _{The memory consistency model affects}

• _{System implementation: hardware, OS, languages, compilers} • _{Programming correctness}

(26)

Types

• Strict consistency- A shared-memory system is said to support the strict consistency model if the value returned by a read operation on a memory address is always the same as the value written by the most recent write operation to that address

• Sequential - A shared-memory system is said to support the sequential consistency model if all processes see the same order of all memory access operations on the shared memory

• Casual - Unlike the sequential consistency model, in the causal consistency model, all

processes see only those memory reference operations in the same (correct) order that are potentially causally related.

• FIFO-Writes done by a single process are seen by all other processes in the order in which they were issued, but writes from different processes may be seen in a different order by different processes.

FIFO consistency is called PRAM consistency in the case of distributed shared memory systems

(27)

(28)

Multiprocessors - Flynn’s Taxonomy

• _S_ingle _I_nstruction _stream, _S_ingle _D_ata _stream _(SISD)

• _{Conventional uniprocessor} • _{Although ILP is exploited}

• _{Single Program Counter -> Single Instruction stream} • The data is not “streaming”

• _S_ingle_I_{nstruction stream,} _M_ultiple_D_{ata stream} _(SIMD)

• _{Popular for some applications like image processing}

(29)

Flynn’s Taxonomy

• Multiple Instruction stream, Single Data stream (MISD)

• Until recently no processor that really fits this category

• “Streaming” processors; each processor executes a kernel on a stream of data

• Maybe VLIW?

• Multiple Instruction stream, Multiple Data stream

(MIMD)

• The most general

• Covers:

• Shared-memory multiprocessors

(30)

Shared-memory Multiprocessors

• _{Shared-Memory =}_{Single shared-address space}_{(extension of}

uniprocessor; communication via Load/Store)

• _{Uniform Memory Access: UMA}

• _{With a shared-bus, it’s the basis for}_SMP_{’s (}_S_ymmetric

MultiProcessing)

(31)

Message-passing Systems

• _{Processors communicate by messages}

• _{Primitives are of the form}_{“send”, “receive”}

• _{The user (programmer) has to insert the messages} • _{Message passing libraries}

• _{Communication can be:}

(32)

The Pros and Cons

• _{Shared-memory pros}

• _{Ease of programming (}_SPMD_{: Single Program Multiple Data paradigm)} • _{Good for communication of small items}

• _{Less overhead of O.S.}

• _{Hardware-based cache coherence}

• _{Message-passing pros}

• _{Simpler hardware (more scalable)} • _{easier for long messages}

(33)

33

Multiprocessor Architecture

• UMA (Uniform Memory Access)

• Time to access each memory word is the same

• Bus-based UMA

• CPUs connected to memory modules through switches

• NUMA (Non-uniform memory access)

• Memory distributed (partitioned among processors)

(34)

 _{Migrating a process to a different processor can be costly when each}

core has a private cache. Why?

 _{Because some Operating System, such as Linux, offer a system call to}

specify that a process is tied to the processor, independently of the processors load.

 _{However, based on how CPU sees the architecture of the main}

memory, there are three classes of multiprocessors:

1. Uniform Memory Acess(UMA) Multiprocessors.

2. Non-Uniform Memory Acess(NUMA) Multiprocessors. 3. Cache Only Memory Acess(COMA) Multiprocessors.

(35)

1. Uniform Memory Acess(UMA) Multiprocessors

 _{In this type of architecture, all processors are connected or shrare a unicque}

centralized primary memory.

 _{Since all processors share the same memory organization. Therefore, each}

CPU has the same memory access time -> Uniform Memory Acess (UMA) or Symmetric Shared-Memory Multiprocessors(SMP)

Shared Bus

(36)

1. Uniform Memory Acess(UMA) Multiprocessors-Continue Crossbar Switch Uniform Memory Acess(UMA)

 _{A switch is located at each crosspoint between a vertical and a horizontal line,}

allowing the CPU and Memory to communicate to each other, when required.

(37)

2. Non-Uniform Memory Acess(NUMA) Multiprocessors

 _{In this type of architectures or systems , we have a shared logical address}

space. However, the physical memory is distribuited among CPUs, so that access time to data depends on data position, in local or in a remote memory -> Non-Uniform Memory Access(NUMA) denomination or

Distribuited Shared Memory(DSM).

 It is used to build higher scalability & memory is distribuited among processors

(38)

2. Non-Uniform Memory Acess(NUMA) Multiprocessors-Continue

 _{There are two types of NUMA systems:}

1. Non-Caching NUMA(NC-NUMA) Multiprocessors. 2. Cache-Coherent NUMA(CC-NUMA) Multiprocessors.

 _In_NC-NUMA_{system, processors have no local cache.}

 _{Eahc memory access is managed with a modified MMU, which}

controls if the request is for a local or for a remote block.

(39)

2. Non-Uniform Memory Acess(NUMA)-Continue

2. Cache-Coherent NUMA(CC-NUMA) Multiprocessors.

 _In _CC-NUMA, _{caching can allevite the problem due to remote}

data access, but brings back the cache coherency issue.

 _{The common approach in CC-NUMA system with many CPUs to}

enforce cache coherency is the directory-based protocol, where each node in the system with a directory for its RAM blocks: a database stating in which cach is located a block, and what is the state

(40)

UMA VS. NUMA

 _{As in UMA systems,, in NUMA system too all CPUs share the same address space, but}

each processor has a local memory attached to it, and visible to all others processors.

 So, differentlu from UMA systems, in NUMA systems access to local memory blocks is quicker than access to remote memory blocks

(41)

3. Cache Only Memory Acess(COMA) Multiprocessors

 _{In this type of architectures or systems , data have no specific} “permanent” location(no specific memory address) where they stay and whence they can be read(copied into local caches) and/or modified(first in the cache and the updated at their “permanent” location.

 _Here_DATA_{can migrate and/or can be replicated in the various memory}

banks of the central main memory.

 _{When processor accesses a data item, its logical address is translated into}

the physical address, and the content of the memory location containing the data is copied into the cache of the processor, where it can be read and/or modified

(42)

3. Cache Only Memory Acess(COMA) Multiprocessors-Continue

 _However,_In_UMA_{systems, centralized memory causes a bottleneck, and}

limits the interconnection between CPU and memory, and its scalability.

 _Therefore,_{to overcome these problems, in}_COMA_{systems the relationship}

between memory and CPU is managed in different manner.

 _{In COMA,} _{there is no longer} _{“home address”}_{, and the entire physical}

address space is considered a huge, single cache.

 _DATA_{can migrate(moving, not being copied) within the whole system, from}

a memory bank to another, according to the request of a specific CPU, that requires that data

(43)

43

Multiprocessor OS

• _{How should OS software be organized?}

• _{OS should handle allocation of processes to processors.}

Challenge due to shared data structures such as process tables and ready queues

• _{OS should handle disk I/O for the system as a whole} • _{Two standard architectures}

• _Master-slave

(44)

44

Master-Slave Organization

• _{Master CPU runs kernel, all others run user processes} • _{Only one copy of all OS data structures}

(45)

45

Symmetric Multiprocessing (SMP)

_• _{Only one kernel space, but OS can run on any CPU}

• Whenever a user process makes a system call, the same CPU runs OS to process it

• Key issue: Multiple system calls can run in parallel on different CPUs

• Need locks on all OS data structures to ensure mutual exclusion for critical updates

• Design issue: OS routines should have independence so that level of granularity for locking gives good performance

(46)

46

Synchronization

• _{Recall: Mutual exclusion solutions to protect critical}

regions involving updates to shared data structures

• _{Classical single-processor solutions}

• _{Disable interrupts}

• _{Powerful instructions such as Test&Set (TSL)} • _{Software solution such as Peterson’s algorithm}

• _{In multiprocessor setting, competing processes can all}

be OS routines (e.g., to update process table)

• _{Disabling interrupts is not relevant as there are}

multiple CPUs

(47)

47

Busy-Waiting vs Process switch

• _{In single-processors, if a process is waiting to acquire}

lock, OS schedules another ready process

• _{This may not be optimal for multiprocessor systems}

• _{If OS itself is waiting to acquire ready list, then switching}

impossible

• _{Switching may be possible, but involves acquiring locks, and}

thus, is expensive

• _{OS must decide whether to switch (choice between}

spinning and switching)

• _{spinning wastes CPU cycles}

• _{switching uses up CPU cycles also}

• _{possible to make separate decision each time locked mutex}

(48)

48

Multiprocessors: Summary

• _{Set of processors connected over a bus with shared} memory modules

• _{Architecture of bus and switches important for efficient} memory access

• _{Caching essential; to manage multiple caches, cache} coherence protocol necessary (e.g. Snoopy)

• _{Symmetric Multiprocessing (SMP) allows OS to run on} different CPUs concurrently

• _{Synchronization issues: OS components work on shared} data structures

• _{TSL based solution to ensure mutual exclusion}

• _{Spin locks (i.e. busy waiting) with exponential backoff to reduce}

(49)

49

Scheduling

• _{Recall: Standard scheme for single-processor scheduling}

• _{Make a scheduling decision when a process blocks/exits or}

when a clock interrupt happens indicating end of time quantum

• _{Scheduling policy needed to pick among ready processes, e.g.}

multi-level priority (queues for each priority level)

• _{In multiprocessor system, scheduler must pick among}

ready processes and also a CPU

• _{Natural scheme: when a process executing on CPU k}

(50)

50

Issues for Multiprocessor Scheduling

• _{If a process is holding a lock, it is unwise to switch it even}

if time quantum expires

• _{Locality issues}

• _{If a process p is assigned to CPU k, then CPU k may hold}

memory blocks relevant to p in its cache, so p should be assigned to CPU k whenever possible

• _{If a set of threads/processes communicate with one another}

then it is advantageous to schedule them together

• _Solutions

• _{Space sharing by allocating CPUs in partitions}

(51)

51

(52)

52

Multicomputers

• _Definition:

Tightly-coupled CPUs that do not share memory

• _{Communication by high-speed interconnect via}

messages

• _{Also known as}

• _{cluster computers}

(53)

53

Switching Schemes

• _{Messages are transferred in chunks called packets} • _{Store and forward packet switching}

• _{Each switch collects bits on input line, assembles the packet, and}

forwards it towards destination

• Each switch has a buffer to store packets

• _{Delays can be long}

• _{Hot-potato routing: No buffering}

• _{Necessary for optical communication links}

• _{Circuit switching}

• _{First establish a path from source to destination} • _{Pump bits on the reserved path at a high rate}

• Wormhole routing

(54)

54

Interprocess

Communication

• _{How can processes talk to each other on}

multi-computers?

• _{User-level considerations: ease of use etc}

• _{OS level consideration: efficient implementation}

• _{Message passing}

• _{Remote procedure calls (RPC)}

(55)

55

Message-based Communication

• Minimum services provided

• send and receive commands

• These are blocking (synchronous) calls

(a) Blocking send call

(56)

56

User-level Communication

Primitives

• _{Library Routines}

• _{Send (destination address, buffer containing message)}

• _{Receive (optional source address, buffer to store message)}

• _{Design issues}

• _{Blocking vs non-blocking calls}

(57)

57

Blocking vs Non-blocking

• _{Blocking send: Sender process waits until the message is}

sent

• _{Disadvantage: Process has to wait}

• _{Non-blocking send: Call returns control to sender}

immediately

• _{Buffer must be protected}

• _{Possible ways of handling non-blocking send}

• _{Copy into kernel buffer}

• _{Interrupt sender upon completion of transmission`}

• _{Mark the buffer as read-only (at least a page long), copy on write}

(58)

58

Buffers and Copying

• _{Network interface card has its own buffers}

• _{Copy from RAM to sender’s card}

• _{Store-and-forward switches may involve copying} • _{Copy from receiver’s card to RAM}

• _{Copying slows down end-to-end communication}

• _{Copying not an issue in disk I/O due to slow speed}

• _{Additional problem: should message be copied from}

sender process buffer to kernel space?

• _{User pages can be swapped out}

• _{Typical solutions}

• _{Programmed I/O for small packets}

(59)

59

The Problem with Messages

• _{Messages are flexible, but}

• _{They are not a natural programming model}

• _{Programmers have to worry about message formats}

• _{messages must be packed and unpacked}

• _{messages have to be decoded by server to figure out what is requested} • _{messages are often asynchronous}

(60)

60

Remote Procedure Call

• _{Procedure call is a more natural way to}

communicate

• _{every language supports it}

• _{semantics are well defined and understood} • _{natural for programmers to use}

• _{Basic idea of RPC (Remote Procedure Call)}

• _{define a server as a module that}_exports_{a set of}

procedures that can be called by client programs. call

return

(61)

61

Remote Procedure Call

• Use procedure call as a model for distributed communication

• RPCs can offer a good programming abstraction to hide low-level communication details

• Goal - make RPC look as much like local PC as possible

• Many issues:

• how do we make this invisible to the programmer?

• what are the semantics of parameter passing?

• how is binding done (locating the server)?

• how do we support heterogeneity (OS, arch., language)?

• how to deal with failures?

(62)

62

Shared memory vs. message

passing

• _{Message passing}

• _{better performance}

• _{know when and what msgs sent: control, knowledge}

• _{Shared memory}

• _familiar

• _{hides details of communication}

• _{no need to name receivers or senders, just write to}

specific memory address and read later

• _{caching for “free”}

• _{porting from centralized system (the original “write}

once run anywhere”)

• _{no need to rewrite when adding processs, scales}

because adds memory for each node

• _{Initial implementation correct (agreement is reached at}

(63)

63

Distributed Shared Memory (DSM)

Replication

(a) Pages distributed on 4 machines (b) CPU 0 reads page 10

(64)

64

Distributed Shared Memory (DSM)

• data in shared address space accessed as in traditional VM.

• mapping manager -- maps the shared address space to the physical address space.

• Advantage of DSM

• no explicit comm. primitives, send and receive, needed in program. It is believed to be easier to design and write parallel alg's using DSM

• _{complex data structure can be passed by reference.}

• moving page containing the data take advantage of locality and reduce comm. overhead.

(65)

65

DSM Implementation Issues

• Recall: In virtual memory, OS hides the fact that pages may reside in main memory or on disk

• Recall: In multiprocessors, there is a single shared memory (possibly virtual) accessed by multiple CPUs. There may be multiple caches, but cache coherency protocols hide this from applications

• how to make shared data concurrently accessible

• DSM: Each machine has its own physical memory, but virtual memory is shared, so pages can reside in any memory or on disk

• how to keep track of the location of shared data

• On page fault, OS can fetch the page from remote memory

(66)

66

Distributed Shared Memory

•Note layers where it can be implemented

• hardware

• operating system

(67)

67

Cache/Memory Coherence and

Consistency

• Coherence: every cache/CPU must have a coherent

view of memory

• _{If P writes X to A, then reads A, if no other proc writes A,}

then P reads X

• _{If P1 writes X to A, and no other processor writes to A, then}

P2 will eventually read X from A.

• _{If P1 writes X to A, and P2 writes Y to A, then}_every

processor will either read X then Y, or Y then X, but all will see the writes in the same order.

• _{Consistency: memory consistency model tells us when}

(68)

68

False sharing in DSM

• _{False Sharing}

(69)

69

Load Balancing

• _{In a multicomputer setting, system must determine}

assignment of processes to machines

• _{Formulation as an optimization problem:}

• _{Each process has estimated CPU and memory requirements}

• _{For every pair of processes, there is an estimated traffic}

• _{Goal: Given k machines, cluster the processes into k}

clusters such that

• _{Traffic between clusters is minimized}

• _{Aggregate memory/CPU requirements of processes within each}