• No results found

Share Memory Multiprocessor UNIT-6.pptx

N/A
N/A
Protected

Academic year: 2020

Share "Share Memory Multiprocessor UNIT-6.pptx"

Copied!
69
0
0

Loading.... (view fulltext now)

Full text

(1)
(2)

A system with multiple CPUs "Sharing" the same main memory is called

multiprocessor.

In a multiprocessor system all processes on the various CPUs share a

unique logical address space, which is mapped on a physical memory that can be distributed among the processors.

Each process can read and writ a data item simply using load and store

operation, and process communication is through share memory.

It the hardware that makes all CPUs access and use the same main

memory

(3)

3

Systems with Multiple CPUs

Collection of independent CPUs (or computers) that

appears to the users/applications as a single system

Technology trends

Powerful, yet cheap, microprocessors

Advances in communications

Physical limits on computing power of a single CPU

Examples

Network of workstations

Servers with multiple processors

Network of computers of a company

(4)

4

Advantages

Data sharing: allows many users to share a common data

base

Resource sharing: expensive devices such as a color printerParallelism and speed-up: multiprocessor system can have

more computing power than a mainframe

Better price/performance ratio than mainframes

Reliability: Fault-tolerance can be provided against crashes

of individual machines

Flexibility: spread the workload over available machines • Modular expandability: Computing power can be added in

(5)

5

Design Issues

Transparency: How to achieve a single-system image

How to hide distribution of memory from applications?

How to maintain consistency of data?

Performance

How to exploit parallelism?

How to reduce communication delays?

Scalability: As more components (say, processors) are

added, performance should not degrade

(6)

6

Classification

• Multiprocessors

• Multiple CPUs with shared memory

• Memory access delays about 10 – 50 nsec

• Multicomputers

• Multiple computers, each with own CPU and memory, connected by a high-speed interconnect

• Tightly coupled with delays in micro-seconds

• Distributed Systems

• Loosely coupled systems connected over Local Area Network (LAN), or even long-haul networks such as Internet

(7)

7

(8)

8

Multiprocessor Systems

Multiple CPUs with a shared memory

From an application’s perspective, difference with

single-processor system need not be visible

Virtual memory where pages may reside in memories

associated with other CPUs

(9)

There are three issues in

particular

Cache coherenceSynchronization

(10)
(11)

The Cache Coherence Problem

In a multiprocessor system, data inconsistency may occur among

adjacent levels or within the same level of the memory hierarchy. For example, the cache and the main memory may have inconsistent

copies of the same object.

As multiple processors operate in parallel, and independently multiple

caches may possess different copies of the same memory block, this creates cache coherence problem. Cache coherence schemes help to avoid this problem by maintaining a uniform state for each cached

(12)
(13)

Snoopy Bus Protocols

Snoopy protocols achieve data consistency between the cache memory and the

shared memory through a bus-based memory system.

(14)

In this case, we have three processors P1, P2, and P3 having a consistent copy of data

(15)

Cache Events and Actions

Following events and actions occur on the execution of memory-access and invalidation commands −

Read-miss − When a processor wants to read a block and it is not in the cache, a read-miss occurs. This initiates a bus-read operation. If no dirty copy exists, then the main memory that has a consistent copy, supplies a copy to the requesting cache memory. If a dirty copy exists in a remote cache memory, that cache will restrain the main memory and send a copy to the requesting cache memory. In both the cases, the cache copy will enter the valid state after a read miss.

Write-hit − If the copy is in dirty or reserved state, write is done locally and the new state is dirty. If the new state is valid, write-invalidate command is

(16)

Write-miss − If a processor fails to write in the local cache memory, the copy

must come either from the main memory or from a remote cache memory with a dirty block. This is done by sending a read-invalidate command, which will invalidate all cache copies. Then the local copy is updated with dirty state.

Read-hit − Read-hit is always performed in local cache memory without

causing a transition of state or using the snoopy bus for invalidation.

Block replacement − When a copy is dirty, it is to be written back to the main

(17)

Hardware Synchronization

Mechanisms

Synchronization is a special form of communication where instead

of data control, information is exchanged between communicating processes residing in the same or different processors.

Multiprocessor systems use hardware mechanisms to implement

low-level synchronization operations. Most multiprocessors have hardware mechanisms to impose atomic operations such as

memory read, write or read-modify-write operations to

implement some synchronization primitives. Other than atomic

(18)

Cache Coherency in Shared Memory Machines

Maintaining cache coherency is a problem in multiprocessor system when the processors contain local cache memory. Data inconsistency between different caches easily occurs in this system.

The major concern areas are −

Sharing of writable dataProcess migration

(19)

Sharing of writable data

When two processors (P1 and P2) have same data element (X) in their local caches and

(20)

Process migration

In the first stage, cache of P1 has data element X, whereas P2 does not have

anything. A process on P2 first writes on X and then migrates to P1. Now, the

process starts reading data element X, but as the processor P1 has outdated data the process cannot read it. So, a process on P1 writes to the data element X and then migrates to P2. After migration, a process on P2 starts reading the data

(21)

I/O activity

• As illustrated in the figure, an I/O device is added to the bus in a two-processor multiprocessor architecture. In the beginning, both the caches contain the data element X. When the I/O

(22)

Cache Coherence

Since all the processors share the same address space, it is possible

for more than one processor to cache an address at the same time. (coherence issue )

(23)

Synchronization issues.

Synchronization mechanisms are typically built with user-level software routines

that rely on hardware supplied synchronization instructions.

For smaller multiprocessors or low-contention situations, instruction sequence

capable of atomically retrieving.

In larger-scale multiprocessors or high-contention situations, synchronization can

(24)

Types of Synchronization.

Mutual exclusion.

Synchronize entry into critical sections.Normally done with locks.

Point-to-point synchronization.

Tell a set of processors (normally set cardinality is one) that they can proceed.Normally done with flags.

Global synchronization.

Bring every processor to sync.

(25)

Memory Consistency Model

A memory consistency model is a set of rules which specify when a

written value by one thread can be read by another thread.

The memory consistency model affects

System implementation: hardware, OS, languages, compilers Programming correctness

(26)

Types

Strict consistency- A shared-memory system is said to support the strict consistency model if the value returned by a read operation on a memory address is always the same as the value written by the most recent write operation to that address

Sequential - A shared-memory system is said to support the sequential consistency model if all processes see the same order of all memory access operations on the shared memory

Casual - Unlike the sequential consistency model, in the causal consistency model, all

processes see only those memory reference operations in the same (correct) order that are potentially causally related.

FIFO-Writes done by a single process are seen by all other processes in the order in which they were issued, but writes from different processes may be seen in a different order by different processes.

FIFO consistency is called PRAM consistency in the case of distributed shared memory systems

(27)
(28)

Multiprocessors - Flynn’s Taxonomy

Single Instruction stream, Single Data stream (SISD)

Conventional uniprocessorAlthough ILP is exploited

Single Program Counter -> Single Instruction stream • The data is not “streaming”

Single Instruction stream, Multiple Data stream (SIMD)

Popular for some applications like image processing

(29)

Flynn’s Taxonomy

• Multiple Instruction stream, Single Data stream (MISD)

• Until recently no processor that really fits this category

• “Streaming” processors; each processor executes a kernel on a stream of data

• Maybe VLIW?

• Multiple Instruction stream, Multiple Data stream

(MIMD)

• The most general

• Covers:

• Shared-memory multiprocessors

(30)

Shared-memory Multiprocessors

Shared-Memory = Single shared-address space (extension of

uniprocessor; communication via Load/Store)

Uniform Memory Access: UMA

With a shared-bus, it’s the basis for SMP’s (Symmetric

MultiProcessing)

(31)

Message-passing Systems

Processors communicate by messages

Primitives are of the form “send”, “receive”

The user (programmer) has to insert the messagesMessage passing libraries

Communication can be:

(32)

The Pros and Cons

Shared-memory pros

Ease of programming (SPMD: Single Program Multiple Data paradigm)Good for communication of small items

Less overhead of O.S.

Hardware-based cache coherence

Message-passing pros

Simpler hardware (more scalable)easier for long messages

(33)

33

Multiprocessor Architecture

• UMA (Uniform Memory Access)

• Time to access each memory word is the same

• Bus-based UMA

• CPUs connected to memory modules through switches

• NUMA (Non-uniform memory access)

• Memory distributed (partitioned among processors)

(34)

Migrating a process to a different processor can be costly when each

core has a private cache. Why?

Because some Operating System, such as Linux, offer a system call to

specify that a process is tied to the processor, independently of the processors load.

However, based on how CPU sees the architecture of the main

memory, there are three classes of multiprocessors:

1. Uniform Memory Acess(UMA) Multiprocessors.

2. Non-Uniform Memory Acess(NUMA) Multiprocessors. 3. Cache Only Memory Acess(COMA) Multiprocessors.

(35)

1. Uniform Memory Acess(UMA) Multiprocessors

In this type of architecture, all processors are connected or shrare a unicque

centralized primary memory.

Since all processors share the same memory organization. Therefore, each

CPU has the same memory access time -> Uniform Memory Acess (UMA) or Symmetric Shared-Memory Multiprocessors(SMP)

Shared Bus

(36)

1. Uniform Memory Acess(UMA) Multiprocessors-Continue Crossbar Switch Uniform Memory Acess(UMA)

A switch is located at each crosspoint between a vertical and a horizontal line,

allowing the CPU and Memory to communicate to each other, when required.

(37)

2. Non-Uniform Memory Acess(NUMA) Multiprocessors

In this type of architectures or systems , we have a shared logical address

space. However, the physical memory is distribuited among CPUs, so that access time to data depends on data position, in local or in a remote memory -> Non-Uniform Memory Access(NUMA) denomination or

Distribuited Shared Memory(DSM).

It is used to build higher scalability & memory is distribuited among processors

(38)

2. Non-Uniform Memory Acess(NUMA) Multiprocessors-Continue

There are two types of NUMA systems:

1. Non-Caching NUMA(NC-NUMA) Multiprocessors. 2. Cache-Coherent NUMA(CC-NUMA) Multiprocessors.

In NC-NUMA system, processors have no local cache.

Eahc memory access is managed with a modified MMU, which

controls if the request is for a local or for a remote block.

(39)

2. Non-Uniform Memory Acess(NUMA)-Continue

2. Cache-Coherent NUMA(CC-NUMA) Multiprocessors.

In CC-NUMA, caching can allevite the problem due to remote

data access, but brings back the cache coherency issue.

The common approach in CC-NUMA system with many CPUs to

enforce cache coherency is the directory-based protocol, where each node in the system with a directory for its RAM blocks: a database stating in which cach is located a block, and what is the state

(40)

UMA VS. NUMA

As in UMA systems,, in NUMA system too all CPUs share the same address space, but

each processor has a local memory attached to it, and visible to all others processors.

 So, differentlu from UMA systems, in NUMA systems access to local memory blocks is quicker than access to remote memory blocks

(41)

3. Cache Only Memory Acess(COMA) Multiprocessors

In this type of architectures or systems , data have no specific “permanent” location(no specific memory address) where they stay and whence they can be read(copied into local caches) and/or modified(first in the cache and the updated at their “permanent” location.

Here DATA can migrate and/or can be replicated in the various memory

banks of the central main memory.

When processor accesses a data item, its logical address is translated into

the physical address, and the content of the memory location containing the data is copied into the cache of the processor, where it can be read and/or modified

(42)

3. Cache Only Memory Acess(COMA) Multiprocessors-Continue

However, In UMA systems, centralized memory causes a bottleneck, and

limits the interconnection between CPU and memory, and its scalability.

Therefore, to overcome these problems, in COMA systems the relationship

between memory and CPU is managed in different manner.

In COMA, there is no longer “home address”, and the entire physical

address space is considered a huge, single cache.

DATA can migrate(moving, not being copied) within the whole system, from

a memory bank to another, according to the request of a specific CPU, that requires that data

(43)

43

Multiprocessor OS

How should OS software be organized?

OS should handle allocation of processes to processors.

Challenge due to shared data structures such as process tables and ready queues

OS should handle disk I/O for the system as a whole Two standard architectures

Master-slave

(44)

44

Master-Slave Organization

Master CPU runs kernel, all others run user processesOnly one copy of all OS data structures

(45)

45

Symmetric Multiprocessing (SMP)

Only one kernel space, but OS can run on any CPU

• Whenever a user process makes a system call, the same CPU runs OS to process it

• Key issue: Multiple system calls can run in parallel on different CPUs

• Need locks on all OS data structures to ensure mutual exclusion for critical updates

• Design issue: OS routines should have independence so that level of granularity for locking gives good performance

(46)

46

Synchronization

Recall: Mutual exclusion solutions to protect critical

regions involving updates to shared data structures

Classical single-processor solutions

Disable interrupts

Powerful instructions such as Test&Set (TSL)Software solution such as Peterson’s algorithm

In multiprocessor setting, competing processes can all

be OS routines (e.g., to update process table)

Disabling interrupts is not relevant as there are

multiple CPUs

(47)

47

Busy-Waiting vs Process switch

In single-processors, if a process is waiting to acquire

lock, OS schedules another ready process

This may not be optimal for multiprocessor systems

If OS itself is waiting to acquire ready list, then switching

impossible

Switching may be possible, but involves acquiring locks, and

thus, is expensive

OS must decide whether to switch (choice between

spinning and switching)

spinning wastes CPU cycles

switching uses up CPU cycles also

possible to make separate decision each time locked mutex

(48)

48

Multiprocessors: Summary

Set of processors connected over a bus with shared memory modules

Architecture of bus and switches important for efficient memory access

Caching essential; to manage multiple caches, cache coherence protocol necessary (e.g. Snoopy)

Symmetric Multiprocessing (SMP) allows OS to run on different CPUs concurrently

Synchronization issues: OS components work on shared data structures

TSL based solution to ensure mutual exclusion

Spin locks (i.e. busy waiting) with exponential backoff to reduce

(49)

49

Scheduling

Recall: Standard scheme for single-processor scheduling

Make a scheduling decision when a process blocks/exits or

when a clock interrupt happens indicating end of time quantum

Scheduling policy needed to pick among ready processes, e.g.

multi-level priority (queues for each priority level)

In multiprocessor system, scheduler must pick among

ready processes and also a CPU

Natural scheme: when a process executing on CPU k

(50)

50

Issues for Multiprocessor Scheduling

If a process is holding a lock, it is unwise to switch it even

if time quantum expires

Locality issues

If a process p is assigned to CPU k, then CPU k may hold

memory blocks relevant to p in its cache, so p should be assigned to CPU k whenever possible

If a set of threads/processes communicate with one another

then it is advantageous to schedule them together

Solutions

Space sharing by allocating CPUs in partitions

(51)

51

(52)

52

Multicomputers

Definition:

Tightly-coupled CPUs that do not share memory

Communication by high-speed interconnect via

messages

Also known as

cluster computers

(53)

53

Switching Schemes

Messages are transferred in chunks called packetsStore and forward packet switching

Each switch collects bits on input line, assembles the packet, and

forwards it towards destination

• Each switch has a buffer to store packets

Delays can be long

Hot-potato routing: No buffering

Necessary for optical communication links

Circuit switching

First establish a path from source to destinationPump bits on the reserved path at a high rate

• Wormhole routing

(54)

54

Interprocess

Communication

How can processes talk to each other on

multi-computers?

User-level considerations: ease of use etc

OS level consideration: efficient implementation

Message passing

Remote procedure calls (RPC)

(55)

55

Message-based Communication

• Minimum services provided

• send and receive commands

• These are blocking (synchronous) calls

(a) Blocking send call

(56)

56

User-level Communication

Primitives

Library Routines

Send (destination address, buffer containing message)

Receive (optional source address, buffer to store message)

Design issues

Blocking vs non-blocking calls

(57)

57

Blocking vs Non-blocking

Blocking send: Sender process waits until the message is

sent

Disadvantage: Process has to wait

Non-blocking send: Call returns control to sender

immediately

Buffer must be protected

Possible ways of handling non-blocking send

Copy into kernel buffer

Interrupt sender upon completion of transmission`

Mark the buffer as read-only (at least a page long), copy on write

(58)

58

Buffers and Copying

Network interface card has its own buffers

Copy from RAM to sender’s card

Store-and-forward switches may involve copyingCopy from receiver’s card to RAM

Copying slows down end-to-end communication

Copying not an issue in disk I/O due to slow speed

Additional problem: should message be copied from

sender process buffer to kernel space?

User pages can be swapped out

Typical solutions

Programmed I/O for small packets

(59)

59

The Problem with Messages

Messages are flexible, but

They are not a natural programming model

Programmers have to worry about message formats

messages must be packed and unpacked

messages have to be decoded by server to figure out what is requestedmessages are often asynchronous

(60)

60

Remote Procedure Call

Procedure call is a more natural way to

communicate

every language supports it

semantics are well defined and understoodnatural for programmers to use

Basic idea of RPC (Remote Procedure Call)

define a server as a module that exports a set of

procedures that can be called by client programs. call

return

(61)

61

Remote Procedure Call

• Use procedure call as a model for distributed communication

• RPCs can offer a good programming abstraction to hide low-level communication details

• Goal - make RPC look as much like local PC as possible

• Many issues:

• how do we make this invisible to the programmer?

• what are the semantics of parameter passing?

• how is binding done (locating the server)?

• how do we support heterogeneity (OS, arch., language)?

• how to deal with failures?

(62)

62

Shared memory vs. message

passing

Message passing

better performance

know when and what msgs sent: control, knowledge

Shared memory

familiar

hides details of communication

no need to name receivers or senders, just write to

specific memory address and read later

caching for “free”

porting from centralized system (the original “write

once run anywhere”)

no need to rewrite when adding processs, scales

because adds memory for each node

Initial implementation correct (agreement is reached at

(63)

63

Distributed Shared Memory (DSM)

Replication

(a) Pages distributed on 4 machines (b) CPU 0 reads page 10

(64)

64

Distributed Shared Memory (DSM)

• data in shared address space accessed as in traditional VM.

• mapping manager -- maps the shared address space to the physical address space.

• Advantage of DSM

• no explicit comm. primitives, send and receive, needed in program. It is believed to be easier to design and write parallel alg's using DSM

complex data structure can be passed by reference.

• moving page containing the data take advantage of locality and reduce comm. overhead.

(65)

65

DSM Implementation Issues

• Recall: In virtual memory, OS hides the fact that pages may reside in main memory or on disk

• Recall: In multiprocessors, there is a single shared memory (possibly virtual) accessed by multiple CPUs. There may be multiple caches, but cache coherency protocols hide this from applications

• how to make shared data concurrently accessible

• DSM: Each machine has its own physical memory, but virtual memory is shared, so pages can reside in any memory or on disk

• how to keep track of the location of shared data

• On page fault, OS can fetch the page from remote memory

(66)

66

Distributed Shared Memory

•Note layers where it can be implemented

• hardware

• operating system

(67)

67

Cache/Memory Coherence and

Consistency

• Coherence: every cache/CPU must have a coherent

view of memory

If P writes X to A, then reads A, if no other proc writes A,

then P reads X

If P1 writes X to A, and no other processor writes to A, then

P2 will eventually read X from A.

If P1 writes X to A, and P2 writes Y to A, then every

processor will either read X then Y, or Y then X, but all will see the writes in the same order.

Consistency: memory consistency model tells us when

(68)

68

False sharing in DSM

False Sharing

(69)

69

Load Balancing

In a multicomputer setting, system must determine

assignment of processes to machines

Formulation as an optimization problem:

Each process has estimated CPU and memory requirements

For every pair of processes, there is an estimated traffic

Goal: Given k machines, cluster the processes into k

clusters such that

Traffic between clusters is minimized

Aggregate memory/CPU requirements of processes within each

References

Related documents

ENGLISH PLACEMENT TEST ENGLISH PLACEMENT TEST.. Mark the correct answer on the

The primary objective of this research effort was to develop and analytically demonstrate enhanced first generation active “fail-safe” hybrid flow-control techniques to

If we listen to a Bach fugue, indepen- dently of any other activity, we are listening to the functioning of pure musical codes, generating musical discourse; music on this

A write-back cache will save time for code with good temporal locality on writes. A write-through cache will always match data with the memory hierarchy level

• Direct Memory Access (DMA) operation can update main memory without going through cache.. ARM and

– Request for main memory access (read or write) – First, check cache for copy. •

In the Channel-SLAM algorithm [7], [8] summarized in Section II, every signal component, or propagation path, corresponds to one transmitter. However, the KEST algorithm may lose

If the RAID Controller’s write-back cache option is enabled, data is first written to the cache memory and the write is acknowledged, and then the RAID controller writes the