A system with multiple CPUs "Sharing" the same main memory is called
multiprocessor.
In a multiprocessor system all processes on the various CPUs share a
unique logical address space, which is mapped on a physical memory that can be distributed among the processors.
Each process can read and writ a data item simply using load and store
operation, and process communication is through share memory.
It the hardware that makes all CPUs access and use the same main
memory
3
Systems with Multiple CPUs
• Collection of independent CPUs (or computers) that
appears to the users/applications as a single system
• Technology trends
• Powerful, yet cheap, microprocessors
• Advances in communications
• Physical limits on computing power of a single CPU
• Examples
• Network of workstations
• Servers with multiple processors
• Network of computers of a company
4
Advantages
• Data sharing: allows many users to share a common data
base
• Resource sharing: expensive devices such as a color printer • Parallelism and speed-up: multiprocessor system can have
more computing power than a mainframe
• Better price/performance ratio than mainframes
• Reliability: Fault-tolerance can be provided against crashes
of individual machines
• Flexibility: spread the workload over available machines • Modular expandability: Computing power can be added in
5
Design Issues
• Transparency: How to achieve a single-system image
• How to hide distribution of memory from applications?
• How to maintain consistency of data?
• Performance
• How to exploit parallelism?
• How to reduce communication delays?
• Scalability: As more components (say, processors) are
added, performance should not degrade
6
Classification
• Multiprocessors
• Multiple CPUs with shared memory
• Memory access delays about 10 – 50 nsec
• Multicomputers
• Multiple computers, each with own CPU and memory, connected by a high-speed interconnect
• Tightly coupled with delays in micro-seconds
• Distributed Systems
• Loosely coupled systems connected over Local Area Network (LAN), or even long-haul networks such as Internet
7
8
Multiprocessor Systems
• Multiple CPUs with a shared memory
• From an application’s perspective, difference with
single-processor system need not be visible
• Virtual memory where pages may reside in memories
associated with other CPUs
There are three issues in
particular
• Cache coherence • Synchronization
The Cache Coherence Problem
• In a multiprocessor system, data inconsistency may occur among
adjacent levels or within the same level of the memory hierarchy. For example, the cache and the main memory may have inconsistent
copies of the same object.
• As multiple processors operate in parallel, and independently multiple
caches may possess different copies of the same memory block, this creates cache coherence problem. Cache coherence schemes help to avoid this problem by maintaining a uniform state for each cached
Snoopy Bus Protocols
• Snoopy protocols achieve data consistency between the cache memory and the
shared memory through a bus-based memory system.
• In this case, we have three processors P1, P2, and P3 having a consistent copy of data
Cache Events and Actions
Following events and actions occur on the execution of memory-access and invalidation commands −
• Read-miss − When a processor wants to read a block and it is not in the cache, a read-miss occurs. This initiates a bus-read operation. If no dirty copy exists, then the main memory that has a consistent copy, supplies a copy to the requesting cache memory. If a dirty copy exists in a remote cache memory, that cache will restrain the main memory and send a copy to the requesting cache memory. In both the cases, the cache copy will enter the valid state after a read miss.
• Write-hit − If the copy is in dirty or reserved state, write is done locally and the new state is dirty. If the new state is valid, write-invalidate command is
• Write-miss − If a processor fails to write in the local cache memory, the copy
must come either from the main memory or from a remote cache memory with a dirty block. This is done by sending a read-invalidate command, which will invalidate all cache copies. Then the local copy is updated with dirty state.
• Read-hit − Read-hit is always performed in local cache memory without
causing a transition of state or using the snoopy bus for invalidation.
• Block replacement − When a copy is dirty, it is to be written back to the main
Hardware Synchronization
Mechanisms
• Synchronization is a special form of communication where instead
of data control, information is exchanged between communicating processes residing in the same or different processors.
• Multiprocessor systems use hardware mechanisms to implement
low-level synchronization operations. Most multiprocessors have hardware mechanisms to impose atomic operations such as
memory read, write or read-modify-write operations to
implement some synchronization primitives. Other than atomic
Cache Coherency in Shared Memory Machines
Maintaining cache coherency is a problem in multiprocessor system when the processors contain local cache memory. Data inconsistency between different caches easily occurs in this system.
The major concern areas are −
• Sharing of writable data • Process migration
Sharing of writable data
• When two processors (P1 and P2) have same data element (X) in their local caches and
Process migration
• In the first stage, cache of P1 has data element X, whereas P2 does not have
anything. A process on P2 first writes on X and then migrates to P1. Now, the
process starts reading data element X, but as the processor P1 has outdated data the process cannot read it. So, a process on P1 writes to the data element X and then migrates to P2. After migration, a process on P2 starts reading the data
I/O activity
• As illustrated in the figure, an I/O device is added to the bus in a two-processor multiprocessor architecture. In the beginning, both the caches contain the data element X. When the I/O
Cache Coherence
• Since all the processors share the same address space, it is possible
for more than one processor to cache an address at the same time. (coherence issue )
Synchronization issues.
• Synchronization mechanisms are typically built with user-level software routines
that rely on hardware supplied synchronization instructions.
• For smaller multiprocessors or low-contention situations, instruction sequence
capable of atomically retrieving.
• In larger-scale multiprocessors or high-contention situations, synchronization can
Types of Synchronization.
• Mutual exclusion.
• Synchronize entry into critical sections. • Normally done with locks.
• Point-to-point synchronization.
• Tell a set of processors (normally set cardinality is one) that they can proceed. • Normally done with flags.
• Global synchronization.
• Bring every processor to sync.
Memory Consistency Model
• A memory consistency model is a set of rules which specify when a
written value by one thread can be read by another thread.
• The memory consistency model affects
• System implementation: hardware, OS, languages, compilers • Programming correctness
Types
• Strict consistency- A shared-memory system is said to support the strict consistency model if the value returned by a read operation on a memory address is always the same as the value written by the most recent write operation to that address
• Sequential - A shared-memory system is said to support the sequential consistency model if all processes see the same order of all memory access operations on the shared memory
• Casual - Unlike the sequential consistency model, in the causal consistency model, all
processes see only those memory reference operations in the same (correct) order that are potentially causally related.
• FIFO-Writes done by a single process are seen by all other processes in the order in which they were issued, but writes from different processes may be seen in a different order by different processes.
FIFO consistency is called PRAM consistency in the case of distributed shared memory systems
Multiprocessors - Flynn’s Taxonomy
• Single Instruction stream, Single Data stream (SISD)
• Conventional uniprocessor • Although ILP is exploited
• Single Program Counter -> Single Instruction stream • The data is not “streaming”
• Single Instruction stream, Multiple Data stream (SIMD)
• Popular for some applications like image processing
Flynn’s Taxonomy
• Multiple Instruction stream, Single Data stream (MISD)
• Until recently no processor that really fits this category
• “Streaming” processors; each processor executes a kernel on a stream of data
• Maybe VLIW?
• Multiple Instruction stream, Multiple Data stream
(MIMD)
• The most general
• Covers:
• Shared-memory multiprocessors
Shared-memory Multiprocessors
• Shared-Memory = Single shared-address space (extension of
uniprocessor; communication via Load/Store)
• Uniform Memory Access: UMA
• With a shared-bus, it’s the basis for SMP’s (Symmetric
MultiProcessing)
Message-passing Systems
• Processors communicate by messages
• Primitives are of the form “send”, “receive”
• The user (programmer) has to insert the messages • Message passing libraries
• Communication can be:
The Pros and Cons
• Shared-memory pros
• Ease of programming (SPMD: Single Program Multiple Data paradigm) • Good for communication of small items
• Less overhead of O.S.
• Hardware-based cache coherence
• Message-passing pros
• Simpler hardware (more scalable) • easier for long messages
33
Multiprocessor Architecture
• UMA (Uniform Memory Access)
• Time to access each memory word is the same
• Bus-based UMA
• CPUs connected to memory modules through switches
• NUMA (Non-uniform memory access)
• Memory distributed (partitioned among processors)
Migrating a process to a different processor can be costly when each
core has a private cache. Why?
Because some Operating System, such as Linux, offer a system call to
specify that a process is tied to the processor, independently of the processors load.
However, based on how CPU sees the architecture of the main
memory, there are three classes of multiprocessors:
1. Uniform Memory Acess(UMA) Multiprocessors.
2. Non-Uniform Memory Acess(NUMA) Multiprocessors. 3. Cache Only Memory Acess(COMA) Multiprocessors.
1. Uniform Memory Acess(UMA) Multiprocessors
In this type of architecture, all processors are connected or shrare a unicque
centralized primary memory.
Since all processors share the same memory organization. Therefore, each
CPU has the same memory access time -> Uniform Memory Acess (UMA) or Symmetric Shared-Memory Multiprocessors(SMP)
Shared Bus
1. Uniform Memory Acess(UMA) Multiprocessors-Continue Crossbar Switch Uniform Memory Acess(UMA)
A switch is located at each crosspoint between a vertical and a horizontal line,
allowing the CPU and Memory to communicate to each other, when required.
2. Non-Uniform Memory Acess(NUMA) Multiprocessors
In this type of architectures or systems , we have a shared logical address
space. However, the physical memory is distribuited among CPUs, so that access time to data depends on data position, in local or in a remote memory -> Non-Uniform Memory Access(NUMA) denomination or
Distribuited Shared Memory(DSM).
It is used to build higher scalability & memory is distribuited among processors
2. Non-Uniform Memory Acess(NUMA) Multiprocessors-Continue
There are two types of NUMA systems:
1. Non-Caching NUMA(NC-NUMA) Multiprocessors. 2. Cache-Coherent NUMA(CC-NUMA) Multiprocessors.
In NC-NUMA system, processors have no local cache.
Eahc memory access is managed with a modified MMU, which
controls if the request is for a local or for a remote block.
2. Non-Uniform Memory Acess(NUMA)-Continue
2. Cache-Coherent NUMA(CC-NUMA) Multiprocessors.
In CC-NUMA, caching can allevite the problem due to remote
data access, but brings back the cache coherency issue.
The common approach in CC-NUMA system with many CPUs to
enforce cache coherency is the directory-based protocol, where each node in the system with a directory for its RAM blocks: a database stating in which cach is located a block, and what is the state
UMA VS. NUMA
As in UMA systems,, in NUMA system too all CPUs share the same address space, but
each processor has a local memory attached to it, and visible to all others processors.
So, differentlu from UMA systems, in NUMA systems access to local memory blocks is quicker than access to remote memory blocks
3. Cache Only Memory Acess(COMA) Multiprocessors
In this type of architectures or systems , data have no specific “permanent” location(no specific memory address) where they stay and whence they can be read(copied into local caches) and/or modified(first in the cache and the updated at their “permanent” location.
Here DATA can migrate and/or can be replicated in the various memory
banks of the central main memory.
When processor accesses a data item, its logical address is translated into
the physical address, and the content of the memory location containing the data is copied into the cache of the processor, where it can be read and/or modified
3. Cache Only Memory Acess(COMA) Multiprocessors-Continue
However, In UMA systems, centralized memory causes a bottleneck, and
limits the interconnection between CPU and memory, and its scalability.
Therefore, to overcome these problems, in COMA systems the relationship
between memory and CPU is managed in different manner.
In COMA, there is no longer “home address”, and the entire physical
address space is considered a huge, single cache.
DATA can migrate(moving, not being copied) within the whole system, from
a memory bank to another, according to the request of a specific CPU, that requires that data
43
Multiprocessor OS
• How should OS software be organized?
• OS should handle allocation of processes to processors.
Challenge due to shared data structures such as process tables and ready queues
• OS should handle disk I/O for the system as a whole • Two standard architectures
• Master-slave
44
Master-Slave Organization
• Master CPU runs kernel, all others run user processes • Only one copy of all OS data structures
45
Symmetric Multiprocessing (SMP)
• Only one kernel space, but OS can run on any CPU• Whenever a user process makes a system call, the same CPU runs OS to process it
• Key issue: Multiple system calls can run in parallel on different CPUs
• Need locks on all OS data structures to ensure mutual exclusion for critical updates
• Design issue: OS routines should have independence so that level of granularity for locking gives good performance
46
Synchronization
• Recall: Mutual exclusion solutions to protect critical
regions involving updates to shared data structures
• Classical single-processor solutions
• Disable interrupts
• Powerful instructions such as Test&Set (TSL) • Software solution such as Peterson’s algorithm
• In multiprocessor setting, competing processes can all
be OS routines (e.g., to update process table)
• Disabling interrupts is not relevant as there are
multiple CPUs
47
Busy-Waiting vs Process switch
• In single-processors, if a process is waiting to acquire
lock, OS schedules another ready process
• This may not be optimal for multiprocessor systems
• If OS itself is waiting to acquire ready list, then switching
impossible
• Switching may be possible, but involves acquiring locks, and
thus, is expensive
• OS must decide whether to switch (choice between
spinning and switching)
• spinning wastes CPU cycles
• switching uses up CPU cycles also
• possible to make separate decision each time locked mutex
48
Multiprocessors: Summary
• Set of processors connected over a bus with shared memory modules
• Architecture of bus and switches important for efficient memory access
• Caching essential; to manage multiple caches, cache coherence protocol necessary (e.g. Snoopy)
• Symmetric Multiprocessing (SMP) allows OS to run on different CPUs concurrently
• Synchronization issues: OS components work on shared data structures
• TSL based solution to ensure mutual exclusion
• Spin locks (i.e. busy waiting) with exponential backoff to reduce
49
Scheduling
• Recall: Standard scheme for single-processor scheduling
• Make a scheduling decision when a process blocks/exits or
when a clock interrupt happens indicating end of time quantum
• Scheduling policy needed to pick among ready processes, e.g.
multi-level priority (queues for each priority level)
• In multiprocessor system, scheduler must pick among
ready processes and also a CPU
• Natural scheme: when a process executing on CPU k
50
Issues for Multiprocessor Scheduling
• If a process is holding a lock, it is unwise to switch it even
if time quantum expires
• Locality issues
• If a process p is assigned to CPU k, then CPU k may hold
memory blocks relevant to p in its cache, so p should be assigned to CPU k whenever possible
• If a set of threads/processes communicate with one another
then it is advantageous to schedule them together
• Solutions
• Space sharing by allocating CPUs in partitions
51
52
Multicomputers
• Definition:
Tightly-coupled CPUs that do not share memory
• Communication by high-speed interconnect via
messages
• Also known as
• cluster computers
53
Switching Schemes
• Messages are transferred in chunks called packets • Store and forward packet switching
• Each switch collects bits on input line, assembles the packet, and
forwards it towards destination
• Each switch has a buffer to store packets
• Delays can be long
• Hot-potato routing: No buffering
• Necessary for optical communication links
• Circuit switching
• First establish a path from source to destination • Pump bits on the reserved path at a high rate
• Wormhole routing
54
Interprocess
Communication
• How can processes talk to each other on
multi-computers?
• User-level considerations: ease of use etc
• OS level consideration: efficient implementation
• Message passing
• Remote procedure calls (RPC)
55
Message-based Communication
• Minimum services provided
• send and receive commands
• These are blocking (synchronous) calls
(a) Blocking send call
56
User-level Communication
Primitives
• Library Routines
• Send (destination address, buffer containing message)
• Receive (optional source address, buffer to store message)
• Design issues
• Blocking vs non-blocking calls
57
Blocking vs Non-blocking
• Blocking send: Sender process waits until the message is
sent
• Disadvantage: Process has to wait
• Non-blocking send: Call returns control to sender
immediately
• Buffer must be protected
• Possible ways of handling non-blocking send
• Copy into kernel buffer
• Interrupt sender upon completion of transmission`
• Mark the buffer as read-only (at least a page long), copy on write
58
Buffers and Copying
• Network interface card has its own buffers
• Copy from RAM to sender’s card
• Store-and-forward switches may involve copying • Copy from receiver’s card to RAM
• Copying slows down end-to-end communication
• Copying not an issue in disk I/O due to slow speed
• Additional problem: should message be copied from
sender process buffer to kernel space?
• User pages can be swapped out
• Typical solutions
• Programmed I/O for small packets
59
The Problem with Messages
• Messages are flexible, but
• They are not a natural programming model
• Programmers have to worry about message formats
• messages must be packed and unpacked
• messages have to be decoded by server to figure out what is requested • messages are often asynchronous
60
Remote Procedure Call
• Procedure call is a more natural way to
communicate
• every language supports it
• semantics are well defined and understood • natural for programmers to use
• Basic idea of RPC (Remote Procedure Call)
• define a server as a module that exports a set of
procedures that can be called by client programs. call
return
61
Remote Procedure Call
• Use procedure call as a model for distributed communication
• RPCs can offer a good programming abstraction to hide low-level communication details
• Goal - make RPC look as much like local PC as possible
• Many issues:
• how do we make this invisible to the programmer?
• what are the semantics of parameter passing?
• how is binding done (locating the server)?
• how do we support heterogeneity (OS, arch., language)?
• how to deal with failures?
62
Shared memory vs. message
passing
• Message passing
• better performance
• know when and what msgs sent: control, knowledge
• Shared memory
• familiar
• hides details of communication
• no need to name receivers or senders, just write to
specific memory address and read later
• caching for “free”
• porting from centralized system (the original “write
once run anywhere”)
• no need to rewrite when adding processs, scales
because adds memory for each node
• Initial implementation correct (agreement is reached at
63
Distributed Shared Memory (DSM)
Replication
(a) Pages distributed on 4 machines (b) CPU 0 reads page 10
64
Distributed Shared Memory (DSM)
• data in shared address space accessed as in traditional VM.
• mapping manager -- maps the shared address space to the physical address space.
• Advantage of DSM
• no explicit comm. primitives, send and receive, needed in program. It is believed to be easier to design and write parallel alg's using DSM
• complex data structure can be passed by reference.
• moving page containing the data take advantage of locality and reduce comm. overhead.
65
DSM Implementation Issues
• Recall: In virtual memory, OS hides the fact that pages may reside in main memory or on disk
• Recall: In multiprocessors, there is a single shared memory (possibly virtual) accessed by multiple CPUs. There may be multiple caches, but cache coherency protocols hide this from applications
• how to make shared data concurrently accessible
• DSM: Each machine has its own physical memory, but virtual memory is shared, so pages can reside in any memory or on disk
• how to keep track of the location of shared data
• On page fault, OS can fetch the page from remote memory
66
Distributed Shared Memory
•Note layers where it can be implemented
• hardware
• operating system
67
Cache/Memory Coherence and
Consistency
• Coherence: every cache/CPU must have a coherent
view of memory
• If P writes X to A, then reads A, if no other proc writes A,
then P reads X
• If P1 writes X to A, and no other processor writes to A, then
P2 will eventually read X from A.
• If P1 writes X to A, and P2 writes Y to A, then every
processor will either read X then Y, or Y then X, but all will see the writes in the same order.
• Consistency: memory consistency model tells us when
68
False sharing in DSM
• False Sharing
69
Load Balancing
• In a multicomputer setting, system must determine
assignment of processes to machines
• Formulation as an optimization problem:
• Each process has estimated CPU and memory requirements
• For every pair of processes, there is an estimated traffic
• Goal: Given k machines, cluster the processes into k
clusters such that
• Traffic between clusters is minimized
• Aggregate memory/CPU requirements of processes within each