• No results found

Memory and I/O Systems

CHAPTER OUTLINE 3.1 Introductiorv

3.5 Virtual Memory Systems

So far, we have only considered levels of the memory hierarchy that employ random-access storage technology. However, in modern high-performance computer systems, the lowest level of the memory hierarchy is actually implemented using magnetic disks as a paging device or backing store for the physical memory, comprising a virtual memory system. The backing store contains blocks of memory that have been displaced from main memory due to capacity reasons, just the same as blocks are displaced from caches and placed either in the next level of the cache hierarchy or in main memory.

Historically, virtual memory predates caches and was first introduced 40 years ago in time-shared mainframe computers to enable sharing of a precious commodity—

the main memory—among multiple active programs [Kilbum et al., 1962]. Virtual memory, as the name implies, virtualizes main memory by separating the program-mer's view of memory from the actual physical placement of blocks in memory.

It does so by adding a layer of cooperating hardware and software that manages the mappings between a program's virtual address and the physical address that actually stores the data or program text being referenced. This process of address translation is illustrated in Figure 3.10. The layer of cooperating hardware and soft-ware that enables address translation is called the virtual memory system and is

MEMORY A N D I/O SYSTEMS 1

(^jrrualaddressj

Main memory

Address I

translation I PhvsTal xMrm'

Figure 3.10

virtual to Physical Address Translation.

responsible for maintaining the illusion that all virtually addressable memory is resident in physical memory and can be transparently accessed by the program, while also efficiently sharing the limited physical resources among competing demands from multiple active programs.

In contrast, time-sharing systems that predated or failed to provide virtual memory handicapped users and programmers by requiring them to explicitly man-age physical memory as a shared resource. Portions of physical memory had to be statically allocated to concurrent programs; these portions had to be manually replaced and evicted to allocate new space; and cumbersome techniques such as data and program overlays were employed to reduce or minimize the amount of space consumed by each program. For example, a program would have to explicitly load and unload overlays that corresponded to explicit phases of program execution, since loading the entire program and data set could either overwhelm all the physical memory or starve other concurrent programs.

Instead, a virtual memory system allows each concurrent program to allocate and occupy as much memory as the system's backing store and its virtual address space allows: up to 4 Gbytes for a machine with 32-bit virtual addresses, assuming adequate backing store is available. Meanwhile, a separate demand paging mecha-nism manages the placement of memory in either the limited physical memory or in the system's capacious backing store, based on the policies of the virtual memory system. Such a system is responsible for providing the illusion that all virtually addressable memory is resident in physical memory and can be transparently accessed by the program.

The illusion of practically infinite capacity and a requirement for transparent access sound quite similar to the principles for caching described in Section 3.4.3;

in fact, the underlying principles of temporal and spatial locality, as well as poli-cies for locating, evicting, and handling updates to blocks, are all conceptually very similar in virtual memory subsystems and cache memories. However, since the relative latencies for accessing the backing store are much higher than the' latencies for satisfying a cache miss from the next level of the physical memory hierarchy, the policies and particularly the mechanisms can and do differ substantially. A refer-ence to a block that resides only in the backing store inflicts 10 ms or more of

138 M O D E R N P R O C E S S O R D E S I G N

latency to read the block from disk. A pure hardware replacement scheme that stalls the processor while waiting for this amount of time would result in very poor utilization, since 10 ms corresponds to approximately 10 million instruction execution opportunities in a processor that executes one instruction per nanosec-ond. Hence, virtual memory subsystems are implemented as a hybrid of hardware and software, where references to blocks that reside in physical memory are satis-fied quickly and efficiently by the hardware, while references that miss invoke the operating system through a page fault exception, which initiates the disk transfer but is also able to schedule some other, ready task to execute in the window of time between initiating and completing the disk request. Furthermore, the operat-ing system now becomes responsible for implementoperat-ing a policy for evictoperat-ing blocks to allocate space for the new block being fetched from disk. We will study these issues in further detail in Section 3.5.1.

However, there is an additional complication that arises from the fact that multi-ple programs are sharing the same physical memory: they should somehow be pro-tected from accessing each others' memory, either accidentally or due to a malicious program attempting to spy on or subvert another concurrent program. In a typical modem system, each program runs in its own virtual address space, which is disjoint from the address space of any other concurrent program As long as there is no over-lap in address spaces, the operating system need only ensure that no two concurrent address mappings from different programs ever point to the same physical location, and protection is ensured. However, this can limit functionality, since two programs cannot communicate via a shared memory location, and can also reduce perform mance, since duplicates of the same objects may need to exist in memory to satisfy the needs of multiple programs. For these two reasons, virtual memory systems typ-ically provide mechanisms for protecting the regions of memory that they map into each program's address space; these protection mechanisms allow efficient sharing and communication to occur. We describe them further in Section 3.5.2.

Finally, a virtual memory system must provide an architected means for trans-lating a virtual address to a physical address and a structure for storing these mappings. We outline several schemes for doing so in Section 3.5.3.

3.5.1 Demand Paging

Figure 3.11 shows an example of a single process that consumes virtual address space in three regions: for program text (to load program binaries and shared libraries); for the process stack (for activation records and automatic storage); and for the process heap (for dynamically allocated memory). Not only are these three regions noncontiguous, leaving unused holes in the virtual address space, but each of these regions can be accessed relatively sparsely. Practically speaking, only the regions that are currently being accessed need to reside in physical memory (shown as shaded in the figure), while the unaccessed or rarely accessed regions can be stored on the paging device or backing store, enabling the use of a system with a limited amount of physical memory for programs that consume large frac-tions of their address space, or, alternatively, freeing up main memory for other applications in a time-shared system.

M E M O R Y A N D I / O S Y S T E M S 13'

Virtual address space

Physical memory

Paging device

Process heap

Process stack

Program text

Figure 3.11

Virtual M e m o r y System.

A virtual memory demand paging system must track regions of memory at some reasonable granularity. Just as caches track memory in blocks, a demand paging system must choose some page size as the minimum granularity for locating and evicting blocks in main memory. Typical page sizes in current-generation systems are 4K or 8K bytes. Some modem systems also support variable-sized pages or multi-ple page sizes to more efficiently manage larger regions of memory. However, we will restrict our discussion to fixed-size pages.

Providing a virtual memory subsystem relieves the programmer from having to manually and explicitly manage the program's use of physical memory. Fur-thermore, it enables efficient execution of classes of algorithms that use the virtual address space greedily but sparsely, since it avoids allocating physical memory for untouched regions of virtual memory. Virtual memory relies on lazy allocation to achieve this very purpose." instead of eagerly allocating space for a program's needs, it defers allocation until the program actually references the memory.

This requires a means for the program to communicate to the virtual memory subsystem that it needs to reference memory that has previously not been accessed. In a demand-paged system, this communication occurs through a page-fault exception. Initially, when a new program starts up, none of its virtual address space may be allocated in physical memory. However, as soon as the program attempts to fetch an instruction or perform a load or store from a virtual memory location that is not currendy in virtual memory, a page fault occurs. The hardware registers a page fault whenever it cannot find a valid translation for the current vir-tual address. This is concepvir-tually very similar to a cache memory experiencing a miss whenever it cannot find a matching tag when it performs a cache lookup.

140 M O D E R N P R O C E S S O R D E S I G N

However, a page fault is not handled implicitly by a hardware mechanism;

rather, it transfers control to the operating system, which then allocates a page for the virtual address, creates a mapping between the virtual and physical addresses, installs the contents of the page into physical memory (usually by accessing the backing store on a magnetic disk), and returns control to the faulting program. The program is now able to continue execution, since the hardware can satisfy its virtual address reference from the corresponding physical memory location.

Detecting a Page F a u l t . To detect a page fault, the hardware must fail to find a valid mapping for the current virtual address. This requires an architected structure that the hardware searches for valid mappings before it raises a page fault exception to the operating system. The operating system's exception handler code is then invoked to handle the exception and create a valid mapping. Section 3.5.3 discusses several schemes for storing such mappings.

Page Allocation. Allocating space for a new virtual memory page is similar to allocating space for a new block in the cache, and depends on the page organiza-tion. Current virtual memory systems all use a fully-associative policy for placing virtual pages in physical memory, since it leads to efficient use of main memory, and the overhead of performing an associative search is not significant compared to the overall latency for handling a page fault. However, there must be a policy for evicting an active page whenever memory is completely full. Since a least-recently-used (LRU) policy would be too expensive to implement for the thousands of pages in a reasonably sized physical memory, some current operating systems use an approximation of L R U called the clock algorithm. In this scheme, each page in physical memory maintains a reference bit that is set by the hardware whenever a reference occurs to that page. The operating system intermittently clears all the reference bits. Subsequent references will set the page reference bits, effec-tively marking those pages that have been referenced recendy. When the virtual memory system needs to find a page to evict it randomly chooses a page from the set of pages with cleared reference bits. This scheme avoids evicting pages that have been referenced since the last time the reference bits were cleared, providing a very coarse approximation of the LRU policy.

Alternatively, the operating system can easily implement a FIFO policy for evict-ing pages by maintainevict-ing an ordered list of pages that have been fetched into main memory from the backing store. While not optimal, this scheme can perform reason-ably well and is easy to implement since it avoids the overhead of the clock algorithm.

Once a page has been chosen for eviction, the operating system must place it in the backing store, usually by performing a write of the contents of the page to a mag-netic disk. This write can be avoided if the hardware maintains a change bit or dirty bit for the page, and the dirty bit is not set. This is similar in principle to the dirty bits in a writeback cache, where only the blocks that have their dirty bit set need to be written back to the next level of the cache hierarchy when they are evicted.

Accessing t h e Backing S t o r e . The backing store needs to be accessed to supply the paged contents of the virtual page that is about to be installed in physical memory.

M E M O R Y A N D I/O S Y S T E M S 1

User process 1

(-Running Translation not found; process sleeps Process I

O/S supervisor

User process 2

Backing store

Page table Evict victim; Schedule search initiate I/O read process 2

I 1

Running

Fetch missing page

Schedule process 1

Figure 3.12

Handling a P a g e Fault.

Typically, this involves issuing a read to a magnetic disk, which can have a latency exceeding 10 ms. Multitasking operating systems will put a page-faulting task to sleep for the duration of the disk read and will schedule some other active task to run on the processor instead.

Figure 3.12 illustrates the steps that occur to satisfy a page fault: first the cur-rent process 1 fails to find a valid translation for a memory location it is attempting to access; the operating system supervisor is invoked to search the page table for a valid translation via the page fault handler routine; failing to find a translation, the supervisor evicts a physical page to make room for the faulting page and initiates an I/O read to the backing store to fetch the page; the supervisor scheduler then runs to find a ready task to occupy the CPU while process 1 waits for the page fault to be satisfied; process 2 runs while the backing store completes the read; the supervisor is notified when the read completes, and runs its scheduler to find the waiting process 1; finally, process 1 resumes execution on the CPU.

3.5.2 Memory Protection

A system that time-shares the physical memory system through the use of virtual memory allows the physical memory to concurrently contain pages from multiple processes. In some scenarios, it is desirable to allow multiple processes to access the same physical page, in order to enable communication between those processes or to avoid keeping duplicate copies of identical program binaries or shared libraries in memory. Furthermore, the operating system kernel, which also has resident physical pages, must be able to protect its internal data structures from user-level programs.

The virtual memory subsystem must provide some means for protecting shared pages from defective or malicious programs that might corrupt their state.

Furthermore, even when no sharing is occurring, protecting various address ranges from certain types of accesses can be useful for ensuring correct execution or for debugging new programs, since erroneous references can be flagged by the protec-tion mechanism.

i n

Jlf

142 M O D E R N P R O C E S S O R D E S I G N

Typical virtual memory systems allow each page to be granted separate read, write, and execute permissions. The hardware is then responsible for checking that instruction fetches occur only to pages that grant execute permission, loads occur only to pages that grant read permission, and writes occur only to pages that grant write permission. These permissions are maintained in parallel with the virtual to physical translations and can only be manipulated by supervisor-state code running in the operating system kernel. Any references that violate the permissions specified for that page will be blocked, and the operating system exception handler will be invoked to deal with the problem, usually resulting in termination of the offending process.

Permission bits enable efficient sharing of read-only objects like program bina-ries and shared librabina-ries. If there are multiple concurrent processes executing the same program binary, only a single copy of the program needs to reside in physi-cal memory, since the kernel can map the same physiphysi-cal copy into the address space of each process. This will result in multiple virtual-physical address map-pings where the physical address is the same. This is referred to as virtual address aliasing.

Similarly, any other read-only objects can be shared. Furthermore, programs that need to communicate with each other can request shared space from the operat-ing system and can communicate directly with each other by writoperat-ing to and readoperat-ing from the shared physical address. Again, the sharing is achieved via multiple virtual mappings (one per process) to the same physical address, with appropriate read and/or write permissions set for each process sharing the memory.

3.5.3 Page Table Architectures

The virtual address to physical address mappings have to be stored in a translation memory. The operating system is responsible for updating these mappings whenever they need to change, while the processor must access the translation memory to determine the physical address for each virtual address reference that it performs.

Each translation entry contains the fields shown in Figure 3.13: the virtual address, the corresponding physical address, permission bits for reading (Rp), writing (Wp), and executing (Ep), as well as reference (Ref) and change (Ch) bits, and possibly a caching-inhibited bit (Ca). The reference bit is used by the demand paging systems eviction algorithm to find pages to replace, while the change bit plays the part of a

Virtual address Real address Rp Wp Ep Ref C h c

Caching-inhibited bit Change bit Reference bit Execute permission Write permission Read permission Figure 3.13

Typical Page Table Entry.

M E M O R Y A N D I / O S Y S T E M S 1 4

dirty bit, indicating that an eviction candidate needs to be written back to the backing store. The caching-inhibited bit is used to flag pages in memory that should not, for either r^rformance or correctness reasons, be stored in the processor's cache hierar-chy. Instead, all references to such addresses must be communicated directly through the processor bus. We will learn in Section 3.7.3 how this caching-inhibited bit is vitally important for communicating with I/O devices with memory-mapped control registers.

The translation memories are usually called page tables and can be organized either as forward page tables or inverted page tables (the latter are often called hashed page tables as well). At its simplest, a forward page table contains a page table entry for every possible page-sized block in the virtual address space of the process using the page table. However, this would result in a very large structure with many unused entries, since most processes do not consume all their virtual address space. Hence, forward page tables are usually structured in multiple lev-els, as shown in Figure 3.14. In this approach, the virtual address is decomposed into multiple sections. The highest-order bits of the address are added to the page table base register (PTBR), which points to the base of the first level of the page table. This first lookup provides a pointer to the next table; the next set of bits from the virtual address are added to this pointer to find a pointer to the next level.

Finally, this pointer is added to the next set of virtual address bits to find the final leaf-level page table entry, which provides the actual physical address and permission bits corresponding to the virtual address. Of course, the multilevel page table can be extended to more than the three levels shown in Figure 3.14.

A multilevel forward page table can efficiently store translations for a sparsely

A multilevel forward page table can efficiently store translations for a sparsely