Virtual Memory. Chapter 4

(1)

Chapter 4

Virtual Memory

Linux processes execute in a virtual environment that makes it appear as if each process had the entire address space of the CPU available to itself. This virtual address space extends from address 0 all the way to the maximum address. On a 32-bit platform, such as IA-32, the maximum address is 232−1 or0xffffffff. On a 64-bit platform, such as IA-64, this is 264−1 or0xffffffffffffffff.

While it is obviously convenient for a process to be able to access such a huge ad-dress space, there are really three distinct, but equally important, reasons for using virtual memory.

1. Resource virtualization.On a system with virtual memory, a process does not have to concern itself with the details of how much physical memory is available or which physical memory locations are already in use by some other process. In other words, virtual memory takes a limited physical resource (physical memory) and turns it into an infinite, or at least an abundant, resource (virtual memory).

2. Information isolation.Because each process runs in its own address space, it is not possible for one process to read data that belongs to another process. This improves security because it reduces the risk of one process being able to spy on another pro-cess and, e.g., steal a password.

3. Fault isolation. Processes with their own virtual address spaces cannot overwrite each other’s memory. This greatly reduces the risk of a failure in one process trig-gering a failure in another process. That is, when a process crashes, the problem is generally limited to that process alone and does not cause the entire machine to go down.

In this chapter, we explore how the Linux kernel implements its virtual memory system and how it maps to the underlying hardware. This mapping is illustrated specifically for IA-64. The chapter is structured as follows. The first section provides an introduction to the virtual memory system of Linux and establishes the terminology used throughout the remainder of the chapter. The introduction is followed by a description of the software and hardware structures that form the virtual memory system. Specifically, the second section describes the Linux virtual address space and its representation in the kernel. The third sec-tion describes the Linux page tables, and the fourth secsec-tion describes how Linux manages

(2)

9 page frame 0 1 4 5 6 7 8 3 2

physical address space 15 1 2 3 0 0x4000000000000000 virtual page 4 5 7 8 9 10 11 6 12 13 14 0x0000000000000000 0xffffffffffffffff

virtual address space

Figure 4.1.Virtual and physical address spaces.

the translation lookaside buffer (TLB), which is a hardware structure used to accelerate virtual memory accesses. Once these fundamental structures are introduced, the chapter describes the operation of the virtual memory system. Section five explores the Linux page fault handler, which can be thought of as the engine driving the virtual memory system. Section six describes how memory coherency is maintained, that is, how Linux ensures that a process sees the correct values in the virtual memory locations it can access. Section seven discusses how Linux switches execution from one address space to another, which is a necessary step during a process context switch. The chapter concludes with section eight, which provides the rationale for some of the virtual memory choices that were made for the virtual memory system implemented on IA-64.

4.1

INTRODUCTION TO THE VIRTUAL MEMORY SYSTEM

The left half of Figure 4.1 illustrates the virtual address space as it might exist for a partic-ular process on a 64-bit platform. As the figure shows, the virtual address space is divided into equal-sized pieces calledvirtual pages. Virtual pages have a fixed size that is an integer power of 2. For example, IA-32 uses a page size of 4 Kbytes. To maximize performance, IA-64 supports multiple page sizes and Linux can be configured to use a size of 4, 8, 16, or 64 Kbytes. In the figure, the 64-bit address space is divided into 16 pages, meaning that each virtual page would have a size of 264/16=260bytes or 1024 Pbytes (1 Pbyte=250 bytes). Such large pages are not realistic, but the alternative of drawing a figure with sev-eral billion pages of a more realistic size is, of course, not practical either. Thus, for this section, we continue to illustrate virtual memory with this huge page size. The figure also shows that virtual pages are numbered sequentially. We can calculate thevirtual page num-ber (VPN)from a virtual address by dividing it by the page size and taking the integer

(3)

4.1 Introduction to the Virtual Memory System 133

portion of the result. The remainder is called thepage offset. For example, dividing virtual address0x40000000000003f8by the page size yields 4 and a remainder of0x3f8. This address therefore maps to virtual page number 4 and page offset0x3f8.

Let us now turn attention to the right half of Figure 4.1, which shows the physical ad-dress space. Just like the virtual space, it is divided into equal-sized pieces, but in physical memory, those pieces are calledpage frames. As with virtual pages, page frames also are numbered. We can calculate thepage frame number (PFN) from a physical address by dividing it by the page frame size and taking the integer portion of the result. The remain-der is thepage frame offset. Normally, page frames have the same size as virtual pages. However, there are cases where it is beneficial to deviate from this rule. Sometimes it is useful to have virtual pages that are larger than a page frame. Such pages are known as su-perpages. Conversely, it is sometimes useful to divide a page frame into multiple, smaller virtual pages. Such pages are known assubpages. IA-64 is capable of supporting both, but Linux does not use them.

While it is easiest to think of physical memory as occupying a single contiguous region in the physical address space, in reality it is not uncommon to encountermemory holes. Holes usually are caused by one of three entities: firmware, memory-mapped I/O devices, or unpopulated memory. All three cause portions of the physical address space to be un-available for storing the content of virtual pages. As far as the kernel is concerned, these portions are holes in the physical memory. In the example in Figure 4.1, page frames 2 and 3 represent a hole. Note that even if just a single byte in a page frame is unusable, the entire frame must be marked as a hole.

4.1.1

Virtual-to-physical address translation

Processes are under the illusion of being able to store data to virtual memory and retrieve it later on as if it were stored in real memory. In reality, only physical memory can store data. Thus, each virtual page that is in use must be mapped to some page frame in physical memory. For example, in Figure 4.1, virtual pages 4, 5, 8, 9, and 11 are in use. The arrows indicate which page frame in physical memory they map to. The mapping between virtual pages and page frames is stored in a data structure called thepage table. The page table for our example is shown on the left-hand side of Figure 4.2.

The Linux kernel is responsible for creating and maintaining page tables but employs the CPU’s memory management unit (MMU) to translate the virtual memory accesses of a process into corresponding physical memory accesses. Specifically, when a process accesses a memory location at a particular virtual address, the MMU translates this ad-dress into the corresponding physical adad-dress, which it then uses to access the physi-cal memory. This is illustrated in Figure 4.2 for the case in which the virtual address is 0x40000000000003f8. As the figure shows, the MMU extracts the VPN (4) from the virtual address and then searches the page table to find the matching PFN. In our case, the search stops at the first entry in the page table since it contains the desired VPN. The PFN associated with this entry is 7. The MMU then constructs the physical address by concate-nating the PFN with the frame offset from the virtual address, which results in a physical address of0x70000000000003f8.

(4)

0000000000003f8 4 page offset VPN virtual address 7 7 0000000000003f8 5 4 0 8 9 5 11 1 page offset VPN PFN physical address 4 PFN page table

Figure 4.2.Virtual-to-physical address translation.

4.1.2

Demand paging

The next question we need to address is how the page tables get created. Linux could create appropriate page-table entries whenever a range of virtual memory is allocated. However, this would be wasteful because most programs allocate much more virtual memory than they ever use at any given time. For example, the text segment of a program often includes large amounts of error handling code that is seldom executed. To avoid wasting memory on virtual pages that are never accessed, Linux uses a method calleddemand paging. With this method, the virtual address space starts out empty. This means that, logically, all virtual pages are marked in the page table asnot present. When accessing a virtual page that is not present, the CPU generates apage fault. This fault is intercepted by the Linux kernel and causes the page fault handler to be activated. There, the kernel can allocate a new page frame, determine what the content of the accessed page should be (e.g., a new, zeroed page, a page loaded from the data section of a program, or a page loaded from the text segment of a program), load the page, and then update the page table to mark the page aspresent. Execution then resumes in the process with the instruction that caused the fault. Since the required page is now present, the instruction can now execute without causing a page fault.

4.1.3

Paging and swapping

So far, we assumed that physical memory is plentiful: Whenever we needed a page frame to back a virtual page, we assumed a free page frame was available. When a system has many processes or when some processes grow very large, the physical memory can easily fill up. So what is Linux supposed to do when a page frame is needed but physical memory is already full? The answer is that in this case, Linux picks a page frame that backs a virtual page that has not been accessed recently, writes it out to a special area on the disk called theswap space, and then reuses the page frame to back the new virtual page. The exact place to which the old page is written on the disk depends on what kind of swap space is in use. Linux can support multiple swap space areas, each of which can be either an entire disk partition or a specially formatted file on an existing filesystem (the former is generally

(5)

4.1 Introduction to the Virtual Memory System 135

more efficient and therefore preferable). Of course, the page table of the process from which Linux “stole” the page frame must be updated accordingly. Linux does this update by marking the page-table entry asnot present. To keep track of where the old page has been saved, it also uses the entry to record the disk location of the page. In other words, a page-table entry that ispresentcontains the page frame number of the physical page frame that backs the virtual page, whereas a page-table entry that isnot presentcontains the disk location at which the content of the page can be found.

Because a page marked asnot presentcannot be accessed without first triggering a page fault, Linux can detect when the page is needed again. When this happens, Linux needs to again find an available page frame (which may cause another page to be paged out), read the page content back from swap space, and then update the page-table entry so that it is marked aspresentand maps to the newly allocated page frame. At this point, the process that attempted to access the paged-out page can be resumed, and, apart from a small delay, it will execute as if the page had been in memory at all along.

The technique of stealing a page from a process and writing it out to disk is called

paging. A related technique is swapping. It is a more aggressive form of paging in the sense that it does not steal an individual page but stealsallthe pages of a process when memory is in short supply. Linux uses paging but not swapping. However, because both paging and swapping write pages to swap space, Linux kernel programmers often use the terms “swapping” and “paging” interchangeably. This is something to keep in mind when perusing the kernel source code.

From a correctness point of view, it does not matter which page is selected for page out, but from a performance perspective, the choice is critical. With a poor choice, Linux may end up paging out the page that is needed in the very next memory access. Given the large difference between disk access latency (on the order of several milliseconds) and memory access latency (on the order of tens of nanoseconds), making the right replacement choices can mean the difference between completing a task in a second or in almost three hours!

The algorithm that determines which page to evict from main memory is called the

replacement policy. The provablyoptimal replacement policy (OPT)is to choose the page that will be accessed farthest in the future. Of course, in general it is impossible to know the future behavior of the processes, so OPT is of theoretical interest only. A replacement policy that often performs almost as well as OPT yet is realizable is theleast recently used (LRU)policy. LRU looks into the past instead of the future and selects the page that has not been accessed for the longest period of time. Unfortunately, even though LRU could be implemented, it is still not practical because it would require updating a data structure (such as an LRU list) oneveryaccess to main memory. In practice, operating systems use approximations of the LRU policy instead, such as theclock replacementornot frequently used (NFU)policies [11, 69].

In Linux, the page replacement is complicated by the fact that the kernel can take up a variable amount of (nonpageable) memory. For example, file data is stored in the page cache, which can grow and shrink dynamically. When the kernel needs a new page frame, it often has two choices: It could take away a page from the kernel or it could steal a page from a process. In other words, the kernel needs not just a replacement policy but also a

(6)

how much is used to back virtual pages. The combination of page replacement and memory balancing poses a difficult problem for which there is no perfect solution. Consequently, the Linux kernel uses a variety of heuristics that tend to work well in practice.

To implement these heuristics, the Linux kernel expects the platform-specific part of the kernel to maintain two extra bits in each page-table entry: theaccessedbit and thedirtybit. Theaccessedbit is an indicator that tells the kernel whether the page was accessed (read, written, or executed) since the bit was last cleared. Similarly, thedirtybit is an indicator that tells whether the page has been modified since it was last paged in. Linux uses a kernel thread, called the kernel swap daemon kswapd, to periodically inspect these bits. After inspection, it clears theaccessedbit. If kswapd detects that the kernel is starting to run low on memory, its starts to proactively page out memory that has not been used recently. If the

dirtybit of a page is set, it needs to write the page to disk before the page frame can be freed. Because this is relatively costly, kswapd preferentially frees pages whoseaccessed

anddirtybits are cleared to 0. By definition such pages were not accessed recently and do not have to be written back to disk before the page frame can be freed, so they can be reclaimed at very little cost.

4.1.4

Protection

In a multiuser and multitasking system such as Linux, multiple processes often execute the same program. For example, each user who is logged into the system at a minimum is running a command shell (e.g., the Bourne-Again shell,bash). Similarly, server processes such as the Apache web server often use multiple processes running the same program to better handle heavy loads. If we looked at the virtual space of each of those processes, we would find that they share many identical pages. Moreover, many of those pages are never modified during the lifetime of a process because they contain read-only data or the text segment of the program, which also does not change during the course of execution. Clearly, it would make a lot of sense to exploit this commonality and use only one page frame for each virtual page with identical content.

WithNprocesses running the same program, sharing identical pages can reduce physi-cal memory consumption by up to a factor ofN. In reality, the savings are usually not quite so dramatic, because each process tends to require a few private pages. A more realistic example is illustrated in Figure 4.3: Page frames 0, 1, and 5 are used to back virtual pages 1, 2, and 3 in the two processes calledbash 1andbash 2. Note that a total of nine virtual pages are in use, but thanks to page sharing, only six page frames are needed.

Of course, page sharing cannot be done safely unless we can guarantee that none of the shared pages are modified. Otherwise, the changes of one process would be visible in all the other processes and that could lead to unpredictable program behavior.

This is where the pagepermission bitscome into play. The Linux kernel expects the platform-specific part of the kernel to maintain three such bits per page-table entry. They are called theR,W, andXpermission bits and respectively control whether read, write, or execute accesses to the page are permitted. If an access that is not permitted is attempted, a

page protection violationfault is raised. When this happens, the kernel responds by sending a segmentation violation signal (SIGSEGV) to the process.

(7)

4.1 Introduction to the Virtual Memory System 137 0 5 7 4 6 2 3 1 0 5 4 2 3 1 7 6 2 3 9 8 4 5 6 7 1 0 data text

virtual addr. space

bash 1 bash 2

virtual addr. space physical addr. space

Figure 4.3.Two processes sharing the text segment (virtual pages 1 to 3).

The page permission bits enable the safe sharing of page frames. All the Linux kernel has to do is ensure that all virtual pages that refer to a shared page frame have the W permission bit turned off. That way, if a process attempted to modify a shared page, it would receive a segmentation violation signal before it could do any harm.

The most obvious place where page frame sharing can be used effectively is in the text segment of a program: By definition, this segment can be executed and read, but it is never written to. In other words, the text segment pages of all processes running the same program can be shared. The same applies to read-only data pages.

Linux takes page sharing one step further. When a process forks a copy of itself, the kernel disables write access toallvirtual pages and sets up the page tables such that the parent and the child process share all page frames. In addition, it marks the pages that were writable before ascopy-on-write (COW). If the parent or the child process attempts to write to a copy-on-write page, a protection violation fault occurs. When this happens, instead of sending a segmentation violation signal, the kernel first makes a private copy of the virtual page and then turns the write permission bit for that page back on. At this point, execution can return to the faulting process. Because the page is now writable, the faulting instruction can finish execution without causing a fault again. The copy-on-write scheme is particularly effective when a program does afork()that is quickly followed by anexecve(). In such a case, the scheme is able to avoid almost all page copying, save for a few pages in the stack- or data-segment [9, 64].

Note that the page sharing described here happens automatically and without the ex-plicit knowledge of the process. There are times when two or more processes need to coop-erate and want to explicitly share some virtual memory pages. Linux supports this through themmap()system call and through System V shared memory segments [9]. Because the processes are cooperating, it is their responsibility to map the shared memory segment with suitable permission bits to ensure that the processes can access the memory only in the intended fashion.

(8)

address space kernel 0x0000000000000000 0xffffffffffffffff VMALLOC_START VMALLOC_END PAGE_OFFSET mapped segment mapped segment identity− page−table− space address user TASK_SIZE

Figure 4.4.Structure of Linux address space.

4.2

ADDRESS SPACE OF A LINUX PROCESS

The virtual address space of any Linux process is divided into two subspaces: kernel space and user space. As illustrated on the left-hand side of Figure 4.4, user space occupies the lower portion of the address space, starting from address 0 and extending up to the platform-specifictask size limit(TASK SIZEin fileinclude/asm/processor.h). The remainder is oc-cupied by kernel space. Most platforms use a task size limit that is large enough so that at least half of the available address space is occupied by the user address space.

User space is private to the process, meaning that it is mapped by the process’s own page table. In contrast, kernel space is shared across all processes. There are two ways to think about kernel space: We can either think of it as being mapped into the top part of each process, or we can think of it as a single space that occupies the top part of the CPU’s virtual address space. Interestingly, depending on the specifics of CPU on which Linux is running, kernel space can be implemented in one or the other way.

During execution at the user level, only user space is accessible. Attempting to read, write, or execute kernel space would cause a protection violation fault. This prevents a faulty or malicious user process from corrupting the kernel. In contrast, during execution in the kernel, both user and kernel spaces are accessible.

Before continuing our discussion, we need to say a few words about the page size used by the Linux kernel. Because different platforms have different constraints on what page sizes they can support, Linux never assumes a particular page size and instead uses the platform-specificpage size constant (PAGE SIZEin fileinclude/asm/page.h) where nec-essary. Although Linux can accommodate arbitrary page sizes, throughout the rest of this chapter we assume a page size of 8 Kbytes, unless stated otherwise. This assumption helps to make the discussion more concrete and avoids excessive complexity in the following examples and figures.

(9)

4.2 Address Space of a Linux Process 139

4.2.1

User address space

Let us now take a closer look at how Linux implements the user address spaces. Each address space is represented in the kernel by an object called the mm structure (struct mm struct in fileinclude/linux/sched.h). As we have seen in Chapter 3,Processes, Tasks, and Threads, multiple tasks can share the same address space, so the mm structure is a reference-counted object that exists as long as at least one task is using the address space represented by the mm structure. Each task structure has a pointer to the mm structure that defines the address space of the task. This pointer is known as themm pointer. As a special case, tasks that are known to access kernel space only (such as kswapd) are said to have ananonymous address space, and the mm pointer of such tasks is set toNULL. When switching execution to such a task, Linux does not switch the address space (because there is none to switch to) and instead leaves the old one in place. A separate pointer in the task structure tracks which address space has been borrowed in this fashion. This pointer is known as theactive mm pointerof the task. For a task that is currently running, this pointer is guaranteed not to beNULL. If the task has its own address space, the active mm pointer has the same value as the mm pointer; otherwise, the active mm pointer refers to the mm structure of the borrowed address space.

Perhaps somewhat surprisingly, the mm structure itself is not a terribly interesting ob-ject. However, it is a central hub in the sense that it contains the pointers to the two data structures that are at the core of the virtual memory system: the page table and the list of virtual memory areas, which we describe next. Apart from these two pointers, the mm structure contains miscellaneous information, such as the mm context, which we describe in more detail in Section 4.4.3, a count of the number of virtual pages currently in use (the

resident set size, orRSS), the start and end address of the text, data, and stack segments as well as housekeeping information that kswapd uses when looking for virtual memory to page out.

Virtual memory areas

In theory, a page table is all the kernel needs to implement virtual memory. However, page tables are not effective in representing huge address spaces, especially when they are sparse. To see this, let us assume that a process uses 1 Gbytes of its address space for a hash table and then enters 128 Kbytes of data in it. If we assume that the page size is 8 Kbytes and that each entry in the page table takes up 8 bytes, then the page table itself would take up 1 Gbyte/8 Kbytes·8 byte=1 Mbyte of space—an order of magnitude more than the actual data stored in the hash table!

To avoid this kind of inefficiency, Linux does not represent address spaces with page tables. Instead, it uses lists ofvm-area structures (struct vm area struct in file include/-linux/mm.h). The idea is to divide an address space into contiguous ranges of pages that can be handled in the same fashion. Each range can then be represented by a single vm-area structure. If a process accesses a page for which there is no translation in the page table, the vm-area covering that page has all the information needed to install the missing page. For our hash table example, this means that a single vm-area would suffice to map the entire hash table and that page-table memory would be needed only for recently accessed pages.

(10)

task 00000 00000 11111 11111 0000000 0000000 1111111 1111111 0000000 0000000 1111111 1111111 mm virtual mem. phys. mem. page table /etc/termcap vm−area mm 3 1 2 file = /etc/termcap start = 0x2000 page table mmap end = 0xa000 0x4000 0x6000 0x2000 0x6008

Figure 4.5.Example: vm-area mapping a file.

To get a better sense of how the kernel uses vm-areas, let us consider the example in Fig-ure 4.5. It shows a process that maps the first 32 Kbytes (four pages) of the file/etc/termcap at virtual address0x2000. At the top-left of the figure, we find the task structure of the process and the mm pointer that leads to the mm structure representing the address space of the process. From there, the mmap pointer leads to the first element in the vm-area list. For simplicity, we assume that the vm-area for the mapped file is the only one in this pro-cess, so this list contains just one entry. The mm structure also has a pointer to the page table, which is initially empty. Apart from these kernel data structures, the process’s vir-tual memory is shown in the middle of the figure, the filesystem containing/etc/termcapis represented by the disk-shaped form, and the physical memory is shown on the right-hand side of the figure.

Now, suppose the process attempts to read the word at address0x6008, as shown by the arrow labeled (1). Because the page table is empty, this attempt results in a page fault. In response to this fault, Linux searches the vm-area list of the current process for a vm-area that covers the faulting address. In our case, it finds that the one and only vm-area on the list maps the address range from0x2000to0xa000and hence covers the faulting address. By calculating the distance from the start of the mapped area, Linux finds that the process attempted to access page 2 ( 0x6008−0x2000/8192 =2). Because the vm-area maps a file, Linux initiates the disk read illustrated by the arrow labeled (2). We assumed that the vm-area maps the first 32KB of the file, so the data for page 2 can be found at file offsets 0x4000through 0x5fff. When this data arrives, Linux copies it to an available page frame as illustrated by the arrow labeled (3). In the last step, Linux updates the page table with an entry that maps the virtual page at0x6000to the physical page frame that now contains the file data. At this point, the process can resume execution. The read access will be restarted and will now complete successfully, returning the desired file data.

(11)

As this example illustrates, the vm-area list provides Linux with the ability to (re-)create the page-table entry for any address that is mapped in the address space of a process. This implies that the page table can be treated almost like a cache: If the translation for a partic-ular page is present, the kernel can go ahead and use it, and if it is missing, it can be created from the matching vm-area. Treating the page table in this fashion provides a tremendous amount of flexibility because translations for clean pages can be removed at will. Transla-tions for dirty pages can be removed only if they are backed by a file (not by swap space). Before removal, they have to be cleaned by writing the page content back to the file. As we see later, the cache-like behavior of page tables provides the foundation for the copy-on-write algorithm that Linux uses.

AVL trees

As we have seen so far, the vm-area list helps Linux avoid many of the inefficiencies of a system that is based entirely on page tables. However, there is still a problem. If a process maps many different files into its address space, it may end up with a vm-area list that is hundreds or perhaps even thousands of entries long. As this list grows longer, the kernel executes more and more slowly as each page fault requires the kernel to traverse this list. To ameliorate this problem, the kernel tracks the number of vm-areas on the list, and if there are too many, it creates a secondary data structure that organizes the vm-areas as an AVL tree [42, 62]. An AVL tree is a normal binary search tree, except that it has the special property that for each node in the tree, the height of the two subtrees differs by at most 1. Using the standard tree-search algorithm, this property ensures that, given a virtual address, the matching vm-area structure can be found in a number of steps that grows only with the logarithm of the number of vm-areas in the address space.1

Let us consider a concrete example. Figure 4.6 show the AVL tree for an Emacs pro-cess as it existed right after it was started up on a Linux/ia64 machine. For space reasons, the figure represents each node with a rectangle that contains just the starting and end-ing address of the address range covered by the vm-area. As customary for a search tree, the vm-area nodes appear in the order of increasing starting address. Given a node with a starting address ofx, the vm-areas with a lower starting address can be found in the lower (“left”) subtree and the vm-areas with a higher starting address can be found in the higher (“right”) subtree. The root of the tree is at the left end of the figure, and, as indicated by the arrows, the tree grows toward the right side. While it is somewhat unusual for a tree to grow from left to right, this representation has the advantage that the higher a node in the figure, the higher its starting address.

First, observe that this tree is not perfectly balanced: It has a height of six, yet there is a missing node at the fifth level as illustrated by the dashed rectangle. Despite this imperfec-tion, the tree does have the AVL property that the height of the subtrees at any node never differs by more than one. Second, note that the tree contains 47 vm-areas. If we were to use a linear search to find the vm-area for a given address, we would have to visit 23.5 vm-area structures on average and, in the worst case, we might have to visit all 47 of them. In

con-1_{In Linux v2.4.10 Andrea Arcangeli replaced AVL trees with Red-Black trees [62]. Red-Black trees are also}

(12)

text data stack shared libraries dynamic loader 0x200000000020A000 0x2000000000208000 0x2000000000778000 0x200000000075E000 0x2000000000502000 0x20000000004E4000 0x2000000000030000 0x2000000000032000 0x2000000000044000 0x20000000000D6000 0x2000000000106000 0x2000000000134000 0x20000000001F2000 0x20000000001F6000 0x200000000021C000 0x200000000022A000 0x200000000025C000 0x200000000026A000 0x2000000000000000 0x2000000000030000 0x2000000000038000 0x200000000003E000 0x2000000000042000 0x2000000000044000 0x20000000000D6000 0x20000000000E4000 0x20000000000F4000 0x2000000000106000 0x2000000000134000 0x2000000000136000 0x2000000000146000 0x20000000001F2000 0x20000000001F6000 0x2000000000208000 0x200000000020A000 0x200000000021C000 0x200000000022A000 0x200000000022C000 0x200000000025C000 0x2000000000258000 0x200000000026A000 0x200000000026C000 0x200000000028A000 0x60000000003B6000 0x6000000000350000 0x40000000002CA000 0x4000000000000000 0x2000000000794000 0x2000000000782000 0x20000000007A2000 0x2000000000794000 0x2000000000782000 0x2000000000778000 0x2000000000506000 0x2000000000502000 0x20000000004E4000 0x20000000004E2000 0x200000000029A000 0x200000000028C000 0x20000000004E2000 0x2000000000464000 0x200000000044A000 0x200000000029A000 0x200000000028C000 0x8000010000000000 0x800000FFFFFBC000 0x2000000000752000 0x200000000056E000 0x2000000000566000 0x200000000056E000 0x6000000000008000 0x6000000000350000 0x20000000007A6000 0x20000000007A2000 0x200000000075E000 0x2000000000752000 0x200000000055A000 0x2000000000506000 0x200000000026C000 0x200000000028A000 0x200000000044A000 0x2000000000464000 0x200000000003E000 0x2000000000042000 0x2000000000146000 0x2000000000136000 0x200000000022C000 0x2000000000258000 0x20000000000E4000 0x20000000000F4000 0x800000ff80000000 0x800000ff80008000 0x200000000055A000 0x2000000000566000

(13)

trast, when the AVL tree is searched, at most six vm-areas have to be visited, as given by the height of the tree. Clearly, using an AVL tree is a big win for complex address spaces. However, for simple address spaces, the overhead of creating the AVL tree and keeping it balanced is too much compared to the cost of searching a short linear list. For this reason, Linux does not create the AVL tree until the address space contains at least 32 vm-areas. Let us emphasize that even when the AVL tree is being maintained, the linear list continues to be maintained as well; this provides an efficient means to visitallvm-area structures.

Anatomy of the vm-area structure

So far, we discussed the purpose of the vm-area structure and how the Linux kernel uses it, but not what it looks like. The list below rectifies this situation by describing the major components of the vm-area:

• Address range:Describes the address range covered by the vm-area in the form of a start and end address. It is noteworthy that the end address is the address of the first byte that isnotcovered by the vm-area.

• VM flags:Consist of a single word that contains various flag bits. The most impor-tant among them are the access right flags VM READ,VM WRITE, andVM EXEC, which control whether the process can, respectively, read, write, or execute the virtual memory mapped by the vm-area. Two other important flags areVM GROWSDOWN andVM GROWSUP, which control whether the address range covered by the vm-area can be extended toward lower or higher addresses, respectively. As we see later, this provides the means to grow user stacks dynamically.

• Linkage info:Contain various linkage information, including the pointer needed for the mm structure’s vm-area list, pointers to the left and right subtrees of the AVL tree, and a pointer that leads back to the mm structure to which the vm-area belongs. • VM operations and private data:Contain theVM operations pointer, which is a pointer to a set of callback functions that define how various virtual-memory-related events, such as page faults, are to be handled. The component also contains a pri-vate data pointer that can be used by the callback functions as a hook to maintain information that is vm-area–specific.

• Mapped file info:If a vm-area maps a portion of a file, this component stores the file pointer and the file offset needed to locate the file data.

Note that the vm-area structure is not reference-counted. There is no need to do that be-cause each structure belongs to one and only one mm structure, which is already reference-counted. In other words, when the reference-count of an mm structure reaches 0, it is clear that the vm-area structures owned by it are also no longer needed.

A second point worth making is that the VM operations pointer gives the vm-area characteristics that are object-like because different types of vm-areas can have differ-ent handlers for responding to virtual-memory-related evdiffer-ents. Indeed, Linux allows each filesystem, character device, and, more generally, any object that can be mapped into user

(14)

space bymmap() to provide its own set VM operations. The operations that can be pro-vided in this fashion areopen(),close(), andnopage(). Theopen()andclose()callbacks are invoked whenever a vm-area is created or destroyed, respectively, and is used primarily to keep track of the number of vm-areas that are currently using the underlying object. The

nopage()callback is invoked when a page fault occurs for an address for which there is no page-table entry. The Linux kernel provides default implementations for each of these call-backs. These default versions are used if either the VM operations pointer or a particular callback pointer isNULL. For example, if thenopage()callback isNULL, Linux handles the page fault by creating ananonymous page, which is a process-private page whose content is initially cleared to 0.

4.2.2

Page-table-mapped kernel segment

Let us now return to Figure 4.4 and take a closer look at the kernel address space. The right-hand side of this figure is an enlargement of the kernel space and shows that it contains two segments: the identity-mapped segment and the page-table-mapped segment. The latter is mapped by a kernel-private page table and is used primarily to implement the kernel

vmalloc arena (file include/linux/vmalloc.h). The kernel uses this arena to allocate large blocks of memory that must be contiguous in virtual space. For example, the memory required to load a kernel module is allocated from this arena. The address range occupied by the vmalloc arena is defined by the platform-specific constantsVMALLOC STARTand VMALLOC END. As indicated in the figure, the vmalloc arena does not necessarily occupy the entire page-table-mapped segment. This makes it possible to use part of the segment for platform-specific purposes.

4.2.3

Identity-mapped kernel segment

The identity-mapped segment starts at the address defined by the platform-specific constant PAGE OFFSET. This segment contains the Linux kernel image, including its text, data, and stack segments. In other words, this is the segment that the kernel is executing in when in kernel mode (unless when executing in a module).

The identity-mapped segment is special because there is a direct mapping between a vir-tual address in this segment and the physical address that it translates to. The exact formula for this mapping is platform specific, but it is often as simple asvaddr−PAGE OFFSET. This one-to-one (identity) relationship between virtual and physical addresses is what gives the segment its name.

The segment could be implemented with a normal page table. However, because there is a direct relationship between virtual and physical addresses, many platforms can optimize this case and avoid the overhead of a page table. How this is done on IA-64 is described in Section 4.5.3.

Because the actual formula to translate between a physical address and the equivalent virtual address is platform specific, the kernel uses the interface in Figure 4.7 to perform such translations. The interface provides two routines: pa()expects a single argument,

(15)

un-4.2 Address Space of a Linux Process 145

unsigned long pa(vaddr); /* translate virtual address to physical address */

void * va(paddr); /* translate physical address to virtual address */

Figure 4.7.Kernel interface to convert between physical and virtual addresses.

defined ifvaddrdoes not point inside the kernel’s identity-mapped segment. Routine va()

provides the reverse mapping: it takes a physical addresspaddrand returns the correspond-ing virtual address. Usually the Linux kernel expects virtual addresses to have a pointer-type (such asvoid *) and physical addresses to have a type ofunsigned long. However, the

pa()and va()macros are polymorphic and accept arguments of either type.

A platform is free to employ an arbitrary mapping between physical and virtual ad-dresses provided that the following relationships are true:

va( pa(vaddr)) = vaddr for allvaddrinside the identity-mapped segment

paddr1<paddr2 ⇒ va(paddr1) < va(paddr2)

That is, mapping any virtual address inside the identity-mapped segment to a physical ad-dress and back must return the original virtual adad-dress. The second condition is that the mapping must be monotonic, i.e., the relative order of a pair of physical addresses is pre-served when they are mapped to virtual addresses.

We might wonder why the constant that marks the beginning of the identity-mapped segment is called PAGE OFFSET. The reason is that the page frame numberpfn for an addressaddrin this segment can be calculated as:

pfn = (addr−PAGE OFFSET)/PAGE SIZE

As we will see next, even though the page frame number is easy to calculate, the Linux kernel does not use it very often.

Page frame map

Linux uses a table called thepage frame map to keep track of the status of the physical page frames in a machine. For each page frame, this table contains exactly onepage frame descriptor(struct page in fileinclude/linux/mm.h). This descriptor contains various house-keeping information, such as a count of the number of address spaces that are using the page frame, various flags that indicate whether the frame can be paged out to disk, whether it has been accessed recently, or whether it is dirty (has been written to), and so on.

While the exact content of the page frame descriptor is of no concern for this chapter, we do need to understand that Linux often uses page frame descriptor pointers in lieu of page frame numbers. The Linux kernel leaves it to platform-specific code how virtual addresses in the identity-mapped segment are translated to page frame descriptor pointers, and vice versa. It uses the interface shown in Figure 4.8 for this purpose.

Because we are not concerned with the internals of the page frame descriptor, Figure 4.8 lists its type (struct page) simply as an opaque structure. Thevirt to page()routine can be

(16)

structpage; /* page frame descriptor */

struct page *virt to page(vaddr); /* return page frame descriptor for vaddr */

void *page address(page); /* return virtual address for page */

Figure 4.8.Kernel interface to convert between pages and virtual addresses.

used to obtain the page frame descriptor pointer for a given virtual address. It expects one argument,vaddr, which must be an address inside the identity-mapped segment, and returns a pointer to the corresponding page frame descriptor. Thepage address()routine provides the reverse mapping: It expects thepageargument to be a pointer to a page frame descriptor and returns the virtual address inside the identity-mapped segment that maps the corresponding page frame.

Historically, the page frame map was implemented with a single array of page frame descriptors. This array was calledmem mapand was indexed by the page frame number. In other words, the value returned byvirt to page()could be calculated as:

&mem map[(addr−PAGE OFFSET)/PAGE SIZE]

However, on machines with a physical address space that is either fragmented or has huge holes, using a single array can be problematic. In such cases, it is better to implement the page frame map by using multiple partial maps (e.g., one map for each set of physically contiguous page frames). The interface in Figure 4.8 provides the flexibility necessary for platform-specific code to implement such solutions, and for this reason the Linux kernel no longer uses the above formula directly.

High memory support

The size of the physical address space has no direct relationship to the size of the virtual address space. It could be smaller than, the same size as, or even larger than the virtual space. On a new architecture, the virtual address space is usually designed to be much larger than the largest anticipated physical address space. Not surprisingly, this is the case for which Linux is designed and optimized.

However, the size of the physical address space tends to increase roughly in line with Moore’s Law, which predicts a doubling of chip capacity every 18 months [57]. Because the virtual address space is part of an architecture, its size cannot be changed easily (e.g., changing it would at the very least require recompilation of all applications). Thus, over the course of many years, the size of the physical address space tends to encroach on the size of the virtual address space until, eventually, it becomes as large as or larger than the virtual space.

This is a problem for Linux because once the physical memory has a size similar to that of the virtual space, the identity-mapped segment may no longer be large enough to map the entire physical space. For example, the IA-32 architecture defines an extension that supports a 36-bit physical address space even though the virtual address space has only 32 bits. Clearly, the physical address space cannot fit inside the virtual address space.

(17)

unsigned longkmap(page); /* map page frame into virtual space */

kunmap(page); /* unmap page frame from virtual space */

Figure 4.9.Primary routines for the highmem interface.

The Linux kernel alleviates this problem through thehighmem interface(file include/-linux/highmem.h).High memoryis physical memory that cannot be addressed through the identity-mapped segment. The highmem interface provides indirect access to this memory by dynamically mapping high memory pages into a small portion of the kernel address space that is reserved for this purpose. This part of the kernel address space is known as the

kmap segment.

Figure 4.9 shows the two primary routines provided by the highmem interface:kmap()

maps the page frame specified by argumentpageinto the kmap segment. The argument must be a pointer to the page frame descriptor of the page to be mapped. The routine returns the virtual address at which the page was mapped. If the kmap segment is full at the time this routine is called, it will block until space becomes available. This implies that high memory cannot be used in interrupt handlers or any other code that cannot block execution for an indefinite amount of time. Both high and normal memory pages can be mapped with this routine, though in the latter casekmap()simply returns the appropriate address in the identity-mapped segment.

When the kernel has finished using a high memory page, it unmaps the page by a call to

kunmap(). Thepageargument passed to this routine is a pointer to the page frame descriptor of the page that is to be unmapped. Unmapping a page frees up the virtual address space that the page occupied in the kmap segment. This space then becomes available for use by other mappings. To reduce the amount of blocking resulting from a full kmap segment, Linux attempts to minimize the amount of time that high memory pages are mapped.

Clearly, supporting high memory incurs extra overhead and limitations in the kernel and should be avoided where possible. For this reason, high memory support is an optional component of the Linux kernel. Because IA-64 affords a vastly larger virtual address space than that provided by 32-bit architectures, high memory support is not needed and therefore disabled in Linux/ia64. However, it should be noted that the highmeminterfaceis available even on platforms that do not provide high memory support. On those platforms,kmap()

is equivalent topage address()andkunmap()performs no operation. These dummy imple-mentations greatly simplify writing platform-independent kernel code. Indeed, it is good kernel programming practice to use thekmap()andkunmap()routines whenever possible. Doing so results in more efficient memory use on platforms that need high memory support (such as IA-32) without impacting the platforms that do not need it (such as IA-64).

Summary

Figure 4.10 summarizes the relationship between physical memory and kernel virtual space for a hypothetical machine that has high memory support enabled. In this machine, the identity-mapped segment can map only the first seven page frames of the physical address

(18)

00000000000 00000000000 00000000000 11111111111 11111111111 11111111111 segment

physical address space

page frame map segment vaddr

virtual address space

page_address(page) virt_to_page(vaddr) = page normal memory high memory __pa(vaddr) kmap 0 1 2 3 4 5 6 7 8 9 10 11 12 PAGE_OFFSET identity−mapped user space

Figure 4.10.Summary of identity-mapped segment and high memory support.

space—the remaining memory consisting of page frames 7 through 12 is high memory and can be accessed through the kmap segment only. The figure illustrates the case in which page frames 8 and 10 have been mapped into this segment. Because our hypothetical machine has a kmap segment that consists of only two pages, the two mappings use up all available space. Trying to map an additional high memory page frame by callingkmap()

would block the caller until page frame 8 or 10 is unmapped by a call tokunmap(). Let us now turn attention to the arrow labeled vaddr. It points to the middle of the second-last page mapped by the identity-mapped segment. We can find the physical address ofvaddrwith the pa()routine. As the arrow labeled pa(vaddr)illustrates, this physical address not surprisingly points to the middle of page frame 5 (the second-to-last page frame in normal memory).

The figure illustrates the page frame map as the diagonally shaded area inside the identity-mapped segment (we assume that our hypothetical machine uses a single contigu-ous table for this purpose). Note that this table contains page frame descriptors forallpage frames in the machine, including the high memory page frames. To get more information on the status of page frame 5, we can usevirt to page(vaddr)to get thepagepointer for the page frame descriptor of that page. This is illustrated in the figure by the arrow labeled

page. Conversely, we can use thepagepointer to calculatepage address(page)to obtain the starting address of the virtual page that containsvaddr.

4.2.4

Structure of IA-64 address space

The IA-64 architecture provides a full 64-bit virtual address space. As illustrated in Fig-ure 4.11, the address space is divided into eightregionsof equal size. Each region covers 261 bytes or 2048 Pbytes. Regions are numbered from 0 to 7 according to the top three bits of the address range they cover. The IA-64 architecture has no a priori restrictions on how these regions can be used. However, Linux/ia64 uses regions 0 through 4 as the user address space and regions 5 through 7 as the kernel address space.

(19)

4.2 Address Space of a Linux Process 149 PAGE_OFFSET TASK_SIZE 0xffffffffffffffff 0xe000000000000000 0xc000000000000000 0xa000000000000000 0x8000000000000000 0x6000000000000000 0x4000000000000000 0x2000000000000000 0x0000000000000000 region 0 shared memory region 1 text region 2 data region 3 stack region 4 region 5 uncached region 6 region 7 (IA−32 Linux) guard page gate page unused per−CPU page +0x80000000000 +0x6000 VMALLOC END +0x4000 +0x2000

GATE_ADDR _{PERCPU_ADDR} VMALLOC_START identity−mapped

page−table−mapped

segment mapped page−table−

Figure 4.11.Structure of Linux/ia64 address space.

There are also no restrictions on how a process can use the five regions that map the user space, but the usage illustrated in the figure is typical: Region 1 is used for shared memory segments and shared libraries, region 2 maps the text segment, region 3 the data segment, and region 4 the memory and register stacks of a process. Region 0 normally remains unused by 64-bit applications but is available for emulating a 32-bit operating system such as IA-32 Linux.

In the kernel space, the figure shows that the identity-mapped segment is implemented in region 7 and that region 5 is used for the page-table mapped segment. Region 6 is identity-mapped like region 7, but the difference is that accesses through region 6 are not cached. As we discuss in Chapter 7,Device I/O, this provides a simple and efficient means for memory-mapped I/O.

The right half of Figure 4.11 provides additional detail on the anatomy of region 5. As illustrated there, the first page is theguard page. It is guaranteednotto be mapped so that any access is guaranteed to result in a page fault. As we see in Chapter 5,Kernel Entry and Exit, this page is used to accelerate the permission checks required when data is copied across the user/kernel boundary. The second page in this region serves as thegate page. It assists in transitioning from the user to the kernel level, and vice versa. For instance, as we also see in Chapter 5, this page is used when a signal is delivered and could also be used for certain system calls. The third page is called theper-CPU page. It provides one page of CPU-local data, which is useful on MP machines. We discuss this page in more detail in Chapter 8,Symmetric Multiprocessing. The remainder of region 5 is used as the vmalloc arena and spans the address range fromVMALLOC STARTtoVMALLOC END. The exact values of these platform-specific constants depend on the page size. As customary in this chapter, the figure illustrates the case in which a page size of 8 Kbytes is in effect.

(20)

63 61 IMPL_VA_MSB 0 unimplemented implemented

vrn

Figure 4.12.Format of IA-64 virtual address.

+0x0000000000000000 +0x0007ffffffffffff +0x1ff8000000000000 +0x1fffffffffffffff

unimplemented

Figure 4.13.Address-space hole within a region withIMPL VA MSB=50.

Virtual address format

Even though IA-64 defines a 64-bit address space, implementations are not required to fully support each address bit. Specifically, the virtual address format mandated by the architecture is illustrated in Figure 4.12. As shown in the figure, bits 61 through 63 must be implemented because they are used to select the virtual region number (vrn).

The lower portion of the virtual address consists of a CPU-model-specific number of bits. The most significant bit is identified by constantIMPL VA MSB. This value must be in the range of 50 to 60. For example, on Itanium this constant has a value of 50, meaning that the lower portion of the virtual address consists of 51 bits.

The unimplemented portion of the virtual address consists of bits IMPL VA MSB+1 through 60. Even though they are marked asunimplemented, the architecture requires that the value in these bits match the value in bitIMPL VA MSB. In other words, the unimple-mented bits must correspond to the sign-extended value of the lower portion of the virtual address. This restriction has been put in place to ensure that software does not abuse unim-plemented bits for purposes such as type tag bits. Otherwise, such software might break when running on a machine that implements a larger number of virtual address bits.

On implementations whereIMPL VA MSBis less than 60, this sign extension has the effect of dividing the virtual address space within a region into two disjoint areas. Fig-ure 4.13 illustrates this for the case in whichIMPL VA MSB=50: The sign extension cre-ates the unimplemented area in the middle of the region. Any access to that area will cause the CPU to take a fault. For a user-level access, such a fault is normally translated into an illegal instruction signal (SIGILL). At the kernel level, such an access would cause a kernel panic.

Although an address-space hole in the middle of a region may seem problematic, it really poses no particular problem and in fact provides an elegant way to leave room for future growth without impacting existing application-level software. To see this, consider an application that requires a huge data heap. If the heap is placed in the lower portion of the

(21)

0 IMPL_PA_MSB

63

uc unimplemented implemented

Figure 4.14.Format of IA-64 physical address.

region, it can grow toward higher addresses. On a CPU withIMPL VA MSB=50, the heap could grow to at most 1024 Tbytes. However, when the same application is run on a CPU withIMPL VA MSB=51, the heap could now grow up to 2048 Tbytes—without changing its starting address. Similarly, data structures that grow toward lower addresses (such as the memory stack) can be placed in the upper portion of the region and can then grow toward the CPU-model-specific lower bound of the implemented address space. Again, the application can run on different implementations and take advantage of the available address space without moving the starting point of the data structure.

Of course, an address-space hole in the middle of a region does imply that an application must not, e.g., attempt to sequentially access all possible virtual addresses in a region. Given how large a region is, this operation would not be a good idea at any rate and so is not a problem in practice.

Physical address space

The physical address format used by IA-64 is illustrated in Figure 4.14. Like virtual ad-dresses, physical addresses are 64 bits wide. However, bit 63 is theucbit and serves a spe-cial purpose: If 0, it indicates a cacheable memory access; if 1, it indicates an uncacheable access. The remaining bits in a physical address are split into two portions: implemented and unimplemented bits. As the figure shows, the lower portion must be implemented and covers bits 0 up to a CPU-model-specific bit number calledIMPL PA MSB. The architecture requires this constant to be in the range of 32 to 62. For example, Itanium implements 44 address bits and thereforeIMPL PA MSBis 43. The unimplemented portion of a physical address extends from bitIMPL PA MSB+1 to 62. Unlike a virtual address, a valid physical address must have all unimplemented bits cleared to 0 (i.e., the unimplemented portion is the zero-extended instead of the sign-extended value of the implemented portion).

The physical address format gives rise to the physical address space illustrated in Fig-ure 4.15. As determined by theucbit, it is divided into two halves: The lower half is the cached physical address space and the upper half is the uncached space. Note that physical addressesx and 263+xcorrespond to the same memory location—the only difference is that an access to the latter address will bypass all caches. In other words, the two halves alias each other.

IfIMPL PA MSBis smaller than 62, the upper portion of each half is unimplemented. Any attempt to access memory in this portion of the physical address space causes the CPU to take an UNIMPLEMENTEDDATAADDRESS FAULT.

Recall from Figure 4.11 on page 149 that Linux/ia64 employs a single region for the identity-mapped segment. Because a region spans 61 address bits, Linux can handle IMPL-PA MSBvalues of up to 60 before the region fills up and high memory support needs to

(22)

unimplemented unimplemented implemented implemented uncached cached 0x0000000000000000 0x00000fffffffffff 0x8000000000000000 0x80000fffffffffff 0xffffffffffffffff

Figure 4.15.Physical address space withIMPL PA MSB=43.

be enabled. To get a back-of-the-envelope estimate of how quickly this could happen, let us assume that at the inception of IA-64 the maximum practical physical memory size was 1 Tbytes (240bytes). Furthermore, let us assume that memory capacity doubles roughly ev-ery 18 months. Both assumptions are somewhat on the aggressive side. Even so, more than three decades would have to pass before high memory support would have to be enabled. In other words, it is likely that high memory support will not be necessary during most of the life span, or even the entire life span, of the IA-64 architecture.

4.3

PAGE TABLES

Linux maintains the page table of each process in physical memory and accesses a page ta-ble through the identity-mapped kernel segment. Because they are stored in physical mem-ory, page tables themselves cannot be swapped out to disk. This means that a process with a huge virtual address space could run out of memory simply because the page table alone uses up all available memory. Similarly, if thousands of processes are running, the page tables could take up most or even all of the available memory, making it impossible for the processes to run efficiently. However, on modern machines, main memory is usually large enough to make these issues either theoretical or at most second-order. On the positive side, keeping the page tables in physical memory greatly simplifies kernel design because there is no possibility that handling a page fault would cause another (nested) page fault.

Each page table is represented as a multiway tree. Logically, the tree has three levels, as illustrated in Figure 4.16. As customary in this chapter, this figure shows the tree growing from left to right: at the first level (leftmost part of the figure), we find theglobal directory

(pgd); at the second level (to the right of the global directory), we find themiddle directories

(pmd); and at the third level we find thePTE directories. Typically, each directory (node in the tree) occupies one page frame and contains a fixed number of entries. Entries in the global and middle directories are eithernot presentor they point to a directory in the next level of the tree. The PTE directories form the leaves of the tree, and its entries consist of

(23)

4.3 Page Tables 153 PTRS_PER_PMD PTRS_PER_PTE pmd PTE dir. PAGE_SIZE PTRS_PER_PGD USER_PTRS_PER_PGD FIRST_USER_PGD_NR pgd Level: memory physical

Figure 4.16.Linux page-table tree.

The primary benefit of implementing a page table with a multiway tree instead of a linear array is that the former takes up space that is proportional only to the virtual address space actually in use, instead of being proportional to themaximum size of the virtual address space. To see this, consider that with a 1-Gbyte virtual address space and 8-Kbyte pages, a linear page table would need storage for more than 131,000 PTEs, even if not a single virtual page was actually in use. In contrast, with a multiway tree, an empty address space requires only a global directory whose entries are all markednot present.

Another benefit of multiway trees is that each node (directory) in the tree has a fixed-size (usually a page). This makes it unnecessary to reserve large, physically contiguous regions of memory as would be required for a linear page table. Also, because physical memory is managed as a set of page frames anyhow, the fixed node size makes it easy to build a multiway tree incrementally, as is normally done with demand paging.

Going back to Figure 4.16, we see that the size and structure of the directories is con-trolled by platform-specific constants. The number of entries in the global, middle, and PTE directories are given by constants PTRS PER PGD,PTRS PER PMD, and PTRS-PER PTE, respectively. The global directory is special because often only part of it can be used to map user space (the rest is either reserved or used for kernel purposes). Two additional parameters define the portion of the global directory that is available for user space. Specifically,FIRST USER PGD NRis the index of the first entry, andUSER

(24)

PTRS-pmd

PTE dir. _{phys. mem.} unused pgd index pmd index pte index offset

pgd

virtual address

page−table pointer

Figure 4.17.Virtual-to-physical address translation using the page table.

PER PGDis the total number of global-directory entries available to map user space. For the middle and PTE directories, all entries are assumed to map user space.

So how can we use the page table to translate a virtual address to the corresponding physical address? Figure 4.17 illustrates this. At the top, we see that a virtual address, for the purpose of a page-table lookup, is broken up into multiple fields. The fields used to look up the page table arepgd index,pmd index, andpte index. As their names suggest, these fields are used to index the global, middle, and PTE directories, respectively. A page-table lookup starts with the page-page-table pointer stored in the mm structure of a process. In the figure, this is illustrated by the arrow labeledpage-table pointer. It points to the global directory, i.e., the root of the page-table tree. Given the global directory, thepgd indextells us which entry contains the address of the middle directory. With the address of the middle directory, we can use thepmd indexto tell us which entry contains the address of the PTE directory. With the address of the PTE directory, we can use thepte indexto tell us which entry contains the PTE of the page that the virtual address maps to. With the PTE, we can calculate the address of the physical page frame that backs the virtual page. To finish up the physical address calculation, we just need to add the value in theoffsetfield of the virtual address to the page frame address.

(25)

4.3 Page Tables 155

Note that the width of thepgd index,pmd index, andpte indexfields is dictated by the number of pointers that can be stored in a global, middle, and PTE directory, respectively. Similarly, the width of theoffsetfield is dictated by the page size. Because these fields by definition consist of an integral number of bits, the page size and the number of entries stored in each directory all must be integer powers of 2. If the sum of the widths of these fields is less than the width of a virtual address, some of the address bits remain unused. The figure illustrates this with the field labeledunused. On 32-bit platforms, there are usually no unused bits. However, on 64-bit platforms, the theoretically available virtual address space is so big that it is not unusual for some bits to remain unused. We discuss this in more detail when discussing the IA-64 implementation of the virtual memory system.

4.3.1

Collapsing page-table levels

At the beginning of this section, we said that the Linux page tableslogicallycontain three levels. The reason is that each platform is free to implement fewer than three levels. This is possible because the interfaces that Linux uses to access and manipulate page tables have been carefully structured to allow collapsing one or even two levels of the tree into the global directory. This is an elegant solution because it allows platform-independent code to be written as if each platform implemented three levels, yet on platforms that implement fewer levels, the code accessing the unimplemented levels is optimized away completely by the C compiler. That is, no extra overhead results from the extraneous logical page-table levels.

The basic idea behind collapsing a page-table level is to treat a single directory entry as if it were the entire directory for the next level. For example, we can collapse the middle directory into the global directory by treating each global-directory entry as if it were a middle directory with just a single entry (i.e.,PTRS PER PMD=1). Later in this chapter, we see some concrete examples of how this works when a page table is accessed.

4.3.2

Virtually-mapped linear page tables

As discussed in the previous section, linear page tables are not very practical when imple-mented in physical memory. However, under certain circumstances it is possible to map a multiway tree into virtual space and make it appear as if it were a linear page table. The trick that makes this possible is to place aself-mappingentry in the global directory. The self-mapping entry, instead of pointing to a middle directory, points to the page frame that contains the global directory. This is illustrated in Figure 4.18 for the case where the global-directory entry with index 7 is used as the self-mapped entry (labeledSELF). Note that the remainder of the global directory continues to be used in the normal fashion, with entries that are either not mapped or that point to a middle directory.

To make it easier to understand how this self-mapping works, let us consider a specific example: Assume that page-table entries are 8 bytes in size, pages are 8 Kbytes in size, and that the directories in the page tree all contain 1024 entries. These assumptions imply that the page offset is 13 bits wide and that the pgd, pmd, and pte indices are all 10 bits wide. The virtual address is thus 43 bits wide. A final assumption we need to make is that the

(26)

7 6 5 4 3 1 0 2 SELF pgd pmd page−table pointer

Figure 4.18.Self-mapping entry in the global directory.

format of the entries in the pgd and pmd directories is identical to the format used in the PTE directories.

Assuming the self-mapping entry has been installed in the global-directory entry with indexSELF, we claim that the equations below forL3(va),L2(va), andL1(va)are the vir-tual addresses at which we can find, respectively, the PTE-directory entry, middle-directory entry, and global-directory entry that correspond to virtual addressva:

L3(va) = SELF·233+8· va/213

L2(va) = SELF·233+SELF·222+8· va/223

L1(va) = SELF·233+SELF·222+SELF·213+8· va/233

For example, if we assume thatSELF=1023=0x3ff, we could access the page-table entry for virtual addressva=0x80af3at virtual address0x7fe00000200.

The effect of the self-mapping entry can be observed most readily when considering how the above equations affect the virtual address. Figure 4.19 illustrates this effect. The first line shows the virtual addressvabroken up into the three directory indicesva pgd idx,

va pmd idx,va pte idxand the page offsetva off. Now, if we consider the effect of equation

L3(va), we see that the page offset was replaced by 8·va pte idx(the factor of 8 comes from the fact that each page-table entry is assumed to be 8 bytes in size). Similarly, the pte index has been replaced withva pmd idx, and the pmd index has been replaced withva pgd idx. Finally, the pgd index has been replaced withSELF, the index of the self-mapping entry. When we look at the effects ofL2(va)andL1(va), we see a pattern emerge: At each level, the previous address is shifted down by 10 bits (the width of an index) and the top 10 bits are filled in with the value ofSELF.

Now, let us take a look at what the operational effect of L1(va)is when used in a normal three-level page table lookup. We start at the global directory and access the entry at indexSELF. This lookup returns a pointer to a middle directory. However, because this is the self-mapping entry, we really get a pointer to the global directory. In other words, the global directory now serves as a fake middle directory. So we use the global directory

(27)

4.3 Page Tables 157

pgd index pmd index pte index offset

va = L3(va) = L2(va) = L1(va) =

va_pgd_idx va_pmd_idx va_pte_idx va_off va_pgd_idx va_pmd_idx va_pte_idx

SELF SELF SELF SELF SELF SELF va_pgd_idx va_pmd_idx va_pgd_idx 000 000 00 0 33 23 13 0

Figure 4.19.Effect of self-mapping entry on virtual address.

again to look up the pmd index, which happens to beSELFagain. As before, this lookup returns a pointer back to the global directory but this time the directory serves as a fake PTE directory. To complete the three-level lookup, we again use the global directory to look up the pte index, which again containsSELF. The entry at indexSELFis now interpreted as a PTE. Because we are dealing with the self-mapping entry, it once again points to the global directory, which now serves as a fake page frame. The physical address thatL1(va)

corresponds to is thus equal to the address of global directory plus the page offset of the virtual address, which is 8·va pgd idx. Of course, this is exactly the physical address of the global-directory entry that corresp