W
henever two devices with significantly mismatchedspeeds need to work together, the faster device will often end up waiting for the slower device. Depending on how often the system accesses the slower device, the overall throughput of the system can effectively be reduced to that of the slower device. To alle- viate this situation, system designers often incorporate acacheinto a design to reduce the cost of accessing a slow device.
A cache reduces the cost of accessing a device by providing faster access to data that resides on the slow device. To accomplish this, a cache keeps copies of data that exists on a slow device in an area where it is faster to retrieve. A cache works because it can provide data much more quickly than the same data could be retrieved from its real location on the slow device. Put another way, a cache interposes itself between a fast device and a slow device and transparently provides the faster device with the illusion that the slower device is faster than it is.
This chapter is about the issues involved with designing a disk cache, how to decide what to keep in the cache, how to decide when to get rid of something from the cache, and the data structures involved.
8.1
Background
A cache uses some amount of buffer space to hold copies of frequently used data. The buffer space is faster to access than the underlying slow device. The buffer space used by a cache can never hold all the data of the under- lying device. If a cache could hold all of the data of a slower device, the cache would simply replace the slower device. Of course, the larger the buffer
space, the more effective the cache is. The main task of a cache system is the management of the chunks of data in the buffer.
A disk cache uses system memory to hold copies of data that resides on disk. To use the cache, a program requests a disk block, and if the cache has the block already in the cache, the block is simply read from or written to and the disk not accessed. On a read, if a requested block is not in the cache, the cache reads the block from the disk and keeps a copy of the data in the cache as well as fulfilling the request. On a write to a block not in the cache, the cache makes room for the new data, marks it as dirty, and then returns. Dirty data is flushed at a later, more convenient, time (perhaps batching up many writes into a single write).
Managing a cache is primarily a matter of deciding what to keep in the cache and what to kick out of the cache when the cache is full. This man- agement is crucial to the performance of the cache. If useful data is dropped from the cache too quickly, the cache won’t perform as well as it should. If the cache doesn’t drop old data from the cache when appropriate, the useful size and effectiveness of the cache are greatly reduced.
The effectiveness of a disk cache is a measure of how often data requested is found in the cache. If a disk cache can hold 1024 different disk blocks and a program never requests more than 1024 blocks of data, the cache will be 100% effective because once the cache has read in all the blocks, the disk is no longer accessed. At the other end of the spectrum, if a program randomly requests many tens of thousands of different disk blocks, then it is likely that the effectiveness of the cache will approach zero, and every request will have to access the disk. Fortunately, access patterns tend to be of a more regular nature, and the effectiveness of a disk cache is higher.
Beyond the number of blocks that a program may request, the locality of those references also plays a role in the effectiveness of the cache. A program may request many more blocks than are in the cache, but if the addresses of the disk blocks are sequential, then the cache may still prove useful. In other situations the number of disk blocks accessed may be more than the size of the cache, but some amount of those disk blocks may be accessed many more times than the others, and thus the cache will hold the important blocks, reducing the cost of accessing them. Most programs have a high degree of locality of reference, which helps the effectiveness of a disk cache.
8.2
Organization of a Buffer Cache
A disk cache has two main requirements. First, given a disk block number, the cache should be able to quickly return the data associated with that disk block. Second, when the cache is full and new data is requested, the cache must decide what blocks to drop from the cache. These two requirements necessitate two different methods of access to the underlying data. The first task, to efficiently find a block of data given a disk block address, uses the
8 . 2 O R G A N I Z AT I O N O F A B U F F E R C A C H E
129
…
MRU
Hash table (indexed by block number)
LRU
cache _ent
Figure 8-1 A disk block cache data structure showing the hash table and the LRU list.
obvious hash table solution. The second method of access requires an or- ganization that enables quick decisions to be made about which blocks to flush from the cache. There are a few possible implementations to solve this problem, but the most common is a doubly linked list ordered from the most recently used (MRU) block to the least recently used (LRU). A doubly linked list ordered this way is often referred to as anLRU list(the head of which is the MRU end, and the tail is the LRU end). The hash table and LRU list are intimately interwoven, and access to them requires careful coordination.
The cache management we discuss focuses on this dual structure of hash table and LRU list. Instead of an LRU list to decide which block to drop from the cache, we could have used other algorithms, such as random replacement, the working set model, a clock-based algorithm, or variations of the LRU list (such as least frequently used). In designing BFS, it would have been nice to experiment with these other algorithms to determine which performed the best on typical workloads. Unfortunately, time constraints dictated that the cache get implemented, not experimented with, and so little exploration was done of other possible algorithms.
Underlying the hash table and LRU list are the blocks of data that the cache manages. The BeOS device cache manages the blocks of data with a data structure known as acache ent. The cache entstructure maintains a pointer to the block of data, the block number, and the next/previous links for the LRU list. The hash table uses its own structures to index by block number to retrieve a pointer to the associatedcache entstructure.
In Figure 8-1 we illustrate the interrelationship of the hash table and the doubly linked list. We omit the pointers from thecache entstructures to the data blocks for clarity.
Cache Reads
First, we will consider the case where the cache is empty and higher-level code requests a block from the cache. A hash table lookup determines that the block is not present. The cache code must then read the block from disk
Block 1 Block 2 Block 3 Block 4 Block 5 MRU LRU Block cache header Hash table
Figure 8-2 An old block is moved to the head of the list.
and insert it into the hash table. After inserting the block into the hash table, the cache inserts the block at the MRU end of the LRU list. As more blocks are read from disk, the first block that was read will migrate toward the LRU end of the list as other blocks get inserted in front of it.
If our original block is requested again, a probe of the hash table will find it, and the block will be moved to the MRU end of the LRU list because it is now the most recently used block (see Figure 8-2). This is where a cache provides the most benefit: data that is frequently used will be found and retrieved at the speed of a hash table lookup and amemcpy()instead of the cost of a disk seek and disk read, which are orders of magnitude slower.
The cache cannot grow without bound, so at some point the number of blocks managed by the cache will reach a maximum. When the cache is full and new blocks are requested that are not in the cache, a decision must be made about which block to kick out of the cache. The LRU list makes this decision easy. Simply taking the block at the LRU end of the list, we can discard its contents and reuse the block to read in the newly requested block (see Figure 8-3). Throwing away the least recently used block makes sense inherently: if the block hasn’t been used in a long time, it’s not likely to be needed again. Removing the LRU block involves not only deleting it from the LRU list but also deleting the block number from the hash table. After reclaiming the LRU block, the new block is read into memory and put at the MRU end of the LRU list.
8 . 2 O R G A N I Z AT I O N O F A B U F F E R C A C H E
131
Block 2 Block 1
Block 3 Block 4 Block 5 Block 6 MRU
LRU
Block cache header
Hash table
Figure 8-3 Block 1 drops from the cache and block 6 enters.
Cache Writes
There are two scenarios for a write to a cache. The first case is when the block being written to is already in the cache. In this situation the cache canmemcpy() the newly written data over the data that it already has for a particular disk block. The cache must also move the block to the MRU end of the LRU list (i.e., it becomes the most recently used block of data). If a disk block is written to and the disk block is not in the cache, then the cache must make room for the new disk block. Making room in the cache for a newly written disk block that is not in the cache is the same as described previously for a miss on a cache read. Once there is space for the new disk block, the data is copied into the cache buffer for that block, and thecache entis added to the head of the LRU list. If the cache must perform write-through for data integrity reasons, the cache must also write the block to its corresponding disk location.
The second and more common case is that the block is simply marked dirty and the write finishes. At a later time, when the block is flushed from the cache, it will be written to disk because it has been marked dirty. If the system crashes or fails while there is dirty data in the cache, the disk will not be consistent with what was in memory.
Dirty blocks in the cache require a bit more work when flushing the cache. In the situations described previously, only clean blocks were in the cache, and flushing them simply meant reusing their blocks of data to hold new data. When there are dirty blocks, the cache must first write the dirty data to disk before allowing reuse of the associated data block. Proper handling of dirty blocks is important. If for any reason a dirty block is not flushed to disk before being discarded, the cache will lose changes made to the disk, effectively corrupting the disk.
8.3
Cache Optimizations
Flushing the cache when there are dirty blocks presents an interesting oppor- tunity. If the cache always only flushed a single block at a time, it would perform no better at writing to the disk than if it wrote directly through on each write. However, by waiting until the cache is full, the cache can do two things that greatly aid performance. First, the cache can batch multiple changes together. That is, instead of only flushing one block at a time, it is wiser to flush multiple blocks at the same time. Flushing multiple blocks at once amortizes the cost of doing the flush over several blocks, and more im- portantly it enables a second optimization. When flushing multiple blocks, it becomes possible to reorder the disk writes and to write contiguous disk blocks in a single disk write. For example, if higher-level code writes the following block sequence:
971 245 972 246 973 247
when flushing the cache, the sequence can be reorganized into
245 246 247 971 972 973
which allows the cache to perform two disk writes (each for three consec- utive blocks) and one seek, instead of six disk writes and five seeks. The importance of this cannot be overstated. Reorganizing the I/O pattern into an efficient ordering substantially reduces the number of seeks a disk has to make, thereby increasing the overall bandwidth to the disk. Large con- secutive writes outperform sequential single-block writes by factors of 5–10 times, making this optimization extremely important. At a minimum, the cache should sort the list of blocks to be flushed, and if possible, it should coalesce writes to contiguous disk locations.
In a similar manner, when a cache miss occurs and a read of a disk block must be done, if the cache only reads a single block at a time, it would not perform very well. There is a fixed cost associated with doing a disk read, regardless of the size of the read. This fixed cost is very high relative to the amount of time that it takes to transfer one or two disk blocks. Therefore it is better to amortize the cost of doing the disk read over many blocks. The BeOS
8 . 4 I / O A N D T H E C A C H E
133
cache will read 32K on a cache miss. The cost of reading the extra data is insignificant in comparison to the cost of reading a single disk block. Another benefit of this scheme is that it performs read-ahead for the file system. If the file system is good at allocating files contiguously, then the extra data that is read is likely to be data that will soon be needed. Performing read-ahead of 32K also increases the effective disk bandwidth seen by the file system because it is much faster than performing 32 separate 1K reads.
One drawback to performing read-ahead at the cache level is that it is in- herently imperfect. The cache does not know if the extra data read will be useful or not. It is possible to introduce special parameters to the cache API to control read-ahead, but that complicates the API and it is not clear that it would offer significant benefits. If the file system does its job allocating files contiguously, it will interact well with this simple cache policy. In practice, BFS works very well with implicit read-ahead.
In either case, when reading or writing, if the data refers to contiguous disk block addresses, there is another optimization possible. If the cache system has access to a scatter/gather I/O primitive, it can build a scatter/gather table to direct the I/O right to each block in memory. A scatter/gather table is a table of pointer and length pairs. A scatter/gather I/O primitive takes this table and performs the I/O directly to each chunk of memory described in the table. This is important because the blocks of data that the cache wants to perform I/O to are not likely to be contiguous in memory even though they refer to contiguous disk blocks. Using a scatter/gather primitive, the cache can avoid having to copy the data through a contiguous temporary buffer.
Another feature provided by the BeOS cache is to allow modification of data directly in the cache. The cache API allows a file system to request a disk block and to get back a pointer to the data in that block. The cache reads the disk block into its internal buffer and returns a pointer to that buffer. Once a block is requested in this manner, the block is locked in the cache until it is released. BFS uses this feature primarily for i-nodes, which it manipulates directly instead of copying them to another location (which would require twice as much space). When such a block is modified, there is a cache call to mark the block as dirty so that it will be properly written back to disk when it is released. This small tweak to the API of the cache allows BFS to use memory more efficiently.
8.4
I/O and the Cache
One important consideration in the design of a cache is that it should not remain locked while performing I/O. Not locking the cache while perform- ing I/O allows other threads to enter the cache and read or write data that is already in the cache. This approach is known as hit-under-miss and is important in a multithreaded system such as the BeOS.
There are several issues that arise in implementing hit-under-miss. Un- locking the cache before performing I/O allows other threads to enter the cache and read/write to blocks of data. It also means that other threads will manipulate the cache data structures while the I/O takes place. This has the potential to cause great mayhem. To prevent a chaotic situation, before re- leasing the cache lock, any relevant data structures must be marked as busy so that any other threads that enter the cache will not delete them or oth- erwise invalidate them. Data structures marked busy must not be modified