Implementing and Evaluating SCM Algorithms for Rate-Aware Prefetching

(1)

KULKARNI, AMIT V. Implementing and Evaluating SCM Algorithms for Rate-Aware Prefetch-ing. (Under the direction of Dr. Xiaosong Ma).

File system prefetching has been widely studied and used to hide high latency of disk I/O. However, there are very few algorithms that explicitly take the file access rate or burstiness into account to distribute resources, especially the buffer cache.

(2)

Amit V. Kulkarni

A thesis submitted to the Graduate Faculty of North Carolina State University

in partial fullfillment of the requirements for the Degree of

Master of Science

Computer Science

Raleigh, North Carolina

2009

APPROVED BY:

Dr. Edward Davis Dr. Vincent Freeh

(3)

DEDICATION

(4)

BIOGRAPHY

(5)

ACKNOWLEDGMENTS

(6)

TABLE OF CONTENTS

LIST OF FIGURES . . . vii

1 Introduction . . . 1

1.1 Inventory Management and Prefetching . . . 2

1.2 Summary of Contributions . . . 3

1.3 Outline of Thesis . . . 4

2 Background of Prefetching and Related Work . . . 5

2.1 Storage Hierarchy . . . 5

2.2 Locality of Reference . . . 6

2.3 File System Caching and Prefetching . . . 7

2.4 Related Work . . . 7

2.4.1 Access pattern detection . . . 8

2.4.2 Prefetching aggressiveness . . . 8

2.4.3 Allocation of memory . . . 9

2.5 The Linux prefetching algorithm . . . 10

3 An SCM Perspective of Prefetching . . . 13

3.1 Inventory Theory . . . 13

3.2 Mapping of Concepts . . . 14

3.2.1 Block request to Demand . . . 15

3.2.2 Disk access time to Lead time . . . 15

3.2.3 Prefetched but unused memory to Inventory . . . 16

3.2.4 Batched prefetching to Batched replenishment . . . 16

3.3 SCM Algorithms for Prefetching . . . 16

3.3.1 Equal Time Supply (ETS) . . . 17

3.3.2 Equal Safety Factor (ESF) . . . 18

4 Design and Implementation . . . 19

4.1 Requirements . . . 19

4.2 Challenges . . . 20

4.3 Background . . . 20

4.3.1 Linux page cache . . . 20

4.3.2 The read system call . . . 21

4.4 Implementation . . . 24

4.4.1 Data structure . . . 24

4.4.2 Calculating the standard deviation . . . 26

4.4.3 Measuring the lead time . . . 27

(7)

4.4.5 Locking strategy . . . 29

4.4.6 Handling integer overflow . . . 30

4.4.7 Miscellaneous implementations . . . 31

5 Evaluation . . . 32

5.1 Performance Metrics . . . 32

5.1.1 Response time . . . 33

5.1.2 Hit/miss ratio . . . 33

5.1.3 Prefetched but not accessed pages . . . 34

5.1.4 Throughput . . . 34

5.2 Experimental Setup . . . 35

5.3 Workload . . . 35

5.4 Results . . . 36

5.4.1 Two streams varying rate difference . . . 36

5.4.2 Two streams with varying standard deviation difference . . . 38

5.4.3 Real Linux applications . . . 38

5.4.4 Server workload . . . 39

5.4.5 Performance of ESF . . . 42

6 Conclusion and future work . . . 43

(8)

LIST OF FIGURES

Figure 1.1 Grocery store and prefetching . . . 2

Figure 2.1 The Linux prefetch-window and the ahead-window with window pointers . . . 11

Figure 3.1 The cyclic nature of inventory . . . 14

Figure 3.2 The variation of prefetch memory with time . . . 15

Figure 4.1 The call sequence for fetching a page . . . 28

Figure 5.1 Two streams with increasing rate difference. Average response time. . . 37

Figure 5.2 Two streams with increasing rate difference. Average number of misses per prefetch cycle. . . 37

Figure 5.3 Two streams with increasing standard deviation difference. Average response time. . . 38

Figure 5.4 Two streams with increasing standard deviation difference. Response time for each stream. . . 39

Figure 5.5 Throughput of file transfer applications . . . 40

Figure 5.6 Pure SPC-2 like workload . . . 41

Figure 5.7 Total throughput of SPC-2 workload with TPCH . . . 41

(9)

Chapter 1

Introduction

Processors have grown in speed (at least up to recently) according to Moore’s Law, doubling every two years. Hard disk capacity has also followed Moore’s law and doubled every 18 months between 1995 and 2005; however, hard disk access times have only grown at 8% every year [12]. Presently, the clock cycle of a typical server processor is less than 1 ns [15], whereas, the hard disk access times are in the range of a few milliseconds [29]. Therefore, there is a difference of several orders of magnitude between the processor speed and the hard disk speed. Main memory access times, on the other hand, are in the range of a few nanoseconds [28], but, L1 and L2 caching in the processors effectively make the main memory work at speeds close to that of the processor. File caching and prefetching have therefore been used for a long time to hide this difference in speeds.

There is a huge body of existing work in the area of file system prefetching and caching. However, to the best of our knowledge, there are very few prefetchig algorithms that directly take the file access rate andburstiness into account. There are two areas of research in prefetching that we are interested in: optimization for concurrent file streams, and, prefetching in storage systems with multiple levels. In the first area, the problem is to distribute resources such as main memory among concurrent file streams with different access patterns. The problem of multiple storage levels is more interesting. In general, various levels of storage can also be a storage system in their own right: each will have their own buffer cache and their own caching and prefetching algorithms. Thus the prefetching and caching decisions made by one level of storage will affect many other levels.

(10)

prefetch-Figure 1.1: Grocery store and prefetching

ing are similar to those encountered in another field of study, namely Supply Chain Management (SCM). Unlike prefetching, SCM has a long history of more than 50 years of research. Therefore, it is attractive to apply existing SCM techniques to prefetching and to study its benefits and drawbacks. In the rest of this chapter we make a preliminary case for using SCM algorithms in prefetching, and then, describe the aim and organization of this thesis.

1.1 Inventory Management and Prefetching

Fig. 1.1 illustrates a typical inventory management scenario. In a typical grocery store we have goods of different types such as apples, potatoes, and bananas. Due to the economies of scale, grocery stores typically order goods in large quantities and stock them on the shelf or in the warehouse. As customers purchase the goods, the amount of goods in the inventory gradually falls to a level where they have to replenished by ordering them from the suppliers. In this example, replace customers by applications, goods by file blocks, inventory by file system cache, and, suppliers by hard disk and we almost exactly see file system caching and prefetching going on. This mapping is not exactly straight forward, and we shall discuss more about this in chapter 3.

(11)

and seek latency of hard-disks [29].

There are various problems in inventory management that parallel the problems of caching and prefetching. First, the amount of storage space is limited, but it has to be distributed among goods with widely different consumption pattern. For example, in a grocery store, amount of apples purchased in a day will generally be greater than amount of an unpopular item, say spinach. Similarly, in storage systems, throughput of different applications will generally be different: for example, a video player will slowly read a video file, whereas, a file search utility such as “grep” will rapidly read an entire file. The challenge is to fairly distribute the storage space among them. Second problem is that ofsafety inventory, i.e. when to place a replenishment order. If the replenishment order is placed too soon, then, because of the limited storage, we will have to place a smaller order; hence, forgoing the advantages of the economy of scale. If we place the order too late, we may have to turn back some customers because of exhaustion of some goods. Intuitively, we have to order goods with higher demand rate earlier than the goods that have lower demand.

This example of a grocery store is very simple; it assumes that the suppliers always have the goods. In reality, the situation can be quite complicated. Generally the suppliers themselves need raw materials from their suppliers and will maintain their own inventory of finished products and raw materials. The suppliers of these suppliers may themselves have other suppliers and so on. Such supply chains are addressed by a field of Operations Research(OR) and Production Engineering called Supply Chain Management. Similarly, in storage systems, we may have multiple levels of storage, each with its own cache used to store data prefetched from the lower levels, and, the data requested by the upper levels.

1.2 Summary of Contributions

(12)

1. This thesis identifies the mapping between concepts of SCM and prefetching, and describes the application of SCM techniques to the problems of data prefetching, which to the best of our knowledge, has not been done previously.

2. This thesis also describes in detail the implementation of two SCM algorithms, ETS and ESF, in the Linux kernel to perform dynamic data prefetching.

3. Finally, it describes the extensive tests carried out to evaluate the performance of SCM algorithms for data prefetching. Results indicate that SCM algorithms give a performance improvement of up to 41.8% for a real-world server workload benchmark, and, up to 33% for a combination of real Linux applications.

1.3 Outline of Thesis

(13)

Chapter 2

Background of Prefetching and

Related Work

As mentioned before, there has been a plethora of work in the area of caching and prefetching. In this chapter we will give a background of some of the relevant concepts and an overview of the related work. In section 2.1 we describe hierarchical storage found in most computing systems. In section 2.2 we briefly explain the concept of locality of reference, and, in section 2.3 we illustrate how the locality principle is used in file system caching and prefetching. In section 2.4, we explain various issues in prefetching and how they are handled by different algorithms. We have implemented our SCM-based algorithms in Linux kernel and extensively compared our algorithms with the standard Linux prefetching algorithm. To better appreciate our implementation and results, it is necessary to understand the standard Linux prefetching algorithm, therefore, in section 2.5, we give a detailed explanation of the Linux prefetching algorithm.

2.1 Storage Hierarchy

(14)

fastest storage. If a program wants to use data stored in the slower levels, it first brings it to the fastest level and then operates on it.

The highest (and the fastest) level in computing systems is the processor cache. The processor cache can itself consist of multiple levels and may also be differentiated for instruc-tions and data. However, most modern processors automatically handle caches; so, for most applications and most parts of the Operating System, caches are transparent. Therefore, most programs will see the main memory as the first level in the storage hierarchy. What is present below the main memory is anybody’s guess. For most Personal Computers (PC), the next level is generally a disk, which is also the last level. In storage systems it is quite common to consider both the caches and main memory as parts of the CPU and start numbering the storage levels from the level after the main memory. Therefore, we can consider most PCs as having single level of “external” storage.

However, it is not uncommon for workstations to mount remote storage using protocols such as NFS or CIFS. In such a case we have a two-level storage. It is also possible that the storage exported by the remote servers is not directly attached to them; the remote server may be attached to a Storage Area Network (SAN). In this case we have three-level storage.

As explained in section 1.2 of Introduction, we shall restrict ourselves to a single level of storage. However, the general principle of using SCM approach and algorithms to prefetching can be applied to multi-level prefetching as well.

2.2 Locality of Reference

The basis for all caching and prefetching is the principle of locality of reference. With reference to main memory, Denning defines it “observed tendency of programs to cluster ref-erences to small subsets of their pages for extended intervals”[8]. This definition encompasses two types of locality: spatial locality and temporal locality. Spatial locality says that the next memory reference of a program will generally be near its last memory reference. Temporal lo-cality means that there is a high probability of a program accessing an already accessed memory location in the near future. The locality principle is a fundamental principle in computer science and has been used to address many problems [8].

(15)

Thus, if a file block (say,B1) has been accessed by a file, then, there is a high probability that the next block to be accessed will be the next file block (B₂), and then the block after that (B₃) and so on.

2.3 File System Caching and Prefetching

Unlike the main memory, disk accesses times are not uniform. A simple disk model [27] breaks the disk access time into three main parts: disk head seek time to align the disk head to the proper cylinder, disk rotation time to allow the disk head to pass over the required disk sector, and, the actual read time. Of these three, the first two are dependent on the current position of the head and the position of the required data on the disk. The third part is dependent only on the amount of data to be read. In general the first two parts of the disk access time are of the order of few milliseconds [27, 29] and therefore, it is extremely inefficient to read small chunks of data from the disk. Therefore, due to the large access time and spatial locality, it makes sense in modern Operating Systems to prefetch large amount of data [2]. Additionally, due to temporal locality, it also makes sense to cache the disk blocks in main memory to avoid repeated fetching of disk blocks.

From the disk model we can also see that the fetch time for sectors close to one another will be less than the fetch time for sectors randomly chosen on the disk. For a very long time, file systems have therefore been designed to organize and store all the file data and meta-data on adjacent sectors [22, 26]. This organization makes prefetching all the more attractive because we do not incur extra seek and rotation time to fetch the prefetch blocks.

2.4 Related Work

(16)

2.4.1 Access pattern detection

During prefetching, the first question that we need to answer is “what to prefetch”. The simplest kind of prefetching just fetches the next few blocks in the file. This is called sequential prefetching. Algorithms such as AMP [10] and the standard Linux prefetching algorithm [2] are examples of sequential prefetching. There are also sophisticated algorithms that identify the access pattern of applications and fetch blocks in non-sequential manner. For example, QuickMine[32] is a multi-level caching technique that uses Data Mining techniques to identify file access patterns and predict the next block. Various table–driven prefetching algorithms [13, 19] store the history of file accesses in a table and use it to predict the next accesses. Other less generic approaches use either application hints [25] or the knowledge of application internals [20] to predict the next file block. However, these sophisticated block prediction algorithms are rarely used in actual systems. Most systems use simple sequential prefetching because it is the only approach that can give “high long-term predictive accuracy” [10].

2.4.2 Prefetching aggressiveness

When we know what blocks to prefetch, the next two questions that a prefetching algorithm needs to answer are “when to prefetch” (trigger distance) and “how much to prefetch” (prefetch degree). The trigger distance is the amount of data remaining in the cache when the next prefetch is triggered. The prefetch degree is the amount of data to prefetch once it has been decided to prefetch. If the trigger distance is zero, then the algorithm is called synchronous, i.e. prefetching is triggered only when all the previously prefetched data has been accessed. In synchronous prefetching, we prefetch only on a miss and the application has to wait for the prefetching to complete before it can proceed; therefore the name “synchronous”. If the trigger distance is non-zero then the algorithm isasynchronous, i.e. prefetching is triggered when there is still some amount of previously prefetched data and hence the application does not have to wait for prefetching to complete. The trigger distance and the prefetching degree can both be either fixed or adaptive. Based on the nature of trigger distance and prefetch degree Gill and Bathen [10] classify prefetching algorithms into four types

• Fixed Synchronous (FS)

(17)

• Fixed Asynchronous (FA)

• Adaptive Asynchronous (AA)

There are many algorithms that control the prefetching aggressiveness by changing the prefetch degree or trigger distance or both [10, 18, 25]. For instance, consider AMP [10] that is an Adaptive Asynchronous algorithm that varies both prefetch degree and trigger distance in an attempt to achieve optimal values for them. If the trigger distance or the prefetch degree or both are too large, then a file will have too many pages in the cache; in which case, it is possible that a page is evicted from cache without being ever accessed. In this case AMP decreases both the trigger distance and prefetch degree by one. If trigger distance is too small, then an application will have to stall for prefetching to complete; in this case AMP increases the trigger distance. AMP keeps increasing the prefetch degree on every hit to the last page of the last prefetch. Eventually, the prefetch degree will become too large and there will be page evictions without access, and, the prefetch degree will be eventually decreased. Informed Prefetching [25] is an algorithm similar to ours that explicitly measures the request rate of file streams and uses it to control aggressiveness. However, it doest not change the trigger distance, and, also relies on prefetching hints either explicitly provided by applications or by some other pattern detection layer to adjust the prefetch degree.

2.4.3 Allocation of memory

Not all file accesses are sequential, some applications such as database query processing and transaction processing generate a mix of large number of sequential and random I/O refer-ences [14]. Data brought into the file cache because of non-sequential I/O is called demand-paged data. There is a whole class of prefetching algorithms that focus on distribution of memory be-tween sequential and demand-paged data [5, 17, 16, 11, 24]. Teng and Gumaer [16] suggest that a fixed portion of file cache be allocated to demand-paged data. SARC [11] goes one level ahead and dynamically varies the ratio of memory allocated to sequential and demand-paged data based on the current I/O patterns. Our algorithm is only concerned with sequential file streams, and can be used in conjunction with any of the above algorithms that divide the buffer cache between demand-paged and sequential data.

(18)

variant of the commonly used LRU cache replacement algorithm that uses file size and popularity to make replacement decisions. AMP [10] uses file access rate in its theoretical discussion to distribute prefetch memory (by changing trigger distance and prefetch degree). However, the algorithm as such uses throttling to adjust trigger distance and prefetch degree without explicitly considering file access rates.

2.5 The Linux prefetching algorithm

We have used Linux kernel (version 2.6.18) to implement our algorithms and have compared them with the standard Linux prefetching algorithm. In this section we give a brief overview of the Linux prefetching algorithm. All 2.6.x Linux kernels implement some form of this algorithm, with variations in some details [2].

Linux uses a simple asynchronous dynamic prefetching algorithm; therefore, it is possi-ble that Linux is prefetching some pages even though some of the pages in the previous prefetch are still unaccessed. To keep track of these two types of pages (pages of last prefetch and the next prefetch), Linux groups the pages into two “windows”. The pages of the last prefetch consti-tute the prefetch–window, whereas the pages that are currently being prefetched constitute the ahead–window. In normal working conditions, when prefetching is enabled, the ahead–window is immediately after the prefetch–window. Linux maintains three pointers: start that points to the start of the prefetch–window,ahead start that points to the start of the ahead–window, and, the pointer prev page that points to the last page in the prefetch window that was read by the application. The prefetch–window, ahead–window and the pointers are shown in Fig. 2.1.

The size of the last prefetch is given byahead start - start. The size of the next prefetch (ahead size) is dynamically calculated based on the size of the last prefetch and number of hits in the prefetch–window. Linux limits the maximum value of ahead size to 32 pages (128 KB).

Initially, the ahead–window is disabled, and, the prefetch–window size is set to a num-ber between 4 and 32 pages depending on the application request size. If Linux detects a hit in the prefetch–window, it initializes the ahead–window and starts a batch prefetch. Based on the size of the prefetch–window, size of the initial prefetch is either twice or four times the size of the prefetch–window. When the application reads the last page of the prefetch–window, Linux starts a new batch of prefetch and makes the ahead–window the current prefetch–window.

(19)

Figure 2.1: The Linux prefetch-window and the ahead-window with window pointers

size of prefetch–window to 4 pages and fetches 4 pages. The size of ahead–window is zero. After the first hit, Linux initializes the ahead–window and starts the first prefetch with ahead size = 8 pages. The values of all the pointers at this stage are

start = 0, ahead start = 4, ahead size = 8, prev = read request size

After application reads the 4th page, Linux starts a new batch of prefetch of size 16 and makes the ahead–window the current prefetch–window. The values of various pointers after the start of 2nd prefetch are:

start = 4, ahead start = 12, ahead size = 16, prev = 5

Note that the size of ahead–window was aggressively increased from 8 to 16. In the next prefetch, it will be doubled to the maximum allowed value of 32 pages. After the 3rd and 4th prefetch, the values of the pointers will be:

(20)

(21)

Chapter 3

An SCM Perspective of Prefetching

Chopra et al. define a supply chain as “all parties involved, directly or indirectly, in fulfilling a customer request” [6]. Supply chain management is, in short, management of this supply chain. SCM encompasses many aspects of production engineering and management such as inventory management, production planning and scheduling, and, “vertical integration” [30]. In chapter 1 we gave a brief overview of how prefetching is similar to SCM. In this chapter, we provide a more careful mapping of concepts between prefetching and SCM. We then describe two algorithms that are popularly used in SCM and seem promising in prefetching.

3.1 Inventory Theory

The inventory level for an item is cyclical in nature. A store generally orders items in large batches. Immediately after the batches arrive, the stock of the item it at its maximum. However, gradually, as the items are purchased by customers, its stock keeps falling. Finally, when the stock becomes sufficiently small, the store orders the item again (called a replenishment order) and when the items arrive, the stock again jumps to maximum. Fig. 3.1 shows the variation of inventory level of an item with respect to time.

(22)

Figure 3.1: The cyclic nature of inventory

the reordering completes. With this level we only have enough items to satisfy normal customer demand, however, if the demand or lead time is highly variable, then to be on a safer side we set the inventory level at reordering point toRL+δ, whereδ is called thesafety inventory. Safety inventory is that part of the inventory at reordering point that is kept to satisfy some burst in the demand or a variance in the lead time.

3.2 Mapping of Concepts

Fig. 3.2 shows the variation of prefetched but not accessed data with time in the file system cache in absence of eviction. By definition trigger distance is the amount of data in the cache remaining unaccessed when we start a new prefetch. This maps directly with the reordering point. The prefetch degree is analogous to the replenishment order quantity. As mentioned before, the inventory at reordering point consists of two parts, the cycle inventory, and the safety inventory. Analogously, we denote those two parts in the trigger distance by

TC _and _TS _{respectively. In the following subsections we give a more detailed mapping of the}

(23)

!"#$%

&'$(&)$%

&**$++% (&,$%

-&+,% &**$++% (&,$%

+./0% &**$++% (&,$%

1($-$,*2$3%

",$#+%

4($-$,*2% 3$)($$%

,("))$(% 3"+,&5*$%$%

((&&&&,,

Tc

Ts

Figure 3.2: The variation of prefetch memory with time

3.2.1 Block request to Demand

It is easy to see that file block requests in prefetching map directly to the customer demand in SCM. However, there is one subtle difference. In SCM, all items of the same type in the inventory are same. Only the type of item and the quantity of the items is enough to describe the demand. However, in file systems each block is different and a demand not only specifies a particular file and the number of bytes, but also the bytes to fetch. However, if we can correctly predict the next file blocks required by the application, then by prefetching them we will ensure that a required block is always in cache. If we correctly prefetch the blocks, then returning only the next blocks is as good as returning any random block. For sequentially accessed files, predicting the next blocks is easy. Therefore, we are not particularly concerned about the requirements of all the items being the same.

3.2.2 Disk access time to Lead time

(24)

assume that the disk access time is constant and is equal to the average disk access time. More accurately modeling disk access time and using it in prefetching algorithms can be interesting future work.

3.2.3 Prefetched but unused memory to Inventory

It may seem that the file system cache can be directly mapped to the inventory. How-ever, we should notice that items are consumed from the inventory, but, blocks are not removed from the cache when they are accessed. File blocks are only removed on eviction. It is more accurate to map the prefetched but unused portion of the file system cache to the inventory. After a block is accessed it is no longer part of the prefetched but unaccessed memory.

3.2.4 Batched prefetching to Batched replenishment

As observed in chapter 1, for both SCM and file system prefetching, it is more eco-nomical to fetch large number of items or blocks; therefore, there is a direct mapping between batched prefetching and batched replenishment.

3.3 SCM Algorithms for Prefetching

The problem that inventory management in SCM tries to solve is to accurately predict the reorder point and the order quantity. These quantities correspond to the trigger distance and prefetch degree respectively in prefetching. SCM has a large amount of theory dedicated to finding both the order quantity and the reordering point. However, in our work we will vary only the trigger distance and keep the prefetch degree fixed. Varying both trigger distance and prefetch degree seems attractive option and can form interesting future work.

(25)

3.3.1 Equal Time Supply (ETS)

In the first algorithm called theEqual Time Supply, the total safety inventory is divided such that each of the items have equal time’s worth of supply, i.e. the individual safety inventories are set proportional to the items’ demand rate. Though this algorithm seems overly simplistic, a large US consulting firm estimates that 80%–90% of its clients use this method [30].

The analogous idea in prefetching is to set the value ofTS

i of theithstream proportional

to the application request rateRi.

∃C₁ >0, such that∀i, TiS

Ri

=C₁

T_iS =C1Ri

As argued in section 3.1, the cycle inventory at reorder point is equal to the expected demand in the lead time.

T_iC =RiLi

Here Li is the time to fetch a batch of prefetch pages of file stream i. However, if we

fix the prefetch degree of all the file streams to the same value and assume that lead time for each batch is the same and equal to the average lead timeL, then we have

T_iC =RiL

Therefore, the total trigger distance for the it_h _{file stream (}_T

i) is given by

Ti=TiC +TiS= (C1+L)Ri=CRi

IfTtotal is the total amount of trigger distance that we want to use in the system, then

we have to choose the value of the constant C such that the sum of all the trigger distances is equal to Ttotal. If TA is the average trigger distance of all the streams in the system and the

total number of file streams isn, then we have,

Ttotal = n

X

i=1

Ti=nTA

∴C

n

X

i=1

(26)

∴C = PnT_n A

i=1Ri

∴Ti= PnRi j=1Rj

nTA

3.3.2 Equal Safety Factor (ESF)

Although, the safety inventory is supposed to protect against the variation in demand, ETS does not use any measure of variability in the calculation of the safety inventory level. ESF uses the standard deviation of each item as a measure of its demand uncertainty and distributes the total safety inventory proportional to the standard deviation.

Suppose that the standard deviation of the application request rate of a file stream is

σi, then according to ESF

∃C₂ >0, such that∀i, TiS

σi

=C₂

n

X

i=1 Ti =

n

X

i=1 T_iC +

n

X

i=1

T_iS =Ttotal

∴C2

n

X

i=1

σi =Ttotal− n

X

i=1 T_iC

∴C2 = Ttotal−

P_n

i=1TiC

P_n

i=1σi

∴T_iS = P_nσi

j=1σj(Ttotal− n

X

i=1 T_iC)

Therefore, finding the trigger distance consists of two stages. First we calculateTC

i =

RiLfor each stream and then distribute the remaining total trigger distance proportional to the

(27)

Chapter 4

Design and Implementation

We have implemented both ETS and ESF algorithms in the standard Linux kernel (version 2.6.18). In doing so, we have written and debugged 1226 lines of new code, and, have modified parts of Linux kernel that directly affect another 4000 lines of code. ETS and ESF are pretty straightforward algorithms, however, there is significant challenge in measuring all the metrics required in the algorithms, and, to implement the algorithms in a distributed manner so that trigger distances are calculated by each streams separately without having to have a central manager.

In the following section we first present the requirements and the challenges in fulfilling them. The rest of the sections are divided into two parts: in the first part we give relevant background information about the Linux page cache and the read system call, and, in the second part we describe how we modify these to implement ESF and ETS.

4.1 Requirements

1. Light weight – It is the most important requirement. Whatever we do, we have to make sure that we do not add significant overhead to the read system call. Otherwise, any gains made by the algorithms will be offset by the overhead of the algorithms.

(28)

and hence require locks for access.

3. The algorithm should work with all file reading system calls (read, readv, aio read and mmap). We discuss this in section 4.3.2

4. We should be able to change the prefetching algorithm at run time without having to restart the machine. This requirement is mainly to aid testing. We discuss this in section 4.4.7.

5. Profiling – To verify the algorithm and to measure its performance, we need to collect and expose some statistics such as hit-ratio or average throughput or latency from the kernel to the user level. We shall discuss this in Chapter 5.

4.2 Challenges

1. Linux kernels do not have floating point support in the kernel, therefore, all the calculations of mean and standard deviation of requests should use integer arithmetic. However, Linux kernel does support 64–bit arithmetic on all architectures, and, we shall heavily rely on it.

2. Due to the lack of floating point support, Linux kernel also does not have square root function required to find standard deviation.

3. File sizes are large numbers. Therefore, we have to be careful while calculating trigger distance to avoid integer overflow.

4.3 Background

4.3.1 Linux page cache

(29)

Each page is represented in the Linux kernel by a struct page. The Linux page cache is organized as a collection of 64-ary radix trees of struct page, one for each file. The cached pages are organized as a tree and not as a simple linked list to make it easier to search a particular page in the page cache. All the pages in the file are stored as leaves; therefore, a radix tree of heighth can store 64h₋_{1 = 2}6h₋_{1 pages.}

Since Linux treats directories and other files also as files, data used by special-files is also present in the page cache. However, our algorithms and implementation are designed for and work with only regular file data fetched from directly attached storage.

Disk I/O can be done only in terms of blocks that are generally 512 bytes in size; but, the page cache deals only in terms of pages of data. Therefore, when an application asks for say first 100 bytes of a file, Linux first fetches at least the first page from the disk and stores it in the page cache. Then, it copies the required bytes to the application buffer. In case an application modifies a page, Linux just modifies the page in page cache and marks the page as “dirty”; it does not immediately write it to disk. Dirty pages are written to the disk only when they are evicted from the page cache or when an application explicitly asks for the dirty data to be written (via thefsyncorclose system call)

4.3.2 The read system call

As mentioned before, prefetching is part of the read data path of Linux and we have to modify this part of the Linux kernel to implement our prefetching algorithms. In this section we provide a brief overview of how file reading is implemented in Linux and what roles the page cache and prefetching perform.

When an application makes a system call xxx, a software interrupt is raised by the system call library to transfer the control to Linux kernel. The kernel then calls the corresponding

sys xxxfunction. All the system calls for files are implemented in various files in “fs” directory

of the Linux source code. For example, when an application invokes theopen system call, the corresponding sys open function is invoked in the kernel. Similarly, when application calls a

(30)

applications to map files into memory and read and write to them. The Linux kernel however handles all these system calls uniformly. It breaks the functions into two parts: first part to handle the blocking or mapping semantics, and, second part to perform actual file reading (called generic file read). Prefetching is implemented in the second part and hence any change to the policy will affect all the system calls to read file.

The top-level function of generic file reading is generic file aio read, which even-tually callsdo generic mapping readthat implements the actual read routine. As this function deals with many different types of devices in the lower layer that can fail in many ways, and, because all underlying functions are asynchronous, it is a very complex function (about 300 lines). The simplified pseudo-code for this function is given below.

do_generic_mapping_read(file, offset, nbytes, user_buffer) {

for each page to be read {

page_index = get_page_index(next_byte_to_read);

/* Ask for this page and any read ahead pages to be fetched * and inserted in page cache. This function will not block. */

page_cache_readahead(page_index);

/* If prefetching was successful, this page should be in the cache, * get a reference to it. Also lock the page if found in page cache */

page = find_get_page(); if(page not ready) {

/* wait for page to be ready. If necessary, also fetch the page again */ }

copy_to_user(user_buffer, page->data);

page_cache_release(page); /* Page can now be evicted if required */ }

(31)

All the read ahead logic in Linux is embedded in the function page cache readahead. The simplified pseudo code for the function is as given below

page_cache_readahead(file, page_number) {

ra = file->ra_state; /* get the read ahead state */ if(page_number == ra->prev_page)

return; /* reading more bytes from the previous page */

if(!is_sequential(file)) {

disable_readahead_window(file); return;

} else if (first_read(file)) { create_prefetch_window(file); }

ra->prev_page = page_number;

trigger_dist = get_trigger_distance(); /*returns 32 */ prefetch_degree = get_prefetch_degree();

if(ra->prev >= ra->ahead_start + ra->ahead_size - trigger_dist) { /* Reached trigger distance. Start a new

* batch of prefetched pages. */ make_ahead_window(ra);

/* Make the ahead window the prefetch window * and create a new ahead window */

ra->start = ra->prev_page;

ra->size = ra->ahead_start + ra->ahead_size - ra->start; ra->ahead_start = ra->start + ra->size;

ra->ahead_size = prefetch_degree; }

}

(32)

in chapter 2. In the above algorithm, both the prefetch degree and trigger distance are fixed to 32 because any sequential file that reads more than a few Kilobytes of data will eventually settle down with a trigger distance and prefetch degree of 32 pages. Like the previous pseudo-code, even in this pseudo-code we omit many details that are necessary for implementation but cloud the algorithm. Another important point to mention is that when Linux schedules a new batch of prefetch pages, it first allocates space for them in the page cache, but, marks the page as not ready.

4.4 Implementation

To implement both ETS and ESF we first have to determine the total trigger distance

Ttotal we are ready to allocate. An alternative way is to fix the average trigger distance while

keeping the total trigger distance varying with number of file streams. We choose the second approach and choose an average trigger distance of 24 pages. We also fix the prefetch degree to 48 pages. We use these two numbers because doing so will ensure that on an average our algorithms will use the same amount of prefetched but unused memory as the standard Linux algorithm.

If average trigger distance isT and the prefetch degree isP, then the memory allocated for prefetch pages in the Linux page cache varies fromT toT+P. Therefore, the average memory of the prefetched but unused pages is (T+(T+P))/2 =T+P/2. For standard Linux algorithm,

T =P = 32, therefore average prefetch memory usage is 48 pages. Similarly for ETS and ESF,

T = 24 andP = 48, therefore the average memory usage here is also 48 pages. To prevent the trigger distances from becoming very large and swamping the disk with many prefetch pages, we restrict the maximum trigger distance to the prefetch degree (48 pages).

The implementation of the actual algorithms is simple. To make the algorithms work, we also have to modify the function page cache readahead to use the functions described in section 4.4.4 to get the trigger distance. The functionpref stats get trigger dist automat-ically calculates the correct trigger distance based on the algorithm in use.

4.4.1 Data structure

(33)

rate, (2) Standard Deviation of the request sizes, (3) Lead time (to fetch pages from disk). Apart from per-stream request rate and standard deviation, we also need the aggregate request rate and standard deviation. For every open file, Linux maintains a data structure of type struct fileto store the file information such as file offset and open flags (O RDONLY etc.). It also stores references to other kernel objects such as inodes and address mappings. We define our own structure to store all the per-file metrics we need and insert a reference to it in the file structure. Our data structure is as shown below.

struct pref_stats {

u64 open_time; /* Open time stamp */

u64 total_read; u64 total_diff; u64 total_returned;

u64 win_start_time; /* Start time of the current window */

u32 win_nr_read; /* Number of bytes read in the current window */

struct pref_stats_queue q; /* Queue of bytes read in last 16 windows */

u64 lead_time; /* Total lead time measured for I/Os */

u64 nr_ios; /* Number of I/Os */

u32 trigger_dist; /* Cached trigger distance */

struct list_head list; /* Used to chain list of all streams */

spinlock_t lock; /* Lock to protect this entire structure */

/* .... Other profiling fields .... */ };

(34)

start of the current measurement window. The fieldwin nr read records the number of bytes read in the current window. After the completion of every read request, we add the number of bytes requested to win nr read. The field “q” is a queue of the number of bytes read in the last 16 windows that will be used to calculate the rolling mean and standard deviation. The structurepref stats q also stores the mean and deviation (not exactly the standard deviation, more about this in section 4.4.2) of the numbers already in queue. After the completion of every window, we add the current value of win nr readto the head of the queue, remove the oldest value, and, update the mean and deviation.

To calculate the aggregate request rate and deviation, we keep a list of streams that are sequential; the field list is used to insert this structure in the global list of all sequential files. The spin locklock is used to protect all the fields of this data structure.

4.4.2 Calculating the standard deviation ESF uses the standard deviation(σ) defined by

σ =

s

P_N

i=1(xi−µ)2

N

forN numbersx1...xN with a mean ofµ.

In other words, the standard deviation is theroot mean square(RMS) of the differences of the values xi from the mean µ. The purpose of this definition is to measure the dispersion

around the mean. We use the RMS instead of arithmetic mean because x_i −µ can be both negative and positive and the arithmetic mean of the differences will add up to zero. However, we cannot use this definition to find the standard deviation in the Linux kernel because, as mentioned before, Linux kernel does not support floating point arithmetic, and, has no straight-forward way to calculate square root.

We side-step this issue of calculating the square-root by usingMean Absolute Deviation (MAD) instead of standard deviation. MADω is defined as the arithmetic mean of the absolute values of the differences.

ω=

P_N

i=1|xi−µ|

N

(35)

ω =σ r

2

π ≈0.8σ

We claim that using the arithmetic mean of deviation instead of the RMS of deviation does not lead to significant difference in ESF. First, we notice that in ESF we are only concerned with the ratioσj/

P_N

i=1σi. Ifωandσdiffer only by a constant factor (as in normal distribution),

then the ratio used in ESF will have same value whether we use the standard deviation or the MAD. In general, the requests may not be normally distributed, and, the two quantities may not be so nicely related. Intuitively, however, both are just measures of uncertainty, and fit nicely into our argument of distributing the trigger distance ESF proportional to the uncertainty.

4.4.3 Measuring the lead time

The lead time (access time) is the amount of time required to fetch a batch of pages from the disk. The lead time obviously depends on the number of pages fetched. Therefore, we define the average lead time as the average time required to fetch an averaged sized batch. To measure the lead time, we need to keep track of the time when a batch is requested to be fetched and the time when the pages are actually ready with the data. To do so, we augment thepagestructure with two extra fields: io start timethat stores the time stamp at the time the I/O was scheduled, and,struct file *filpthat stores the pointer to the file structure of the file stream that issued the request for that page.

In Linux, the high–level page fetching function is mpage readpages, which is asyn-chronous (i.e. it only issues the fetch request but does not wait for the blocks to be available). When a page is fetched, the callback functionmpage end io readis called for every page. The call sequence of the functions is given in Fig. 4.1

When we schedule a page to be fetched (in do page cache readahead), we store the current time in the io start time field of the page structure. When the page becomes ready

(inmpage end io read), we find the lead time by finding the difference betweenio start time

(36)

Figure 4.1: The call sequence for fetching a page

4.4.4 Main API

Although we have made many ad-hoc changes to the kernel for profiling, we only need three functions to use the algorithm.

1. update pref stats(struct pref stats *p, int nbytes) : Called directly from the

function generic file aio read to record the number of bytes requested by a vari-ant of the read system call. This function adds the byte count to the number of bytes read in the current window, and, if the duration of current window has elapsed, it adds the current window byte count to the queue and starts a new window.

2. pref stats get trigger dist(struct file *f) : Calculates and returns the trigger

distance according to the current prefetching algorithm (see section 4.4.7 for details on selecting a prefetching algorithm). This function is used in the kernel only while creat-ing the first prefetch window, at other times, the cached trigger distance in the structure

pref statsis used.

(37)

trigger distance with the new value of the trigger distance. Calculating the trigger distance using ETS or ESF is a very costly process and has to be avoided as much as possible. This function is called every time after creating a new prefetch/ahead window. After a prefetch window is created, the kernel uses the cached trigger distance until we move to the next prefetch window.

4.4.5 Locking strategy

Locking is necessary to maintain consistency of shared data and avoid race conditions. The data structure defined in section 4.4.1 has a spin lock to protect the per-stream data structure itself. It is required because almost all the fields of the structurepref statsis prone to updates from concurrent contexts. The fields used to measure the request rate and standard deviation can be modified byreadsystem called by multiple threads on the same file. The fields used to measure the lead time can are updated in the callback functionmpage end io read as mentioned in section 4.4.3. That function however, may be invoked from an interrupt context in Linux and it is not guaranteed that only one page will try to update the lead time in the per-stream data structure. However, note that we only do very small changes to the data structure (update a field or two) once we lock it, therefore, spin lock is a very good choice to protect the per-stream data structure.

The only other shared data in our code is a global list of all the streams that are sequential. A stream is added to this list as soon as we discover that it is sequential and removed from the list either when it is detected to be not sequential, or, when the file stream is closed. This list provides us a convenient way to iterate through all the file streams and find the total request rate or deviation. An important property of this shared data is that it is only read most of the times. It is modified only when a file is closed or detected to be not sequential; both of which are rare. Therefore we use a read–write lock (rwlock t in Linux) to protect this list.

(38)

4.4.6 Handling integer overflow

To avoid loss of precision during integer division, we store all the quantities such as

win nr read in bytes rather than pages. Some of the fields in the structure pref stats are

64-bits to accommodate for the largest file size possible in Linux. Many of the other fields are just 32-bits because we don’t expect them to store very large numbers. For example, the field win nr read stores the number of bytes read in the current measurement window, and, theoretically it is possible for this field to have a value more than 4G, but, to read 4GB in 8ms will require a disk throughput of 500 GB/sec, which is way beyond than the capacity of modern storage devices [29].

Although we have avoided overflow problems in storage of per-stream metrics, we do face integer overflow problems during the calculation of trigger distance in ETS or ESF. For example, the trigger distance in ETS can be simply calculated by

trigger_dist = (DEFAULT_TRIGGER_DIST * num_streams) * rate / total_rate;

If the default trigger distance is 24 pages (24 * 4096 bytes), the number of streams is 20 and the request rate of a stream is 1 page (4096 bytes) per measurement window, then the numerator of the right hand side of the assignment has a value of 7.5G, close to twice of what can be represented in a 32-bit unsigned integer. A solution is to avoid using bytes in calculations and simply use the unit of pages, however, this will lead to loss of precision because we only have integers to work with. Our solution is to use 64-bit arithmetic in such situations. The above code can be written as

u64 numerator = ((u64)DEFAULT_TRIGGER_DIST * num_streams) * rate ; u64 denominator = total_rate;

do_div(numerator, denominator); trigger_dist = (u32) numerator;

(39)

4.4.7 Miscellaneous implementations

(40)

Chapter 5

Evaluation

In this chapter we compare the performance of ETS and ESF with the standard Linux algorithm and a variant of Linux algorithm – “Linux-24-48” – that uses a trigger distance of 24 and a prefetch degree of 48 pages. In section 5.1 we describe the performance metrics that we use to evaluate the algorithms, and, how we measure some of them. In section 5.2 we present the experimental setup including the configuration of the test machines and the experimentation methodology. In section 5.3 give an overview of the workload we use, and, in section 5.4 we present the complete results of experiments on various algorithms

5.1 Performance Metrics

We use four performance measures to evaluate the algorithms: prefetched but unused memory, hit ratio, response time and throughput. Response time and throughput are important quantities that are directly perceived by the applications, and, are commonly used in research to evaluate prefetching algorithms [10, 18]. The other two metrics though not directly visible to the applications, help us to understand some of the results that we get.

(41)

5.1.1 Response time

Response time is one of the metrics that is directly perceived by the application. For applications such as web-servers response time is an important measure of performance and users of web applications do care a lot about response time (Google search even prints the response time with the result of search queries).

Response time can directly be measured in the test applications by measuring the time required to complete a read. However, applications explicitly have to measure the response time. Common Linux applications suchcp do not measure or print response time in its results. Therefore, we measured the response time for a request in the kernel. We define the response time as the time it takes to deliver a page to the application after its request has reached the kernel. The simplest way is to time the execution of thereadsystem call. However, this method will not work withmmaporaio read. Our approach is to put three different timestamps on every requested page. As soon as a page request reaches the kernel, the kernel first allocates space for the page if it does not exist. At this point we put the request time in the fieldpage request time

of thepage structure. If a page is not present in the page cache it is first inserted into the page cache and then fetched from the disk. When a page is inserted in the page cache, we put the current timestamp in page cache enter time. When the page becomes ready, we put the current timestamp in page ready time. All the pages that are fetched in the same batch will have the same ready time, but may not have the same request time; prefetched pages would be requested only in the future by the application. In case the page ready time is less than the

page request time, then the page is in the cache and from the kernel’s perspective, the response

time for this page should be zero. Otherwise, from the kernel’s perspective, the response time

ispage ready time -- page request time.

5.1.2 Hit/miss ratio

The hit-ratio is not directly perceived by the application, but it has effect on measures such as response time and throughput. Our aim in measuring the hit ratio is to better explain some of the other results that we obtain. In caching and prefetching, hit ratio is one of the most widely used performance measure [10]. In SCM it is called thefill rate and is a popular measure of the effectiveness of inventory strategy [6].

(42)

section 5.1.1. If the page was in the cache then its request time will be more than the ready time. Therefore, it is simple to figure out whether a page reference was a hit or a miss.

5.1.3 Prefetched but not accessed pages

We have claimed in chapter 4 that SCM-based algorithms use the same amount of memory as the standard Linux algorithm. To prove this we have to measure the average amount of prefetched but unaccessed memory for every file. Prefetching algorithms aim to keep this quantity as small as possible without impacting the performance. If there is large amount of prefetched but unaccessed memory in the system, then there will be lot of contention among file streams for memory and will lead to cache pollution and wasted prefetches [10, 35]. Cache pollution happens when prefetched data replaces more useful data in the memory, thus degrading the overall system performance.

To measure this metric, we record the amount of prefetched and unaccessed memory in a counter in the per-stream data structure that we have defined in chapter 4. Whenever a prefetched page becomes ready, we increment the counter, whenever a page is accessed, we decrement the counter. At any given point in time, the counter gives us the amount of prefetched but unaccessed memory for that file.

5.1.4 Throughput

Throughput is a metric that is directly perceived by the application. Like response time and hit ratio it also widely used to evaluate caching and prefetching algorithms [10, 11, 35, 18]. Gill et al. in [10] mention that maximizing the overall system throughput is one of the ultimate aims of optimal prefetching.

(43)

5.2 Experimental Setup

We run all our experiments on a Dell PowerEdge 2950 III server. This server has two 2.33 GHz quad-core Intel Xeon E5410 processors, 16 GB of RAM, and, a RAID-5 with 146 GB hard drives spinning at 15000 rpm. All the statistics that we collect can be sensitive to the system workload, system scheduling, and, disk scheduling. Therefore, for all the experiments, we run the tests multiple times and report the average of the best three results. We found that the variance in the results is less than 10% and therefore omit the error-bars in all the performance charts.

5.3 Workload

We use three different kinds of workloads to test the performance of SCM-based algo-rithms and compare them to Linux.

1. Synthetic benchmarks – SCM-based algorithms will perform the best when we have long sequential streams with different request rates and standard deviations. However, it is difficult to control the rate and standard deviation of file streams in real Linux applications. Therefore, we use a series of synthetic benchmarks that use streams with varying rates and deviations. These benchmarks provide valuable insights about the behavior of various algorithms.

2. Real Linux applications – Although synthetic benchmarks help us understand the behavior of the algorithms, we need to run real applications to evaluate the algorithms in realistic situations. In this workload, we choose two file transfer applications (cp and scp) to compare the performance of different algorithms. We usecp to transfer large files to local and remote machines (using NFS). We use scp to transfer large files to geographically distinct remote locations in PlanetLab [7].

(44)

we call this server workload “SPC2-like” as done in [10]. We also run the SPC2-VOD benchmark along with TPC-H traces obtained collected in Purdue University [3] to simu-late a mix of sequential and random workload. The TPC-H trace is collected by running the TPC-H benchmark [34] on MySQL [23] database. This trace is highly non-sequential and contains only about 3% of sequential references [3].

5.4 Results

5.4.1 Two streams varying rate difference

This experiment is designed to compare the effect of ETS and Linux prefetching on two streams when their request rate difference is varied. In this test, we spawn two different threads to read two different files at different rates. Each thread periodically reads one page (4KB) of data and goes to sleep between two successive reads. By controlling the sleep time we can control the read request rate of the streams. In the experiment, we keep the request rate of the first stream fixed at 1000 pages/second, and, vary the speed of second stream between 3000 and 7000 pages/second.

As mentioned in chapter 4, Linux uses a trigger distance and prefetch degree of 32 pages. However, ETS and ESF use a prefetch degree of 48 pages and an average trigger distance of 24 pages. Only using different prefetch degrees can itself lead to different performance because the disk access time is amortized across a larger batch of prefetch pages. Therefore, for a fair comparison, we have to compare ETS with an algorithm that uses prefetch degree of 48 pages. For this purpose, we use a modified version of the standard Linux algorithm that uses a trigger distance of 24 pages and a prefetch degree of 48 pages. We designate the two Linux algorithms as “Linux-32-32” and “Linux-24-48”. We use a trigger distance of 24 pages in Linux-24-24 to keep its average prefetch memory usage same as that of Linux-32-32. Recall that the average prefetch memory usage of an algorithm is T +P/2 (see section 4.4); therefore, the average prefetch memory usage of all the algorithms that we are using is the same (48 pages).

(45)

0 20 40 60 80 100 120

1 : 7 1 : 5

1 : 3

Response time (

µ

s)

Rate 1 : Rate 2 Linux-32-32

Linux-24-48 Linux-ETS

Figure 5.1: Two streams with increasing rate difference. Average response time.

0 0.5 1 1.5 2 2.5 3 3.5 4 4.5

1 : 7 1 : 5

1 : 3

Number of misses

per prefetching cycle

Rate 1 : Rate 2 Linux-32-32

Figure 5.2: Two streams with increasing rate difference. Average number of misses per prefetch cycle.

(46)

0 10 20 30 40

7 5

3

Response time (

µ

s)

SD of stream 2 as factor of the mean Linux-ETS

Linux-ESF

Figure 5.3: Two streams with increasing standard deviation difference. Average response time.

5.4.2 Two streams with varying standard deviation difference

This experiment is designed to understand the behavior of ESF algorithm. In this experiment, we keep the rates of the two streams fixed at 1000 pages/second while varying the standard deviation. The standard deviation of the first stream is fixed at square root of the average request rate while the standard deviation of the second stream is varied between 3 to 7 times the mean request rate.

Fig. 5.3 shows the result of this experiment. As the request rates of both the streams are same, ETS will behave similar to the Linux and hence we do not show Linux and only compare ETS and ESF. We can see that ESF consistently outperforms ETS. The average improvement of ESF over ETS in terms of average response time is 25% while the maximum improvement is 35%. Fig. 5.4 shows the response time of the individual streams, and, we see that we are only slightly increasing the response time of stream 1 while significantly reducing that of stream 2, giving us an overall improvement over ETS.

5.4.3 Real Linux applications

(47)

0 10 20 30 40 50 60 70 7 5 3

Response time of individual

streams(

µ

s)

SD of stream 2 as factor of the mean Stream 1 w/ Linux-ETS

Stream 1 w/ Linux-ESF Stream 2 w/ Linux-ETS Stream 2 w/ Linux-ESF

Figure 5.4: Two streams with increasing standard deviation difference. Response time for each stream.

grams have different read throughput. We expect that by giving more trigger distance to faster streams, ETS and ESF will improve the throughput of of the faster streams while minimally impacting the throughput of the slower streams.

Fig. 5.5 shows the percentage improvement in the throughput of various file transfer applications over Linux-32-32. The scp streams in the figure are sorted by their speed, with the slowest one being the leftmost and the fastest one the rightmost. Additionally, the numbers above the bars denote the throughput of the application with Linux-32-32. As expected, ETS and ESF greatly increase the performance ofcp to local destination. There is slight degradation of performance for the other applications but it is less than 5% for ETS and 7% for ESF. The huge improvement in throughput ofcp combined with very slight degradation of the other applications gives an overall improvement of 33% for ETS and 22% for ESF. Here we see that ESF performs worse than ETS; we shall discuss this in section 5.4.5

5.4.4 Server workload

(48)

60% 50% 40% 30% 20% 10% 0% -10% -20% -30% aggregate scp5 scp4 scp3 scp2 scp1 cp-NFS cp-local

Improvement in throuput

19K pg/sec 831 pg/sec 766 pg/sec 753 pg/sec 597 pg/sec 174 pg/sec 107 pg/sec 22K pg/sec Linux-24-48 Linux-ETS Linux-ESF

Figure 5.5: Throughput of file transfer applications

read in every read varies from 16KB to 128KB at a time. The different read sizes correspond to the different qualities of video that are being played. To test the scalability of our implementation we vary the number of concurrent streams from 30 to 60.

Fig. 5.6 shows the results of running SPC-2 like workload on a video server. We see that ETS far outperforms both Linux and ESF while using less memory than Linux. In this test ETS has about 41% performance improvement over Linux, while using about 25% less prefetch memory. As expected both Linux-32-32 and Linux-24-48 have average prefetched-but-unused memory consumption close to 48 pages/second. However, ETS and ESF use far less memory due to the upper limit placed on the trigger distance in the implementation. Generally we expect the throughput of a system to vary with increasing system load, however we find that the system throughput remains almost the same for all the cases. This is because with 30 streams, our system (not the disk) is alread saturated and increasing the number of streams has no effect on the throughput.

(49)

0 2000 4000 6000 8000 10000 12000 60 50 40 30 0 10 20 30 40 50 60 70

Throughput (pages/sec) Average unused but

prefetched memory (pages)

Number of streams

Linux-32-32 Linux-24-48 Linux-ETS Linux-ESF Linux-32-32(Mem) Linux-24-48(Mem) Linux-ETS(Mem) Linux-ESF(Mem)

Figure 5.6: Pure SPC-2 like workload

0 1000 2000 3000 4000 5000 6000 7000 8000 9000 60 50 40 30 Total Throughput(pages/sec)

Number of sequential streams Linux-32-32

Linux-ETS

(50)

0 100 200 300 400

60 50

40 30

TPC-H Throughput (pages/sec)

Number of sequential streams

Figure 5.8: Throughput of the TPC-H component of the SPC-2 and TPC-H workload test

a video server and database server running on the same machine. Fig 5.7 shows the results of running this experiment. We see that even in this case, ETS outperforms Linux by about 17%. Our algorithms only work on sequential streams and leave it up to Linux to handle random accesses. However, due to aggressive prefetching used by our algorithms, we can expect some minimal impact on the throughput of TPC-H. Resutls from Fig. 5.8 show that we degrade the performace of only the TPC-H component of the workload by less than 5% while still giving better overall throughput.

5.4.5 Performance of ESF

(51)

Chapter 6

Conclusion and future work

Our aim at the beginning of this work was to implement SCM algorithms in the Linux kernel and demonstrate that they can be successfully used in the context of prefetching. Ac-cordingly, in this work, we have given a mapping between prefetching and inventory theory of SCM and applied two algorithms from SCM to prefetching. We have implemented ETS and ESF in the Linux kernel and the demonstrated that they can improve over the performance of the standard Linux prefetching algorithm. In particular ETS consistently performs better than both ESF and Linux algorithm for both real applications and server workloads.

There are many avenues for future research that we have identified in this thesis.

1. In this work we have only changed the trigger distance, however, there are many algorithms in SCM to determine the order quantity (i.e. the prefetch degree). We can use one of the algorithms along with ETS or ESF to change both the trigger distance and prefetch degree simultaneously. The challenge in this approach is that changing the prefetch degree has a large effect on performance because of the large disk latency.

2. As mentioned in section 3.2.2, modeling and measuring the lead time more accurately and using it to calculate the trigger distance and prefetch degree can be an interesting area of further research. More accurate modeling may require more information from the disk I/O system and the challenge here is to accurately measure the lead time while not adding significantly to the overhead of the read system call. The implementation of ESF may benefit the most by measuring the lead time and its variation more accurately.

(52)

to enable them to be further used in multi-level prefetching. Our research team is already trying to use some of the SCM approaches to minimize what is known as bull-whip effect [30, 6]. The idea is to reduce “information distortion” in multi–level prefetching that occurs because of servers in the lower levels not being able to distinguish between the actual application requests and the prefetch requests of the upper layer[35].

(53)

Bibliography

[1] Rakesh Barve, Mahesh Kallahalla, Peter J. Varman, and Jeffrey Scott Vitter. Competitive parallel disk prefetching and buffer management. J. Algorithms, 36(2):152–181, 2000.

[2] Daniel Bovet and Marco Cesati. Understanding the Linux Kernel. O’Reilly & Associates, Inc., third edition, 2008.

[3] Ali R. Butt, Chris Gniady, and Y. Charlie Hu. The performance impact of kernel prefetching on buffer cache replacement algorithms.SIGMETRICS Perform. Eval. Rev., 33(1):157–168, 2005.

[4] Pei Cao. Application-controlled file caching and prefetching. PhD thesis, Princeton Univer-sity, Princeton, NJ, USA, 1996.

[5] Zhifeng Chen, Yan Zhang, Yuanyuan Zhou, Heidi Scott, and Berni Schiefer. Empirical evaluation of multi-level buffer cache collaboration for storage systems. SIGMETRICS Perform. Eval. Rev., 33(1):145–156, 2005.

[6] Sunil Chopra, Peter Meindl, and D.V. Kalra. Supply Chain Manangement: Strategy, Plan-ning, and Operation. Prentice Hall, third edition, 2006.

[7] Brent Chun, David Culler, Timothy Roscoe, Andy Bavier, Larry Peterson, Mike Wawrzo-niak, and Mic Bowman. Planetlab: an overlay testbed for broad-coverage services. SIG-COMM Comput. Commun. Rev., 33(3):3–12, 2003.

[8] Peter J. Denning. The locality principle. Commun. ACM, 48(7):19–24, 2005.