BENCHMARKING CRITERIA FOR
FILE SYSTEM BENCHMARKS
WASIM AHMAD BHAT*
Research Scholar, P. G. Department of Computer Sciences, University of Kashmir, India
S. M. K. QUADRI
Head, P. G. Department of Computer Sciences, University of Kashmir, India
ABSTRACT
Comparing products based solely on technical merits and specifications is rarely useful for predicting actual performance, so the most common approach is to empirically evaluate their performance using some workload and performance gathering tool. Benchmarking file systems is a process in which some specific workload is run on a specific system in order to get performance data. This way file system performance is evaluated accurately. However, the benchmarks proposed in file system research papers suffer from several problems. Although many design criteria have been proposed for development of good file system benchmarks, very few benchmarks follow them completely. In this paper, we try to summarize all the generally accepted design criteria proposed in various research papers for file system benchmarks based on the various problems that they had identified. The goal of this paper is to remove the ambiguity among the design criteria of file system benchmarks by classifying them as per their perspective of application and provide a refined and fresh list.
Keywords: Benchmark; File System; Criteria.
1. INRODUCTION
Whenever a new software or hardware is designed and developed, the first thing people are interested in is its performance. The performance data of such a software or hardware has a significant impact on its value. Benchmarks are used to get such performance data of software or hardware and thus, may add to or subtract from the value of that software or hardware. Generally, this performance data may be used by consumers in purchasing decisions, or by researchers to help determine a system’s worth. When the results of empirical performance evaluation of a system are presented, the results and implications must be clear to the reader. The results include accurate depictions of behavior under realistic workloads and in worst-case scenarios, as well as explaining the reason behind benchmarking methodologies and such system behavior. In addition, the reader should be able to verify the benchmark results, and compare the performance of one system with that of another. To accomplish these goals, much thought must go into choosing suitable benchmarks and configurations, and accurate results must be conveyed. It is possible for users to test performance in their own way using real workloads and thus, transfers the responsibility of benchmarking from author to user. However, this is usually impractical because testing multiple systems is time consuming, especially in that exposing the system to real workloads implies learning how to configure the system properly, possibly migrating data and other settings to the new systems, as well as dealing with their respective bugs. In addition, many systems developed for research purposes are not released to the public. Although rare, we have seen performance measured using actual workloads when they are created for in-house use [1] or are made by a company to be deployed [2]. Hence, the next best alternative is that the authors should run workloads that are representative of real-world use on commodity hardware.
Benchmarking file and storage systems is a complex case of benchmarking and thus requires a lot of parameters to be look at. Every file system has a single motive; mitigate the access to data in secondary storage devices via a uniform notion of files but they differ in many ways, such as type of underlying media (e.g., magnetic disk, optical disk, solid state memory, network storage, volatile RAM, etc.), storage environment (e.g., RAID, LVM, virtualization, etc.), the workloads for which the system is optimized, and in their features (e.g., journals, encryption, compression, etc.) [3]. In addition to this, complex interactions exist between file systems,
*
I/O devices, specialized caches (e.g., buffer cache, disk cache), kernel daemons (e.g., kflushd in Linux), and other OS components. Some operations may be performed asynchronously, and this activity is not always captured in benchmark results. Because of this complexity, many factors must be taken into account when performing benchmarks and analyzing the results.
In this paper, we try to summarize all the generally accepted design criteria proposed in various research papers for file system benchmarks based on the various problems that they had identified. The goal of this paper is to remove the ambiguity among the design criteria of file system benchmarks by classifying them as per their perspective of application and provide a refined and fresh list.
2. FILE SYSTEM BENCHMARKS
In 1972, Lucas [4] stated that the three reasons to obtain performance data:
Selection Evaluation (“which system is best for me”),
Performance Monitoring (“how can I tweak the system to improve performance”), and
Performance Projection (“how well will this idea for a system perform”).
He further states that benchmarks are excellent for selection evaluation, adequate for performance monitoring, and insufficient for performance projection. Thus, benchmarks serve the purpose of two classes of consumers. One class consists of customers looking to buy a new system. Benchmarking results help them to decide which system will perform best under their workload. It is this class that has spawned benchmarks such as IOStone [5], which only yield one number, as a final result. This type of benchmark is fairly useless, because only customers whose workload at least approximates the benchmark’s target workload can use the result, and then only for relative comparisons. System designers comprise the other class; they use benchmarks to point them towards possible areas for improvement either in the current system or in the design of a new system; a benchmark that yields only one number is of no use to this class of bench markers.
Benchmarks may be categorized in two ways:
One way is to categorize a benchmark as being either a synthetic or an application benchmark;
The other way is as a macro- or micro- benchmark.
Application benchmarks consist of programs and utilities that a user can actually use. For example, SPECint92 [6] consists of six applications including a C compiler, a lisp interpreter, and a spreadsheet. Synthetic benchmarks, on the other hand, model a workload by executing various operations in a mix consistent with the target workload; the NFS benchmarks (nfsstone, nhfsstone, and LADDIS) allow the user to input a target mix of operations, i.e., what percentage of the workload should consist of create’s, getattr’s, etc. [7][8].
Synthetic benchmarks are more flexible than application benchmarks: They usually have a larger number of parameters that might allow them to scale better with technology and to increase the number of different workloads they can model. However, the problem with synthetic benchmarks is that they do not measure any real work. This makes their results questionable because the operations completed in a synthetic benchmark might not take the same amount of time in a real application. Either the synthetic benchmark might add overhead that does not exist in a real application, or a real application might incur overhead not modeled in the benchmark. There is no answer to this question, although conventional wisdom so far disregards the problem.
Macro-benchmarks measure the entire system, and usually model some workload; they can be either synthetic or application benchmarks.
Micro-benchmarks measure a specific part of the system: They can be thought of as a subset of synthetic benchmarks in that they are artificial; however, they do not try to model any real workload whatsoever. An example of a micro-benchmark is the create micro-benchmark from the original LFS paper: It timed how long the system took to create 10,000 files [9].
3. BENCHMARKING PROBLEMS & MITIGATION
system might perform on a different workload or that point a system designer towards possible areas for improvement.
Tang [12] provides an overview and critique of several file system benchmarks. Most macro benchmark programs report only a single result, either the elapsed time to execute the benchmark, or the file system throughput achieved by the benchmark. By providing only a single number as the evaluation of a file system’s performance, these benchmarks are essentially claiming to be able to provide a one-dimensional ordering of different file systems. This claim doesn’t hold up when we examine how different file systems perform with different workloads. In fact, the most that a benchmark can do is predict the performance of a file system with a workload that is similar to the benchmark workload. This highlights the central deficiency of existing benchmarking methodology. Macro benchmarks can determine the system on which a particular workload will perform best, but they provide little information about the underlying causes for such performance differences.
Micro benchmarks, in contrast, are useful for understanding the detailed performance differences between systems, (e.g., one system may provide higher I/O throughput than another). Sometimes, micro benchmarks may show that one file system out-performs another in all areas. More commonly, however, different file systems offer different advantages. In such cases, the micro benchmark results by themselves provide no insight into which systems are best suited to particular workloads. Most users turn to benchmarks, or benchmark results, in order to answer the question, “How will my workload perform on this system?” Micro benchmarks provide detailed information about file system performance, but in isolation they are seldom sufficient to answer this question. In the absence of a detailed understanding of the interactions between a workload and the file system, however, micro benchmarks provide little insight into the workload’s performance on the target file system. Macro benchmarks, in contrast, are designed to answer this very question, but the user is still left with the difficult task of determining whether publicly available macro benchmarks are similar to his own workloads. Thus a technique for researchers to make the environment in which they execute benchmarks more realistic, and new benchmarking strategies that make it possible for users to determine how well their own workload will perform on a file system of interest is needed. In isolation, neither type of benchmark fully illuminates the behavior of a file system. Traeger et. al. [3] recommends use of both micro and macro benchmarks for the performance evaluation of file systems to get the accurate, fair and useful comparison results.
3.1. Considerations Apparent from Analysis of Various Benchmarks
Tang [13] conducted analysis of various benchmarks and showed that current benchmarks are inadequate for measuring file system performance, and that they suffer from the following problems:
Several benchmarks measure only a subset of the file system functionality, typically data throughput. Metadata operations are usually ignored, even though they constitute a large percentage of the requests made to a file system [14] [15]. Running nfswatch, a utility that monitors what packets are sent to the server, shows that the majority of requests sent to a server are meta-data requests.
Several benchmarks measure only peak performance, rather than what happens when a real workload is running on a system.
Most of the benchmarks do not scale with technology to stress today’s systems as well as yesterday’s systems.
By trying to model a specific workload (e.g., system development, scientific calculation), several benchmarks end up either not modeling any actual workload or being useful to only a very narrow group (while perhaps being widely used).
A few benchmarks present meaningless results, which cannot be used to predict the performance of a system on realistic workloads or to point system designers towards possible areas for improvement.
3.2.Importing I/O Benchmark Criteria
Chen laid out some criteria for judging I/O systems which can be adapted for file systems as well [16]. Chen states that an I/O benchmark should be:
Prescriptive: it should point system designers towards possible areas for improvement.
I/O bound: the benchmark should measure the I/O system and not, for example, the CPU.
Scalable with advancing technology.
Comparable between different systems.
General: applicable to a wide variety of workloads.
Tightly specified: no loopholes; clarity in what needs to be reported.
3.3. Think Like File System Designer
A file system benchmark designer should think like a file system designer in order to modularize his benchmark design accordingly and validate every module of the file system. The following decisions must be made when designing a file system:
What meta-data (file system control structures) is there?
How are blocks allocated to a file? Where is meta-data placed in relation to data? What model of locality is used? For example, FFS tries to have both spatial and logical locality, placing blocks for a single file near each another and placing files in the same directory close together as well.
How are files named? What algorithm is used for pathname resolution?
What caching is there, and how are the caches maintained (LRU versus random caching algorithms, determining when data should be flushed to disk)?
How is disk scheduling done? Can reads and writes be clustered? Can requests be pulled off of the disk queue or be rescheduled?
How disk space is managed (a free map, blocks versus sectors, etc.)? Is there some method used to minimize disk space fragmentation?
What semantic guarantees are made to the user? For example, when a create system call returns to the user, does the file exist on disk?
How does a file system recover from a crash? Can it recover? How long does it take to recover?
Can the file system handle multiple users, perhaps accessing or even changing the same file?
How does the file system provide protection for a user’s data? How well does it work?
The designers of a file system need to solve all of these problems, and a file system benchmark needs to be able to measure how well their solution works. A file system benchmark should also determine what performance gain is due to a clever algorithm in the file system versus a clever disk.
3.4. Which Metric Evaluates a File System?
A benchmark writer needs to address concerns which metric to use in evaluating a system. The most common metric is throughput: how many operations can be completed in a certain amount of time. Throughput is usually expressed in KBytes per second, or IOStones per second, or in general, something per second. Often, however, users care more about latency, i.e., how long it takes to do one operation. The user cares about how long the system takes to respond to a keyboard stroke or to list a directory: “how long do I have to wait?” [17]. Typically, latency is expressed in the average amount of time needed to complete some operation. While throughput and latency are the most common metrics used, they are definitely not the only ones. Other metrics include reliability, security, and efficiency of disk space usage, although throughput and latency are more general.
3.5. How to Validate the Results?
Traeger et. al. [3] recommended a list of guidelines to consider when evaluating the performance of a file or storage system which indirectly allow benchmarking results to be validated. The two underlying themes of these guidelines are the following.
Explain What Was Done in as Much Detail as Possible. For example, if one decides to create one’s own benchmark, the paper should detail what was done. If replaying traces, one should describe where they came from, how they were captured, and how they were replayed (what tool? what speed?). This can help others understand and validate the results.
In Addition to Saying What Was Done, Say Why It Was Done That Way. For example, while it is important to note that one is using ext2 as a baseline for the analysis, it is just as important to discuss why it is a fair comparison. Similarly, it is useful for readers to know why one ran that random-read benchmark so that they know what conclusions to draw from the results.
3.6. Measure Both Meta-data & User-data Performance
3.7. Provide Multi-dimensional Results
File system benchmarks, in general, provide a single value or a limited number of values that are intended to represent the overall performance of the file system. However, this is somewhat like trying to reduce a vector to a scalar in the sense that a significant amount of important detail can and does get lost in the translation. In order to get an accurate assessment of the performance of a storage subsystem it is necessary to perform a benchmark over a range of values of specific parameters.
3.8. Consider all your Benchmarking Perspectives
The perspective is the point of view from which the performance is measured. Three of the more generally accepted perspectives are:
1. Application 2. System
3. Storage Subsystem
The Application perspective is what most of the file system benchmarks represents. From this perspective all of the underlying system services and hardware functions are hidden. This perspective includes all the cumulative effects of other applications running at the same time as the benchmark run. This is also true for applications running on other machines that may be simultaneously accessing the storage subsystem under test. From this perspective the results of a benchmark can be skewed due to undesirable interactions from these other applications and other machines.
The System perspective is viewed by running system-monitoring tools (such as sar, osview, or filemon for example) during a benchmark run. These tools provide coarse-grained real-time monitoring of the system I/O activity for such high-level operations as file reads and writes as well as the number of operations actually sent to the storage subsystem on a device-by-device basis. From this perspective it is possible to see and measure the effect of other applications that are running concurrently with the benchmark program.
The Storage Subsystem perspective is the most difficult to monitor since there are not many tools available to collect performance data from the storage subsystem.
All the 3 perspectives should be considered and performance data from these perspectives should be used to derive the final conclusion.
3.9. Keep Multilevel Caches into Consideration
There are several levels at which caching is used to mitigate performance issues with the underlying storage layers. These include the file system buffer cache, disk array controller caches, and disk drive caches. The file system with its buffer cache is on the top layer of the cache hierarchy. The file system buffer cache is generally some significant amount of physical system memory that is used to hold large chunks of data from files being accessed through a file system manager. For example, when an application or benchmark program issues a read system call most file system managers will read data into the file system buffer cache and then copy the requested data into the user buffer. Similarly, for write operations, the data is copied from the user buffer into the file system buffer cache and later written to the storage media. For normal applications this is acceptable behavior but when running benchmarks it is necessary to understand when the cache is being used and when it is not. Otherwise, the results of the benchmark can be rendered meaningless. The disk array controller and disk drives have separate caches that are not connected or controlled by the file system manager or the device drivers. It is under the control of the disk array or disk drive controller and there are many different control algorithms that determine how it is used and how effective it is. The configuration and usage modes of the cache are purely vendor-dependent and model specific. It is important to understand how the cache, if present, is being used during a benchmark run so that its effects can be taken into account when setting up the benchmark runtime parameters and/or interpreting the results.
3.10. File System Aging Does Effect!
necessary to be able to test and/or monitor a file system’s performance as it ages in order to determine when defragmentation is really necessary.
4. CONCLUSION
In this paper we discussed the use of benchmarks and presented a generally accepted classification of file system benchmarks. Further, we discussed the problems that current file system benchmarks are suffering with. Finally, we summarized all the generally accepted design criteria proposed in various research papers for file system benchmarks based on various problems that they had identified and tried to classify them on the basis of their perspective of application.
Many things have been left unsettled in this paper, like, no existing benchmarks have been benchmarked for such criteria, no techniques have been identified to implement such criteria, etc.
References
[1] Ghemawat, S.; Gobioff, H.; Leung, S. T. (2003). The Google file system, In Proceedings of the 19th ACM Symposium on Operating
Systems Principles. (ACM SIGOPS), Bolton Landing, NY, 29–43.
[2] Schmuck, F.; Haskin, R. (2002). GPFS: A shared-disk file system for large computing clusters, In Proceedings of the 1st USENIX
Conference on File and Storage Technologies, Monterey, CA, 231–244.
[3] Traeger, A.; Zadok, E.; Joukov, N.; Wright, C. P. (2008). A Nine Year Study of File System and Storage Benchmarking, ACM
Transactions on Storage,Volume 4 , Issue 2
[4] Lucas, H. C. Jr. (1972). Performance Evaluation and Monitoring, Computing Surveys, 79-91.
[5] Park, A.; Becker, J. C. (1990). IOStone: A Synthetic File System Benchmark, Computer Architecture News 18, 2, 45-52.
[6] Case, B. (1992). Updated SPEC Benchmarks Released: SPEC92, New Multiprocessor Benchmarks Now Available, Microprocessor
Report, 14-19.
[7] Molloy, M. K. (1992). Anatomy of the NHFSSTONES Benchmark, Performance Evaluation Review 19.
[8] Nelson, B.; Lyon, B.; Wittle, M.; Keith, B. (1992). LADDIS – A Multi-Vendor and Vendor-Neutral NFS Benchmark, UniForum
Conference.
[9] Rosenblum, M.; Ousterhout, J. K. (1992). The Design and Implementation of a Log-Structured File System, ACM Transactions on
Computer Systems, 26-52.
[10] Ousterhout, J. K. (1990). Why Aren’t Operating Systems Getting Faster As Fast As Hardware? Proceedings of the 1990 USENIX
Summer Technical Conference, 247-256.
[11] Seltzer, M.; Smith, K.; Balakrishnan, H.; Chang, J.; McMains, S.; Padmanabhan, V. (1995). File System Logging versus Clustering: A
Performance Evaluation, Proceeding of the 1995 USENIX Technical Conference, 249-264.
[12] Tang, D. (1995). Benchmarking filesystems. Tech. Rep. TR-19-95, Harvard University.
[13] Tang, D.; Seltzer, M. (1994). Lies, damned lies, and file system benchmarks, Tech. Rep. TR-34-94, Harvard University.
[14] Baker, M. G.; Hartman, J. H.; Kupfer, M. D.; Shirriff, K. W.; Ousterhout, J. K. (1991). Measurements of a Distributed File System,
Proceedings of the 13th Symposium on Operating Systems, 198-212.
[15] Blackwell, T.; Harris, J.; Seltzer, M. (1995). Heuristic Cleaning Algorithms for Log-Structured File Systems, Proceedings of the 1995 USENIX Technical Conference, 277-288.
[16] Chen, P. M.; Patterson, D. A. (1992). A New Approach to I/O Benchmarks – Adaptive Evaluation, Predicted Performance,
UCB/Computer Science Dept. 92/679, University of California at Berkeley.
[17] Mogul, J. (1992). SPECmarks Are Leading Us Astray, The Third Workshop on Workstation Operating Systems, 160-163.
[18] Bhat, W. A.; Quadri, S. M. K. (2010). Review of FAT Data Structure of FAT32 file system, Oriental Journal of Computer Science &
Technology, Volume 3, No 1.
[19] Ruwart, T. M. (2001). File system performance benchmarks, then, now, and tomorrow. In Proceedings of the 14th IEEE Symposium
on Mass Storage Systems, San Diego, CA.
[20] Smith, K. A. et al. (1997). File System Aging – Increasing the Relevance of File System Benchmarks, Proceedings of 1997 ACM
SIGMETRICS Conference, June 1997, ACM
[21] Bancroft, M. et al. (1999). Functionality and Performance Evaluation of File Systems for Storage Area Networks(SAN), Proceedings