Filesystems for Embedded Computing
Kurt Rosenfeld
December 14, 2005
Abstract
Embedded computing systems often need some of the services of-fered by traditional filesystems. But embedded environments differ sufficiently from workstation and server environments to motivate the development of embedded filesystems. Two motivating differences are the use of flash memory instead of hard disks, and the lack of need for the full set of conventional filesystem semantics.
This document presents three embedded filesystems and discusses their relative merits. The three filesystems reviewed did not appear at the same time, and therefore it is possible to see them as a sequence of improvements. Although the approaches are quite different, and the experimental data presented in the papers does not enable direct comparison, where possible, the experimental data is quantitatively compared.
Finally, a novel embedded filesystem concept is proposed that could further improve performance beyond what can be obtained from any of the three existing systems that are reviewed.
1
Introduction
Embedded computing is the class of computing systems that does not in-clude conventional workstations, servers, and laptops. Typical applications are cellphones, digital cameras, and portable music players. The tendency is for embedded systems to be severely constrained in volume, power consump-tion, and cost. Fortunately these systems are not usually expected to have performance competitive with full-size computers. Also fortunate is the lack of need for compliance with an existing filesystem API.
1.1
Flash Memory
Flash memory is almost universally used for nonvolatile storage in embedded systems. The reasons for this are power, weight, and reliability. Flash mem-ories have at least two modes, active and standby, which they can switch between quickly. In active mode, flash memories use an order of magnitude less power than hard drives. Comparing standby mode power consumptions, flash uses two orders of magnitude less power. The solid state memories weigh less than 1 gram, whereas the lightest storage devices with a spinning magnetic platter weigh 16 grams. Finally, embedded systems are often mo-bile and/or in harsh environments that include vibration and impact. Hard drives are unreliable in this type of environment. Flash is not affected.
Although the reasons given above are compelling reasons to choose flash over a hard drive in an embedded application, flash is not without its own disadvantages. First, it is impossible to flip bits from the 0 state to the 1
state without erasing (resetting) the entire segment 1
that contains the bits. Most devices support reprogramming, which means that a Y can be written “on top of” an existing value of X, but instead of the new value being Y, as it would be in RAM, the new value when flash is rewritten is XY. The second disadvantage is that each segment of flash can only be erased a finite number of times. This limit is typically 105
. Beyond this limit the block is ruined. This is a significant concern for flash filesystem designers. Wear leveling
refers to a deliberate strategy that avoids over-using particular segments. Concern for the longevity of the memory device is a factor that comes into play with flash but not with magnetic disks.
The performance of flash is very different from that of magnetic disks. For a hard disk, write speeds and read speeds are nearly equal. Flash is order of magnitude faster than a hard drive for reads, and an order of magnitude slower than a hard drive for writes. Most dramatic is the time to erase a segment: up to 4 seconds![3]
Hard drives are always accessed using I/O methods. Even on systems that have memory-mapped I/O, hard drives do not appear as main memory to the higher layers. In contrast, many embedded systems are configured so that their flash storage does appear as main memory. This is practical when
1The termsegmentis used in this document to mean a contiguous region of the address
space of the flash memory that is a single unit from the standpoint of erasing. Segments are also called blocks anderase units in some documents, but we try to avoid ambiguity by staying with one term. Segment sizes are on the order of 64KB.
the flash memory is programmed all at once with an image and then is used as read-only. The essential difference here is that as an I/O device, there is a device driver and perhaps a file system between the application and the flash. If the flash device were to be directly mapped into the memory space of the application and used in a read/write mode, the application would have to manage its own indirection mechanism as the location of particular data could change. Furthermore, the application would most likely be responsible for wear leveling. This would be too much functionality for one module to have. For practical purposes, all systems that use flash in read/write mode interface with the chips as I/O. In the case of a mechanism like mmap() in UNIX, the disk may appear to the application (process) as main memory, but in fact this is just an illusion provided by the operating system, which nevertheless treats the flash chips as a peripheral device accessed using I/O.
1.2
Wear Leveling
It is not clear what an optimal wear leveling strategy would be. Assuming that a count is kept for how many times each segment has been erased, we can sort the free segments in terms of this erase count. Two possible wear leveling strategies for allocating blocks are
1. always choose a segment in the bottom quartile of erase count, or 2. never choose a segment in the top quartile of erase count.
If flash chips are used actively in read/write mode, eventually the memory cells will die. Wear leveling determines the order that the cells will fail, and how soon they will fail. What happens when individual cells fail depends on the resiliency of the layers above. Two reasonable strategies for planned cell death are
1. postpone the first failure as long as possible
2. postpone as long as possible the time when the number of usable cells dips below N.
1.3
Performance Metrics
There is a multitude of different parameters that can express the performance of a storage system. The three file systems surveyed here provide different APIs.
2
Kawaguchi
Kawaguchi, Nishioka, and Motoda[4] approach the flash storage problem at the device driver level. Their objective is to present the flash storage hardware as a regular block device to the filesystem layer. They use the device driver as an abstraction layer between the UNIX VFS and the flash memory hardware. To the filesystem, their device appears as a standard block device. However, the mapping of block numbers to physical flash memory addresses is determined by the authors’ device driver. Their work centers on design of this mapping and how to implement and maintain it.
2.1
Segment Cleaning for Reclamation
Kawaguchi, Nishioka, and Motoda apply the concepts of the log-structured filesystem[6], but with their own constraints and goals. The essential mech-anism that they borrow issegment cleaning.[7] Segment cleaning is a process that separates the live data in the log from the invalidated data. Rosenblum and Ousterhout use segment cleaning to maintain large contiguous extents on the hard disk, and thus to avoid seeking when performing writes. Their use of it is motivated by the huge disparity between random block access time and sequential block access time in hard disks. Flash memories have do not have this characteristic. Kawaguchi’s use of segment cleaning is to relocate the live data to other segments, thereby allowing the segment to be erased. The segment cannot be erased until all of the live data is copied elsewhere and metadata is modified to point to the new location.
Despite major differences between flash and disk, another characteris-tic that makes both of them well-served by the log-structured filesystem is the the high cost of writing relative to reading. For servers, the origi-nal ratioorigi-nale[8] was that server operating systems could and would perform extensive read-caching, thereby answering most read requests from RAM.
But write() calls really require an immediate write to the disk. Caching
write()s in RAM would cause inconsistency in the event of a power
out-age or system crash. So Rosenblum and Ousterhout predicted that servers would have an increasingly asymmetrical workload in terms of what the disk actually sees. And therefore writes were optimized to have the lowest cost possible, maybe even at the expense of reads. Classic UNIX FFS[5] was al-ready very good at achieving a high probability of sequential blocks in large files, and therefore had good read speed already. Flash has a much higher
write cost than read cost due to the difference between read speed and write speed as mentioned previously. A data layout that slows reads by a factor of 10 but speeds writes by a factor of 2 would easily be a net gain for an embedded flash storage application.
2.2
Standard UNIX Interface
The benefit of the Kawaguchi scheme is that standard filesystems such as FFS can be used on top of their block device abstraction. This means that the filesystem code at the bottom of VFS can call bread() and bwrite()
on the device with the standard semantics of this interface and therefore implementing the Kawaguchi system does not propagate any kernel changes outside their driver.
2.3
Performance
When the filesystem is not overly full, the full write speed of the flash chip is available. However, as is seen in the 90%-full trace, in Figure 1, write speed is harmed by lack of free space. The reduction in write speed is caused by the segment cleaning code being activated when the number of free segments drops below a certain threshold. Despite the log-structured file system’s strategy of separating the active data from the inactive data, when a block needs to be reclaimed (erased) to maintain the pool of available extents for writing, even if the segment with the smallest amount of live data is chosen for reclamation, still there may be a significant amount of live data that needs to be copied elsewhere. It is this copying overhead that causes the write performance to drop when the system gets full.
3
TFFS: Transactional Flash File System
Eran Gal and Sivan Toledo presented TFFS[1] at USENIX 2005. They target embedded devices with extremely limited resources. Their file system offers atomic transactions. The file system requires less than 200 bytes of RAM.
0
50
100
150
200
250
300
0
50
100
150
200
Write Throughput (KByte/s)
Cumulative MBytes Written
/usr/motoG/tmp/ha1_10_20_94/cum-g-s-ni.eps
30%
60%
90%
Figure 1: Kawaguchi write performance for three levels of fullness
3.1
Reduced Semantics
TFFS does not aim to provide full POSIX file system semantics. Instead, an API is provided that has just the essential features anticipated for embedded applications. Particularly, TFFS does away with hierarchical filenames, file truncation, and changing the attributes of an existing file.
3.2
Mixed transactional/non-transactional API
The API is mainly based on transactions. But non-transactional operations are also supported. If a non-transactional method is used while transactional methods are pending, the transactional semantics may become corrupted.[2] The justification for allowing this dangerous flexibility is efficiency. Trans-actions require a queue to be maintained, which consumes system resources. TFFS leaves it up to the judgment of the user to decide when to use transac-tions and when to use the straight atomic methods. The authors’ suggestion is to designate particular files as either transactional or non-transactional. In any case, this is not enforced by the file system.
Fax Phone Recorder 0 20 40 60 80 100 Endurance (in %) 8192/64 7−14 8192/64 2−4 8192/64 2−4 NSP 448/64 7−14 448/64 2−4 448/64 2−4 NSP 448/2 2−4
Figure 8: Endurance under different device and file-system configurations.
iments are shown in Figure 7.
The other group of experiments assesses the impact of device characteristics and file-system configuration on TFFS’s performance. This group includes the same device/file-system configurations as in the capacity ex-periments, but the devices were kept roughly two-thirds full, with half of the data static and the other half chang-ing cyclically. The results of this group of endurance experiments are shown in Figure 8.
The graphs show that on the fax workload, endurance is good, almost always above 75% and sometimes above 90%. On the two other workloads endurance is not as good, never reaching 50%. This is caused not by early wear of a particular block, but by a large amount of file-system structures written to the device (because writes are performed in small chunks). The endurance of the fax workload on the device with 2 KB erase units is relatively poor because fragmentation forces TFFSto erase units that are almost half empty. The other significant fact that emerges from the graphs is that the use of spare pointers significantly improves endurance (and performance, as we shall see below).
5.4 Performance Experiments
The next set of experiments is designed to measure the performance ofTFFS. We measured several performance metrics under the different content scenarios (empty, full-half-static, and full-mosly-static file systems) and the different device/file-system configuration scenarios.
The first metric we measured was the average number of erasures per unit of user-data written. That is, on a de-vice with 64 KB erase units, the number of erasures per 64 KB of user data written. The results were almost ex-actly the inverse of the endurance ratios (to within 0.5%). This implies that theTFFSwears out the devices almost completely evenly. When the file system performs few
Fax Phone Recorder
0 20 40 60 80 100
Reclamation Efficiency (in %)
8192 empty 8192 full 8192 full st. 448 empty 448 full 448 full st.
Figure 9: Reclamation efficiency under different content scenarios.
Fax Phone Recorder
0 20 40 60 80 100
Reclamation Efficiency (in %)
8192/64 7−14 8192/64 2−4 8192/64 2−4 NSP 448/64 7−14 448/64 2−4 448/64 2−4 NSP 448/2 2−4
Figure 10: Reclamation efficiency under different device/file-system scenarios.
erases per unit of user data written, both performance and endurance are good. When the file system erases many units per unit of user data written, both metrics degrade. Furthermore, we have observed no cases where uneven wear leads to low endurance; low endurance is always correlated with many erasures per unit of user data writ-ten.
The second metric we measured was the efficiency of reclamations. We define this metric as the ratio of user data to the total amount of data written in block-write operations. The total amount includes writing of data to sectors, both when a sector is first created and when it is copied during reclamation, and copying of valid log entries during reclamation of the log. The denominator does not include writing of sector descriptors, erase-unit headers, and modifications of fields within sectors (fields such as spare pointers). A ratio close to 100% implies that little data is copied during reclamations, whereas a low ratio indicates that a lot of valid data is copied dur-ing reclamation. The two graphs presentdur-ing this metric,
Figure 2: Percentage of non-live data on segment at the moment it is re-claimed. Data are shown for 8192KB and 448KB flash. Three different simulated workloads were used.
3.3
Implementation of Transactions
One segment of the flash memory is dedicated to use a log for transactions. This structure makes heavy use of the ability to rewrite bytes without erasing them. As transactions are underway, status bits are changed from the 1
state to the 0 state. As is required for the transactional model, the system can be stopped and examined at any time and the data structures will be consistent. The log is consulted at startup time so if there was a loss of power or a crash during an operation that requires time, such as erasing a segment, the operation will be repeated. The replaying of the log is standard for transactional database systems, but apparently unusual for something as small as an embedded microcontroller.
3.4
Performance
The performance of TFFS in terms of read and write speeds is not stated directly in the paper. However, an equally useful metric is given, reproduced here in Figure 2. Reclamation Efficiency is the average amount of inactive data segments of flash at the time they are chosen to be erased.
4
JFFS: The Journalling Flash File System
JFFS[9] is a log-structured file system for flash memory that was originally developed by Axis Communications AB in Sweden. It was intended for in-ternal use in their product line which includes video cameras with embedded webservers. Their cameras run Linux. JFFS exists in two distinct versions: the original JFFS, and JFFS2, developed by RedHat.
The original JFFS was aimed at providing a more or less complete set of POSIX filesystem semantics on flash memory with wear leveling, but keeping complexity to a minimum. This could be achieved by running a standard POSIX-compliant filesystem on to on the Kawaguchi system but JFFS’ de-signers consider this wasteful. Instead they mapped POSIX semantics rather directly onto the the flash device. In theory they could have directly mapped filesystem blocks onto flash segments, but the usual factors made this unde-sirable:
1. slow erase speed 2. finite erase cycles, and 3. large erase units (segments)
4.1
Implementation
The design that was chosen for JFFS draws once again on the classic log-structured file system paper.[6] JFFS is notable for its simplicity. There is only one type of structure in the log. It is a structure that closely resembles the regular UNIX inode. Indeed, each instance of this structure appears as an inode to the VFS. As data is appended to the end of the file, new inodes are written at the tail of the log. The inodes belonging to a single file are all strung together in a linked list. Also present in the pseudo-inode structure are additional metadata flags. These flags allow inodes to be invalidated without immediately copying the entire segment and rewriting it.
A garbage collecting routine runs when needed to reclaim dead space. This simple cleaning process always begins at the head of the log and cleans all segments in its path. This is inefficient, and was part of the motiva-tion for JFFS2. But the original JFFS worked and was really simple. The authors point out that it had perfect wear leveling, but that was because ev-ery segment is erased on each pass of the garbage collector through the log.
This excessive cleaning would definitely lead to suboptimal flash memory life under circumstances where the filesystem is mostly full and lots of little
write() calls are being done. The garbage collector will keep waking up
and plowing senselessly through segments of flash. In JFFS there is nothing subtle like log threading.
JFFS2 made the cleaning process more finely granular. Segments can be independently chosen as cleaning candidates. Otherwise JFFS2 is similar to JFFS. Although robustness and performance were improved in JFFS2, some of JFFS’ most attractive feature, simplicity, was lost.
5
Comparison of Features
All three of the surveyed papers draw heavily from the Rosenblum and Ousterhout paper. Kawaguchi creates a standard UNIX block device out of flash chips. This approach gives the system integrator maximum flexibil-ity and convenience. A regular UNIX filesystem like FFS can be used on top of the block device interface that Kawaguchi provides. Gal and Toledo pro-vide an integrated file system API and low-level driver. Their system, TFFS, is optimized for having low resource requirements and for being reliable. To achieve these goals they offer only a stripped-down API but employ novel data structures to maintain consistency no matter what. TFFS emphasizes reliability over performance. JFFS supports the standard UNIX API and semantics with minimal resources. But the efficiency is somewhat low, and it does not maximize the life of the flash memory. Furthermore it was shown by the developers of JFFS2 that JFFS has consistency problems in the event of power outage or system crash.[10]
6
Further Improvement: Block Recycling
One of the main goals of a flash file system is the minimization of erase actions. This is important for performance and for the longevity of the flash device. In the filesystems surveyed in this report, use was made of the fact that bytes can be reprogrammed without an erase. If the new data can be represented as a bitwise AND of the old data and a mask, then the storage location can be reprogrammed to the new value without an erase. The papers only apply this to metadata. I propose a flash filesystem that would attempt
to extend this maneuver to all data.
A metric that would capture the goal of this new filesystem is the aver-age number of different useful data states that a segment goes through before being erased. An extremely compact indirection structure would be neces-sary for mapping the logical units of storage onto the addresses of the flash memory. The design and implementation of this structure will be critical, or else it could be a big waste of space.
An obvious first observation is that for storing text, we might want to code the most common letter as 0x00, and the next eight most common letters using bytes that have one 0-bit, and so forth. This would increase the probability that a region of an inactive extent in the log could be recycled without an erase.
A more sophisticated approach would analyze larger common patterns, and code those patterns using multibyte patterns that can be reprogrammed to become other common patterns. This scheme looks a lot like dictionary-based compression. But the goal here is not compression. In fact even if the coding caused a net gain in storage size, it would be worthwhile if it would significantly increase the lifespan of the flash memory chip. And as mentioned previously, reducing erase actions also improves system performance. In conclusion, the main goal of a flash file system is to minimize the number of erase actions.
References
[1] Eran Gal and Sivan Toledo. A transactional flash file system for micro-controllers. USENIX 2005.
[2] Eran Gal and Sivan Toledo. A transactional flash file system for micro-controllers. USENIX 2005, page 9.
[3] intel. P30 StrataFlash Embedded Memory, November 2005.
[4] A. Kawaguchi, S. Nishioka, and H. Motoda. A flash-memory based file system. Proceedings of the USENIX 1995 Technical Conference, 1995. [5] Marshall K. McKusick, William N. Joy, Samuel J. Leffler, and Robert S.
Fabry. A fast file system for UNIX. Computer Systems, 2(3):181–197, 1984.
[6] Mendel Rosenblum and John K. Ousterhout. The design and implemen-tation of a log-structured file system. ACM Transactions on Computer Systems, 10(1):26–52, 1992.
[7] Mendel Rosenblum and John K. Ousterhout. The design and implemen-tation of a log-structured file system. ACM Transactions on Computer Systems, 10(1):4, 1992.
[8] Mendel Rosenblum and John K. Ousterhout. The design and implemen-tation of a log-structured file system. ACM Transactions on Computer Systems, 10(1):1, 1992.
[9] David Woodhouse. Jffs: The journalling flash file system. In Ottawa Linux Symposium. RedHat Inc., 2001.
[10] David Woodhouse. Jffs: The journalling flash file system. In Ottawa Linux Symposium. RedHat Inc., 2001.