Solid State Drives (SSD) - Disk technology

2.6 Disk technology

2.6.2 Solid State Drives (SSD)

With latency and throughput of the traditional electromechanical hard drive becoming increasingly problematic with respect to the more rapidly improving CPU and memory performance, the need for a better solution arises. Especially with the advent of multi-core CPUs, where each core typically executes an application or thread that has its own I/O needs, the need for solid random- I/O performance arises, as the I/O patterns of each core are likely to differ and therefore interfere. An answer can be found in semiconductor memory that does not lose state when power is turned off: non-volatile memory (NVM). Such memory has been around since the 1970s, in the form of EPROM (erasable programmable read only memory), and even DRAM chips that were backed up by batteries. However, none of these technologies ever became a reliable and cost-effective alternative to magnetic disk.

Only with the advent of NAND flash [Ass95] did the viability of a new mass storage device, capable of replacing the magnetic hard drive, arise. By incor- porating NAND flash storage behind a traditional hard disk interface, the solid state drive came into existence. Since roughly 2005, when NAND flash process technology surpassed DRAM’s 90nm, the technology has become mature and economical enough for mainstream adaptation, delivering what is arguably the biggest boost to commodity hardware performance of the 21st century. Ta- ble 2.5 lists some milestones in both commodity and high-end SSDs. A brief overview of SSDs and the underlying NAND flash technology follows. An in- depth treatment of the topic can be found in [MME13].

36 CHAPTER 2. HARDWARE OVERVIEW Flash Drain Tunnel Oxide Source Word Line Bit line P−Si substrate

8−bit Flash Channel 8−bit Flash Channel

Control Control Flash NAND Flash Controller Flash NAND NAND Flash Flash Control Gate NAND Flash NAND ONO Oxide NAND Flash Flash NAND Floating Gate NAND

Figure 2.14: NAND Floating gate (left) and NAND Flash SSD (right).

at significantly reducing access latency compared to a traditional HDD (see the earlier Table 2.4). However, we also see that this latency has not been improving. So the novel technology does bring a performance boost, but again it seems hard to improve the latency, while we clearly see improvements in capacity and bandwidth. Besides latency, there is one more advantage of SSDs over HDDs though: SSDs, which use semiconductor technology, are easier to parallelize. The result is that not only has read/write bandwidth been increasing, but also random I/O throughput, i.e. the IOPS column, which represents the number of fixed size I/Os (typically 4-8KB pages) per second. So there is an increasing trend in the number of I/O requests that SSDs can service concurrently, without suffering from random seek latency as seen in HDDs. Random I/O bandwidth is still worse then the optimal (sequential) bandwidth numbers as found in the table, i.e. 50000 4KB IOPS is roughly 195MB/s for the Intel 520, compared to 550MB/s peak sequential bandwidth, but the difference is significantly better than for HDDs.

Looking at other factors than performance, SSDs have two main disadvantages compared to HDDs: their price per GB and their smaller capacity, with SSDs being around 5-10 times more expensive per GB, and HDDs typically having 4-8 times the storage capacity. In many other areas, like mean time before failure (MTBF), power consumption, shock resistance, noise, weight and physical size SSDs beat HDDs, positioning them as a solid alternative to HDDs. To maximize the beneﬁts from this new technology, it helps to understand some of its workings and peculiarities, which are discussed in the following sections.

Flash Storage

To retain digital information, NAND flash relies on a floating gate transistor (FGMOS) [KS67] with two overlapping gates rather than one, as depicted in Figure 2.14 (left). The floating gate (FG) is entirely surrounded by oxide, thereby isolating it and providing a “trap” where electrons can be stored for years, even when disconnected from power. These trapped electrons influence the conductivity of the tunnel oxide between source and drain, which can then be used to read the state by applying a voltage to the source and sensing either a low or high voltage on the bit line. To program the floating gate, the control gate can be used in conjunction with the source line to either charge (i.e. program)

2.6. DISK TECHNOLOGY 37

the FG by drawing electrons up from the tunnel oxide, or releasing them to discharge the FG (i.e. erase).

We can distinguish two types of NAND flash. In single level cell (SLC) NAND each floating gate has only one threshold voltage, thereby being able to represent only one bit of information. In multi level cell (MLC) NAND, more than two voltage levels are differentiated, representing more bits of information. For example, four levels to represent two bits of information is very common, but even triple level cell (TLC), which differentiates between eight different levels to store three bits per cell, is becoming popular. The advantage of MLC is that it requires the same amount of transistors, while providing several times the storage capacity of SLC, thereby reducing the cost per gigabyte. The biggest disadvantages are that both programming and reading of an MLC takes more time than an SLC. Furthermore, floating gates have the property that they becomes unreliable after a certain number of program/erase cycles, as the insulating material around the gates wears off, reducing its capability to hold a charge. For MLC NAND, this number is much lower (roughly 5000-10000 cycles) than for SLC NAND, which lasts for around 100000 cycles.

Floating gate cells are replicated to build larger memory chips, which can then be used to build an SSD like in Figure 2.14. A bus connects each memory chip to the controller, which is responsible for processing and scheduling incom- ing read and write requests. As with DRAM, each channel uses a double data rate (DDR) interface, and typically multiple channels are employed to boost performance by increasing parallelism.

A flash chip is composed of one or more sub-chips, or dies, each of which might have multiple planes. Each plane is organized as a collection of blocks (typically 128KB-512KB), which is the basic unit of operation for the erase operation. Each individual block can only endure a limited amount of erase cycles. A block consists of several pages (4-8KB), which are the units of operation for read and write operations. Each page consists of a user area, used for data storage, and a spare area, used for status information and error correction codes (ECC), which allow for the correction of minor errors during reading. The page size depends on the number of NAND chips on the SSD (i.e. storage capacity) and the number of channels (i.e. parallelism). If these are increased, both page size and throughput increase along. As flash densities increase due to shrink- ing process technology, we can expect to see a related growth in page sizes. A negative side-effect is that the durability of smaller cells gets worse as well.

Flash Controller

The ﬂash controller is a simple CPU with its own DRAM. It provides the interface to the host for reading and writing data. It is also responsible for interfacing with the NAND storage and performing error correction during reads. One of its most important components is the flash translation layer (FTL), which provides a mapping between logical blocks, as seen by the host, and physical blocks, containing the actual data. This is illustrated in Figure 2.15. The FTL has three main responsibilities, each of which is discussed below: garbage collection, bad block management and wear leveling.

Due to the nature of ﬂash chips, bits can not be changed arbitrarily at the individual level. This implies that a page, once written, can not be written to again until it is erased (all bits set to 1, in case of NAND ﬂash). Therefore,

38 CHAPTER 2. HARDWARE OVERVIEW

Physical block

Logical block Overflow block

Erasable block Reserved block Bad block A = Available A A A A A A A A A A R R R R

Figure 2.15: Flash Translation Layer (FTL) block mapping.

small writes result in a page to be fully rewritten into a fresh empty page. When this happens, the old page is marked invalid and ready to be garbage collected by the FTL. If no empty page can be found within the original parent block, an overﬂow block needs to be allocated.

Garbage collection is responsible for reclaiming free space, selecting candi- date blocks to be rewritten and erased. To do this, it copies any valid pages in the target block into an already erased block, after which it can erase the target block. This reclaims the space of the non-valid pages in the block being erased. Garbage collection can be performed in the background at times when the drive is idle, or at write time, which is better for write-intense environments where the drive is rarely idle.

A complicating factor for the FTL is the wear of the ﬂoating gates in a block with each erase. To improve durability and reliability, the FTL tries to select destination blocks intelligently, aiming to distribute writes evenly among the available blocks, to avoid skew in the wear of blocks. This implies that blocks with a low write count (i.e. those containing read-mostly data) might end up being moved around to allow data that are changed more frequently to be written to those blocks.

Both garbage collection and wear leveling involve moving around data, in- troducing extra writes that are unrelated to the actual data that the host is trying to write. This phenomenon is called write amplification, deﬁned as

W riteAmplif ication = DataW rittenT oF lashM emory

DataW rittenByHost (2.1)

Write amplification is undesirable, as it consumes extra bandwidth towards the storage layer, reducing effective random write throughput. Sequential writes do not suffer from write amplification.

No matter how smart the wear leveling is, an intrinsic limitation of NAND ﬂash is the presence of bad blocks, i.e. blocks containing one or more locations whose reliability can not be guaranteed. These can exist either from factory production errors or due to wear. Bad block management is responsible for identifying these physical blocks and remapping their logical block to a spare physical one. Bad blocks are remapped to reserved spare blocks on the drive, which are made available by over-provisioning, i.e. a diﬀerence between the physical capacity of a drive and the logical capacity presented to the OS and the user. Over-provisioning is not only used to accommodate for bad blocks,

2.6. DISK TECHNOLOGY 39

but also to provide suﬃcient free space to minimize the negative impact of write ampliﬁcation during garbage collection and wear leveling.

Above complications put a lot of responsibility and complexity in the FTL, which can become somewhat of a black box to SSD performance, especially if writes are involved. However, in general it is safe to state that writes, especially small ones, should be kept to a minimum whenever possible, with rewrites of large sequential regions presenting the most ideal scenario. Big rewrites also ensure a significant amount of empty space, minimizing negative impact of write amplification. These properties are inherently present in a log structured file system [RO92], like Linux’ LogFS or the flash specific JFFS. Journaling file systems, like Linux’ often used ext3 and ext4 are actually especially ill-suited for flash, as the journaling introduces many small writes.

In document Updating compressed column-stores (Page 48-52)