Flash SSD - Flash-aware Database Management Systems

2. Background

2.3. Flash

2.3.2. Flash SSD

The market nowadays offers multiple classes of storage devices based on NAND Flash memory. Those include (i) removable/portable devices, like eMMC Flash20 cards (e.g., SDHC, miniSD) and USB Flash drives; (ii) embedded storage, e.g., Flash memory chips in smartphones, routers, etc; (iii) Flash Solid State Disks (Flash SSDs, or just SSDs hereafter) - mass storage devices for desktop PCs, laptops and servers; and (iv) NVDIMMs21 - a kind of non-volatile RAM made via combination of Flash SSD and traditional DRAM in one module. Each class, in turn, is represented by a variety of devices differing by form factor, physical interface, software protocol, hardware architecture and firmware. As this work is dedicated to the optimal use of Flash storage for DBMSs on servers, we are interested only in SSDs. In this chapter we briefly describe the architecture, working principles, characteristics and diversity of modern SSDs.

SSD is a storage device, which uses NAND Flash memory as a persistent medium, but hides internally most of its native behavioral characteristics to provide to the outside the standard block device access interface. In other words, an SSD creates an abstraction level over the Flash memory emulating the behavior of a traditional HDD. Since this abstraction is completely transparent (hidden) to the device users (OS, FS, DBMS, etc.), SSDs are often referred to as black-box devices. The software layer (firmware) responsible for the abstraction is called Flash Translation Layer (FTL), and it is running either on the on-device controller (most of the products), or on the host system as a device driver (very few products). The SSD’s controller (Figure 2.7) might vary from the simplest single-core controller (e.g., ARM), to the multi-core controllers combined with FPGA or GPU units. To execute the code and cache frequently accessed metadata SSDs commonly have an SRAM module of small capacity (e.g., 128KB), while for caching other FTL metadata and buffering request data (read and write cache) the controller uses on-device DRAM module (e.g., 1GB in enterprise SSDs).

A common SSD nowadays has multiple memory chips (Figure 2.7), where each chip has either two or four NAND dies (dual- or quad-die NAND chips). Each die, in turn, has two or four separate planes of blocks, with one page buffer and one cache buffer registers per plane. The fixed-size blocks contain typically 64, 128 or 256 Flash pages. A page is logically divided into main area (e.g., 2-8KB), which is available to the host for storing user data, and a small-sized (e.g., 64-256 bytes) out-of-band area (OOB), which is used only internally by FTL algorithms. Groups of chips are connected via a shared I/O bus (channel) with the SSD controller.

Embedded Multi-Media Controller (eMMC) Flash - flash storage, where the NAND Flash memory and the controller are integrated on the same silicon wafer.

NAND chip NAND chip NAND chip NAND chip NAND chip NAND chip NAND chip NAND chip Controller (ARM/FPGA) SRAM DRAM ... ... I/ O b u s SSD S A T A , P C I, D IM M , M .2 Plane Block Die Die Chip

Figure 2.7.: Simpliﬁed architecture of SSD.

This architecture supports multiple levels of I/O parallelism. Thus, different operations can be executed simultaneously on chips belonging to different channels. Further, each die can execute I/O operations independently, but must coordinate data transfer through the shared I/O bus with other dies in its channel-group (interleaving of operations). The smallest unit of parallelism is the plane. The operations can be executed on all planes of the die simultaneously, but those operations must be of the same type (e.g., perform read operation on all four planes of a certain die simultaneously), and the addresses of accessed pages must conform to the requirements of multi-plane operations (e.g., have identical offsets within each plane). Altogether, it gives the SSD many possibilities to parallelize execution of incoming I/O requests. For instance, if the SSD consists of 16 quad-die chips (64 dies in total), where each die has two planes, then the controller can theoretically execute as many as 128 operations simultaneously. In practice, however, the available I/O parallelism of SSDs is typically underutilized. The main reasons for this and the proposed solutions will be discussed in Chapters 5, 6.

Today’s storage market offers SSDs in a variety of form factors, with almost dozens of possible bus interfaces (connectors), supporting multiple data transport standards and protocols. We summarize the most popular available options in Tables 2.2, 2.3. Because the topic about alternatives of physical and logical interfaces used by SSD manufacturers is rather orthogonal to the current research, as well as due to the complexity and volume a comprehensive analysis would require, we omit the detailed description of those options.

ssd-interfaces-2

SATA mSATA SAS SATAe

Form factor PCB PCIe cards

SCSI SATA 6G PCIe

AHCI SCSI AHCI

Bus interface (Connector) PCIe: x1, x2, x4, x8, x16, x20 2.5’’ 3.5’’ 2.5’’3.5’’ PCB2.5’’ Transport* SATA 3G SATA 6G 1.0, 2.0, 3.0PCIe: Protocol

Driver NVMeAHCI NVMeAHCI

Table 2.2.: Variety of SSDs on the storage market (Part 1).

FTL

Performance characteristics of SSDs are basically defined by three major aspects - the hardware architecture (e.g., Flash memory, controller, physical interface), the FTL and the communication protocol (e.g., ATA, SCSI, NVMe). The role of FTL in the overall SSD architecture is probably the most important, since it defines to which extent the performance potential of the underlying Flash memory can be utilized. Consider the fact, that there are only few (about 5) manufacturers of Flash memory chips, a dozen alternatives of possible hardware interfaces and protocols, but there are several hundreds manufacturers of SSDs. Thus, the software layer - the FTL - is the key aspect, which allows them to coexist and compete with each other. This is also the reason why SSD manufacturers never publish the details of their FTL and keep those as top secrete information. Both aspects - the importance of the FTL and the lack of information from industry - have forced academia to put a lot of effort into the research and development of FTL in the past two decades, and even today this topic is actual and actively researched.

The research community proposed dozens of different variants of FTL (often called FTL schemes). It is important to mention that neither of the proposed FTL schemes can be seen as being ultimately better than the others, and even the pair-wise comparison could not clearly define the generally better FTL scheme out of two. There are several reasons for this. First, the FTL scheme is characterized by multiple criteria, such as performance numbers and predicted longevity of device, memory and storage footprints, computational complexity and reliability. Those are usually mutually dependent, which means that optimizing on one parameter (e.g., better longevity guarantees) leads to degradation

ssd-interfaces-3

Page 1

m.2 u.2 DDR DIMM FC

Form factor PCB DIMM

SATA 6G PCIe SATA SAS PCIe DDR3 FC

AHCI AHCI SCSI AHCI

Bus interface (Connector) 2.5’’ 3.5’’ 2.5’’3.5’’ Transport* Protocol

Driver NVMeAHCI NVMeAHCI FCP-SCSIFC-NVMe

* Maximal bandwidth of transport protocols: SATA 3G = 3Gb/s

SATA 6G = 6Gb/s

PCIe 1.0 = 250 MB/s per lane per direction PCIe 2.0 = 500 MB/s per lane per direction PCIe 3.0 = 1 GB/s per lane per direction SAS-3 = 12Gb/s

Table 2.3.: Variety of SSDs on the storage market (Part 2).

of another (e.g., decreased performance and higher complexity). Since, customers have different priorities of requirements for the storage, a certain FTL scheme might be the best choice for one system, while being less useful for another. The second reason is that the performance and longevity numbers of a certain FTL are highly workload-dependent. The read/write ratio of I/O requests, their frequency, size, locality and data skew are the typical workload parameters, the interplay of which makes one (or several) FTL scheme(s) preferable.

An FTL scheme consists of multiple algorithms (Figure 2.8), each of which covers a certain property of Flash memory. Address translation and garbage collection (GC) are dealing with the erase-before-overwrite property. Wear-leveling (WL) and bad block management (BBM) address the wear property of Flash memory. Error detection and correction techniques (ECC) are needed due to different types of errors present on Flash memory. Queuing and scheduling of requests are required to utilize latency asymmetry and available degree of parallelism. Below we provide a brief description of those FTL algorithms and their main variants.

FTL

Bad-block management (BBM)

Error detection and correction (ECC) BCH BLCH Wear-levelling (WL) static dynamic Garbage collection (GC) active passive Address translation page-based block-based hybrid Queuing/Scheduling/Caching of requests

Collect and maintain metadata and statistics

Figure 2.8.: Main tasks of FTL.

Address Translation and Garbage Collection

One of the main challenges the FTL designers are facing is how to mask the erase-before- overwrite property of Flash memory, so that the SSD can be accessed using the traditional block device interface and behave as an HDD. The naive approach would be: each time when the SSD gets a write request that modifies previously written data, FTL, at first, stealthily executes an erase operation of the corresponding Flash block (i.e., the block where the old data resides), and then writes the modified data in the original place. This would perfectly work if the data is always updated in multiples of whole Flash blocks. For instance, if the block consists of 128 4KB Flash pages (512KB in total), this would mean, that each update overwrites all 128 pages. While there are some application scenarios with such update pattern (e.g., circular DBMS logs written in chunks equal to Flash block), in general however, applications tend to update much smaller chunks of data. Common units of read and write I/Os are an OS page (e.g., 4KB), a DBMS page (e.g., 4KB-32KB) or just a single disk sector (512B). Consider, for instance, the case where just one Flash page needs to be updated. Since the corresponding Flash block includes also many other pages (e.g, 128), the erase operation would lead to loss of their content. Thus, before erasing a block, FTL must read and temporary store (e.g., in on-device DRAM) all block’s pages except the modified one, then perform an erase operation, and afterwards write back those pages from the temporal store, as well as the modified page. Thereby, for a single page update FTL would need to perform 127 read, one erase and 128 write operations, which would obviously introduce a huge overhead, and make even the fastest Flash memory being slower than the slowest HDD today in terms of write request latency.

Apart from the latency overhead, the naive approach might easily lead to a significant issue with the Flash wear. Typically every workload has a certain locality of requests, which means that some portions of data are accessed and modified more frequently than the others. Thus, often to describe the access patterns of database workloads (especially OLTP workloads) analysts use the Pareto’s 80/20 rule, meaning that 80% of all user requests touch only 20% of data. With such a workload under the naive approach, Flash blocks containing pages with hot data would be erased frequently, while the blocks with outdated or static data (cold blocks) might undergo only few erases over the same period. This skew in erase counts would result in an uneven wear out of Flash memory, and soon its parts with hot blocks might become invalid (due to burn-out), leading to premature damage of the whole SSD. Thus, both issues make the naive approach completely unsuitable for SSDs.

The common solution applied in all modern SSDs is to implement a variant of an

out-of-place update strategy. Its basic idea is to decouple the way the host system (OS,

FS, DBMS, etc.) organizes the data (logical data placement), and how the SSD does this (physical placement of data). That is done via an address indirection level, which is typically realized as an address translation table. It stores for every logical address (at certain granularity) the corresponding physical address pointing to the place where the data actually resides on Flash memory. When the host submits a write request to the SSD (e.g., write 4K page at address 100), the FTL decides on its own (see later how) where to place the data on Flash memory (e.g., write data to physical page 478), and consequently updates the corresponding mapping information (e.g., 100 -> 478). If the host modifies the same data later on, it would be again written to a new location on Flash (e.g., 100 -> 1987). In other words, the data is written/updated out of its original place (out-of-place). To read a data with certain logical address, FTL consults its translation table to get the physical address of where the data is currently placed.

This address indirection has two main purposes. First, the FTL designers gain complete freedom regarding where and how to place the data on Flash memory, which, in turn, gives them a possibility to mitigate the performance and wear issues resulting from the erase-before-overwrite principle. Second, since address translation is completely invisible for the system outside the SSD, the host loses the control over the physical data placement, and thus it is also freed from the responsibility to deal with the constraints of Flash memory, like erase-before-overwrite principle and memory wear. Thus, address indirection is the key strategy of FTL to mask all native behavioral characteristics of Flash memory and make the SSD behave as a traditional HDD.

By placing the data at every update into a new location on Flash memory, the previous version of the data in its original place becomes outdated, and thus invalid. To delete this data, FTL must perform an erase on the corresponding Flash block. However, for

performance reasons FTL typically tries to postpone those erase operations, and thus the invalidated versions of data items are kept stored on Flash memory for a certain time along with their up-to-date versions. In order to avoid a situation of running out of space because of keeping lots of outdated data, FTL periodically runs the process called

garbage collection (GC), which selects the blocks containing invalidated data and erases

them, while shifting the valid data in those blocks to new locations. Finding victim blocks, performing multiple read/write operations to copy valid pages to new destinations and erasing the blocks requires significant computational, memory and especially I/O overhead, which makes GC the most resource and time consuming process of the FTL. Since every die can perform only one operation at a time (e.g., one multi-plane read operation), the FTL must efficiently coordinate the usage of Flash memory by executing on every available die either a GC operation or a read/write operation submitted by the host. When GC is holding an exclusive control over a certain part of Flash memory, the incoming user requests to those dies are postponed. The interleaving of GC with processing of user requests results in two significant issues all SSD manufacturers are faced with. First, this introduces an additional delay to the user requests, and thus slows down the I/O throughput of the device. For instance, although the latency of a program page operation on Flash memory is typically in a range of 250µs to 750µs, the single write request can take as much as 80ms (Figure 2.9) and more (up to 680ms were measured in [16]), when it is interfered with GC. Second, the amount of work performed by GC varies significantly depending on several factors, like current load, access patterns, state of SSD, etc. Thus, the interference with the user requests is also variable, which often results in significant fluctuations in the I/O throughput provided by SSD.

The granularity of address translation table, the algorithms responsible for finding a new location for modified data, as well as GC algorithms defining which blocks to select for erasing, and where to place the valid data in them are the most significant and fundamental characteristics of all FTL schemes. There are basically three different groups of FTL schemes: page-level, block-level and hybrid FTL schemes. They are named based on the granularity of the address translation table, because this parameter influences also the other algorithms of the FTL.

Page-Level FTL Scheme

In a page-level FTL scheme the address translation table has an entry for every Flash page of the underlying memory. The simplest implementation of such table would be a one-dimensional array with the size equal to the number of Flash pages in the SSD. The offsets in the array would indicate the immutable logical addresses of pages (logical page number - LPN), while the elements at these offsets would store the current physical addresses of those pages (PPN). Because the page is the smallest addressable unit in NAND

≈ 10000 x 4KB Rand READ # I/O Requests I/ O L a te n c y [ m s ]

Figure 2.9.: I/O bandwidth ﬂuctuations of an enterprise SSD. Source: Petrov et al. [77].

Flash memory, having a slot in the address translation table for every page, gives this FTL scheme the highest possible flexibility in placing the data. Consider an example in Figure 2.10 assuming a simple implementation of page-level FTL scheme. Here, two consequent write requests are modifying pages with LPNs 578 and 579. Each of those pages can be basically written into an arbitrary free location on the SSD. However, typically, the FTL keeps track of only one current block on every die, and appends the incoming pages into it. Once this block gets full, the FTL selects another block and continues. In our example, the first page is written to the current block 7 into the Flash page with PPN 1022. Then, FTL finds a new empty block (PBN = 1), which becomes a current one, and writes the second page into this block (PPN = 190). The mapping table is updated after every request and pages, holding the original versions of modified data, are marked as invalid (pages with PPNs 2026 and 356).

When there are only few empty blocks left on the SSD, the FTL kicks in garbage collection. GC selects a victim block (e.g., the one with the largest number of invalid pages), copies its valid pages to the current block, and consequently, erases the victim block and adds its address to the pool of free blocks. This process continues until the number of free blocks reaches a certain threshold. It is easy to see that in this FTL scheme the amount of work performed by GC is proportional to the number of valid pages in victim blocks. For instance, for a block with 5 valid pages the garbage collector will perform 5 copybacks and one erase operation, while for a block with 25 valid pages already 25 copybacks and one erase are needed. Thus, the key factor for reducing the overhead of GC and improving thereby performance characteristics of an SSD is reducing the average number of valid

0 1 2 3 578 579 580 N 110 write(LPN=578) write(LPN=579) 214 8337 -1 3645 2026 356 PPN ... ... LPN Logical-to-physical address mapping (L2PAM) 1022 190 ... PBN=0 PBN=1 PBN=2 PBN=7 PBN=15

Write page to the current block. If it becomes

In document Flash-aware Database Management Systems (Page 62-80)