2. Background
2.2. DBMS Storage Alternatives and I/O Stack
2.2.2. Raw Storage
Most of the commercial database systems on the market today offer the possibility to operate without the use of file system as storage abstraction. This storage alternative is also known as "raw storage" (Figure 2.2-C). The rationality behind it is many-fold.
First, the DBMS gets more control over the physical data placement. As already men- tioned, the performance of HDDs varies drastically dependent on the access patterns, e.g., sequential accesses can be up to 100x faster than random. To leverage this property the data should be placed on the drive in a way, which would "sequentialize" the access pattern for the current workload as much as possible. For this task raw storage has a significant advantage over the cooked storage - the direct mapping from logical to physical block addresses. This mapping requires actually no address translation table, but is rather realized by simple constant offset calculation4. This ensures that blocks with contiguous LBAs would have also contiguous physical addresses on the storage5. Thus, by managing the logical address space, the DBMS fully controls physical data placement. In contrast, under the cooked storage alternative the file system manages the physical address space on its own, creating therefore an additional level of address translation between the DBMS and the physical storage. As a result, the DBMS can only indirectly and to a certain extent influence the physical data placement.
Another advantage of the raw storage is the lean I/O stack and the minimal memory and computational overhead. As already mentioned, caching (buffering), prefeching and recovery are the typical functions of the file system that become redundant and overwhelmingly expensive for applications like DBMSs. Moreover, the elimination of FS/OS caching gives the DBMS under the raw storage better control over the time when a certain data gets physically persisted, which solves the issues related to the DBMS’ guarantees regarding data durability in the cooked storage stack.
A general performance comparison of both storage alternatives is, however, difficult to perform, as there are many factors that have a significant influence on it. The DBMS and its configuration, the file system, the workload, the storage and the amount of RAM are the main parameters, which define a huge space of possible system configurations. For instance, under the workload with completely random access pattern, the direct control over the data placement in the raw storage would not bring any advantage over the cooked alternative; but it will still outperform the latter due to the reduced memory and
4
offset = partition_offset + (LBA * block_size)
5
With except of cases where due to the "dead" sectors the HDD would internally remap them to reserved sectors.
computational overhead. Yet, for DBMSs with ample memory that are CPU-bound the performance advantages from the raw storage would be significantly less than for an I/O-bound system. Some performance numbers can be found in [73], [59], [68], as well as in Chapter 5 of this thesis.
Despite these performance advantages of the raw storage alternative on the HDDs, many database setups today still opt for the cooked I/O stack with the file system between the DBMS and storage. The reasons for this are diverse.
One of them is that the database data stored in files can be easily accessed by applications other than a DBMS. For instance, separate backup applications are working typically only with file abstractions, and not with the raw disk partitions. Also, under the raw storage the whole disk partition must be dedicated solely to the DBMS, which might under certain circumstances result in poor space utilization. Assume, for example, a single HDD with 5TB capacity, and that the initial size of the database is just 100GB. If the database is relatively static or has very slow growing rate, then creating a 200GB raw partition and dedicating this to the DBMS might work pretty well. However, what if the database has a higher growing rate, e.g., we estimate it being in one year approximately 1TB. How large should be the database partition - 1TB, 2TB or the whole drive? If we decide for 2TB, then 1TB of our HDD is "reserved" for the second year, and cannot be used by other applications (even for temporal storage of data). And what if the growing rate of data cannot be estimated in advance? Note, that resizing of non-empty disk partitions is a "no go" option, due to the high risk of data corruption or loss. However, those two issues are typically relevant only for small setups. Large and enterprise database instances are typically running anyhow on separate machines dedicated strictly to the DBMS (e.g., database servers, nodes in data centers). Further, modern DBMSs (commercial and open- source) typically provide themselves enough mechanisms to guarantee high reliability and availability, so that there is no need in third-party backup applications.
Another, often mentioned, reason for choosing cooked over raw storage might be encapsulated in the adjective simplicity. By shifting the responsibility of physical space management to the file system, the DBMS storage manager becomes leaner. Moreover, to achieve more efficient data placement on raw storage, the DBMS needs to know some physical characteristics of that storage (e.g., track alignment of HDD [86]). This makes data placement strategy to a certain degree device-specific, which is often seen as a disadvantage compared to the abstraction level given by the file system. However, these reasons are rather subjective. The additional code complexity of the storage manager required for the support of raw storage is usually over-estimated.
Yet, the main reason for the low popularity of raw storage alternative is the widespread use of virtual storage techniques, such as RAID, SAN and LVM. All of them create an additional level of abstraction over the physical storage, which, in turn, makes the control
over the physical data placement by the DBMS difficult. Since those issues are relevant also for the proposed approach on Flash SSDs, we will touch them in detail later on in this work.