Spatial Data Management over Flash Memory

(1)

Ioannis Koltsidas1 _{and Stratis D. Viglas}2 1 _{IBM Research, Zurich, Switzerland}

2 _{School of Informatics, University of Edinburgh, UK}

Abstract. We present desiderata for improved i/o performance of

spa-tial data structures over flash memory and hybrid flash-magnetic storage configurations. We target the organization of the data structures and the management of the in-memory buffer pool when dealing with spa-tial data. Our proposals are based on the fundamentals of flash memory, thereby increasing the likelihood of being immune to future trends.

1 Introduction

Flash memory has moved from being used for short-term, low-volume storage and data transfer, to becoming the primary alternative to magnetic disks for long-term, high-volume and persistent storage. Appearing in systems as disparate as pdas, handheld tablets, laptops and desktops, enterprise servers and clusters, it is one of the most ubiquitous storage media on the market. At the same time, the proliferation of location-based services means that spatial data management techniques acquire increasing traction. As a result, access to and processing of spatial data moves from highly customized application scenarios to commodity ones. The combination of these two trends necessitates revisiting spatial data management for ﬂash memory both at the server and at the client levels. In this paper we will present ideas that we posit are imperatives when it comes to high-performing techniques for ﬂash-resident spatial data.

Solid state drives, or ssds, are arrays of ﬂash memory chips packaged in a

single enclosure with a controller. An ssd is presented to the operating system as a single storage device using the same interface as traditional hdds (e.g., sata). The similarities between the two types of storage medium, however, stop here. The key characteristics of ssds are (a) the lack of mechanical moving parts; (b) the asymmetry between their read and write latencies; and (c) their

erase-before-write limitation. The ﬁrst characteristic means that there is no diﬀerence

between sequential and random access latencies. Performance is not penalized by being dependent on the seek time or the rotational delay, as is the case for hdds. The second characteristic has to do with the physical characteristics of flash memory, which make reading the value of a flash memory cell faster than changing it; therefore, reads are in general faster than writes. The discrepancy between read and write latencies is further influenced by the underlying tech-nology of flash memory and, in particular, by how many bits are stored in each D. Pfoser et al. (Eds.): SSTD 2011, LNCS 6849, pp. 449–453, 2011.

c

(2)

(mlc) can store two bits. slc devices inherently have higher performance but lower density; on the other hand, mlc devices have higher density but lower performance. Finally, the third characteristic stems for the inherent properties of ssds: to update an already written flash page, the controller must first erase it and then overwrite it. Erasures are performed at an erase unit granularity, where each erase unit is a flash block, i.e., a number of contiguous flash pages. A garbage collection mechanism is required to reclaim flash blocks as pages be-come invalid due to user overwrites. This is especially detrimental to the perfor-mance of random write workloads: random writes result in continuously erasing blocks and moving data at the flash level. It is not uncommon for the random write throughput to be five times less than the random read throughput for slc devices; or even up to two orders of magnitude less for mlc devices. Though approaches for efficient spatial data structures have been proposed [1,5], clearly, data management over flash memory requires rethinking our priorities and not improving performance in specific cases. To complicate the situation, existing hdd-based hardware is not going to disappear any time soon; the trend is to-wards augmenting hdd storage with flash memory. It is therefore imperative to design towards hybrid storage configurations employing both ssds and hdds.

We propose ways to tackle spatial data management in flash and hybrid se-tups for the applications of the foreseeable future. We cannot claim that these ideas will be future-proof; however, we solidly ground them on the inherent char-acteristics of flash memory, thereby increasing their potential for applicability. We do not require any flash features be exposed apart from a standard i/o in-terface. Our proposals revolve around two axes: the organization of spatial data structures in ssd-only as well as in hybrid configurations, and the management of pages belonging to spatial data structures once the pages have been brought in main memory. In what follows we will present each axis in turn.

2 Data Structure Organization

One of the key goals of secondary storage data structures is performance guaran-tees. For instance, the best performing spatial data structures are balanced trees. Their guarantee is that all paths from root to leaves are of equal length. This is achieved through bottom-up management algorithms: insertions and deletions cause splits and merges at the leaves; these splits and merges are recursively ap-plied in the bottom-up traversal and propagate higher up in the tree; resulting in the tree having its height increased or decreased. Such algorithms make perfect sense for structures over hdds where the read and write costs are uniform.

Consider, however, the read/write discrepancy of ssds and an insertion like the one of Fig. 1. Typically, as shown in the left part, an insertion to a full leaf

L will cause a split of the leaf into itself and a sibling S; and an update at the

original leaf’s parentP with the new bounding box for L; and the insertion of the bounding box forS. To balance the tree we must perform three disk writes. An alternative, instead of splittingL into L and S, is to allow L to overﬂow into

(3)

L P L P S L P S balanced insertion unbalanced insertion original setup

Fig. 1. Balanced vs. unbalanced insertions

S and we not update P , since L’s bounding box has not changed (right part of

Fig. 1). Thus, instead of performing three writes, we perform one: only forS. Assume now that the ratio between write and read isx, i.e., reading is x times faster than writing. Depending on the application, the cost associated with each type of operation may be any combination of the latency and throughput of the ssd for that type of operation. Saving two writes during insertion and causing the imbalance means that we have saved 2x time units: any operation that will readL would have read it anyway; any operation having to access S too, would have accessed it anyway, but going through a direct child pointer fromP . Thus, the structure overall hasx time units to spare with respect to its optimal form (i.e., the fully balanced one). Then, we allow this imbalance and maintain a counter c at L that measures how many times the overﬂow pointer to S has been traversed. As long asc < x, the data structure still is more eﬃcient; once

c = x we can rebalance it by propagating the bounding box from S to P , thereby

expensing the write we saved in the first place. This idea can work across all levels of the tree, though one must be careful to avoid long overflow chains. In such cases rebalancing can be prioritized using global rather than local metrics. In a typical scenario, the savings will be substantial. As the tree structure grows, the likelihood of a single leaf being “bombarded” with read or write requests decreases for all typical scenarios. Thus, the likelihood of the tree’s imbalance paying off increases. Though generally applicable to all tree structures, catering for imbalance is more likely to pay off for spatial data and location-based applications: the update rate of the spatial data is low, while the point queries themselves are highly volatile as users move about the area.

Observations like the one that lead to unbalanced structures can be gener-alized for hybrid configurations. In more detail, and for a tree structure, the likelihood of having to write data increases as we descend the tree. One might then consider placing the different parts of the tree structure on different media. The top levels of the tree have a read-intensive workload, used to direct searches. Whereas the bottom levels of the tree might have an update-intensive workload; moreover, the likelihood of an update being propagated upstream decreases with the height of the tree. It might then be conducive to place the write-intensive leaves of the tree on the hdd and the read-intensive index nodes on the ssd [3].

(4)

unbalanced, this will result in further performance boosts.

3 Buﬀer Management

The next issue to focus on is buﬀering spatial data from the ssd into main memory. As is usually the case, we assume demand paging: data pages can only be processed in main memory and they are brought from the ssd into main memory only on reference. Main memory is of limited capacity and smaller than the capacity of the ssd; thus, only a subset of the ﬂash-resident data pages can be kept in main memory at any point in time. Once the memory is full, data pages will be evicted to make room for new ones.

There has been a substantial body of research targeting general buffer man-agement for ssds [2,4]. Tailoring the approaches to spatial data, in conjunction with the i/o asymmetry of ssds, offers further room for improvement. Consider a location-based application which continuously accesses a spatial data index for information (e.g., points of interest close to the user’s location). Such an appli-cation will generate multiple hot paths in the index, tracking the user’s motion. As the user moves about, the paths followed from the root of the tree to the leaves will be constantly changing. Therefore, using simple reference counts as the eviction decision metric, though correct, can be further improved. By pre-dicting the motion we can make more informed decisions regarding which pages to keep in the buffer pool and which have a lower probability of being referenced in the future. By selectively caching the parts of the indexed space that are more likely to receive updates, a lot of flash writes, and thereby, erasures can be avoided. Coupled with cost-based replacement policies [3], such an approach can improve i/o performance. One might even envision horizontal or vertical tree partitioning and placement based on access frequency and storage capabilities.

4 Conclusions and Outlook

While ssds are attracting more traction from the research community and in-dustry alike, surprisingly enough, one of the areas that is under-represented is spatial data management over ﬂash memory. We have presented some initial ideas on improving the performance of ﬂash-resident spatial data structures.

One of the aspects we did not address is endurance and wear-leveling. More precisely, each flash cell can only be written to a fixed number of times. Thus, the controller spreads writes evenly throughout the flash chips and maintains a mapping between logical block identifiers and physical ones, through a software layer called the flash translation layer, or ftl. While ftl algorithms are typically inaccessible to the user, some ssd manufacturers are now pushing more of that functionality to the os driver level; in addition, ssd-specific commands have lately been introduced, giving direct control to the filesystem for some flash operations (e.g., the trim command for explicitly erasing blocks). In light of that, the ideas presented in the previous sections can be more efficiently coupled with the ssd and result in further i/o savings.

(5)

References

1. Emrich, T., et al.: On the impact of ﬂash SSDs on spatial indexing. In: DAMON 2010 (2010)

2. Kim, H., Ahn, S.: BPLRU: a buﬀer management scheme for improving random writes in ﬂash storage. In: FAST (2008)

3. Koltsidas, I., Viglas, S.D.: Flashing up the storage layer. Proc. VLDB Endow. 1(1), 514–525 (2008)

4. Park, S.-Y., et al.: CFLRU: a replacement algorithm for ﬂash memory. In: CASES (2006)

5. Wu, C.-H., et al.: An eﬃcient R-tree implementation over ﬂash-memory storage systems. In: GIS (2003)