AN ALGORITHM FOR DISK SPACE MANAGEMENT. whose workload consists of queries that retrieve information. These

(1)

AN ALGORITHM FOR DISK SPACE MANAGEMENT

TO MINIMIZE SEEKS

SHAHRAM GHANDEHARIZADEH AND DOUG IERARDI

Abstract. The past decade has witnessed a proliferation of

reposito-ries whose workload consists of quereposito-ries that retrieve information. These repositories provide on-line access to vast amount of data and serve as an integral component of many applications, e.g., library information systems, scientic applications, and the entertainment industry. Their storage subsystems are expected to be hierarchical, consisting of mem-ory, magnetic disk drives, optical disk drives, and tape libraries. The database itself resides permanently on the tape. Files are staged and selectively cached on either the magnetic or optical disk drives, and later deleted when the available space of a device is exhausted. This behav-ior will generally cause fragmentation of the disk space over a period of time, resulting in a non-contiguous layout of disk{resident les. As a consequence, the disk is required to reposition its read head multiple times (incurring seek operations) whenever a resident le is retrieved in a sequential manner, thus reducing the overall performance of the system.

This brief study describes an algorithm to manage the storage and placement of les cached on mechanical devices. It acts in an online manner to maintain a dynamically changing working set of disk{resident les, while guaranteeing both maximum utilization of disk space and a maximum of dlgne seeks for the continuous retrieval of each disk{

resident le of n blocks. We present two variants of this algorithm,

which explore tradeo s in reorganization overhead and main memory utilization.

Keywords: data processing, le systems, caching

(2)

A recent trend in the area of databases has been an increase in the number of repositories whose primary function is to disseminate information. These systems are expected to play a major role in library information systems, scientic applications, the entertainment industry, health care information systems, and knowledge-based systems. These systems typically provide on-line access to vast amount of data. The large size of their databases has led to the use of hierarchical storage structures consisting of a combination of fast and slow devices the database resides permanently on the slowest device in the hierarchy, and the system controls the placement of data in order to hide their high latency using its faster devices, such as magnetic disks. Data is swapped in and out of the disks based on its expected future access patterns, with the objective of minimizing the frequency of access to slower devices.

A hierarchical storage structure may consist of mechanical devices, such as magnetic disks, optical disks, and tape devices. These devices exhibit the following properties: (1) Data is stored on the medium in a linear manner. (2) The read head of the device may be moved to physical location on the medium this operation is termed a seek. (3) Once the seek is complete, a device can read sequentially from the medium. The time required for a seek on these devices is often substantial relative to the time required to read data from the medium. In this sense, seek operations are wasteful and should be minimized in order to maximize the amount of useful work that is performed.

This study describes an algorithm to manage the storage and placement of objects or les on such devices. The goal of these algorithms is to maximize the contiguous layout of stored objects, and thus to minimize the number of seeks performed when retrieving an object in a sequential manner. This sort of continuous retrieval of large objects is common in scientic and mul-timedia databases Be95]. To simplify the discussion, we shall assume that objects will be cached on a magnetic disk.

All objects in the system are assumed to be read{only. A copy of each re-sides permanently on the slowest device in the hierarchy. Since we focus only on the space management issue, we shall assume that there is an external module which determines the set of objects to be cached on the disk. This module issues a (potentially innite) sequence of requests to materialize and free objects on the device. (See GIZ94] for details of such a module.) Its criteria for choosing these objects to be cached on the disk is unspecied, although it is presumably chosen to t the system and its workload. We fur-ther assume that whenever a request to materialize an object ofnblocks is

issued, there are in fact at leastnfree blocks on the disk thus, the external

module has selected victims to be deleted in order to accommodate the new objects that are to be materialized, and has issues the necessary requests to maintain its chosen working set in an appropriate order. However, there are

(3)

no further restrictions on this module. So if there are a total of cblocks on

the disk, then it may choose to cache any set of objects of requiring at most

c blocks on the disk.

Given such a module, it is the duty of the storage manager to maintain the requested working set on disk. Its performance will be judged by two criteria: (1) the time required by a user process to do a sequential read of any disk{resident object, in the worst case, and (2) the additional time required to maintain this disk organization, over and above the number required to materialize les. The algorithm ofx3 below guarantees that any

disk{resident object of nblocks can be retrieved sequentially with at most dlgne seeks, requires fewer than n additional disk reads and writes when

materializing an object. The two algorithms of x2 andx3 present a tradeo

between the number of additional maintainance operations incurred, and the complexity of the disk's directory structure (a record of the state of the disk).

The algorithm makes use of well known methods for managing dynamic main memory data{structures, as in Mel84] and CLR91], but applied to secondary devices. These points are discussed further in x4. Empirical

results of a comparison between this algorithm and other common strategies may be found in GIZ94].

1.1.

The model.

Assume that the storage medium has been partitioned into physical blocks of a xed size, where a block is the minimum unit of space allocation for the physical device. All blocks are assumed to have the same size. Let B

0 B

1

:::B

C;1 denote the ordered sequence of physical blocks

on the disk, where C is the capacity of the device in blocks. Assume also a

xed collection of objects (or les). Each objectois comprised of an ordered

sequenceho 1

:::o n

iofnpages, for somen>0 (its size, denotedjoj). Each

physical block of the disk can hold the data in exactly one page. An object is disk{resident if all of its pages reside on the disk. The assignment of pages to blocks will be given by a partial function `, that assigns no two pages

to the same physical block. Let R

` be the set of disk{resident objects for `. We shall assume that every object that is not disk{resident has no pages

residing on the disk. So the algorithm cannot take advantage of \partially resident" objects, and whenever an object is non-resident, all of its pages must be materialized on the disk. With respect to a given assignment`, a

disk block is free if ` does not assign a page of any disk{resident object to

it.

Let o be any disk resident object, and o

i a page of

o. then the retrieval

of o

i incurs a seek under the current layout

` if i > 0 and its predecessor o

i;1 is not assigned to the block immediately preceeding `(o

i). That is, in

a continuous retrieval of the pages of o in sequence, the device is forced to

seek to block `(o

i) after reading `(o

i;1).

o's internal fragmentation under a

(4)

1.2.

Basic operations and their costs.

Materialization of an object o

on the disk implies that the set of resident objects and their layout both change. To simplify the analysis for this paper, we assume no cost for writing a page of o to disk during materialization, because any algorithm

must incur this cost. Instead we focus on two quantities that the algorithm intends to minimize: (1) the internal fragmentation of an object, or the number of seeks incurred during a continuous retrieval of that object, and (2) the number of pages that are copied from one physical block to another, or the cost of reorganizing or compacting disk space.

It is assumed that the various data{structures which record the state of the system (such as directories and the free list) are maintained in main memory. We do not address issues related to crash recovery in this paper.

2. The Basic Algorithm

To manage the disk's space, we rst impose an orderedd-ary tree structure

on the sequence of physical blocks, in which leaves correspond to blocks on the disk, and their order corresponds to the blocks' physical sequence. To simplify the description of the algorithm, we take d= 2. Generalization to

the case d>2 is straightforward.

The tree structure imposed on the physical blocks is built up in the follow-ing bottom{up manner. LetBij] denote the contiguous sequence of blocks B

i

:::B

j. For each integer

h= 0:::blgCc, and eachi= 0:::bC=2 h c, the interval Bi2 h (i+ 1)2 h ;1]

is the ith section of height h. The sections of height 0, consisting of single

blocks, are the leaves. Each section of height h > 0 contains exactly two

sections of height h;1, which are its children. By taking these sections

as internal nodes, the blocks of the disk are now organized into an ordered forest of complete binary trees. Sections that are siblings in this tree also called \buddies" Kno65]. So each contiguous section of height h can be

decomposed into a pair of adjacent sections of heighth;1. Conversely, each

pair of buddies of heighth;1 can be combined to form a single contiguous

section of heighth. There is at most one section of each height that has no

buddy each is the root of a complete binary tree in this forest. (Below we call these sections \roots".)

We say that a section is occupied by the object oif some subsequence of o's pages are laid out contiguously in the blocks of this section. By \the

sections of o", we mean the maximal sections occupied by o (i.e. those of

maximal height under containment).

2.1.

The free list.

A section is free if every page in that section is free. For any layout, the free list is just a list of all maximal free sections, i.e.

those free sections that are not contained in other free sections. The current free list will be recorded in a memory-resident data structure, maintained as

(5)

a sequence of lists | one for each possible section height from 0 toblgCc.

The address of each maximal section of height h is enqueued in a list that

handles sections of heighth only.

2.2.

Invariant properties.

For any n(0n<C), let n

h denote the hth

bit in the binary expansion n, so that n= blgCc X h=0 n h2 h :

In its simplest version, the algorithm for managing data on the disk will maintain the following properties.

1. If an object occupies a section, then all of its pages are stored contigu-ously in the blocks of that section.

2. Let obe any disk{resident object and n=joj. Then ohas exactly n h

maximal section of height hblgCc.

3. Suppose that there are f free blocks on the disk. Then the free list

contains exactly f

h maximal free sections of height

h, for each h blgCc.

For example, suppose that the disk has a capacity of 258 blocks. So the sections range in height from 0 to 8. There are two roots, one of a complete binary tree of height 8, and another of height 2. If an objectoof 13 pages is

resident, then the properties above require that the pages ofooccupy three

sections: one of height 3 (with 8 blocks), one section of height 2 (with 4 blocks) and one of height 0 (with 1 block). At most 2 pages of o will incur

a seek. Similarly, if the free list contains 32 free blocks, then these must all occupy a single section of height 4.

To record the state of the system (e.g. a directory), it suces to record the set of disk resident objects, their maximal sections, and the sections on the free list. Since each maximal section can be specied by its starting address and its height, the disposition of a resident object ofnblocks can be

recorded in O(lgn) space, and the sections on the free list can be recorded

in O(lgC) space. So a record of the entire system requires space at most X

o2R

O(lgjoj) +O(lgC)O(jRjlgC):

2.3.

Materialization.

Materialization of an object is handled by the fol-lowing algorithm. Assume that the properties enumerated in the previous section hold, and a request is made to materialize objecto of npages. Let n = P blgCc h=0 n h2

h Partition the blocks of

o into intervals, with n

h intervals

of size 2h, for each h.

For each height h = blgCc:::0, if n h

> 0 the following algorithm is

used to allocate a section of height h.

(6)

2. Otherwise, recursively allocate one section of height h+ 1. Partition

this into 2 sections of height h. Enqueue one of these on the free list,

and return the other.

Once all sections are allocated, each interval ofois copied contiguously into

a section of the appropriate height.

Lemma 2.1.

The allocation algorithm preserves all the properties of x2.2.

It requires at most O(lgC) processing time, and writes exactlyn disk blocks

during the operation.

Proof. If there are f free blocks and f = P

h f

h2

h, then for each

h there

are f

h maximal free sections of height

h on the free list. The algorithm

above simply mimics the algorithm for computing the dierence off andn

as binary numerals. The properties enumerated above follow immediately from this observation.

2.4.

Deletion.

The following steps are taken to remove the object o from

the disk-resident set, and to reclaim its space. First, all of the sections of

o are enqueued on the appropriate free lists. Then, for each height h =

0:::blgCc the space is compacted. First, in memory, a new layout is

determined using the following algorithm. While there are more than d

sections on the list, these steps are repeated:

1. First choose any 2 of these sections. Call thesef 1 and

f

2. From these,

choose one that is not a root. Without loss of generality, assume that this is f 1, and let b 2 be its buddy. 2. Then (a) Removef

2 from the free list.

(b) Record that the page stored at blockb

2 is to be copied to f

2.

(c) Addb

2 to the free list.

3. Now f

1 and its buddy are both on the free list at height

h. Remove

them from the list, merge them, and place the resulting section of height h+ 1 on the next free list.

Once the new layout is determined, it is realized on the disk by copying pages directly to the nal locations that were computed by the algorithm. Of course, if the algorithm determined that a page was to be copied multiple times | rst to one block, and from there to another | the copying phase merely moves it directly from its initial to nal location.

Lemma 2.2.

The deletion algorithm preserves all the properties of x2.2.

The proof again follows from the observation that freeing the space allocated to an object is formally equivalent to adding ntof, as binary numerals.

3. A \Lazy" Variant of the Algorithm

The implementation of the basic operations presented above yields a rather high cost for the deletion of objects, even in an amortized sense.

(7)

In the worst case, the deletion of an object can cause a cascade of com-pactions involving sections of larger height. For example, in the case where each of the sublists at heights 0:::khas exactly one section, freeing a

sin-gle block can require copying nearly 2k +1 additional blocks. Thus, because

the copying costs associated with larger sections is also larger, the cost of a deletion cannot be bounded immediately by the size of the object deleted.

These costs can be bounded by a simple variant of the algorthm above in which the compaction of the deallocation procedure is made \lazy". In other words, once a new layout for the data on the disk has been determined after deallocation of an object, the data is not immediately reorganized. Instead, a record is kept of the changes needed to realized this new organization. As a consequence, a deletion will not initiate any disk activity and, as argued below, the additional overhead incurred by materializing an object will be proportional to its size.

To realize this, each section on the free list will be designated either dirty or clean: If no physical block of a section contains valid data (i.e. a page of a disk{resident object that is not stored in some other block) then the section is clean otherwise it is dirty. Each dirty section also carries with it a list of the target blocks or sections that are to receive the valid data that its own blocks or subsections contains. The materialization and deletion procedures above are then modied as follows.

3.1.

Deletion.

Whenever sections are merged during the deallocation pro-cedure, their respective lists are concatenated, and the physical movement of data to new blocks is postponed. Hence the invariant properties 2 and 3 ofx2.2 may in fact be violatedonthedisk | since there may be more than

one unoccupied section of each height | yet the property is always main-tained on the memory{resident free list. Since the deletion procedure aects only these data structures, there is no copying of disk{resident data. 3.2.

Materialization.

When an object is materialized, the algorithm ofx2

section is used, but with a minor modication. Before the object is written to disk, the valid data occupying the sections allocated to the object are rst moved to the target locations recorded previously.

Theorem 3.1.

When modied as above, the storage management algorithm can dynamically maintain a disk-resident set of objects. For each object o

of n blocks, the algorithm guarantees: (1) that while o is disk-resident, at

most dlgne seeks are required whenever o is retreived in its entirety and

(2) that materialization of o on disk requires fewer than n additional disk

reads and writes (3) that deletion of o from the disk requires no additional

disk activity.

Proof. The rst guarantee follows from the fact that invariant property 1 continues to hold, both in memory and on disk. Hence each objectohas at

most dlgjoje sections. The second point follows from the observation that

(8)

the number of blocks contained ino itself. The last guarantee is immediate

from the description of the deallocation procedure.

Note, however, that cost incurred by adopting this lazy behavior is a larger size for the memory{resident free list.

4. Conclusions

As noted in x1, the algorithms proposed above resemble main memory

algorithms for dynamic data structures, like the binomial heap CLR91], and those for dealing with decomposible search problems, as in Mel84]. It also has some relation to the buddy system proposed in Kno65, LD91] for ecient main memory (DRAM) storage allocators. However, the latter is required to allocate a contiguous chunk of memory for each object materi-alized. This results in fragmentation, and and motives the need for either a re-organization process or a garbage collector GR93].

We conjecture that the lazy algorithm proposed above is in fact provides an optimal organization for bounding the number of seeks obtained when caching a working set of objects on a linear storage medium, while both minimizing the overhead required to maintain the organization as working set changes and maximizing the utilization of the medium (the amount of data that is cached). More precisely,

Conjecture 4.1.

Under the assumptions of x1, any storage management

scheme which guarantees that every disk{resident object oincurs fewer than dlgjojeseeks for sequential access, either requires more thanjoj;1 additional

disk writes for materializing some object o, or cannot fully utilize the space

of the device, from some sequence of requests.

Acknowledgements

This research was supported in part by the National Science Foundation under grants IRI-9110522, IRI-9258362 (NYI award), CDA-9216321, and CCR-9207422 by a DOD contract and by an unrestricted cash/equipment gift from Hewlett-Packard.

References

Be95] T.E. Bell. Harvesting remote{sensing data.IEEE Spectrum, Volume 32, Num-ber 3, March 1995.

CLR91] T.H. Cormen, C.E. Leiserson and R.L. Rivest. Introduction to Algorithms, chapter 20, pages 400{419. The MIT Press, 1991.

GIZ94] S. Ghandeharizadeh, D. Ierardi and R. Zimmerman.Management of Space in Hierarchical Storage Devices, Technical Report USC{CS{94{598, University of Southern California, Department of Computer Science. November, 1994. GR93] J. Gray and A. Reuter. Transaction Processing: Concepts and Techniques,

chapter 13, pages 670{671. Morgan Kaufmann, 1993.

Kno65] K. C. Knowlton. A fast storage allocator. Communications of the ACM, 8(10):623{625, October 1965.

(9)

LD91] H. R. Lewis and L. Denenberg. Data Structures & Their Algorithms, chap-ter 10, pages 367{372. Harper Collins, 1991.

Mel84] K. Melhorn.Data Structures and Algorithms 3: Multi{dimensional Searching and Computational Geometry, chapter VII, pages 1{78. Springer Verlag, 1984.

Department of Computer Science, University of Southern California, Los Angeles, CA 90089-0781