Los Angeles, CA, USA [kunfu,

(1)

Kun Fu

a

and Roger Zimmermann

a

_{Integrated Media Systems Center, University of Southern California}

Los Angeles, CA, USA 90089-2561

[kunfu, rzimmerm]@usc.edu

ABSTRACT

Presently, IP-networked real-time streaming media storage has become increasingly common as an integral part of many applications. In recent years, a considerable amount of research has focused on the scalability issues in storage systems. Random placement of data blocks has been proven to be an effective approach to balance heterogeneous workload in a multi-disk environments. However, the main disadvantage of this technique is that statistical variation can still result in short term load imbalances in disk utilization, which in turn, cause large variances in latencies. In this paper, we propose a packet level randomization (PLR) technique to solve this challenge. We quantify the exact performance trade-off between our PLR approach and the traditional block level randomization (BLR) technique through analytical analysis. Our preliminary results show that the PLR technique outperforms the BLR approach and achieves much better load balancing in multi-disk storage systems.

Keywords: Randomized load balancing, scalable storage system, continuous media server

1. INTRODUCTION

Large scale digital continuous media (CM) servers are currently being deployed for a number of different applications. For example, cable television providers now offer video on demand services in all the major markets in the United States. A large library of full length feature movies and television shows requires a vast amount of storage space and transmission bandwidth. Magnetic disk drives are usually the storage devices of choice for such streaming servers and they are generally aggregated into arrays to enable support for many concurrent users. Multi-disk continuous media server designs can largely be classified into two different paradigms: (1) Data blocks are striped in a round-robin manner/ across the disks and blocks are retrieved in cycles or rounds on behalf of all streams. (2) Data blocks are placed randomly0 across all disks and the data retrieval is based on a deadline for each block. The first paradigm attempts to guarantee the retrieval or storage of all data. It is often referred to as deterministic. With the second paradigm, by its very nature of randomly assigning blocks to disks, no absolute guarantees can be made. For example, a disk may briefly be overloaded, resulting in one or more missed deadlines. This approach is often called statistical.

One might at first be tempted to declare the deterministic approach intuitively superior. However, the statistical ap-proach has many advantages. For example, the resource utilization achieved can be much higher because the deterministic approach must use worst case values for all parameters such as seek times, disk transfer rates, and stream data rates, whereas the statistical approach may use average values. Moreover, the statistical approach can be implemented on widely available platforms such as Windows or Linux which do not provide hard real time guarantees. It also lends itself very naturally to supporting a variety of different media types that require different data rates (both constant (CBR) or variable (VBR)) as well as interactive functions such as pause, fast-forward and fast-rewind. Moreover, it has been shown that the performance of a system based on the statistical method is on par with that of a deterministic system.1 Finally, a recent study2 argues that the statistical approach can support on-line data reorganization much more efficiently than determin-istic method. Note the efficient on-line data reorganization feature is very crucial for scalable storage system since data rearrangement is inevitable due to disk addition and removal operations.

Even though the random/deadline-driven design is very resource efficient and ensures a balanced load across all disk devices over the long term, there is a non-negligible chance that retrieval deadlines are violated. Short term load fluctua-tions occur because occasionally consecutive data blocks may be assigned to the same disk drive by the random location

3

This research has been funded in part by NSF grants EEC-9529152 (IMSC ERC), IIS-0082826 and CMS-0219463, and unrestricted cash/equipment gifts from Intel, Hewlett-Packard and the Lord Foundation.

(2)

generator. During the playback of such a media file, the block retrieval request queue may temporarily hold too many requests such that not all of them can be served by their required deadline. In this report we introduce a novel packet-level randomization (PLR) technique which significantly reduces the occurrence of deadline violations with random data placement.

The remainder of this paper is organized as follows. Section 2 relates our work to prior research. Section 3 presents our proposed design approach. Performance analysis are contained in Section 4. Finally, Section 5 outlines our future plans.

2. RELATED WORK

There are three basic mechanisms to achieve load balancing in striped multi-disk multimedia storage systems. One ap-proach is to use large stripes that access many consecutive blocks from all disks at a single request for each active stream. This approach provides perfect load balancing because it guarantees that the number of disk I/Os are the same on all the devices. However, it results in extremely large data requests when the number of disks is large. Furthermore, it does not ef-ficiently support unpredictable access patterns. The second approach is to use small requests accessing just one block on a single disk, with sequential requests cycling over all the disks./ This technique does not support unpredictable access pat-terns as well. The third technique randomly allocates data blocks to disks blocks,1

and therefore supports unpredictable workload efficiently. With the third approach, there has been proposal to counteract the short term load imbalance problem by random data replication,1 which requires more space and adds more complexity in the design and implementation. To our knowledge, no previous work has identified and quantified the trade-off between the randomization at the packet and block levels when fine-grained load balancing is desired or required.

3. DESIGN APPROACH

We assume a multi-disk, multi-node streaming media server cluster design similar to the ones introduced in our previ-ous research activities. Our first prototype implementation, called Yima,

was a scalable real-time streaming architecture to support applications such as video-on-demand and distance learning on a large scale. Our current generation system, termed the High-performance Data Recording Architecture (HYDRA),

improves and extends Yima with real time record-ing capabilities. In the HYDRA architecture each cluster node is an off-the-shelf personal computer with attached storage devices and, for example, a Fast Ethernet connection. The HYDRA server software manages the storage and network resources to provide real-time service to the various clients that are requesting media streams.

For load-balancing purposes, without requiring data replication, a multimedia object

is commonly striped into blocks, e.g.,

/

across the disk drives that form the storage system. Because of its many advantages, we assume a random assignment of blocks to the disk drives.

3.1. Packets versus Blocks

Packet-switched networks such as the Internet optimally transmit relatively small quanta of data per packet (for example 1400 bytes). On the other hand, magnetic disk drives operate very inefficiently when data is read or written in small amounts. This is due to the fact that disk drives are mechanical devices that require a transceiver head to be positioned in the correct location over a spinning platter before any data can be transferred. The seek time and rotational latency, even though necessary, are wasteful. Figure 2 shows the relative overhead experienced with a current generation disk drive (Seagate Cheetah X15) as a function of the retrieval block size. The overhead was calculated based on the seek time needed to traverse half of the disk’s surface plus the average rotational latency. As illustrated, only large block sizes beyond one or two megabytes allow a significant fraction of the maximum bandwidth to be used (the fraction is also called effective bandwidth). Consequently, media packets need to be aggregated into larger data blocks for efficient storage and retrieval. Traditionally this has been accomplished as follows.

Block-Level Randomization (BLR): Media packets are aggregated in sequence into blocks (see Figure 1a). For example, if packets fit into one block then the data distribution algorithm will place the first sequential packets into block

, the next packets into block

/

, and so on. As a result, each block contains sequentially numbered packets. Blocks are then assigned randomly to the available disk drives. We term this technique Block-Level Randomization (BLR). During retrieval, the deadline of the first packet in each block is essentially the retrieval deadline for the whole block. The advantage of BLR is that only one buffer at a time per stream needs to be available in memory across all the storage nodes.

(3)

1 2 3 4 Randomized Block to Disk Mapping Packet to Block Aggregation 1 2 3 4 5 6 Packets Blocks 1 2 3 4 Randomized Packet to Disk Mapping 2 3 9 4 7 8 Packets Blocks Packet to Block Aggregation Packet to Block Aggregation Packet to Block Aggregation 1 2 3 5 4 7 8 9 6 1 5 6

Figure 1a: Block Level Randomization (BLR) Figure 1b: Packet Level Randomization (PLR)

Figure 1. Two different randomization schemes that can be applied in a Recording System, e.g. HYDRA. Note that in this example, each block contains

packets. 0 20 40 60 80 100 0 200 400 600 800 1000 1200 1400 1600 1800 2000 Overhead [%]

Retrieval Block Size [KB]

Seagate ST336752LC

Figure 2. Overhead in terms of seek time and rotational latency as a percentage of the total retrieval time which includes the block

transfer time (Seagate Cheetah X15).

BLR ensures a balanced load across all disk devices over the long term. However, there is a non-negligible chance that retrieval deadlines are violated. Short term load fluctuations occur because occasionally consecutive data blocks may be assigned to the same disk drive by the random placement algorithm. During the playback of such a media file, the disk retrieval request queue may temporarily hold too many requests such that not all of them can be served by their required deadline. One way to avoid this scenario is to reduce the maximally allowed system utilization. For example, an admission control mechanism may restrict the maximum load to, say, 80%. Hence, the chance for overloads is much reduced. However, this approach wastes approximately 20% of the available resources most of the time. In order to allow high disk utilization while still reducing the probability of hot-spots we propose a novel technique to achieve better load-balancing as follows.

Packet-Level Randomization (PLR): With this scheme, each media packet is randomly assigned to one of the storage nodes, where they are further collected into blocks (Figure 1b). One advantage of this technique is that during playback data is retrieved randomly from all storage nodes at the granularity of a packet. Therefore, load-balancing is achieved at a very small data granularity. The disadvantage is that memory buffers need to be allocated concurrently on all nodes per stream. However, we believe the benefit of PLR outweighs its disadvantage since the cost of memory is continually decreasing. Therefore, packet-level randomization is a promising technique for high-performance media servers. In the next section we evaluate the load-balancing properties of PLR and BLR in terms of the uniformity of data distribution on each disk.

4. PERFORMANCE ANALYSIS

With PLR, let the random variable

denote the number of packets assigned to a specific disk. As proposed in,2 uniformity of data distribution can be measured by the standard deviation and the coefficient of variation ( ) of

(4)

Term Definition Units

Total number of disks

Total data size MB

Packet size MB

Block size MB

Total data size in blocks

Total data size in packets

The amount of data assigned to a disk in PLR MB

The amount of data assigned to a disk in BLR MB

The number of packets assigned to a disk in PLR

The number of blocks assigned to a disk in BLR

Ratio of block size and packet size, i.e.,

Table 1. List of terms used repeatedly in this study and their respective definitions.

and respectively . By definition,

can be derived from dividing

by the mean value of ,

. If we consider the random assignment of a packet to a disk as a Bernoulli trial, which has probability

/

to be successfully allocated to a specific disk, then the total number of successful Bernoulli trials could be naturally mapped to

. Intuitively,

follows a Binomial distribution, where the number of Bernoulli trials is the total number of packets to be assigned, and denoted as . Note that can be computed as

, where

is the total data size and is the packet size. Therefore, the mean, standard deviation and coefficient of variation for

can be obtained as shown in Equation 1.

"! $# !&% /(' ! *) # % /(' ! +,.--(1)

With BLR, let the random variable/ denote the number of blocks assigned to a disk. Furthermore,10 represents the total data size in blocks.0 can be calculated as201

, where

30 denotes the block size. Similar to PLR, we obtain the mean, standard deviation and coefficient of variation of/ as follows.

! # !4%" /5' ! *) / # % /(' ! +6,7--(2) Let893: and890;:

denote the amount of data assigned to a disk with<>=@? andAB=@? , respectively. Then,8C3:

and 890;: can be calculated by 8 3: 8 0": D/B 0 (3)

Using Equations 3 and 1, we obtain the mean, standard deviation and coefficient of variance of8 3:

as $E GF % /(' H) 8 3: # % /5' ! GF +6,7-I-(4)

. Similarly, with Equation 3 and 2, we can obtain the mean, standard deviation and coefficient of variance of8 0": as J $E GF % /(' *) 8 0;: # % /5' ! GF +,.--(5)

. Finally, from Equations 4 and 5, we obtain:

(6) and 8 0": 890": K? F ) (7)

(5)

where? represents the ratio of block size to packet size, i.e.,?

. As shown in Equation 6, the mean value of 8 0":

and893:

are the same, which confirms that both the BLR and PLR schemes achieve the same level of load balancing in the long run as expected. However, with respect to short term load balancing, PLR has a significant advantage over BLR as indicated by Equation 7. Note that the interesting fact that the ratio of

and

is only determined by the ratio of block to packet size. Next, we will discuss the performance advantages of PLR in details.

0 10000 20000 30000 40000 50000 60000 70000 80000 90000 0 200 400 600 800 1000 1200 1400 1600 1800 2000

Ratio of load imblance in BLR and PLR

Number of packets in a disk block

Figure 3. Ratio of load imbalance,

G , with different block to packet size ratio

.

Impact of Block to Packet Size Ratio? : Figure 3 shows the ratio of load imbalance

as a function of the block to packet size ratio?

. We observe that when

? increases from , to -I-- ,

G increases sharply from , to approximately

-

-I-- . The figure clearly indicates the significant performance gain of PLR over BLR. In fact,

---packets in one block is not unusual. For example, with a packet size of;

,

Bytes, a block size of30

, MB is a quite common configuration in current generation streaming servers.

This figure also suggests that, as the block size increases, with BLR short term load balancing degrades quickly. Recall that with larger disk blocks the disk overhead decreases (see Figure 2). Therefore, in BLR, there is a trade-off in determining the optimal block size.

5. CONCLUSIONS

Load balancing is important to ensure overall good performance in a scalable multimedia storage system. Periods of load imbalance can result in poor end-to-end performance for client sessions. This paper identifies and quantifies the performance benefit of the packet level randomization (PLR) scheme over the traditional block level randomization (BLR) scheme. We intend to enhance PLR to support dynamic disk scaling operations. Furthermore, we plan to implement the PLR approach into our streaming prototype, HYDRA, and evaluate its performance with real measurements.

REFERENCES

1. S. Berson, S. Ghandeharizadeh, R. Muntz, and X. Ju, “Staggered Striping in Multimedia Information Systems,” in Proceedings of the ACM SIGMOD International Conference on Management of Data, 1994.

2. J. R. Santos and R. R. Muntz, “Performance Analysis of the RIO Multimedia Storage System with Heterogeneous Disk Configurations,” in ACM Multimedia Conference, (Bristol, UK), 1998.

3. J. R. Santos, R. R. Muntz, and B. Ribeiro-Neto, “Comparing Random Data Allocation and Data Striping in Multimedia Servers,” in Proceedings of the SIGMETRICS Conference, (Santa Clara, California), June 17-21 2000.

4. A. Goel, C. Shahabi, S.-Y. D. Yao, and R. Zimmermann, “SCADDAR: An Efficient Randomized Technique to Reorga-nize Continuous Media Blocks,” in Proceedings of the 18th International Conference on Data Engineering, pp. 473– 482, February 2002.

5. F. Tobagi, J. Pang, R. Baird, and M. Gang, “Streaming RAID-A Disk Array Management System for Video Files,” in First ACM Conference on Multimedia, August 1993.

6. C. Shahabi, R. Zimmermann, K. Fu, and S.-Y. D. Yao, “Yima: A Second Generation Continuous Media Server,” IEEE Computer 35, pp. 56–64, June 2002.

7. R. Zimmermann, K. Fu, and W.-S. Ku, “Design of a large scale data stream recorder,” in The 5th International Con-ference on Enterprise Information Systems (ICEIS 2003), Angers - France, April 23-26 2003.