Parallelization-aware Caching - On I/O Performance and Cost Efficiency of Cloud Storage: A Clie

In this section, we first analyze the impact of parallelized prefetching on caching with an illustrative example and then describe our cache replacement policy.

4.4.1 Impact of Parallelized Prefetching

Parallelized prefetching can change the relative costs of accessing objects from the cloud. Specifically, for the correlated objects that can be prefetched in parallel, the access cost is amortized, and thus the relative cost is lower than fetching each object individually. A direct implication to caching is that the relative cost of fetching an object in a cluster upon a cache miss would be significantly smaller (i.e., a lower miss penalty). This would change the equation for making a caching decision—evicting a low-cost object is a wise choice. Without such awareness, simply combining parallelized prefetching with traditional caching algorithms, such as LRU, ARC [109], and GreedyDual-Size (GDS) [41], would be sub-optimal.

4.4.2 An Illustrative Example

To illustrate the impact of parallelized prefetching on caching, we give a simple example in Table 4.1 to show the difference between the caching scheme of Pacaca and the traditional LRU caching scheme, which is widely adopted in current cloud-based storage systems [3, 114, 7, 8, 9, 125, 32, 140]. In the example, both schemes handle the same access stream in the scenario of parallelized prefetching (downloading all the objects of a cluster in parallel upon a related cache miss). Table 4.2 describes the sizes, latencies, and the access costs of the objects and clusters.

Table 4.1. An Illustrative Example of Pacaca’s Caching Scheme

Step Access LRU Lat. Pacaca Lat.

1 A [A] 7 [A] 7 2 B1 [B1, B2, B3, B4, A] 2 [A, {B1, B2, B3, B4}] 2 3 B2 [B2, B1, B3, B4, A] 0 [A, {B1, B2, B3, B4}] 0 4 B3 [B3, B2, B1, B4, A] 0 [A, {B1, B2, B3, B4}] 0 5 B4 [B4, B3, B2, B1, A] 0 [A, {B1, B2, B3, B4}] 0 6 C1 [C1, C2, C3, C4, B4, B3, B2, B1] 2 [A, {C1, C2, C3, C4}] 2 7 C2 [C2, C1, C3, C4, B4, B3, B2, B1] 0 [A, {C1, C2, C3, C4}] 0 8 C3 [C3, C2, C1, C4, B4, B3, B2, B1] 0 [A, {C1, C2, C3, C4}] 0 9 C4 [C4, C3, C2, C1, B4, B3, B2, B1] 0 [A, {C1, C2, C3, C4}] 0 10 B1 [B1, C4, C3, C2, C1, B4, B3, B2] 0 [A, {B1, B2, B3, B4}] 2 11 B2 [B2, B1, C4, C3, C2, C1, B4, B3] 0 [A, {B1, B2, B3, B4}] 0 12 B3 [B3, B2, B1, C4, C3, C2, C1, B4] 0 [A, {B1, B2, B3, B4}] 0 13 B4 [B4, B3, B2, B1, C4, C3, C2, C1] 0 [A, {B1, B2, B3, B4}] 0 14 A [A, B4, B3, B2, B1] 7 [A, {B1, B2, B3, B4}] 0 Total Time 18 13

Note: This is an example illustrating the advantages of the caching scheme of Pacaca over the traditional LRU caching scheme in the scenario of parallelized prefetching, in which all the objects of a cluster are downloaded in parallel upon related cache misses. In this example, the cache space is set to 16, and the cache is empty before Step 1. The objects shown in the cache from left to right have caching priorities from high to low. The objects of the lowest caching priority have the least “value” to be held in cache. The objects downloaded from the cloud are boldfaced. The sizes, downloading latencies, and costs of the objects and clusters are shown in Table 4.2.

Table 4.2. Access Costs of the Objects/Clusters

Object/Cluster Size Latency Latency/Size

A 8 7 0.875 {B1, B2, B3, B4} 8 2 0.25 B1 2 2 1 B2 2 2 1 B3 2 2 1 B4 2 2 1 {C1, C2, C3, C4} 8 2 0.25 C1 2 2 1 C2 2 2 1 C3 2 2 1 C4 2 2 1

Note: {B1, B2, B3, B4} denotes the cluster containing objects B1, B2, B3, and B4; {C1, C2, C3, C4} denotes the cluster containing objects C1, C2, C3, and C4. The latency of a cluster is the time units of downloading the objects of the cluster in parallel. The cost of each object or cluster is calculated by latency/size.

(13 time units vs. 18 time units). Initially, the two caching algorithms have the same content. At Step 6, Pacaca and LRU begin to make distinct caching decisions. Since LRU makes the caching decisions only based on the recency of each object and finds that object A has a lower recency than other objects; consequently, LRU decides to evict object A. This decision leads to a cache miss of object A at a later time (Step 14), causing a high miss penalty (7 time units). By contrast, knowing that objects B1, B2, B3, and B4 are correlated and could be fetched in a cluster {B1, B2, B3, B4} in a parallelized manner, Pacaca estimates that the miss penalty of the cluster is lower than that of object A (0.25 cost unit vs. 0.875 cost unit). Thus, Pacaca decides to evict the cluster {B1, B2, B3, B4}, which leads to a relatively lower penalty (2 time units) at Step 10.

This example clearly illustrates the impact of parallelized prefetching on caching and demonstrates the importance of considering parallelism and object correlations when de- ciding the victim objects.

4.4.3 Cache Replacement Policy

Pacaca adopts a cost-aware cache replacement algorithm based on GDS [41]. Our augmented algorithm is capable of recognizing clusters of objects. The objects in a cluster are fetched together in parallel, when a related cache miss happens. We use a cluster as the basic unit for cost estimation. An object that does not have any correlated objects is considered as a special cluster containing a single object.

Figure 4.2 shows the algorithm of the caching scheme. Each cluster is associated with a value H to determine the caching priority (lines 5 and 8). The cluster with the lowest H value is selected as the victim and will be evicted first (lines 7-9). The H value is calculated as H(c) = L + _Size(c)Lat(c), which includes two components:

• L is a global inflation value, tracking the H value of the most recently evicted cluster. Since the cluster having the lowest H value is always selected as the victim cluster (lines 7-9), L keeps growing and indicates the access recency of the clusters. Thus, a low L value means that the cluster has not been accessed recently.

1 initialize L = 0

2 upon the request of object x

3 let c be the cluster containing x

4 if cache hit

5 H(c) = L + Lat(c)/Size(c)

6 if cache miss

7 while not enough cache space

8 update L = min(H)

9 evict cluster d such that H(d) = L

10 parallelized prefetching for cluster c

11 H(c) = L + Lat(c)/Size(c)

Figure 4.2. Cache Replacement Algorithm

• _Size(c)Lat(c) evaluates the cost of the cluster, considering the miss penalty of the cluster per size unit. It incorporates the time of fetching the cluster in a parallelized way, Lat(c), and the size of the cluster, Size(c).

From this function, we can see that the cluster that has not been accessed for a long time and has a lower miss penalty is of less value for caching. Such a caching policy incorporates different factors, including not only access recency but also parallelization- aware miss penalty and cluster size.

It is worth noting that the latency function, Lat(c), and the size function, Size(c), here should only involve the objects that have been accessed on demand rather than the entire originally identified cluster. This is because some prefetched objects could be evicted earlier due to mis-prediction, or have not reached its expiration time and are waiting to be accessed (see Section 4.3.2). Therefore, when calculating the cost of a cluster, we only consider the objects that have been accessed on demand. Similarly, when evicting a victim cluster, only the objects that have been accessed on demand will be evicted. The prefetched objects that are detected to expire will be evicted by the mis-prefetching handler (see Section 4.3.2 and Section 4.5).

In document On I/O Performance and Cost Efficiency of Cloud Storage: A Client\u27s Perspective (Page 81-85)