Tornado: A Capability-Aware Peer-to-Peer Storage Network

(1)

Tornado: A Capability-Aware Peer-to-Peer Storage Network

Hung-Chang Hsiao [email protected]

Chung-Ta King*

[email protected]

Department of Computer Science National Tsing Hua University

Hsinchu, Taiwan 300

Abstract

Peer-to-peer storage networks aim at aggregating the unused storage in today’s resource-abundant computers to form a large, shared storage space. To lay over the extremely variant ma- chines, networks and administrative organizations, peer-to- peer storage networks must be aware of the capabilities of the constituent components to leverage their resources, performance and reliability. This paper reports our design of such a peer-to-peer storage network, called Tornado. Tornado is built on top of two concepts. The first is the virtual home concept, which adds an extra level of abstraction between data and storage nodes to mask the underlying heterogeneity. The second concept is the classification of the storage nodes into “good”

and “bad” according to their static and dynamic capabilities.

Only “good” peers can host virtual homes, whereby introduc- ing quality of services into the storage network. We evaluate Tornado via simulation. The results show that Tornado is comparable with previous systems, where each route takes at most ª^log^Nº hops, a new node takes ª^{log N}º² messages to join, and the memory overhead in each node is ^{O log}

(

^N

)

. Moreover, Tornado is able to provide comprehensive services with features scattered in different systems previously, and takes account of and exploits the heterogeneity in the underlying network environment.

1. Introduction

Research on the peer-to-peer (P2P) storage network [4] has attracted much attention recently. One reason is its ability of aggregating unused storage in today’s resource abundant computers to construct a global-scaled, shared storage space. In a P2P storage network, nodes contribute portions of their unused storage to the storage network. They may come from different administrative domains and may dynamically join and leave the storage system. Functionally, the peer nodes are identical, which can act as a client, a server and a router. Examples of P2P storage networks include Freenet [2], OceanStore [5], PAST [7] and CFS [3].

The core designs [6], [8], [9], [11] of most proposed P2P storage systems adopt a hash-based scheme for data naming and routing to accommodate network dynamics and enable self- administration. The basic idea is to name each peer node and each published data item via a hashing function. A data item with the hashing key k is managed by the peer node whose hashing key is closest to k. To fetch that data item, the request message is routed through the intermediate peer nodes whose hashing keys are closer and closer to k. If a uniform hashing

function (e.g., SHA-1) is employed, the number of data items allocated to each node will be nearly equal. In essence, this hash-based scheme treats all the peer nodes equal. It (pur- posely) ignores the heterogeneity in the underlying systems.

In practice, P2P storage networks are most likely overlaid on top of systems with extreme heterogeneity in hardware and software configurations, network dynamics and administration practices. The participating systems may have vast varying capabilities. For example, some nodes may be large servers with plentiful resources accessed through a reliable and high- speed network; some nodes may be PDAs with wireless connections that have limited resources and unreliable connections.

If the P2P storage network is aware of the capabilities of the constituent systems, then it can leverage their resources, performance and reliability.

In this paper, we discuss how heterogeneity in the underlying systems can be exploited in a hash-based P2P storage network.

We describe the design of a P2P storage network called Tor- nado. Tornado is built on top of two concepts. The first is the virtual home concept, which adds an extra level of abstraction between data and storage nodes to mask the underlying heterogeneity. The second concept is the classification of the storage nodes into “good” and “bad” according to their static and dynamic capabilities. “Good” and active peers can host virtual homes and help forward messages issued from other nodes that may not be good peers. In this way, Tornado can leverage the reliability and performance of the good peers, and introduces quality of services into the storage network. When a good peer becomes overloaded, Tornado seeks another good but inactive peer to relieve the load. Tornado incorporates caching and redundant replication to optimize data accesses and facilitate data availability. Tornado also employs directories [4], which map the hashing address of a data item directly to the address of the node storing this data item, to further shorten data access time.

We evaluate Tornado via simulation and the results show the followings. (1) Each route in Tornado takes at most ª^log^Nº hops to reach its destination. (2) It takes at most ª^{log N}º²^messages for a node to join Tornado. (3) Caches and directories help alleviate the load of nodes and optimize data accesses. (4) Directories can accommodate an increasing number of data items. The performance benefit of caches, however, is affected by the availability of free storage space. (5) Tornado can maintain high data availability and meanwhile exploit physical network locality. These performance results are comparable with previous systems [6], [8], [9], [11]. Moreover, Tornado incorporates several features, such as the use of directories, the ex- ploitation of network locality and the use of fully adaptive routing path, to provide comprehensive services and boost the

* This work was supported in part by the National Science Council, R.O.C., under Grant NSC 90-2213-E-007-076 and by the Ministry of Education, R.O.C., under Grant MOE 89-E-FA04-1-4.

(2)

performance and reliability. These features were previously scattered in different systems. Most importantly, Tornado takes account of and exploits the heterogeneity in the underlying environment.

The remainder of the paper is organized as follows. Section 2 presents the related works. The design of Tornado is given in Section 3. Section 4 discusses the simulation methodology and results. Conclusions of the paper are given in Section 5, with possible future research directions.

2. Related Works

CAN (Content-Addressable Network) [6] partitions the ad- dressing space of peer nodes into n dimensions. Each node is mapped to a coordinate point in the n-dimensional space via n hashing functions. Two nodes are neighbors if their coordi- nates are different only in one dimension. A message is routed greedily towards the neighbor that has the coordinate numerically closest to the requested key. A message has an overhead of ^O

( )

ⁿ ^headers.

Pastry [8] and Tapestry [11] implicitly partition the hashing space of the peers into several ordered segments. A message is routed by going through each segment in order. The forward- ing peer in each segment has a section of b bits in its ID identi- cal to the same b-bit section of the destination address. Con- ceptually, Pastry and Tapestry can be represented as an

( )

^N^b ^b

Olog₂ ⋅2 -way tree-based data routing and locating infrastructure. Each node is virtually associated with a single tree.

Two nodes in a tree have an edge if they share the same ib-bit (where i = ¹^,²^,³^,^"^,^O

( )

^log^N²^b ) section of their hash keys. It follows that the dimension-order routing in Pastry and Tapes- try enforces a route that follows the tree edges with the in- creasing tree levels, i.e., a b-bit section is matched after ad- vancing to the next tree level. This thus limits the available path selection (denoted the routing adaptivity) to ^O

( )

log^N2^b per each tree node. This in turn may reduce the system reliability and performance.

Chord [9] does not partition the addressing space of the peers.

Instead, each node in Chord maintains a finger table consisting of several successors. An immediate successor of a node s is the node with the smallest key greater than s. To send a mes- sage to a node k, node s tries to forward the message to the predecessor of k. The predecessor then forwards the message to k. Chord implicitly associates an ^O

( )

^log²^N -way tree with each physical node. Two nodes in the tree have an edge if the difference λ of their hash keys is λ≥2ⁱ and there does not exist δ ^{such that}λ>δ ≥²ⁱ, where i=1,2,3", ^{O log}

(

^N

)

^.

To leverage the performance, each node should maintain its successors in its finger table in order to maintain the routing adaptivity in ^O

( )

^log²^N . Otherwise, a message may not be efficiently routed to its destination. Maintaining the finger tables will be cumbersome and inefficient in a dynamic network, since it relies on the routing. Apparently, Chord cannot exploit network locality by using only successors of finger tables.

Tornado is an ^O

( )

^log²^N ^⋅²^b-way tree-based protocol, where b is a constant. It does not decompose the addressing space of the peers into dimensions. It thus can fully exploit the routing adaptivity in ^O

( )

^log²^N ^⋅²^b. This allows flexible selection of the leaders to serve as the children of a tree node. This in turn enables topology-awareness and the use of the proximity routing. The ability to use multiple routing paths not only leverages the system performance but boosts reliability. In addition, Tor- nado takes into account node capability through the virtual home concept.

Tornado first maps data items to virtual homes, which are then mapped to physical, “good” nodes. This gives an extra level of abstraction to mask the underlying heterogeneity. Several virtual homes may be mapped to a physical node, and their hashing values are not statically bound to a physical node. Accord- ing to the ability of a node and the current workloads, the virtual homes represented by various hashing values may be migrated to another active and good peer node. Each active physical node in Tornado is thus responsible for managing the storage of several peer nodes.

3. Tornado

3.1 Virtual Home Concept

Each data item in Tornado has a virtual home, which is a logi- cal entity that stores and manages this data item. The virtual home represents a placeholder for the data item, where the data can be found. A virtual home may contain several data items.

A physical node that participates in Tornado can host zero, one, or more virtual homes. If a peer node hosts one or more virtual home, we call it active. If the node does not host any virtual home, it is inactive. The hosting node should provide the physical resources, e.g., CPU cycles and storage, required by the virtual homes. We can think of the virtual home concept as an additional layer of abstraction in mapping from data items to their storage nodes.

To cope with system heterogeneity and take account of ma- chine capabilities, the peer nodes are designated as “good” or

“bad” according to their static and dynamic capabilities. There is no definite distinction between “good” or “bad” nodes. In general, a good node has plentiful resources, is reliable, and has access to a reliable and high-speed network. To leverage the reliability and performance of good nodes, Tornado en- sures that each active node in the system is a good peer. A good peer contributes its resources to the storage space and can host multiple virtual homes. If an active node becomes overloaded and turns “bad”, it will try to seek an inactive good peer and migrate some of the virtual homes to the latter.

The concept of virtual homes differentiates Tornado from previous works [6], [8], [9], [11] that ignore the heterogeneity of the constituent nodes. The virtual home concept also makes Tornado different from systems that adopt the randomness routing, e.g., Freenet [2]. In these systems, messages randomly visit a predefined number of nodes. There is no guarantee that the messages will reach nodes that store the requested data items. Our scheme is also unlike the route flooding method used in Gnutella, where the messages are breath-flooded to each connected node. Note that the participating nodes may fail, which cause the stored data items unavailable. We will address this issue in Section 3.4.2.

3.2 Virtual Home and Data Naming

Similar to previous works [6], [8], [9], [11], Tornado adopts the hashing scheme to name each data item. However, rather than assigning a unique hashing key to each physical node, Tornado applies the hashing to each virtual home. For each data item in the storage infrastructure, there is a unique hashing key representing it. The collection of the hashing keys is called the data addressing space. Similarly, a unique hashing key is used to represent each virtual home in the system, and the resultant hashing keys form the virtual home addressing space. Usually, the data addressing space is larger than the virtual home addressing space. In addition to data and virtual home addressing spaces, the physical addressing space de-

(3)

notes the addresses of active physical nodes in the system. This space collects relevant address information of active nodes, e.g., their IP addresses and port numbers. As mentioned above, a physical node may maintain several virtual homes.

The allocation of data items is done by mapping a data item to the virtual home that has the key “numerically closest”. If the system state of the infrastructure is stable, then each virtual home will be ideally allocated the same amount of data items due to the use of a uniform hash function. To access a data item, the request can be sent to the home with a key numerically closest to the key of the requested data item.

The uniform hashing scheme helps evenly distribute the data items to the virtual homes. However, physical nodes may not have equal loads because they may host different numbers of virtual homes. On the other hand, since each virtual home manages a similar amount of data items, the load of an active physical node may be estimated based on the number of virtual homes it hosts. Tornado can in turn determine how to employ good peers to provide a reliable and efficient storage.

3.3 Per Physical Node Components

In Tornado, each active node contains a set of virtual homes. It also maintains a virtual-to-physical address mapping table that associates the hashing addresses of virtual homes and the address of the physical node, e.g., its IP address and port number.

A virtual home consists of neighbor tables, routing tables, directory maps, and data storages. The data storage of a virtual home supports both temporary and permanent space for data items, where the temporary space is used as caches.

3.3.1 Neighbor Table

The neighbor table of a virtual home x maintains a set of vir- tual home IDs that are “numerically closest” to x. Logically, homes with numerically close keys are clustered together. A route towards a virtual home advances incrementally to those clusters of virtual homes with similar keys. This creates a

“logical” network locality. Since data are allocated according to the hashing key, this also introduces a logical data locality.

Note that the neighbor table of a virtual home contains the virtual homes whose IDs are greater than its own ID. This enforces messages to move towards homes with larger IDs. Fig- ure 1(a) depicts the neighbor table, where each entry is a virtual home ID pointing to a logically neighboring home. Note that virtual homes are addressed by keys with m bits.

For maintaining the network connectivity, a physical node helps each hosted virtual home to periodically monitor its neighboring homes by consulting the neighbor table. If a physical node fails in Tornado, the data items maintained by

the virtual homes that it hosts will be moved gradually to other physical nodes (see Section 3.4.2). Hence, the neighbor table not only maintains the logical network and data locality but also provides the mechanism for tolerating faults.

3.3.2 Routing Table

The routing table is the core component of Tornado. It consists of several routing levels. Each routing level conceptually gov- erns a range of the virtual home addressing space, and comprises of a virtual home ID (see Figure 2(b)). The virtual home addressing space governed by the A-th routing level with re- spect to a virtual home whose ID is x (denoted home x) is de- fined as

[ )

[ ] [ ( ) )

( )

[ ] [ ( ) )

°¯

°®

<

− ℜ

<

+

≤

≤ ℜ +

− ℜ ℜ ℜ +

−

ℜ

>

+ ℜ

<

≤

−

≤ ℜ

+

− ℜ

−

ℜ

<

+

≤

−

≤ +

−

ș 0 ș and 0

, 0

and 1

ș ș and

0 ,

0 and 1

ș ș 0 ,

ϑ ϑ

ϑ

ϑ ϑ ϑ ϑ

ϑ

ϑ ϑ ϑ ϑ

ϑ

if ș % , ,

ș %

if ș%

ș, , ș if ș,

, (1)

In Equation (1), ℜ is the size of the virtual home addressing space, ϑ^is⁽x⁺₂^ℜA⁾^mod^ℜ, and θ is 2^ª^log^ℜ^º⁻^A.

Conceptually, each routing level will be assigned a few virtual homes, the leaders. Leaders are responsible for forwarding requests to the homes whose IDs fall in the region of the home addressing space governed by them. Note that leaders in a higher routing level are responsible for a smaller region in the home addressing space. The size of address region will be ex- ponentially decreased as the routing level increases. A request will be routed from the leaders in lower routing levels to those in higher routing levels and moved closer to the destination. In other words, the route goes through the homes with keys numerically closer to that of the destination. In this way, a request can be sent to its destination in a logarithmic fashion.

To prevent homes from overloading while forwarding requests, the home IDs of leaders in the A-th routing level are chosen with a numerical difference of _A

2

ℜ . Consequently, leaders of different nodes at the same routing level are different. This helps to distribute the load for relaying requests to different homes.

The density of the routing table represents the logical network locality. Given a virtual home x, a relatively sparse routing table results in a poorer network locality for the homes nu- merically close to x. This also induces poor data locality. As a result, those homes with IDs near or equal to x need to main- tain a stronger network connectivity with other virtual homes.

It is sometimes even necessary to maintain a larger amount of replicas for the stored data items to leverage data availability and distribute loads.

m V irtual H om e ID

(a )

m V irtual H om e ID

(b ) L evel 1 L evel 2 L evel 3

V irtual H om e ID IP + P ort

(c) m

(d) IP + P ort

(e) n D ata ID

T im e contrac t

n D ata ID P erm anent bit

T im e contract D ata address pointer

(f) m

V irtual ho m e ID

Figure 1. The data structures used: (a) the neighbor table, (b) the routing table, (c) the VP mapping table, (d) the inactive list, (e) the directory map, and (f) the hybrid storage space containing the cache and permanent home store

(4)

In Tornado, the size of a routing table is limited by having ª^log^ℜº⁻¹ routing levels in total. The memory overhead per virtual home scales logarithmically to the entire system size, and thus a better scalability is achieved.

3.3.3 Virtual-to-Physical Address Mapping Table

Each physical node maintains a virtual-to-physical address mapping table (see Figure 1(c)), denoted as a VP table. The table associates the virtual home ID to its real network address, e.g., the IP address and port number of the hosting active node.

Since a virtual home will be assigned to an active node, to retrieve data items from a virtual home should consult the VP table to resolve the network address of the associated physical node. The request can then be forwarded to the resolved address.

3.3.4 Inactive List

Each active node also maintains an inactive list that comprises of a set of inactive nodes whose virtual homes are allocated in the active node (see Figure 1(d)). Note that a node just joined Tornado is pessimistically assumed to be a “bad” peer. Each active node periodically monitors the nodes maintained in its inactive list. If it is overloaded and can discover a good peer from the inactive list, it will migrate some number of virtual homes to the latter. The monitoring can be accomplished by manipulating the profiles of nodes whose addresses are stored in the inactive list.

3.3.5 Directory Map

Each entry in the directory map comprises of a data ID, a time contract and a virtual home address. A valid directory entry indicates a shortcut to access the corresponding data item from its virtual home using the supplied IP address by consulting the corresponding VP table. The directory map is shown in Figure 1(e), where each data item is located by an n-bit hashing ad- dress.

Although the directory map helps locate data items, a virtual home may fail and cannot maintain those data items numerically closest to it. We thus associate each entry with a time contract to provide a consistent view for the shortcuts that each node maintains. The time contract is application-specific and specified by the data owner.

3.3.6 Permanent Store and Cache

Each virtual home in Tornado provides some storage for storing data items. The storage provides both the permanent home space and the cache. The permanent space stores those data items designated to the virtual home. Once the permanent store is allocated to a data item, no replacement is allowed unless the associated time contract is expired. If there is still space available in the storage, the extra space can be used for caching data items whose virtual home is somewhere else. Thus, the cache space is highly dependent on the available free space in the mixed storage. In Tornado, the cache simply adopts the least-recently-used (LRU) policy.

Figure 1(f) shows the storage for the permanent home space and cache. Each entry in the storage consists of a permanent bit, the hashing address of the stored data item, the time contract and the associated data address pointer. The permanent bit indicates whether the entry stores a permanent data item or a cached copy. The time contract is an application- programmable value, which denotes the time-to-live (TTL) value of the associated data item. The data address pointer

specifies the local memory/disk address that actually stores the data item.

To achieve high data availability, a data producer can specify the number of replicas, said k, for a particular data item. Tor- nado adopts the limited vectors [4] scheme that additionally associates k−1 vectors with each entry in the data storage.

These vectors point to the k−1 replication nodes. Note that the

−1

k nodes are the virtual homes with the node IDs “numerically closest” to the local node. Once the virtual home fails, the missing data items can still be accessed from the replication nodes with a high probability. Since a data producer will periodically republish its data items to refresh the associated TTL, one of the replication nodes will receive the refreshing message. It then replaces the old, failed virtual home and becomes the new home of the data item. Subsequent requests designat- ing to the old virtual home will eventually be sent to the new virtual home.

3.4 The Algorithms

Due to space constrains, we roughly present the design of algorithms as follows. The detailed can refer to [10].

3.4.1 System Adaptation

Physical nodes may dynamically join and leave Tornado. Tor- nado must efficiently adapt to the dynamic changes in the storage network.

When a node joins Tornado, it will contact an active node to allocate necessary data structures, including the neighbor table, the routing table, the directory map and the data store. The contacted active node will also store the IP address and port number of the joining node in the inactive list. The active node periodically monitors those nodes in the inactive list. If it is overloaded, it will select a good node from its inactive list to distribute the load.

3.4.1.1 Inserting a Physical Node

A node with the hashing key i intending to join the storage infrastructure should first contact a randomly chosen virtual home y hosted by a randomly selected active node via an out- of-band mechanism, e.g., a secure e-mail system. Home y then explores a route towards a virtual home whose ID is numeri- cally closest to i. Let the latter be the virtual home v. Note that a virtual home with an ID n₁ is “numerically closest” to a virtual home with an ID n₂ at the A-th routing level if there does not exist n that satisfies

ª º

2^log^ℜ⁻^A

≤

∂

<

n , (2)

where

otherwise n n if n n

n

n ₁ ₂

2 1

,

, ≥

°¯

°®

−

= ℜ

∂ , and n is the ID of any virtual

home.

Home v then generates ª^log^ℜº⁻¹ messages, each with a destination that is numerically close to =( + ^ℜ)modℜ

2^A

ϑ i , where

ª^log º ¹

,..., 3 , 2 ,

1 ℜ −

=

A . Meanwhile, the active node A locally allocates the data structures (i.e., home i) to host the storage space of the inserting node i and stores i’s address into its inac- tive list. Next, home v forwards the joining request to a leader

x1 with a key numerically closest to ϑ, where A=1. Once the leader x1 receives such a request, it forwards its highest and valid routing level, together with its neighbor table, to home v.

Home x1 also determines whether the ID i of the newly created virtual home can replace any of the routing entries and neighbor links in its routing and neighbor tables with the corresponding entries in the VP table of the hosted active node.

(5)

Once home v receives the partial routing entries and neighbor table for the inserting node i from home x1, it helps fill the routing table of home i with the received routing entries and determines which neighbor homes of x1 can be its neighbors.

Additionally, the corresponding addresses are inserted into the VP table of the active node hosting home i for the entries up- dated in home i’s routing and neighbor tables. Similarly, the joining request for home i will route to homes x₂,x₃,",

ª^log^ℜº⁻¹

x via v. Their corresponding routing levels and neighbor tables will be sent to v to help construct home i’s routing and neighbor tables with the associated entries of the VP table. These visited homes will also update their routing, neighbor and VP tables if necessary. From the above discus- sion, we can see that node join takes an overhead of O

(

logℜ

)

²

messages.

3.4.1.2 Removing a Physical Node

Each active node in Tornado needs to periodically monitor the neighbors for each virtual home it maintains. If a virtual home cannot connect to a neighbor, it will remove that neighbor from its neighbor table and then find another. Note that the neighbor table of home x should maintain the neighbors whose IDs are greater than x. Tornado accomplishes this by sending a message with the destination address x from home x to dis- cover the neighbor home with an ID closest to x. Each home receiving such a message will help forward the message by consulting its routing and neighbor tables, although the forwarding cannot use the routing entries or neighbor links desig- nating the destination address x. Since homes with similar IDs are logically clustered together, the request’s path length is expected to be small. Note that if the address of an inactive virtual home appears in the inactive list of an active node, such an address will be removed from the inactive list.

3.4.1.3 Self-Healing

Since nodes may dynamically join and leave the Tornado network, each physical node in Tornado thus needs to update the routing and neighbor tables of each hosted virtual home to reflect the dynamic network states. The update helps optimize the routing overhead and consequently improves routing effi- ciency for accessing data items (see Section 3.4.2), nodes insertion and the search of neighbor nodes.

Three possible events may trigger an active node to update the routing and neighbor tables of a home. First, an active node detects that several leaders in the routing table stop forwarding messages. This may be due to the failure of communication links or the failure of active nodes hosting that home. Second, a hosted home cannot communicate with its neighbor nodes.

Third, the time for a periodical update is expired. In either case, an active node should help a home to reconstruct its routing table. The repair of the neighbor table adopts an approach similar to that of node joining, except that the destination is designated to the home to be repaired. Again, such a routing cannot utilize the routing entries or neighbor links whose IDs are identical to the repaired home.

3.4.1.4 Migrating Virtual Homes

As mentioned above, when a node joins Tornado, there is a virtual home created for that joining node. The virtual home will be allocated to an active Tornado node with a key numerically closest to the created virtual home. An active Tornado node, however, may be overloaded with excessive virtual homes. The active node may spawn another inactive peer to share its load. Such a spawned peer should be a good peer that

can provide reliable communication and responsive computa- tion.

In Tornado, each active node x should monitor those nodes whose virtual homes are temporarily managed by x, i.e., those nodes appear in the inactive list in x. The monitoring deter- mines whether an inactive peer can (1) provide reliable and agile communication, (2) perform computational-intensive operations and (3) contribute its storage spaces for virtual homes. These can be accomplished by manipulating the profile of the inactive node and are unspecified in this work.

Note that different nodes may have different Threshold values, where Threshold denotes the maximum number of virtual homes a physical node can maintain. A newly spawn active node will create a home space for each migrated home. Each migrated home performs the node insertion operations to set up its routing and neighbor tables, and update the routing, neighbor and VP tables of the node already active. The new active node also creates an inactive list to monitor the migrated homes representing the inactive peers. Meanwhile, the active node originally hosting the migrated homes deletes the routing tables, the neighbor tables, the VP tables, the directory maps, the data stores and the associated entries in the inactive list for the migrated homes hosted. Another thing to note is that migrated homes may appear in the routing and neighbor tables of several virtual homes. Due to the inconsistent virtual-to- physical address mapping in the VP tables of their active nodes, they will fail to communicate with the migrated nodes. The migrated homes thus will be gradually removed from the routing, neighbor and VP tables of these virtual homes. Also removed are the entries in the inactive lists of their active nodes.

Since an active node periodically update its VP table and the routing and neighbor tables of each hosted virtual home, this guarantees that the tables of each virtual home will have the update-to-date states in a stable storage network. Possibly, each entry in the routing, neighbor and VP tables is associated with a time-to-live value. Once such a value is expired, the entry is invalidated and thus stale routes towards virtual homes are removed.

3.4.2 Data Accessing

3.4.2.1 Retrieving

Suppose a virtual home with the ID r wants to retrieve a data item with the key d. A request message will be forwarded to the homes specified by the various routing levels in the intermediate homes (i.e., the leaders) visited. Such a message will first be forwarded to a home, x1, indicated by the first routing level of the requesting home x0=r. Then x1 will consult its local routing table and forward the request to a home, x2, specified by its second routing level. In this way, x₃,x₄,",x_ª_log_ℜ_º₋₁ are visited. Finally, the message is sent to the home whose address is numerically closest to the key of the data item, as indicated by the lowest routing level of x_ªlogℜ_º−1.

It is possible that there is no valid route to advance to the next hop in the next routing level. In this case, the route should be forwarded to a home that can provide a valid route via either neighbor links or homes indicated by the current routing level.

To further improve the access performance, three optimiza- tions are included. First, if a request can be satisfied by the cache of an intermediate home, the request will be immediately served and responded to the requester. Second, if the requested data item is not in the local cache, the intermediate home will consult its local directory map. If its directory map has a valid entry, the request will be directly forwarded to its destination.

(6)

An entry is valid in the directory map if it is not expired and the address tag matches the ID of the requested data item. Fi- nally, the request can be sent to the virtual home indicated by the highest routing level in the routing table of a visited home.

When a requested data item is returned, it follows a reverse route. The data item will be cached in the local cache of the requesting virtual home and each intermediate home. Also stored are the data ID, as well as the home ID, the corresponding IP address and port number of the physical node, if the corresponding directory entry can be found in the replying home. Note that the home ID cached is not the ID of the replying home, but the ID of the home of the requested data item.

3.4.2.2 Publishing

Similar to retrieving, a virtual home publishing a data item needs to write the data item to the virtual home whose node ID is numerically closest to the ID of the published data item.

First, the publisher should determine whether the data item has been published previously or the hashing key representing the data item has been used. Meanwhile, the virtual home for storing such a data item should determine whether there is available space and whether the numerically closed home is capable of performing replication. If the key collides or the space of the remote home and the replication homes is not available, the publishing operation will be aborted. Otherwise, the data item can be written to its virtual home and the replication homes.

As mentioned above, Tornado leverages data availability by constantly replicating and maintaining k replicas for each data item via the limited vectors approach. The probability of com- pletely losing a given data item is thus _k

p

1 , where p is the ratio to lose a particular replica. Once a virtual home receives a publishing request, it will first construct k-1 routes to k-1 vir- tual homes, whose IDs are numerically closest to itself. Note that a route cannot visit the homes that have been chosen for the replication. To publish the replicas from the virtual home, k-1 publishing requests with the hashing keys obtained from the k-1 routes are sent to the replication homes. Meanwhile, the virtual home stores these k-1 hashing values in the associ- ated vectors. The virtual home will periodically monitor these replicas via the associated k-1 vectors. Since a data owner will periodically republish data items it generated, the corresponding virtual home will also need to periodically republishing replicas to the k-1 nodes. This guarantees that there always exists an active virtual home for each data item. On the other hand, if the virtual home fails, subsequent requests to the virtual home will be forwarded to one of its replicas. This is done easily with Tornado’s routing infrastructure, because one of the virtual homes responsible for the replications will have the numerically closest home ID to the requested data ID.

3.4.2.3 Leasing

Tornado uses leasing to provide a relaxed data consistency model. Each data item is associated with a time contract that specifies its lifetime. The control of a data item’s lifetime is left to the application. This scheme is very similar to that used in the World Wide Web, in which a web page can be associated with a TTL value that indicates for how long it can stay fresh in the local cache. Once the TTL is expired, the web page must be retrieved again from its original web server for the most up-to-date copy.

Tornado adopts a similar approach, but concentrates more on aggressively pushing data items into the distributed storage network. The published data items are maintained by an

anonymous active node beyond administrative boundaries and can be randomly replicated in any peer node with encryption.

3.5 Exploiting Physical Network Locality

As mentioned above, the network locality exploited is logical.

Virtual homes with closer keys may not follow the physical network locality. Since each home maintains multiple leaders in each routing level, Tornado can exploit physical network locality by choosing an appropriate leader to forward requests.

An appropriate leader for a given routing level in home x is the home with the minimal routing cost from x. The routing cost can be the transmission latency, the number of hops and the bandwidth between two nodes. For Tornado, this only entails a modification by varying the “numerically closest” to “numerically closer with the minimal routing cost”. We will show that this simple modification can introduce network locality and help route requests by consulting the nearby physical nodes (see Section 4.4).

Similar to the proximity routing mentioned above, Tornado could also be constructed to approximate the physical network topology. This can be done by simply choosing the appropriate leaders for the newly joining nodes.

4. Performance Evaluation 4.1 Impact of System Size

We evaluate Tornado via simulation. In default, each routing level in a routing table of a virtual home consists of two leaders. There are four neighbor nodes per home. We simulate capabilities of the physical nodes by randomly varying the number of virtual homes that a physical node can host from 1 to 5. The path length required is first investigated. Three system configurations Ϋ “Stable”, “Refresh” and “Never Re- fresh” Ϋ are studied. “Stable” is an ideal configuration where each virtual home has the optimized routing table (i.e., a leader of a given home’s routing level has the numerically closest key to the corresponding ϑ ). “Refresh” simulates the case in which each physical node periodically updates the routing table of each virtual home that it hosts. The number of updates is moderate (i.e., 10 updates) for each home as the system size increases. “Never Refresh” denotes the case in which the physical node does not help the update of the hosted homes.

We randomly select a set of virtual homes in the network and assign a group of randomly generated key values to each home in the set. The path lengths are measured and averaged when a

Figure 2. The number of hops required versus the number of physical nodes for various system configurations

(7)

physical node routes a message towards its assigned key values.

The number of hops reported here denotes the application- level hops between physical nodes.

Figure 2 presents the simulation results. We can see that if the virtual homes never refresh their routing tables, the path length required will be linearly increased with the system size. How- ever, if the virtual homes update their routing tables periodically as in Refresh, the path lengths will be comparable to those of Stable. Additionally, the path lengths of Refresh are logarithmically increased as the system size.

4.2 Performance of Data Accessing

The performance of data accessing relies on the available storage in each physical node. What we are interested in is the relative performance of various storage designs, i.e., the system without caches and directories (denoted “W/O”), the system with caches only (denoted “Cache”) and the system with the support of caches and directories (denoted “Directory + Cache”). The directory, the cache and the data store share the storage allocated to each home. In this experiment, the directory can provide the index space to accommodate all the data items in the system. This is reasonable since each directory entry is the data descriptor about its hashing address and the corresponding IP address. Thus, a directory entry can be quite small compared with a data item. Due to a lack of P2P storage workloads, we model a representative Web-like traffic, where 90% of requests access 10% of data items [1]. Similar to Sec- tion 4.1, the requests are randomly generated.

Figure 3 presents the average number of hops required versus the memory pressure¹ for “W/O”, “Cache” and “Directory + Cache”. Obviously, “Directory + Cache” outperforms “W/O”

and ‘Cache”. “Directory + Cache” does not increase the hop count since directories provide shortcut paths between two nodes. The performance of “Cache” degrades as the memory pressure increases. This is because the produced data items gradually consume the available memory space. For “W/O”, there is no optimization support and thus W/O is not affected by the memory pressure.

1 Memory pressure is defined as follows.

pace storage s ble lly availa of initia total size

s data item of unique total size pressure

memory = .

4.3 Impact of Failures

To study the performance impact of failures, the system size is initially set to 320000 physical nodes. We randomly remove one node at a time until the system size is reduced to 625 physical nodes. Similar to Section 4.1, the path lengths are measured and averaged by randomly assigning requests to physical nodes. Note that each physical node in Tornado needs to help update the neighbor table of each hosted virtual home, if it finds that the corresponding neighbor links are failed.

Meanwhile, each home also updates the corresponding routing levels of its routing table if the failed neighbor nodes appear in such routing levels.

In Figure 4(a), we can see that the path lengths are logarithmically shortened as the number of physical nodes increases.

Figure 4(b) shows the probability of successful routes. Up-to 98% of messages can be delivered to destinations in Tornado.

Tornado maintains high data availability via replications. We experimented with 0, 5, 15 and 35 replicas for each data item.

The system studied is initialized to 100000 nodes and we randomly remove one physical node at a time until the system size is reduced to 625 physical nodes. The availability is measured by assigning data retrieving request with randomly generated data keys to randomly selected physical nodes. Note that we do not simulate data leasing and thus virtual homes do not republish the data items they produced. The number of identical data

(a)

(b)

Figure 4. (a) the number of hops required versus the number of physical nodes, and (b) the probability of successful routes versus the percentage of physical nodes

Figure 3. The number of hops required for accessing data items versus the memory pressure

(8)

items is thus reduced along with the decreasing number of physical nodes.

Figure 5 shows the simulation results. If there are over three copies replicated for each published data item, nearly 100% of the data items are available and can be retrieved in a system with 20% of nodes failed. A data availability of 90%, 63% and 40% can be maintained for a system respectively encountering

60%, 80% and 90% of failure nodes. Even when 95% of nodes are failed, 22% of the data items are still available. Similar results can be obtained in a system with one or seven replicated copies for each data item.

4.4 Impact of Network locality

Figure 6(a) presents the communication cost for transmitting requests between physical nodes. We randomly assign the communication cost to the communication links from 2 (e.g., 2M bits 802.11b wireless LAN) to 50 (e.g., 100M bits Ethernet). The “average” denotes the average communication cost between any two physical nodes in the network. The

“leader” and “neighbor” represents the average communication costs required for a route between two consecutive leaders, and between leader and neighbor, respectively. The breakdown of the number of visited leaders and neighbors for a route is de- picted in Figure 6(b). The results indicate that Tornado can provide the routes via nearby nodes to reach the destinations and most of the nearby nodes visited are the leaders. Obviously, exploiting the network locality for the routes between leaders is beneficial.

5. Conclusions

In this study, we propose a scalable and reliable P2P storage infrastructure, Tornado. Tornado is based on the virtual home concept to exploit the capabilities of the underlying components to leverage their resources, performance and reliability.

Tornado is reliable and efficient, and only “good” nodes are used for the storage infrastructure. It is self-organizing and is capable of providing fault-tolerant routes for accommodating the dynamics in a storage network. Additionally, it allows loads being distributed to good nodes and maintains high data availability by utilizing their storage spaces.

References

[1] M. F. Arlitt and C. L. Williamson. “Web Server Workload Characterization: The Search for Invariants,” In ACM International Conference on Measurements and Modeling of Computer Systems, pages 126-137, May 1996.

[2] I. Clarke, O. Sandberg, B. Wiley, and T. W. Hong. “Freenet: A Distributed Anonymous Information Storage and Retrieval Sys- tem,” In Workshop on Design Issues in Anonymity and Unob- servability, pages 311-320, July 2000.

[3] F. Dabek, M. F. Kaashoek, D. Karger, R. Morris, and I. Stoica.

“Wide-Area Cooperative Storage with CFS,” In ACM Symposium on Operating Systems Principles, October 2001.

[4] H.-C. Hsiao and C.-T. King. “Modeling and Evaluating Peer-to- Peer Storage Infrastructure,” In IEEE Parallel and Distributed Processing Symposium, April 2002.

[5] J. D. Kubiatowicz et al. “OceanStore: An Architecture for Global- Scale Persistent Storage,” In ACM International Conference on Ar- chitectural Support for Programming Languages and Operating Systems, pages 190-201, Nov. 2000.

[6] S. Ratnasamy, P. Francis, M. Handley, R. Karp, and S. Shenker. “A Scalable Content-Addressable Network,” In ACM SIGCOMM, pages 161-172, August 2001.

[7] A. Rowstron and P. Druschel. “Storage Management and Caching in PAST, A Large-Scale, Persistent Peer-to-Peer Storage Utility,” In ACM Symposium on Operating Systems Principles, Oct. 2001.

[8] A. Rowstron and P. Druschel. “Pastry: Scalable, Distributed Object Location and Routing for Large-Scale Peer-to-Peer Systems,” In IFIP/ACM International Conference on Distributed Systems Plat- forms (Middleware 2001), Nov. 2001.

[9] I. Stoica, R. Morris, D. Karger, M. F. Kaashoek, and H. Balakrish- nan. “Chord: A Scalable Peer-to-Peer Lookup Service for Internet Applications,” In ACM SIGCOMM, pages 149-160, August 2001.

[10] Tornado. http://pads1.cs.nthu.edu.tw/tornado.html.

[11] B. Y. Zhao, J. D. Kubiatowicz, and A. D. Joseph. “Tapestry: An Infrastructure for Fault-Tolerant Wide-Area Location and Routing,”

Technical Report UCB/CSD-01-1141, April 2000.

(a)

(b)

Figure 6. (a) the communication cost required for a route to a leader/neighbor versus the average cost required of a route between any two physical nodes, and (b) the average number of leaders and neighbors visited towards a destination

Figure 5. The data availability versus the number of physical nodes