Reliability of the system - Game Theoretic Analysis of Distributed Systems: Design and Incentiv

In contrast to storage services, our goal is to ensure data durability with a P2P backup system design that does not require neither user payments nor external incentive mechanisms to provide safe backup in exchange for as little shared resources as possible. Besides the issues and solutions discussed so far, the reliability of the service highly depends on the speed of data transfers and on the security aspects.

3.3.1 Data transfers

Besides the amount and placement of redundant data, organizing data transfers has received less attention from the research community. The authors of [19] analyzed random backup scheduling by modeling peer uptime as a Markovian process. Their study reached a conclusion that is analogous to what we obtain in Chapter 6: in backups, the completion time of random scheduling converges to the optimal value as the system size grows.

BitTorrent [29] uses predefined sized fragments, and adaptive upper limits on parallel transfers in order to avoid unfinished transfers and the appearance of bottlenecks respectively. Many of our design elements are inspired by what is implemented in this file sharing application.

As mentioned in Chapter 2, we suggest to employ a data center as a remedy for the temporary lack of peer resources, as bandwidth. Little work has been done on hybrid approaches to mitigate the shortcomings of P2P systems. To the best of our knowledge no prior work tackles data backup and/or repair operations assisted by a central entity.

AmazingStore [138] improves data availability of cloud-based storage services and reduces their costs by augmenting centralized clouds with an efficient client-side storage system. Peers backup at other peers, besides the servers, with different online patterns to improve the data availability and to serve read requests within the P2P network. Therefore the hybrid system mitigates the issue of the centralized point of failure, and provides resilience to large-scale failures. FS2You [91] is a peer-assisted system that provides temporary storage and seeding for files in a BitTorrent-like content distribution system with a hybrid structure consisting peers and servers. FS2You does not guarantee data persistence; while its goal is to minimize bandwidth costs, we focus instead on minimizing the storage costs that will be dominant in the long run for a storage system.

3.3.2 Security

While correlation among peer uptimes is a natural phenomenon, losing data on peers in masses is somewhat suspicious. The most probable reason for many peers to lose the data stored on them simultaneously is that they all run the same softwares, e.g., operating system, or they had been created as part of a Sybill attack [40]. Since the reason for these correlated peer failures is

CHAPTER 3. RELATED WORK 24

a bug or vulnerability, and they create high risk on the reliability of the system, we see them as a security issue.

Many works [66, 73, 74, 98, 133] tackled this issue, and proposed to place data on peers that are less possibly correlated: those who use different software configurations, who are connected through different network service providers and are far from each other geographically.

Along with losses due to peer failures, replicas or fragments can be destroyed on peers acci- dentally and voluntarily. In order to detect these events, data integrity checks must be performed periodically. These veriﬁcations also ensure that peers are really storing the data assigned to them, hence enforcing fairness in storage.

While common checksums signal accidental corruptions, detecting malicious data deletion re- quires more sophisticated operations. Two main approaches have been proposed: either peers create self-verifying data blocks with signatures that are cryptographic collision-resistant hash func- tion of the blocks themselves [117,134], or they perform probabilistic challenges via cryptographic protocols toward storage peers that can be answered only by holding the data block [14,90,104]. With the ﬁrst option, to detect any data modiﬁcation the peer has to download the block and recompute the signature to perform the comparison.

Since in our backup system we can assume to have a single data owner authorized to read and write it, data conﬁdentiality and access control can be achieved with standard cryptographic techniques between the storing couples. Moreover, for the same reason, consistency guarantees for multiple readers and writers, and anonymity for data publishers and readers are not needed.

Chapter 4

System design

The goal of a P2P backup system is to store data for users safely on remote peers. Every system participant runs a client application on its device, e.g., computer or set-top box, with shared storage capacity, connected to the Internet. The system design must cope with the unavailability of peers and the unpredictable amount of resources dedicated to the system.

We have to consider many design aspects when building our system. This thesis presents innovative elements regarding the data redundancy scheme, the peer selection with related data placement strategies, and the data transfer scheduling policy. In those ﬁelds where known techniques provide reasonable solutions, given the purposes and the assumptions of a P2P backup system, our work refers to and reuses mechanisms published in the literature, e.g., erasure coding and repairs.

4.1 Data backup and retrieval

The system design that we present contains a central server, called tracker, which is operated by the backup service provider to supply various system management operations, such as peer registration and monitoring. We note that they can also can be implemented in a distributed way using well-known techniques (e.g., DHTs). Such a task is however outside the scope of the thesis. Therefore for simplicity we assume that all of them are carried out by the tracker.

When a peer has to save new data, the backup phase begins. During it, the data owner: • establishes its data to be backed up (stored locally indeﬁnitely unless the device of the peer

crashes), divides it into ﬁxed size fragments, then creates additional fragments by encoding the original fragments by erasure coding (Section 4.2);

• queries the tracker for a set of remote peers that are willing to store a fragment of the peer on their devices, i.e., have suﬃcient unallocated storage capacity, and announces itself at the tracker as a storage candidate in the meanwhile (Section 4.3);

CHAPTER 4. SYSTEM DESIGN 26

• performs fragment transfers to an online subset of selected peers, striving to complete each of them within a predetermined transfer timeout (Section 4.4).

The phase lasts until backup is completed: the peer reaches suﬃcient data redundancy, thus safe backup, by having uploaded enough fragments successfully. Afterwards, the rearranging

phase is started, during which the peer re-uploads the fragments that are lost on crashed peers.

As soon as a peer notices it has lost its local storage due to e.g., device crash, the peer initiates a retrieval phase, during which it remains online until its retrieval operation is ﬁnished. After it has downloaded enough of its own fragments from remote peers and restored its data, it also downloads fragments of remote peers to store.

The performance metrics we will use to evaluate our novel redundancy, peer selection and transfer scheduling schemes are data loss probability, time to backup and time to retrieve the data.

Definition 4.1.1 Data loss probability The backup gets lost if not enough fragments can be retrieved from remote peers. The amount of time spent before noticing peer crashes and the speed of retrievals have important impact on the probability of losing data, defining an interval during which data is at risk, because no maintenance of redundancy is carried out.

Data loss probability (DLP) describes data durability by the likelihood of losing a number of encoded fragments stored on remote peers within a given time frame, so that the remaining fragments are insufficient to restore the original data. Besides the given time duration, the DLP depends on the crash rate of storing remote peers.

Definition 4.1.2 Time-To-Backup and Time-To-Retrieve The Time-To-Backup (TTB) (resp. Time-To-Retrieve (TTR)) of a user is the time elapsed while the user uploads (resp. downloads) a number of encoded fragments to (resp. from) remote storage locations which meets the target redundancy (resp. is sufficient to restore the original data). TTR is definable only if the user has backed up enough fragments to make restore possible before starting to retrieve a subset of them.

We use baselines for backup and restore operations which bound both TTB and TTR. Let us assume an ideal storage system with unlimited capacity and uninterrupted online time that backs up user data. In this case, TTB and TTR only depend on the backup data amount and on the bandwidth capacity and availability of the data owner. We label these ideal values of a user as minTTB and minTTR. A peer i with upload and download bandwidth ui and di starting the

backup of an object of size o at time t completes its backup at time t′, after having spent _uo

i time

online. Analogously, i restores a backup object with the same size at t′′ _{after having spent} o di time

online. We define minT T B(i, t) = t′ − t and minT T R(i, t) = t′′− t. We use these reference

values throughout the paper to compare the relative performance of our P2P application versus that of such an ideal system.

CHAPTER 4. SYSTEM DESIGN 27

Note that TTB is generally several times longer than TTR. First, in the retrieval phase, peers are not likely to disconnect from the Internet. Second, most peers have asymmetric lines with fast downlink and slow uplink; third, backups require uploading redundant data while restores involve downloading an amount of data equivalent to the original backup, as we show later. Because of this unbalance, we argue that it is reasonable to use a redundancy scheme that trades longer TTR (which affects only users that suffer a crash) for shorter TTB (which affects all users).

In document Game Theoretic Analysis of Distributed Systems: Design and Incentives (Page 41-45)