Preventing Misbehavior in Cooperative Distributed Systems

(1)

SHIN, KYUYONG Preventing Misbehavior in Cooperative Distributed Systems. (Under the direction of Professor Douglas S. Reeves).

Cooperative distributed systems are becoming increasingly popular as alternatives

to the traditional client-server model for many applications, including file sharing,

stream-ing, and distributed computing. In cooperative distributed systems, participants directly

cooperate with each other to achieve common goals by sharing resources without the need of any central control. Therefore, in contrast to the client-server model, the system

ca-pacity potentially scales as the number of participants in a system increases, providing the

participants with information or services with few resource restrictions. The information or

services provided by the system can be thought of as apublic good, and participantsshould

play a part in the protection and provision of the public good. Thus, cooperation among participants to obtain mutual benefits is the fundamental premise behind the success of

such a system.

In spite of the importance of cooperation among participants to protect and

sup-port the public goods in cooperative distributed systems, a high level of informational

integrity of the goods and behavioral integrity of participants toward the goods is difficult

to achieve due to malicious or selfish participants. Because such malicious or selfish be-havior was not anticipated at the inception of cooperative distributed applications, they

are highly vulnerable to such behavior. To address the problem, in this dissertation, we

identify two major threats (i.e., pollution and free-riding) to the protection and provision of the public goods, and propose tailored solutions to those specific threats. In addition,

a general, fairness-enforcing incentive mechanism is proposed to foster cooperation among

participants, which could be readily used to prevent various misbehavior in a wide range of

cooperative distributed systems.

Firstly, this dissertation investigates the pollution problem in file sharing systems,

and proposes a novel Distributed Hash Table (DHT)-based anti-pollution scheme called

winnowing. Winnowing attempts to achieve a high level of informational integrity of the public goods (i.e., shared files in this case) through cooperation among (benign)

(2)

sharing. By employing secret sharing into file sharing, the proposed scheme, called Treat-Before-Trick (TBeT), enforces cooperation among participants by restricting uncooperative participants from the acquisition of secrets required to complete their work. Therefore, a

high level of behavioral integrity on the part of participants toward the public goods can

be achieved under TBeT.

Finally, this dissertation proposes a general incentive mechanism which can be

readily and widely used in many cooperative distributed systems to enforce cooperation among participants, which is named Triangle Chaining (T-Chain). T-Chain strongly de-pends both on the use of light-weight symmetric cryptography to reduce the opportunity

(3)

by Kyuyong Shin

A dissertation submitted to the Graduate Faculty of North Carolina State University

in partial fullfillment of the requirements for the Degree of

Doctor of Philosophy

Computer Science

Raleigh, North Carolina

2009

APPROVED BY:

Dr. George N. Rouskas Dr. Peng Ning

Dr. Douglas S. Reeves Dr. Injong Rhee

(4)

DEDICATION

To my beautiful and tolerant wife

Nari Jun

and cherished two sons

Jaebeen

and

Jaeui

(5)

BIOGRAPHY

Kyuyong Shin was born on April 20, 1973, in South Korea. He married Nari Jun

in 2000 and they have two sons, Jaebeen and Jaeui.

He received his Bachelor’s degree in the Department of Computer Science at the

Korea Military Academy in 1996. After serving in the Korean Army as a platoon leader, he

attended Korea Advanced Institute of Science and Technology (KAIST) from 1998 to 2000,

where he earned his Master’s degree in computer science. After he experienced various

military assignments as an army officer and assistant professor in Korea, he continued his

graduate study in the Ph.D. program in the Department of Computer Science at North

Carolina State University, since the fall of 2005.

He lives in Seoul, South Korea, where he is currently working for the Korea Military

Academy as a faculty member. His research interests lie in Peer-to-Peer systems, network

(6)

ACKNOWLEDGMENTS

I would like to thank all people who have helped and inspired me during my

doc-toral study. First of all, I would like to express my deepest gratitude to my advisor, Dr.

Douglas S. Reeves, for his endless concern, inspiration, commitment, and support.

Through-out my doctoral research, he continually encouraged me to develop analytical thinking and

research skills, and patiently guided me in the right direction.

I would like to thank my co-advisor, Dr. Injong Rhee, for his insightful and

con-structive guidance in my doctoral research. Also, I am very grateful to the other committee

members, Dr. George N. Rouskas and Dr. Peng Ning, for their continued support, valuable

comments, and encouragement. Special thanks should be given to Dr. Xiaohui Gu for her

substitution during my final oral defense, allowing me the flexibility to schedule the exam.

It has been great to work with Dr. Carla D. Savage as her Teaching Assistant for

several semesters, acquiring invaluable teaching experience here at NCSU. In addition, I

would like to thank the Director of Graduate Program, Dr. David J. Thuente, for his help

during my Ph.D. study. I am also grateful to Ms. Margery Page for her constant help.

This research is supported by various funding sources including Secure Open

Sys-tems Initiative (SOSI) and I would like to thank them for their support. I also would like

thank the Korean Army and the Korea Military Academy for giving me this invaluable

opportunity to study abroad.

I would like to thank my friends and colleagues for their help in various ways

throughout the entire duration of this thesis, including Ahmed Azab, An Liu, Attila

Al-tay Yavuz, Beakcheol Jang, Emre (John) Sezer, Entong Shen, Dr. Hyungjun Kim, John

Sobrero, Juan Du, Jun Bum Lim, Pastor Kwanseok Kim, Dr. Mihui Kim, Min Yeol Lim,

Michele Calcavecchia, Sangwon Hyun, Sangtae Ha, Seong Ik Hong, Qinghua Zang, Wei Wei,

Yao Liu, Young Hee Park, and others whom I may have mistakenly forgotten to mention.

I also would like to thank Korean army officers, Chongkyung Kil, Jung Ki So,

Woopan Lee, Yong Chul Kim, Kiseok Jang, Changho Son, and Hesun Choi for all their

help, support, and valuable time we spent together at NC State. And, I cannot forget to

mention my KMA 52 classmates, Seungjin Han, Sungrok Kang, Sungwoo Kim, Youngchoon

Jeon, Youngho Jo, and their families for the unforgettable memories in United States.

Finally, but most importantly, thank God for saving my life and always being with

(7)

TABLE OF CONTENTS

LIST OF TABLES . . . viii

LIST OF FIGURES . . . ix

1 Introduction . . . 1

1.1 Cooperative Distributed Systems . . . 1

1.2 Major Threats to Cooperative Distributed Systems . . . 3

1.3 Challenges to Developing Solutions . . . 4

1.4 Our Approaches . . . 5

1.5 Organization of the Dissertation . . . 8

2 Anti-Pollution Mechanism . . . 9

2.1 Background . . . 12

2.1.1 Publishing and Retrieving . . . 12

2.1.2 Existing pollution attacks . . . 14

2.2 Sharing in the wild . . . 14

2.3 Index Filtering (winnowing) . . . 17

2.3.1 Keyword Publish Verification . . . 18

2.3.2 Privacy-preserving Object Reputation . . . 19

2.3.3 Penalizing Uncooperative or Malicious Index Nodes . . . 22

2.4 Evaluation . . . 23

2.4.1 Performance of Keyword Publish Verification (Preliminary Phase) . 23 2.4.2 Analytic Model (Privacy-preserving Object Reputation) . . . 25

2.4.3 Modeling and Simulation Results . . . 27

2.5 Discussion . . . 32

2.6 Related Work . . . 34

2.7 Summary of Winnowing . . . 35

3 Free-riding Prevention Mechanism . . . 37

3.2 Background . . . 40

3.2.1 BitTorrent Overview . . . 41

3.2.2 Existing Free-riding Strategies . . . 42

3.2.3 (t, n) Threshold Secret Sharing . . . 43

3.3 Treat-Before-Trick (TBeT) . . . 44

3.3.1 Basic Protocol Description . . . 44

3.3.2 Protocol Enhancements . . . 45

3.3.3 Optimal Threshold Value . . . 47

3.3.4 Countering Free-Riding Strategies . . . 49

(8)

3.4.1 Experimental Setup . . . 50

3.4.2 Optimal Threshold Value (t) . . . 51

3.4.3 Penalty for Free-Riding . . . 52

3.4.4 Performance of Compliant Peers . . . 55

3.4.5 Fairness . . . 56

3.4.6 Performance of Rational Peers . . . 57

3.5 Discussion . . . 58

3.5.1 Computational Overhead . . . 58

3.5.2 Collusion . . . 59

3.5.3 Key Disclosure . . . 59

3.6 Summary of TBeT . . . 62

4 General Incentive Mechanism . . . 63

4.1.1 Basic Enhancements to TFT . . . 69

4.1.2 Use of Indirect Information (Reputation or Accounting) . . . 70

4.1.3 Use of Coding or Encryption . . . 72

4.1.4 Summary of Existing Work . . . 74

4.2 Design of T-Chain . . . 74

4.2.1 Protocol Overview . . . 75

4.2.2 Initiation Phase : Figure 4.1(a) . . . 76

4.2.3 Continuation Phase : Figure 4.1(b) . . . 78

4.2.4 Termination Phase : Figure 4.1(c) . . . 79

4.2.5 Incentives Immanent in T-Chain . . . 80

4.2.6 Protocol Enhancements . . . 81

4.2.7 Summary of the Proposed Scheme . . . 84

4.3 Security and Overhead Analysis . . . 86

4.3.1 Known BitTorrent Free-Riding Attacks . . . 86

4.3.2 Attacks Unique to T-Chain . . . 88

4.3.3 Overhead of T-Chain . . . 90

4.3.4 Fault Tolerance Issues . . . 91

4.3.5 Optimality . . . 93

4.4 Evaluation . . . 94

4.4.1 Experimental Conditions . . . 95

4.4.2 Performance without Free-Riding (Flash Crowd Case) . . . 96

4.4.3 Performance with Free-Riding (Flash Crowd Case) . . . 100

4.4.4 Impact of Collusion Among Free-Riders (Flash Crowd Case) . . . . 102

4.4.5 Performance with Free-Riding (Real Trace Case) . . . 103

4.4.6 Fairness with Free-Riding (Real Trace Case) . . . 105

4.4.7 Chain Characteristic . . . 106

4.4.8 Opportunistic Seeding . . . 108

4.4.9 Symmetric and Asymmetric Interests . . . 109

4.4.10 Performance Under High Churn Rates . . . 111

4.4.11 Summary of Evaluation . . . 112

(9)

5 Conclusion and Future Work . . . 114

5.1 Conclusion . . . 114

5.2 Future Work . . . 116

(10)

LIST OF TABLES

Table 2.1 Summary of Notations Used in Winnowing . . . 25

Table 3.1 Summary of Notations Used in TBeT . . . 47

(11)

LIST OF FIGURES

Figure 2.1 A high level overview of the publish and retrieve mechanism. . . 13

Figure 2.2 File sharing statistics in the wild (Kad network) . . . 15

Figure 2.3 User statistics in the wild (the Kad network and BitTorrent) . . . 16

Figure 2.4 Message processing in eMule with winnowing. . . 18

Figure 2.5 The measurement results of the keyword publish verification. . . 24

Figure 2.6 The relative performance of each system under perfect conditions. . . 28

Figure 2.7 The relative performance of each system under realistic conditions. . . 28

Figure 2.8 The effects of the active pollution with the Sybil attack. . . 29

Figure 2.9 The effects of IP binning withα under the realistic conditions. . . 30

Figure 2.10 The effect of collusion attacks. . . 31

Figure 2.11 The number of negative votes received by the malicious index node. . . 31

Figure 3.1 BitTorrent overview . . . 41

Figure 3.2 Optimal threshold valuet of TBeT. . . 52

Figure 3.3 Average download completion time of compliant peers and free-riders ( with-out the large-view-exploit). . . 53

Figure 3.4 Average download completion time of compliant peers and free-riders (with the large-view-exploit). . . 53

Figure 3.5 The ratio of the average completion time of free-riders over that of compliant peers under the homogeneous (a) and heterogenous (b) upload speed case. . . 54

Figure 3.6 The rate at which free-riders in TBeT get subkeys under the continuous arrival process with heterogeneous upload rate peers. . . 55

(12)

Figure 3.8 Correlation between the download completion time and the donated upload rate in both systems with the large-view-exploit and 25% free-riders out of 400 leechers. In this experiment, peers join the systems in a flash crowd. . . 57

Figure 3.9 The average download completion time of semi-seeders and rational peers after subkey completion under the continuous arrival process. Half of 500 semi-seeders were randomly selected as rational peers. . . 58

Figure 3.10 The completion times of compliant peers and free-riders with heterogeneous upload rates in both systems after a key disclosure under the continuous stream model. . . 60

Figure 3.11 The instantaneous completion time of compliant peers in TBeT with rekey-ing, assuming continuous peer arrivals. . . 61

Figure 4.1 The initial, intermediate, and terminal transactions in a chain of T-Chain. 77

Figure 4.2 Average download completion time (a) and average uplink utilization (b) of compliant leechers in BitTorrent, PropShare, and T-Chain under a flash crowd leecher arrival model without free-riders. . . 97

Figure 4.3 The effects of the file size (a) and the swarm size (b) under T-Chain. . . 98

Figure 4.4 The transfer times of individual file pieces for two specific leechers with the lowest and the highest upload rates. . . 98

Figure 4.5 The number of different pieces between each pair of neighbors in a real BitTorrent swarm over 7 days (a) and the effects of initial piece differences among leechers (b). . . 99

Figure 4.6 Average download completion time of compliant leechers (a) and free-riders (b) in BitTorrent, PropShare, and T-Chain in a flash crowd (without the large-view-exploit). . . 100

Figure 4.7 Average download completion time of compliant leechers (a) and free-riders (b) in BitTorrent, PropShare, and T-Chain in a flash crowd (with the large-view-exploit). . . 101

Figure 4.8 The effects of collusion in T-Chain under the flash crowd arrival model. . 103

Figure 4.9 The average download completion time (a) and the average uplink utiliza-tion (b) of compliant leechers in BitTorrent, PropShare, and T-Chain under a continuous stream model. . . 104

(13)

Figure 4.11 The number of active chains as a function of time in a flash crowd model (a) and that in a continuous stream model without free-riders (b). . . 107

Figure 4.12 The length of each chain under a flash crowd model with 600 leechers without free-riders. . . 108

Figure 4.13 The cumulative number of the chains created by the seeder and by the opportunistic seeding in a flash crowd (a), and the ratio of opportunistic seeding in a real trace as the function of the fraction of free-riders in the system (b). . . 109

Figure 4.14 The ratio of pieces received through bilateral exchanges in a continuous stream model with different faction of free-riders. . . 110

(14)

Chapter 1

Introduction

With the onset of electronic information exchange comprising the majority of

in-ternet traffic, cooperative distributed systems (e.g., Peer-to-Peer file sharing [64], Content

Delivery Networks [2], and Grid Computing [13]) are increasingly popular as alternatives

to the traditional client-server model for many applications. For instance, a recent study

done by a German based company on the usage of file sharing applications in several

coun-tries shows that more than 50% of all Internet traffic is Peer-to-Peer (P2P) related, with

peaks of greater than 95% during the nighttime hours [28]. Distributed systems, while

being extremely popular (thanks to their simplicity, scalability, robustness, and flexibility),

raise a variety of technical issues. This is mainly because there has been little security

consideration (to malicious or selfish behavior) at their inception. In this dissertation, we

examine major threats to cooperative distributed systems, identify challenges in developing

solutions to those threats, and propose some defense mechanisms for resolving the threats

while overcoming the identified challenges.

1.1 Cooperative Distributed Systems

A cooperative distributed system is comprised of many autonomous participants (i.e., computers) which communicate through an overlay network (i.e., a virtual network of computers and logical links) built on top of an existing network. The use of a virtual

overlay network makes it possible to temporarily build a simple, flexible, and robust

infras-tructure for a new service that is not currently available in the existing network. These

(15)

in-frastructure is dynamically constructed in an ad-hoc manner without fixed, dedicated, or centralized infrastructure. Membership in such systems is also ad-hoc and dynamic in that

there is no requirement of participants for joining and departing the systems (i.e., open membership). Participants directly cooperate with one another only while they remain in the systems to achieve their mutual benefits (e.g., file downloads, information exchanges, or

solving scientific problems), by sharing resources such as computing power, storage space,

or network bandwidth. Unlike the client-server model, system capacity potentially scales

as the number of participants in a system increases, liberating the original service provider

from over-provisioning its resources to handle peak user demands [20, 19, 32, 68, 66].

A cooperative distributed system is aself-organizing system of equal, autonomous, individual participants which aims for the shared usage of distributed resources in a virtually

networked environment without the need for central coordination services. Such a concept

forms a fundamental design principle for distributed systems and reflects theparadigm shift

from coordination tocooperation, from centralization todecentralization, and from control

toincentives [70].

The significantly increased access capability to the Internet, the ample storage

space, and resourceful computing power of individual end users make the paradigm shift

promising. That is, the aggregated system resources of individual computers at the edge of

the networks are more powerful than any dedicated computing or storage service unit (e.g.,

super-computer or storage server), but they are not always in use, wasting the resource when not used. In a distributed system, each participant donates its resources when idle

(i.e., when the resources are not in use) and obtains other resources in return when busy

(i.e., when more resources are needed), which is mutually beneficial to the participants.

Therefore, if proper incentive is given to the participant, the mutual benefits are perceived

by the participants (so that they voluntarily cooperate with each other), and the system is

managed by the participants themselves without the need for any central entity or control,

future Internet-based applications even those requiring exorbitant resources can be easily

realized without resource restriction.

The information or services provided by a cooperative distributed system to its

(16)

files are not diminished by downloads (by participants) and the use of these files by any

one person does not restrict the use by anyone else. Since there is generally no dedicated

service provider, the fundamental premise of the success of cooperative distributed systems

is voluntarycooperation among participants toprotect and support public goods.

1.2 Major Threats to Cooperative Distributed Systems

As the usage of cooperative distributed systems increases, the importance of

achieving a high level of information integrity of the public goods shared and behavioral

integrity of participants in the systems cannot be overemphasized. It is required in order to

protect the Internet and its users. When a system lacks such integrity, it often causes losses

of the critical advantages of flexibility, robustness, and scalability offered by the system.

This level of integrity is, however, rarely attained under distributed environments without

a centralized authority governing the entire system. The major threats to these systems

come from the malicious and selfish misbehavior of participants.

One major threat is a deviation from the informational integrity of the public goods

shared in the systems by malicious participants. A high level of informational integrity

is indispensable for the success of a system, in that no one wants to join or remain in

the system for an unproductive service or useless data. Unfortunately, the public goods

shared are often error prone, causing a waste of network bandwidth, user time, and storage

space. For example, Liang et al. [49] showed that up to 80% of the copies of popular

files in the KaZaA network [4] are contaminated. This contamination is caused in part by

deliberate file pollution attacks [49, 18, 46]. Because pollution was not considered in the design of such systems (and they have no central, trusted authority), cooperative distributed

systems are highly vulnerable to such intentional pollution attacks [18]. Since pollution

squanders network resources and discourages user participation, the success of such systems

is questionable unless properly addressed. Therefore, there should be a mechanism to

protect the public goods shared in the systems.

Another major threat is the deviation from behavioral integrity of participants

in the systems toward the goods by selfish participants. It has been shown that there is

the desire of some participants to get benefits without making contributions (due to their

(17)

share any files (and were therefore free-riders). Handurukande et al. [33] reported more

than 80% of eDonkey users were free-riders in 2006, and Zghaibeh et al. [78] showed that

the volume of free-riders continues to increase in BitTorrent [7, 20] and approached 16.8%

in 2008. Free-riding occurs when the benefits of a good (i.e., consuming the good) can be

enjoyed without contributing to or paying for the production of that good. With free-riding,

some participants will contribute for the benefits of others but receive little or nothing in

return. The result can be a reduction in the willingness to contribute, with a resulting

underproduction (relative to the demand) or lack of production of the good. This problem

is known as “the tragedy of the commons” where individual’s attempt to maximize his or

her benefits leads to reduction of the overall public good [34]. Therefore, a new mechanism

to enforce cooperation among participants is required in order to prevent free-riding.

The best way of preventing such deviation from informational and behavioral

in-tegrity in cooperative distributed systems is to introduce highly effective penalty or incentive

mechanisms to the systems. These mechanisms must be impossible for internal system

par-ticipants to bypass, if they are to be effective. Although researchers have put efforts into

developing such penalty or incentive mechanisms in order to improve the systems, it has

been proven to be a daunting and difficult task which has not yet been truly successful as

seen in Chapters 2, 3, and 4.

1.3 Challenges to Developing Solutions

As mentioned, participants in a cooperative distributed system directly and

vol-untarily cooperate with one another to achieve their mutual benefits by sharing resources

without a universally trusted authority. Since there is no omniscient oracle in a distributed system, participants in the system must trust each other. When a deviation (e.g., pollution

or free-riding) occurs by malicious or selfish participants, however, the lack of a central

authority makes it difficult for the rest to detect or penalize the deviants (since no one is in charge), which is aggravated by the open membership. As such, the challenges are how

to figure out that I am currently under a selfish or malicious attack, how I penalize the

attacker, and to whom (and how) I report the attack so that the attacker is punished by

others. In addition, mutual agreements among participants should be obtained to effectively

deal with the deviants without any pre-trusted participant or authority. This is, however,

(18)

myriad of participants.

In this dissertation, we look for a way to achieve a high level of informational

integrity of the public goods shared and behavioral integrity of participants in cooperative

distributed systems. Since there is no centralized entity responsible for managing the entire

system in most distributed systems, solutions need to be implemented in a distributed

and cooperative manner. In this sense, our approaches depend on strong collaboration

among benign participants to detect and defeat the malicious and selfish behavior. These

approaches correspond to the intended original philosophy of the distributed approach (i.e.,

by the participants, for the participants).

1.4 Our Approaches

This dissertation attempts to make contributions by developing mechanisms

ad-dressing the above mentioned two major threats to cooperative distributed systems, pol-lution and free-riding. While addressing the two specific threats, we have noted that co-operation among participants is the key to prevent such deviation from informational and behavioral integrity, which leads us to design a general incentive mechanism for cooperation.

Therefore, a general incentive mechanism is proposed and evaluated, which can be readily

and widely used in many cooperative distributed systems to foster orenforce cooperation among participants. The mechanisms are as follow:

Anti-Pollution Mechanism (Winnowing) : Pollution (i.e., sharing of corrupted files,

or contaminating index information with bogus index records) is a de facto problem in many file sharing systems in use today. Pollution degrades the informational integrity of

the systems, squanders network resources, and frustrates participants with unprofitable

downloads (due to corrupted files) and unproductive download trials (due to bogus index

records). In this work, we propose a novel Distributed Hash Table (DHT) based

anti-pollution scheme called winnowing. Winnowing aims to reduce or eliminate decoy index records (pointing to non-existing or corrupted files) held by DHT (i.e., index) nodes in

the system, so that download attempts based on the remaining (clean) index records are

more likely to yield satisfactory results. To achieve this goal, two techniques are used:

(19)

of content and metadata pollution. By integrating these techniques, winnowing converges

quickly to a near-optimal solution. Winnowing has the added benefit that it does not reveal

a participant’s download history to other participants.

The publish verification of winnowing has been implemented on top of the latest

eMule client [8], and extensive data have been collected from the Kad network [55, 17], the

largest DHT-based P2P system in existence, using this modified client. The measurement

results have shown that the use of simple (keyword) publish verification would sharply

re-duce the fraction of bogus index records in the system (up to 35%), minimizing the impact

caused by index pollution. The findings from the measurement study are then incorporated

into our analytical model of the privacy-preserving object reputation. The model

demon-strates the robustness of the proposed object reputation to a variety of pollution attacks,

and to attacks against winnowing itself. The results of analysis are confirmed by means of

event-driven simulations, all of which are presented in Chapter 2.

Free-Riding Prevention Mechanism (Treat-Before-Trick) : As previously

men-tioned, free-riders who use others’ resources without sharing their own cause system-wide

performance degradation. Free-riding can be thought of as deviation from behavioral

in-tegrity by selfish participants in the system. Varying incentive schemes have been proposed

recently to encourage participants to cooperate by sharing their resources. They may be

classified as monetary-based [73], reciprocity-based [20, 72, 31, 14], and reputation (or

credit)-based [44], which are well summarized by Feldman et al. [30]. However, existing

techniques to counter free-riders are either complex (and thus not widely deployed) or easy

to bypass (and therefore not effective). In this work, we consider the free-riding problem

in a major P2P file-sharing system, BitTorrent. We propose a unique scheme to

penal-ize free-riding and reward compliant behavior, based on the use of (t, n) threshold secret

sharing [65]. Under the proposed scheme, files are divided and encrypted by the owner

(i.e, the seeder). The key used for encryption is divided into nsubkeys, any different t of

which are sufficient to recover the key and to decrypt the file pieces. A participant must

upload (encrypted) file pieces to obtain the subkeys necessary to decrypt file pieces which

have been downloaded (i.e., subkeys are swapped for file pieces). Only the participants who

have collected all the encrypted file pieces and at least t subkeys can complete their file downloads. This scheme is called“Treat-Before-Trick” (TBeT).

(20)

pieces and necessary subkeys). TBeT counters known free-riding strategies, incentivizes

par-ticipants to donate more upload bandwidth, and consequently increases the overall system

capacity for compliant participants. No central authority is required under the proposed

approach. TBeT has been implemented as an extension to BitTorrent, and the results of

experimental evaluation are presented in Chapter 3.

A General Incentive Mechanism (Triangle-Chaining) : In this work, we propose a

simple, completely distributed, but highly efficient fairness-enforcing incentive mechanism

for cooperative distributed systems, relying on a low cost symmetric key scheme. The

proposed scheme leveragessymmetricandasymmetric interests created at the time of com-munication to obtain information or services. A participant sends encrypted information to

another participant with an implicit request that the receiver should reciprocate with

virtu-ally the same amount of work to the participant designated by the sender. The decryption

key is givenif and only if the request is fulfilled by the receiver, meaning that reciprocation is not optional but mandatory under the proposed approach. Upon fulfilling the request,

the receiver becomes another sender of encrypted information in the next transaction, and

this process is continuously applied link by link, in a chain like manner. No centralized

monitoring or control is required, and the overhead of encryption and decryption is evenly

distributed among participants. The mechanism is called Triangle Chaining, or T-Chain

for short.

T-Chainenforcesstrong fairness among participants, limiting the system resources allotted to uncooperative participants (e.g., free-riders), while ensuring higher resource

avail-ability for compliant participants. Our simulation results demonstrate the effectiveness of

the proposed mechanism against free-riding: under realistic conditions, free-riders never

complete their downloads, and unrealistically strong collusion is required for free-riders to

complete their downloads without contribution. Moreover, it is general enough to be applied

to any distributed system requiring interactions among potentially un-trusted participants.

Since T-Chain can easily leverage asymmetric interests (in addition to symmetric ones), it

is readily applicable for many diverse Internet applications whose primary direction of

inter-actions is asymmetric, such as live and VOD streaming. To evaluate the effectiveness of our

approach, we applied T-Chain to a BitTorrent-like swarm based content centric network,

(21)

1.5 Organization of the Dissertation

The rest of this dissertation is organized as follows. Chapter 2 discusses the

anti-pollution mechanism and Chapter 3 presents the free-riding prevention mechanism.

Chap-ter 4 details the proposed general incentive mechanism and evaluates how the mechanism

could be applied to address the deviation from behavioral integrity (i.e., free-riding) on top

of BitTorrent. Finally, Chapter 5 concludes this dissertation and points out some future

(22)

Chapter 2

Anti-Pollution Mechanism

In this chapter, we investigate an example of malicious behavior in cooperative

distributed systems,pollution, which is a deviation (bymalicious participants) from coop-eration in the protection of public goods (i.e., files) shared in these systems.

Peer-to-Peer (P2P) systems, a representative family of cooperative distributed

sys-tems, have emerged rapidly as a popular way to exchange information over the Internet.

Their advantages include simplicity, robustness, flexibility, and scalable performance.

Un-fortunately, the information shared by current P2P systems has turned out to be error prone,

causing waste of network bandwidth, user time, and storage space. For example, Liang et

al. [49] showed that up to 80% of the copies of popular files in the KaZaA network [4] are

contaminated. This contamination is caused in part by deliberate file pollution [49, 18, 46].

Because pollution was not anticipated in the design of P2P systems, and they have no

central, trusted authority, P2P applications are highly vulnerable to such intentional

at-tacks [18]. Since pollution squanders network resources and discourages user participation,

the success of P2P systems is questionable unless properly addressed.

Much research has been done regarding the prevention of the pollution problem

in P2P systems. Several modeling techniques [26, 43, 46] have been introduced, which are

useful to understand the proliferation of pollution. Some previous work has given general

ideas on how pollution could be reduced [49, 12, 18, 24]. Only peer reputation [40, 22],

object reputation [74, 79], and hybrid reputation [21, 23] approaches, however, are

pro-posed as practical solutions thus far in this literature. Reputation based approaches aim to

resolve the pollution problem through cooperation among users sharing their past

(23)

pollution. Reasons include the complexity of implementation (e.g., use of complex security

mechanisms), the slow convergence to reach optimal performance (e.g., individual users are

responsible for collecting and evaluating opinions of others), and the cold start problem for

new users (e.g., other users’ opinions are barely useful until a newcomer has accumulated

a sufficient number of its own download experiences). Most of these weakness of existing

reputation systems are caused by the separation of the reputation and file sharing services.

That is, the building and acquisition of the reputation for a peer or an object is detached

from publish or look-up processes. In addition, current reputation systems disregard the

privacy concerns of participants, assuming that users willingly share their file download experiences or history (in part or as a whole) with other peers.

This chapter addresses the pollution problem in DHT-based P2P systems. In

DHT-based P2P systems, users store (or retrieve) the information on the files shared in the

system in (or from) DHT as index records (i.e., the description of the shared files or their location). No single user can download wanted files unless the matching index record is first

referenced. If the index records shared in the DHT are clean (i.e., pointing to only

non-polluted files), no system resources will be wasted on unwanted or non-polluted file downloads.

The principle driving our proposal is that benign users should cooperate in two ways with

one another in cleaning the index records to fight pollution. First, index nodes should

detect false (keyword) publish messages by confirming the contents of the publish message

upon receipt; no information published is accepted at face value. Second, privacy-preserving

object reputation is integrated into DHT as a part of the publish and look-up processes.

That is, index nodes collect feedback from past downloaders about the validity of those

files. This feedback from users is aggregated and inserted into the matching index records

in a novel way to indicate the authenticity of file contents. The results are then provided

to prospective downloaders to help with their decision about which files to download. The

results do not reveal information about specific users or their download history. Careful

consideration is given to reducing the impact of false feedback and malicious index nodes.

We call this scheme “winnowing” 1 The contributions of this work are:

• Object reputation is integrated into the DHT as a part of the file publish and look-up processes. Information is shared only between users who are interested in thesamefile contents; this reduces the complexity of reputation implementation, and accelerates

1

(24)

the convergence, as will be shown. Since newcomers can directly refer to the built-in

reputation information included in matching index records, no “cold start” problem

occurs under winnowing.

• Winnowing is privacy-preserving. The feedback from past downloaders is collected

and processed by the DHT (i.e., index nodes). Only the aggregated feedback results

are given to prospective downloaders, in a condensed form (i.e., voting credits). This

does not reveal the details of voters, downloading, or voting to other peers.

• Winnowing is easy to implement, completely distributed, and scalable. It does not

require any trusted third parties or centralized servers, which are difficult to achieve

in a large, self-organizing P2P system. No cryptographic signature is required for user

feedback. Random numbers (cryptographic nonces) are used for purposes of matching

votes to downloads.

We have implemented keyword publish verification on top of the latest version of eMule

client [8]. Using this modified client, the results of keyword publish verification have been

collected from the Kad network [55, 17], the largest DHT-based P2P system in existence.

These results are summarized in Section 2.4.1, and are incorporated into an analytical

model. This model is used to evaluate the performance of the privacy-preserving object

reputation mechanism. The model demonstrates the effectiveness of integrated reputation:

fast convergence to near-optimal performance and robustness to various pollution attacks.

All modeling results are then confirmed by means of event-driven simulations.

The remainder of this chapter is as follows: In Section 2.1, some P2P terminology

and a brief overview of the Kad network are introduced. Existing pollution strategies are

discussed in the latter part of the section. Section 2.2 investigates file sharing in the wild,

providing a deep understanding of existing P2P systems. Section 2.3 presents the protocol

description of winnowing with security considerations. The results of the measurement

of keyword publish verification and analysis of privacy-preserving object reputation are

presented in section 2.4. Section 2.5 analyzes the overhead for implementing winnowing

and discusses limitations of the proposed method followed by possible solutions. Section 2.6

briefly describes and analyzes existing methods of anti-pollution approaches and, finally,

(25)

2.1 Background

This work investigates pollution problems in DHT-based file sharing P2P networks.

A specific file being shared, whether a movie or song or something else, is referred to as

a title. There may be different versions of a title, and for one version of a title, multiple

copies could exist if the version is downloaded and shared by many users in the network. There can also be many decoys, which appear to be versions of the title, but which do not in fact exist in the network, or which contain corrupted/lower-quality content. Each file has

metadata, which is structured information that describes the file. Metadata includes the file name, the file size, the file type or format, etc [49, 17]. Akeywordis a single token extracted from the metadata of a file, usually from the file name. In the Kad system, a keyword must

be composed of at least three characters [17]. Each file might have many keywords. Auser

is an operator of a P2P client application, whereas apeer (or interchangeably anode) is the client application itself. A user is called a content owner (alternately, a publisher) when the user contributes a file to the P2P system. A user downloading a file from the system is

referred to as adownloader.

2.1.1 Publishing and Retrieving

The Kad network is a widely deployed DHT-based P2P system implemented by

several file sharing applications, such as eMule [27] or aMule [11]. It is based on one of

the well-known DHTs, Kademlia [55]. The Kad network may have more than one million

concurrent peers at any given time [71]. A key is an identifier used to store and retrieve information in and from the DHT, and has the same bit size as all Kad IDs. There are two

types of keys, a content key and a keyword key. A content key is obtained by hashing the entire content of a file, whereas a keyword key is computed by hashing each keyword.

Publishing is used to index information about a file. This information is stored by the index node in charge of that portion of the Kad ID space. A peer who wants to

publish a file (content owner) first hashes the whole content of the file to get the content

key. The content owner locates the node that has the closest Kad ID to the content key.2

For this purpose, the content owner uses iterative routing [17, 69]. Once determined, the

closest node to the content key becomes a content key owner of the file. The content

2

(26)

kk1=HASH(“dragon”), keyword publish (kk1) <keyword key1, content key, metadata>

<kk1, ck, metadata>

<ck, publisher info.>

1 2

2

(a) publish

<ck, publisher info.>

File download

2

3

1

(b) retrieve

Figure 2.1: A high level overview of the publish and retrieve mechanism.

key owner updates its local index table with a content index record <content key, location information>. The location information includes the basic information on the content owner

such as the Kad ID, the IP address, TCP port, etc. Furthermore, the content owner hashes

each keyword of the file into a keyword key. For each of the keyword keys, the content

owner locates the closest node (keyword key owner) to the keyword key, in the same way as before, and publishes the index information of the file. Each keyword key owner updates

its index table with a keyword index record <keyword key, content key, publisher IP list, possible file names, metadata > of the file. For a given key, up to 10 closest nodes can

be selected as key owners (referred to as a tolerance zone), to deal with high churn rates. Content publishing followed by keyword publishing is called a2-level publishingscheme [17]. In the Kad system, content (keyword) keys are periodically republished every 5 (24) hours.

Retrieving file information from the DHT is the reverse of publishing. Users obtain a content key through a keyword search, and use this content key to retrieve location

information via a content search. The user then attempts to download the file from a

content owner via the location information that is returned.

Figure 2.1 depicts a high level overview of the publish and retrieve mechanism in

a DHT-based P2P system. In this example, we assume that the publisher (or the

down-loader) wants to publish (or retrieve) a movie file, “Dragon War.mpg”. To publish the file

(Figure 2.1(a), first, the publisher hashes the entire content to get the content key (ck)

of the file and sends content publish messages to the content key owner(s). Second, the

publisher hashes each keyword (i.e., “dragon” and “war”) to get a keyword key (kk) and

(27)

file (Figure 2.1(b), first, the downloader hashes a possible keyword (e.g., “dragon” in this

example) to get the keyword key. Then, sends keyword search messages to the keyword

key owner(s). Second, the downloader selects one content key among all possible content

keys returned by the keyword key owner(s) based on the metadata of each index record and

sends content search messsages to the content key owner(s). Finally, the downloader

down-loads the file from the content owner(s) based on the publisher (i.e., location) information

returned by the content key owner(s). The downloaded file is immediately republished and

shared until the downloader deletes or moves it.

2.1.2 Existing pollution attacks

Depending on the strategies adopted by polluters, pollution can be classified into

the categories ofcontent pollution,metadata pollution, and index pollution.

Content pollution[49, 24, 46] changes all or part of the content of a target file by

degrading its quality. This can be accomplished by the addition of white noise, by shuffling

or omitting some of its contents, or by substituting new information. Polluters can easily

create polluted files which have the same content hash value (i.e., the same content key) as

a non-polluted file. This is possible by exploiting the weakness of the current hash function

of the Kad network, MD4 [56].

Metadata pollution [49, 18] tampers with the metadata of a file, rather than

the contents themselves. Users downloading files based on that metadata will be misled

into downloading content differing from what they expect.

Index pollution [51] directly attacks the index structure of the network, instead

of the content or metadata of the target files. Index pollution can be accomplished by

inserting a massive number of false records into the index table of either a content key owner, or a keyword key owner of the target file (i.e., title). With index pollution, users fail

to locate the target file when relying upon the information in a false index record. Since

index nodes in most file-sharing based P2P systems today do not authenticate or verify

publish messages, polluters may easily contaminate the index records [51].

2.2 Sharing in the wild

For better understanding of existing file sharing P2P systems, some relevant

(28)

patterns, version popularity, and user characteristics (the fraction of NATed users, IP

dis-tribution, etc.). These results are used in explaining and validating the design of winnowing

in the following two sections.

In these experiments, distribution and access to five mp3 songs were investigated.

The first 4 famous songs (T1 ∼ T4) were selected from the Top 10 Songs [9] in June of

2008, and were very popular. The last song (T5) was selected, for comparison, from the

late 1970’s billboard charts. This song hit #1 at that time, but is considerably less popular

today than the other 4 songs.

To collect results, we modified and inserted a the crawling client (0.49a MorphXT

version 11.0 [8]) into the Kad network. First, for the keyword key owners, one keyword per

each title (KTi) was extracted from the file name and hashed into the 128-bit keyword key.

Then, the client ID of each crawler (i.e., the matching keyword key owner) was configured

to have the 128-bit value of the hashed keyword. By so doing, each crawler could receive

keyword publish messages for the mapped keyword. Second, for each content key owner, the

most popular content key (i.e., version) among all content keys of each title was selected.

The most popular content key was then configured as the crawler’s client ID, so that content

key publish messages for that content would be received.

Data were collected for 48 hours for keyword publish messages, and for 10 hours

for content publish messages; this spans twice the normal republish cycle.

0 5000 10000 15000 20000 25000 30000

12:30 22:30 08:30 18:30 04:30 11:30

total # of keyword publish messages per hour

time KT1 KT2 KT3 KT4 KT5

(a) total # of keyword publishes

0 5000 10000 15000 20000 25000 30000

12:30 22:30 08:30 18:30 04:30 11:30

# of unique content keys published per hour

time

KT1 KT2 KT3 KT4 KT5

(b) # ofuniquecontent keys

1e-04 0.001 0.01 0.1 1

1 10 100

PDF of copies (log scale)

content key number (ordered by polularity - log scale) T1 T2 T3 T4 T5

(c) version popularity

Figure 2.2: File sharing statistics in the wild (Kad network)

Figure 2.2(a) shows the total number of keyword publish messages received by

each keyword key owner. It demonstrates the time of day effect on the total number of

(29)

order of magnitude, based on the popularity of keywords. Note that the target songs are

selected from titles famous in the USA and the keyword publish messages are collected

from the winnowing clients in the EST time zone. So, this result gives a hint of the diurnal

behavior of Kad users.

Even at the high rate keyword publishes, we assumed that the number of unique

content key publishes is quite low since there must be many duplicate keyword publishes for

the same content keys. To investigate this, only the keyword publishes of unique content

keys are counted, and Figure 2.2(b) presents the results. As indicated, the number of

keyword publishes for unique content keys is less than 1/5 of total keyword publishes in most

cases and the ratio keeps decreasing as time goes on. This suggests that the proliferation

of existing contents (rather than newly introduced contents) dominates file sharing in the

wild.

Next, the version popularity of each target song is investigated. In the Kad

net-work, the number of keyword publish messages for a content key represents the popularity

of the version (i.e., content key) because there must be a keyword publish message whenever

a version is downloaded. To check the popularity, content keys are ordered based on the

number of keyword publish messages received during the measurement period. Figure 2.2(c)

demonstrates the PDF on a log-log scale for the number of copies with respect to the top

100 content keys of each title. The near linearity of the curves confirms that the version

popularity of a title in the Kad network follows a Zipf distribution, which strongly indicates

that users select a version (i.e., content key) based on its popularity if there is no other

selection standard. The results agree with the results of previous studies [49, 22] onother

P2P systems. 0 0.1 0.2 0.3 0.4 0.5

T1 T2 T3 T4 T5

fraction of NATed users

title

(a) NATed users (Kad)

0 5000 10000 15000 20000 25000

1 2 3 4 >=5

frequency

number of users per IP/24 range T1

T2

T3

T4

T5

(b) # users per IP/24 (Kad)

0 10000 20000 30000 40000 50000 60000 70000

1 2 3 4 >=5

frequency

number of users per IP/24 range

(c) # users per IP/24 (BitTorrent)

(30)

Realistic user statistics are valuable in designing an effective reputation

mecha-nism. First, the fraction of NATed users was investigated; this helps us understand the

distribution of downloaders of a file. In the Kad network, since a NATed publisher includes

its buddy3 information in the content publish message, content key owners can easily figure

out whether the publisher is a NATed one or not. Figure 2.3(a) shows the fraction of NATed

users for the most popular content key of each title, ranging from 38.55% to 46.06%. We

believe that the actual ratio of NATed users is somewhat higher than indicated in that the

titles selected are popular in ARIN (North America) area where most of the public IPv4

addresses are allocated.

Next, the number of users per small IP prefix range (i.e., IP/24) who have

down-loaded the same most popular version (content key) of each title was investigated. This is

useful for purposes of identifying colluding malicious nodes. Figure 2.3(b) demonstrates the

number of users in each IP/24 range for the most popular content key of each title. The

average number of users who have downloaded the same version in a IP/24 range was only

1.1. Note that the data have been collected for 10 hours, which is twice the length of the

content republish cycle; therefore, the actual number in a single cycle is lower than that.

A similar investigation was done for RedHat 9 Torrent tracker trace [5], which represents 5

months of activity in a BitTorrent swarm. As shown in Figure 2.3(c), the same trend was

found in this case.

These two results show that the number of users in a small IP range downloading

the same content (i.e., the same version) is relatively low, even considering the high volume

of NATed users in the wild. This will motivate the use of IP prefix-based binning for user

feedback collection and aggregation (Section 2.3.2).

2.3 Index Filtering (winnowing)

In this section we present the basic protocol description ofwinnowing. Winnowing aims to reducedecoy (i.e., bogus or corrupted) keyword index records4 in P2P systems. To achieve this goal, as a preliminary phase, keyword key owners verify whether the content

3

A buddy can receive incoming messages for a firewalled or NATed peer, and will manage its communications.

4

(31)

key in each keyword publish message is valid for all keyword publish messages received, and disregard invalid keyword publish messages. Then privacy-preserving reputation is

integrated into DHT. That is, for files which are published and downloaded, keyword key

owners collect and aggregate feedback from the users who have downloaded files via their

keyword index records. The results are then stored in the mapping keyword index records

in condensed form. This helps prospective downloaders to locate clean (unpolluted) files.

No information on the voters or their downloading history is stored in the keyword index

records or revealed to other downloaders; this preserves the privacy of downloaders.

In the discussion below, we initially assume that all index nodes are trustworthy.

This assumption is then relaxed in Section 2.3.3. Careful consideration is given to reducing

the impact of false feedback and malicious index nodes.

downloader keyword key owner publisher content key owner

CONTENT_PUBLISH_REQ KEYWORD_PUBLISH_REQ CONTENT_PUBLISH_RES CONTENT_KEY_VERI_RES K e y w o r d P u b li s h V e r if ic a ti o n KEYWORD_PUBLISH_RES CONTENT_KEY_VERI_REQ

(a) keyword publish verification (preliminary phase)

CONTENT_DOWNLOAD KEYWORD_SEARCH_REQ

CONTENT_SEARCH_RES CONTENT_SEARCH_REQ KEYWORD_SEARCH_RES

downloader keyword key owner publisher content key owner

AFR and RES

MJR and RES

P r iv a c y -p r e s e r v in g o b je c t r e p u ta ti o n

(b) privacy-preserving object reputation

Figure 2.4: Message processing in eMule with winnowing.

2.3.1 Keyword Publish Verification

To reduce index pollution (Section 2.1.2), the keyword key owner must verify the

contents of each keyword publish message it receives. The principle here is that no

informa-tion published is accepted at face value; it is confirmed by the keyword key owners before

acceptance. Otherwise, a polluter could generate a random content key which matches

no file in the network, and publish the information <keyword key, random content key, metadata> to the matching keyword key owners. When a user attempts to download a

(32)

and effort. If the user cannot locate a file after several attempts, it is likely he or she will

abandon the search and leave the system.

As seen in Figure 2.4(a), verification of the keyword publish message is

accom-plished by issuing content search messages, using as a target the content key in the keyword

publish message. A successful reply indicates the content key is valid. Conversely, if there

is no reply to the content search using the given content key within a reasonable period

of time, the content key can be considered to be bogus. Note that search queries succeeds

99.9% of the time in the current eMule [71], which means that this assumption of success

when searching for a valid file is realistic.

Content key verification may be tricked if the attacker also publishes fake location

(i.e., publisher) information for the random content keys. However, this attack requires

substantial effort and resources from the attacker in that the bogus random content keys

must be re-published every 5 hours (Section 2.1.1). In addition, index pollution is effective

in this case only when themajority of keyword index records are bogus [51]. Moreover, the bogus keyword index records will be further subjected to examination (and rejection) by

means of the privacy-preserving object reputation mechanism.

2.3.2 Privacy-preserving Object Reputation

The publish verification explained above does not prevent content and metadata

pollution (Section 2.1.2). This is because index nodes themselves are unable to judge the

quality or validity of file contents. Therefore, polluters could still store corrupted index information in the P2P system through content and metadata pollution. In addition, in

some cases, bogus index records inserted by smart polluters circumventing keyword publish

verification may exist.

To deal with these problems, privacy-preserving object reputation is proposed.

The collection and dissemination of reputation information is integrated in DHT as a part

of the file publish and look-up processes. In this approach, keyword key owners collect

feedback from users (who have downloaded files via their index records) about the contents

of files they have downloaded. The feedback from users is aggregated and combined into the

matching keyword index records, in a way that indicates the authenticity or validity of the file contents. The aggregated feedback results are then provided to prospective downloaders

(33)

Feedback Collection

To collect feedback, each keyword key owner must maintain two kinds of lists:

keyword reference lists (KRLs) andcontent key voter lists (CKVLs). One KRL is allocated to each keyword key managed by the matching keyword key owner, and one CKVL is

assigned to each content key. When a keyword key owner receives a keyword search query

from a peer P, it generates a random number R 5 and inserts <IP address, R >into the

matching KRL for this keyword key. The reply to P also contains the value ofR.

To help eliminate polluted content, some fraction of users will report their file

download experiences to the appropriate keyword key owners. The user feedback report

comprises theautomated failure report (AFR) and themanual judgement report (MJR). As seen in Figure 2.4(b), an automated failure report message is generated by the peer for every

failed download trial. This happens without requiring any user action. The report is sent

directly to the mapping keyword key owner if no location information is returned for the

content key attempted. This failure could be caused by either a bogus keyword index record

(not filtered out by the keyword publish verification), or by a stale keyword index record.

If a download attempt is successful, and upon viewing or listening to the file’s contents,

the user may send a manual judgement report by simply marking it as either “clean” or

“polluted” at his or her own discretion. This report is sent directly to the keyword key

owner. Both the AFR and the MJR contain the value ofR sent to the peer by the keyword

key owner in response to a previous keyword search query. Casting a vote is mutually

beneficial for compliant users in that when a user received a polluted one, the vote from the

user reduces the probability that other compliant users select the same defective content,

and their feedback helps the user to select a good one with the enhanced probability.

If a peer P provides a keyword key owner with feedback (AFR or MJR) about a

download experience, the keyword key owner checks the KRL to see whether the peer has

ever issued a keyword search query before. This can be done based on the peer’s IP address,

and the random numberR presented as evidence of the peer’s keyword search. If a match

in the KRL is not found, the feedback is ignored. Otherwise, the keyword key owner checks

the CKVL to see whether the peer has previously voiced its opinion about this content key,

based on the IP address of the voter and R. If so, the feedback is also ignored. Otherwise,

5

A cryptographic nonce used to prove that a voter has actually downloaded a file via one of the keyword index records of this keyword key owner. It also prevents attackers from using spoofed IP addresses outside

(34)

the feedback from this peer is aggregated with feedback from other peers as explained in

the next section, and the keyword key owner inserts <IP address, R > into the CKVL.

This prevents the same peer from casting a vote about the same content multiple times as

a result of a single search.

Attackers can cast multiple votes about content, in an attempt to manipulate

the feedback mechanism of winnowing. In order to do that, the attackers may perform

different search queries (which yield to the same content key) in order to get many voting

opportunities (i.e., different values of R). The number of such multiple malicious votes

(from the same user) is, however, easily regulated by limiting the search frequency, which

is already done in the existing eMule client [76]. More importantly, the effects are further

limited by the feedback aggregation mechanism of winnowing explained in the following.

Feedback Aggregation

In winnowing, each keyword index record has the form<keyword key, content key,

voting credits,· · ·, metadata>. The voting credits represent the validity of the content key (i.e, version), and reflect the votes (“clean” or “polluted”) previously cast by downloaders

of that content. Each accepted vote is reflected in the matching keyword index record by

increasing (for a positive vote) or decreasing (for a negative vote) the voting credits for a

content key. The voting credits do not reveal who cast votes, nor what individual votes

were cast. This preserves the privacy of the voters participating in the file sharing system.

The way in which voting credits are increased or decreased determines the

effective-ness of this object reputation scheme. The approach advocated here is

Additive-Increase-Multiplicative-Decrease (AIMD). With the AIMD approach, each keyword key owner adds

1 to the voting credits if a positive vote is received, but reduces by half (divides by 2) the

voting credits if a negative vote is received. Winnowing adapts the AIMD approach as

the method of updating voting credits in order to disseminate information about polluted

content quickly. Of course, an attacker could potentially abuse this policy to mislead peers

into believing clean content is actually polluted due to the multiplicative decrease (i.e.,

order-dependent) nature of AIMD. However, theIP prefix binning strategy in the following and theimbalanced feedback mechanism (Section 2.3.3) minimize the impact, which will be examined again in Section 2.4.

(35)

designed. That is, malicious nodes will lie (i.e., vote positively for decoy keyword index

records and vote negatively for clean ones). This is referred to here asreverse voting. Such users may also attempt to amplify the impact by forging false identities (i.e., resorting to

the Sybil attack [25]). Winnowing addresses this problem by means of the IP prefix based binning strategy with weighted voting. With this mechanism, one IP address prefix (e.g., IP/24) is mapped into a bin, and the weight of a vote in the same bin decreases as the number of votes in the bin increases. This prevents multiple votes in the same IP range

from counting as much as multiple votes fromdifferent IP ranges. The motivation for this mechanism is that it has been found in practice that the majority of users in P2P systems

are benign [50], and the number of users who have downloaded the same file (i.e., version)

in a small IP prefix range is low, as indicated by our measurement results (Section 2.2). For

this purpose, the frequency of the same IP prefix of voters is remembered in the mapping

CKVL to adjust the weight of each vote.

2.3.3 Penalizing Uncooperative or Malicious Index Nodes

Thus far, it has been assumed that all keyword key owners are honest. Under

this assumption, the processed voting results for each content key must be reflected into

the mapping keyword index record, and secured by the keyword key owners themselves.

However, this assumption may not hold in practice. In some cases, the keyword key owners

may be lazy or uncooperative, and may simply not act properly as feedback mediators. In

other cases, attackers may insert themselves as a keyword key owner of the target file, so that

they can control (and falsify) the keyword index records as they wish. This will be possible

if an attacker intentionally manipulates its client ID so that the ID matches the hash of

a keyword of the target file. Once the attacker becomes the keyword key owner(s) of the

target title and returns only decoy index records (i.e.,provides false information) whenever asked, downloaders may be unable to located clean copies of a file, or may consistently

download polluted copies of files.

To address this problem, winnowing adopts a so calledimbalanced feedback mech-anism. With imbalanced feedback, the size of voting messages differs based on whether the vote is positive (favorable) or negative (unfavorable). The size of a positive vote is

very small (possibly only a few tens of bytes) whereas the size of a negative vote is much

(36)

penalizes malicious keyword key owners in that index nodes issuing many decoy keyword

index records will receive a high volume of negative votes. The penalty is inevitable for the

attacking keyword key owners once launched by compliant users, consuming the download

bandwidth of the attackers. Second, the technique makes it more difficult for attackers to

cast many reverse votes for clean keyword index records since that will consume their own

resources (i.e., upload bandwidth).

The keyword key owner may also fail to provide keyword index records for selected

files (i.e., tell nothing). Note that ifall the keyword key owners for a keyword are replaced by such attackers, then users may not be able to find the files whose name (or metadata)

includes that keyword. However, such attacks will be difficult and expensive, given the

degree of redundancy of keyword index records in current systems. For instance, it has

been found that there are an average of 19 keyword key owners for a single keyword in the

Kad network [71], and each title is indexed by a number of keywords. Moreover, use of

multiple hash functions [77] designed for load balancing will further restrict the ability of

attackers to control all of the keyword key owners for a file.

2.4 Evaluation

In this section, the winnowing approach is evaluated. The first part checks the

effectiveness of keyword publish verification in the Kad system. In the second part,

privacy-preserving object reputation of winnowing is evaluated through analytic modeling, and the

results are compared with simulation. Modeling and simulation allow a larger variety of

conditions and assumptions to be tested than would a deployed solution. They have the

added benefit that they do not require acceptance by and the training of a user community,

which is normally a lengthy process.

2.4.1 Performance of Keyword Publish Verification (Preliminary Phase)

The keyword publish verification of winnowing was implemented on top of an

up-to-date eMule client (0.49a MorphXT version 11.0 [8]) and inserted in Kad network.

Ex-periments were conducted under the same settings in Section 2.2. Each keyword key owner

(i.e., modified winnowing client) verified the content key in each keyword publish message,

issuing content search messages (i.e., CONTENT KEY VERI REQ in Figure 2.4(a)). If