SHIN, KYUYONG Preventing Misbehavior in Cooperative Distributed Systems. (Under the direction of Professor Douglas S. Reeves).
Cooperative distributed systems are becoming increasingly popular as alternatives
to the traditional client-server model for many applications, including file sharing,
stream-ing, and distributed computing. In cooperative distributed systems, participants directly
cooperate with each other to achieve common goals by sharing resources without the need of any central control. Therefore, in contrast to the client-server model, the system
ca-pacity potentially scales as the number of participants in a system increases, providing the
participants with information or services with few resource restrictions. The information or
services provided by the system can be thought of as apublic good, and participantsshould
play a part in the protection and provision of the public good. Thus, cooperation among participants to obtain mutual benefits is the fundamental premise behind the success of
such a system.
In spite of the importance of cooperation among participants to protect and
sup-port the public goods in cooperative distributed systems, a high level of informational
integrity of the goods and behavioral integrity of participants toward the goods is difficult
to achieve due to malicious or selfish participants. Because such malicious or selfish be-havior was not anticipated at the inception of cooperative distributed applications, they
are highly vulnerable to such behavior. To address the problem, in this dissertation, we
identify two major threats (i.e., pollution and free-riding) to the protection and provision of the public goods, and propose tailored solutions to those specific threats. In addition,
a general, fairness-enforcing incentive mechanism is proposed to foster cooperation among
participants, which could be readily used to prevent various misbehavior in a wide range of
cooperative distributed systems.
Firstly, this dissertation investigates the pollution problem in file sharing systems,
and proposes a novel Distributed Hash Table (DHT)-based anti-pollution scheme called
winnowing. Winnowing attempts to achieve a high level of informational integrity of the public goods (i.e., shared files in this case) through cooperation among (benign)
sharing. By employing secret sharing into file sharing, the proposed scheme, called Treat-Before-Trick (TBeT), enforces cooperation among participants by restricting uncooperative participants from the acquisition of secrets required to complete their work. Therefore, a
high level of behavioral integrity on the part of participants toward the public goods can
be achieved under TBeT.
Finally, this dissertation proposes a general incentive mechanism which can be
readily and widely used in many cooperative distributed systems to enforce cooperation among participants, which is named Triangle Chaining (T-Chain). T-Chain strongly de-pends both on the use of light-weight symmetric cryptography to reduce the opportunity
by Kyuyong Shin
A dissertation submitted to the Graduate Faculty of North Carolina State University
in partial fullfillment of the requirements for the Degree of
Doctor of Philosophy
Computer Science
Raleigh, North Carolina
2009
APPROVED BY:
Dr. George N. Rouskas Dr. Peng Ning
Dr. Douglas S. Reeves Dr. Injong Rhee
DEDICATION
To my beautiful and tolerant wife
Nari Jun
and cherished two sons
Jaebeen
and
Jaeui
BIOGRAPHY
Kyuyong Shin was born on April 20, 1973, in South Korea. He married Nari Jun
in 2000 and they have two sons, Jaebeen and Jaeui.
He received his Bachelor’s degree in the Department of Computer Science at the
Korea Military Academy in 1996. After serving in the Korean Army as a platoon leader, he
attended Korea Advanced Institute of Science and Technology (KAIST) from 1998 to 2000,
where he earned his Master’s degree in computer science. After he experienced various
military assignments as an army officer and assistant professor in Korea, he continued his
graduate study in the Ph.D. program in the Department of Computer Science at North
Carolina State University, since the fall of 2005.
He lives in Seoul, South Korea, where he is currently working for the Korea Military
Academy as a faculty member. His research interests lie in Peer-to-Peer systems, network
ACKNOWLEDGMENTS
I would like to thank all people who have helped and inspired me during my
doc-toral study. First of all, I would like to express my deepest gratitude to my advisor, Dr.
Douglas S. Reeves, for his endless concern, inspiration, commitment, and support.
Through-out my doctoral research, he continually encouraged me to develop analytical thinking and
research skills, and patiently guided me in the right direction.
I would like to thank my co-advisor, Dr. Injong Rhee, for his insightful and
con-structive guidance in my doctoral research. Also, I am very grateful to the other committee
members, Dr. George N. Rouskas and Dr. Peng Ning, for their continued support, valuable
comments, and encouragement. Special thanks should be given to Dr. Xiaohui Gu for her
substitution during my final oral defense, allowing me the flexibility to schedule the exam.
It has been great to work with Dr. Carla D. Savage as her Teaching Assistant for
several semesters, acquiring invaluable teaching experience here at NCSU. In addition, I
would like to thank the Director of Graduate Program, Dr. David J. Thuente, for his help
during my Ph.D. study. I am also grateful to Ms. Margery Page for her constant help.
This research is supported by various funding sources including Secure Open
Sys-tems Initiative (SOSI) and I would like to thank them for their support. I also would like
thank the Korean Army and the Korea Military Academy for giving me this invaluable
opportunity to study abroad.
I would like to thank my friends and colleagues for their help in various ways
throughout the entire duration of this thesis, including Ahmed Azab, An Liu, Attila
Al-tay Yavuz, Beakcheol Jang, Emre (John) Sezer, Entong Shen, Dr. Hyungjun Kim, John
Sobrero, Juan Du, Jun Bum Lim, Pastor Kwanseok Kim, Dr. Mihui Kim, Min Yeol Lim,
Michele Calcavecchia, Sangwon Hyun, Sangtae Ha, Seong Ik Hong, Qinghua Zang, Wei Wei,
Yao Liu, Young Hee Park, and others whom I may have mistakenly forgotten to mention.
I also would like to thank Korean army officers, Chongkyung Kil, Jung Ki So,
Woopan Lee, Yong Chul Kim, Kiseok Jang, Changho Son, and Hesun Choi for all their
help, support, and valuable time we spent together at NC State. And, I cannot forget to
mention my KMA 52 classmates, Seungjin Han, Sungrok Kang, Sungwoo Kim, Youngchoon
Jeon, Youngho Jo, and their families for the unforgettable memories in United States.
Finally, but most importantly, thank God for saving my life and always being with
TABLE OF CONTENTS
LIST OF TABLES . . . viii
LIST OF FIGURES . . . ix
1 Introduction . . . 1
1.1 Cooperative Distributed Systems . . . 1
1.2 Major Threats to Cooperative Distributed Systems . . . 3
1.3 Challenges to Developing Solutions . . . 4
1.4 Our Approaches . . . 5
1.5 Organization of the Dissertation . . . 8
2 Anti-Pollution Mechanism . . . 9
2.1 Background . . . 12
2.1.1 Publishing and Retrieving . . . 12
2.1.2 Existing pollution attacks . . . 14
2.2 Sharing in the wild . . . 14
2.3 Index Filtering (winnowing) . . . 17
2.3.1 Keyword Publish Verification . . . 18
2.3.2 Privacy-preserving Object Reputation . . . 19
2.3.3 Penalizing Uncooperative or Malicious Index Nodes . . . 22
2.4 Evaluation . . . 23
2.4.1 Performance of Keyword Publish Verification (Preliminary Phase) . 23 2.4.2 Analytic Model (Privacy-preserving Object Reputation) . . . 25
2.4.3 Modeling and Simulation Results . . . 27
2.5 Discussion . . . 32
2.6 Related Work . . . 34
2.7 Summary of Winnowing . . . 35
3 Free-riding Prevention Mechanism . . . 37
3.1 Related Work . . . 39
3.2 Background . . . 40
3.2.1 BitTorrent Overview . . . 41
3.2.2 Existing Free-riding Strategies . . . 42
3.2.3 (t, n) Threshold Secret Sharing . . . 43
3.3 Treat-Before-Trick (TBeT) . . . 44
3.3.1 Basic Protocol Description . . . 44
3.3.2 Protocol Enhancements . . . 45
3.3.3 Optimal Threshold Value . . . 47
3.3.4 Countering Free-Riding Strategies . . . 49
3.4.1 Experimental Setup . . . 50
3.4.2 Optimal Threshold Value (t) . . . 51
3.4.3 Penalty for Free-Riding . . . 52
3.4.4 Performance of Compliant Peers . . . 55
3.4.5 Fairness . . . 56
3.4.6 Performance of Rational Peers . . . 57
3.5 Discussion . . . 58
3.5.1 Computational Overhead . . . 58
3.5.2 Collusion . . . 59
3.5.3 Key Disclosure . . . 59
3.6 Summary of TBeT . . . 62
4 General Incentive Mechanism . . . 63
4.1 Related Work . . . 69
4.1.1 Basic Enhancements to TFT . . . 69
4.1.2 Use of Indirect Information (Reputation or Accounting) . . . 70
4.1.3 Use of Coding or Encryption . . . 72
4.1.4 Summary of Existing Work . . . 74
4.2 Design of T-Chain . . . 74
4.2.1 Protocol Overview . . . 75
4.2.2 Initiation Phase : Figure 4.1(a) . . . 76
4.2.3 Continuation Phase : Figure 4.1(b) . . . 78
4.2.4 Termination Phase : Figure 4.1(c) . . . 79
4.2.5 Incentives Immanent in T-Chain . . . 80
4.2.6 Protocol Enhancements . . . 81
4.2.7 Summary of the Proposed Scheme . . . 84
4.3 Security and Overhead Analysis . . . 86
4.3.1 Known BitTorrent Free-Riding Attacks . . . 86
4.3.2 Attacks Unique to T-Chain . . . 88
4.3.3 Overhead of T-Chain . . . 90
4.3.4 Fault Tolerance Issues . . . 91
4.3.5 Optimality . . . 93
4.4 Evaluation . . . 94
4.4.1 Experimental Conditions . . . 95
4.4.2 Performance without Free-Riding (Flash Crowd Case) . . . 96
4.4.3 Performance with Free-Riding (Flash Crowd Case) . . . 100
4.4.4 Impact of Collusion Among Free-Riders (Flash Crowd Case) . . . . 102
4.4.5 Performance with Free-Riding (Real Trace Case) . . . 103
4.4.6 Fairness with Free-Riding (Real Trace Case) . . . 105
4.4.7 Chain Characteristic . . . 106
4.4.8 Opportunistic Seeding . . . 108
4.4.9 Symmetric and Asymmetric Interests . . . 109
4.4.10 Performance Under High Churn Rates . . . 111
4.4.11 Summary of Evaluation . . . 112
5 Conclusion and Future Work . . . 114
5.1 Conclusion . . . 114
5.2 Future Work . . . 116
LIST OF TABLES
Table 2.1 Summary of Notations Used in Winnowing . . . 25
Table 3.1 Summary of Notations Used in TBeT . . . 47
LIST OF FIGURES
Figure 2.1 A high level overview of the publish and retrieve mechanism. . . 13
Figure 2.2 File sharing statistics in the wild (Kad network) . . . 15
Figure 2.3 User statistics in the wild (the Kad network and BitTorrent) . . . 16
Figure 2.4 Message processing in eMule with winnowing. . . 18
Figure 2.5 The measurement results of the keyword publish verification. . . 24
Figure 2.6 The relative performance of each system under perfect conditions. . . 28
Figure 2.7 The relative performance of each system under realistic conditions. . . 28
Figure 2.8 The effects of the active pollution with the Sybil attack. . . 29
Figure 2.9 The effects of IP binning withα under the realistic conditions. . . 30
Figure 2.10 The effect of collusion attacks. . . 31
Figure 2.11 The number of negative votes received by the malicious index node. . . 31
Figure 3.1 BitTorrent overview . . . 41
Figure 3.2 Optimal threshold valuet of TBeT. . . 52
Figure 3.3 Average download completion time of compliant peers and free-riders ( with-out the large-view-exploit). . . 53
Figure 3.4 Average download completion time of compliant peers and free-riders (with the large-view-exploit). . . 53
Figure 3.5 The ratio of the average completion time of free-riders over that of compliant peers under the homogeneous (a) and heterogenous (b) upload speed case. . . 54
Figure 3.6 The rate at which free-riders in TBeT get subkeys under the continuous arrival process with heterogeneous upload rate peers. . . 55
Figure 3.8 Correlation between the download completion time and the donated upload rate in both systems with the large-view-exploit and 25% free-riders out of 400 leechers. In this experiment, peers join the systems in a flash crowd. . . 57
Figure 3.9 The average download completion time of semi-seeders and rational peers after subkey completion under the continuous arrival process. Half of 500 semi-seeders were randomly selected as rational peers. . . 58
Figure 3.10 The completion times of compliant peers and free-riders with heterogeneous upload rates in both systems after a key disclosure under the continuous stream model. . . 60
Figure 3.11 The instantaneous completion time of compliant peers in TBeT with rekey-ing, assuming continuous peer arrivals. . . 61
Figure 4.1 The initial, intermediate, and terminal transactions in a chain of T-Chain. 77
Figure 4.2 Average download completion time (a) and average uplink utilization (b) of compliant leechers in BitTorrent, PropShare, and T-Chain under a flash crowd leecher arrival model without free-riders. . . 97
Figure 4.3 The effects of the file size (a) and the swarm size (b) under T-Chain. . . 98
Figure 4.4 The transfer times of individual file pieces for two specific leechers with the lowest and the highest upload rates. . . 98
Figure 4.5 The number of different pieces between each pair of neighbors in a real BitTorrent swarm over 7 days (a) and the effects of initial piece differences among leechers (b). . . 99
Figure 4.6 Average download completion time of compliant leechers (a) and free-riders (b) in BitTorrent, PropShare, and T-Chain in a flash crowd (without the large-view-exploit). . . 100
Figure 4.7 Average download completion time of compliant leechers (a) and free-riders (b) in BitTorrent, PropShare, and T-Chain in a flash crowd (with the large-view-exploit). . . 101
Figure 4.8 The effects of collusion in T-Chain under the flash crowd arrival model. . 103
Figure 4.9 The average download completion time (a) and the average uplink utiliza-tion (b) of compliant leechers in BitTorrent, PropShare, and T-Chain under a continuous stream model. . . 104
Figure 4.11 The number of active chains as a function of time in a flash crowd model (a) and that in a continuous stream model without free-riders (b). . . 107
Figure 4.12 The length of each chain under a flash crowd model with 600 leechers without free-riders. . . 108
Figure 4.13 The cumulative number of the chains created by the seeder and by the opportunistic seeding in a flash crowd (a), and the ratio of opportunistic seeding in a real trace as the function of the fraction of free-riders in the system (b). . . 109
Figure 4.14 The ratio of pieces received through bilateral exchanges in a continuous stream model with different faction of free-riders. . . 110
Chapter 1
Introduction
With the onset of electronic information exchange comprising the majority of
in-ternet traffic, cooperative distributed systems (e.g., Peer-to-Peer file sharing [64], Content
Delivery Networks [2], and Grid Computing [13]) are increasingly popular as alternatives
to the traditional client-server model for many applications. For instance, a recent study
done by a German based company on the usage of file sharing applications in several
coun-tries shows that more than 50% of all Internet traffic is Peer-to-Peer (P2P) related, with
peaks of greater than 95% during the nighttime hours [28]. Distributed systems, while
being extremely popular (thanks to their simplicity, scalability, robustness, and flexibility),
raise a variety of technical issues. This is mainly because there has been little security
consideration (to malicious or selfish behavior) at their inception. In this dissertation, we
examine major threats to cooperative distributed systems, identify challenges in developing
solutions to those threats, and propose some defense mechanisms for resolving the threats
while overcoming the identified challenges.
1.1
Cooperative Distributed Systems
A cooperative distributed system is comprised of many autonomous participants (i.e., computers) which communicate through an overlay network (i.e., a virtual network of computers and logical links) built on top of an existing network. The use of a virtual
overlay network makes it possible to temporarily build a simple, flexible, and robust
infras-tructure for a new service that is not currently available in the existing network. These
in-frastructure is dynamically constructed in an ad-hoc manner without fixed, dedicated, or centralized infrastructure. Membership in such systems is also ad-hoc and dynamic in that
there is no requirement of participants for joining and departing the systems (i.e., open membership). Participants directly cooperate with one another only while they remain in the systems to achieve their mutual benefits (e.g., file downloads, information exchanges, or
solving scientific problems), by sharing resources such as computing power, storage space,
or network bandwidth. Unlike the client-server model, system capacity potentially scales
as the number of participants in a system increases, liberating the original service provider
from over-provisioning its resources to handle peak user demands [20, 19, 32, 68, 66].
A cooperative distributed system is aself-organizing system of equal, autonomous, individual participants which aims for the shared usage of distributed resources in a virtually
networked environment without the need for central coordination services. Such a concept
forms a fundamental design principle for distributed systems and reflects theparadigm shift
from coordination tocooperation, from centralization todecentralization, and from control
toincentives [70].
The significantly increased access capability to the Internet, the ample storage
space, and resourceful computing power of individual end users make the paradigm shift
promising. That is, the aggregated system resources of individual computers at the edge of
the networks are more powerful than any dedicated computing or storage service unit (e.g.,
super-computer or storage server), but they are not always in use, wasting the resource when not used. In a distributed system, each participant donates its resources when idle
(i.e., when the resources are not in use) and obtains other resources in return when busy
(i.e., when more resources are needed), which is mutually beneficial to the participants.
Therefore, if proper incentive is given to the participant, the mutual benefits are perceived
by the participants (so that they voluntarily cooperate with each other), and the system is
managed by the participants themselves without the need for any central entity or control,
future Internet-based applications even those requiring exorbitant resources can be easily
realized without resource restriction.
The information or services provided by a cooperative distributed system to its
files are not diminished by downloads (by participants) and the use of these files by any
one person does not restrict the use by anyone else. Since there is generally no dedicated
service provider, the fundamental premise of the success of cooperative distributed systems
is voluntarycooperation among participants toprotect and support public goods.
1.2
Major Threats to Cooperative Distributed Systems
As the usage of cooperative distributed systems increases, the importance of
achieving a high level of information integrity of the public goods shared and behavioral
integrity of participants in the systems cannot be overemphasized. It is required in order to
protect the Internet and its users. When a system lacks such integrity, it often causes losses
of the critical advantages of flexibility, robustness, and scalability offered by the system.
This level of integrity is, however, rarely attained under distributed environments without
a centralized authority governing the entire system. The major threats to these systems
come from the malicious and selfish misbehavior of participants.
One major threat is a deviation from the informational integrity of the public goods
shared in the systems by malicious participants. A high level of informational integrity
is indispensable for the success of a system, in that no one wants to join or remain in
the system for an unproductive service or useless data. Unfortunately, the public goods
shared are often error prone, causing a waste of network bandwidth, user time, and storage
space. For example, Liang et al. [49] showed that up to 80% of the copies of popular
files in the KaZaA network [4] are contaminated. This contamination is caused in part by
deliberate file pollution attacks [49, 18, 46]. Because pollution was not considered in the design of such systems (and they have no central, trusted authority), cooperative distributed
systems are highly vulnerable to such intentional pollution attacks [18]. Since pollution
squanders network resources and discourages user participation, the success of such systems
is questionable unless properly addressed. Therefore, there should be a mechanism to
protect the public goods shared in the systems.
Another major threat is the deviation from behavioral integrity of participants
in the systems toward the goods by selfish participants. It has been shown that there is
the desire of some participants to get benefits without making contributions (due to their
share any files (and were therefore free-riders). Handurukande et al. [33] reported more
than 80% of eDonkey users were free-riders in 2006, and Zghaibeh et al. [78] showed that
the volume of free-riders continues to increase in BitTorrent [7, 20] and approached 16.8%
in 2008. Free-riding occurs when the benefits of a good (i.e., consuming the good) can be
enjoyed without contributing to or paying for the production of that good. With free-riding,
some participants will contribute for the benefits of others but receive little or nothing in
return. The result can be a reduction in the willingness to contribute, with a resulting
underproduction (relative to the demand) or lack of production of the good. This problem
is known as “the tragedy of the commons” where individual’s attempt to maximize his or
her benefits leads to reduction of the overall public good [34]. Therefore, a new mechanism
to enforce cooperation among participants is required in order to prevent free-riding.
The best way of preventing such deviation from informational and behavioral
in-tegrity in cooperative distributed systems is to introduce highly effective penalty or incentive
mechanisms to the systems. These mechanisms must be impossible for internal system
par-ticipants to bypass, if they are to be effective. Although researchers have put efforts into
developing such penalty or incentive mechanisms in order to improve the systems, it has
been proven to be a daunting and difficult task which has not yet been truly successful as
seen in Chapters 2, 3, and 4.
1.3
Challenges to Developing Solutions
As mentioned, participants in a cooperative distributed system directly and
vol-untarily cooperate with one another to achieve their mutual benefits by sharing resources
without a universally trusted authority. Since there is no omniscient oracle in a distributed system, participants in the system must trust each other. When a deviation (e.g., pollution
or free-riding) occurs by malicious or selfish participants, however, the lack of a central
authority makes it difficult for the rest to detect or penalize the deviants (since no one is in charge), which is aggravated by the open membership. As such, the challenges are how
to figure out that I am currently under a selfish or malicious attack, how I penalize the
attacker, and to whom (and how) I report the attack so that the attacker is punished by
others. In addition, mutual agreements among participants should be obtained to effectively
deal with the deviants without any pre-trusted participant or authority. This is, however,
myriad of participants.
In this dissertation, we look for a way to achieve a high level of informational
integrity of the public goods shared and behavioral integrity of participants in cooperative
distributed systems. Since there is no centralized entity responsible for managing the entire
system in most distributed systems, solutions need to be implemented in a distributed
and cooperative manner. In this sense, our approaches depend on strong collaboration
among benign participants to detect and defeat the malicious and selfish behavior. These
approaches correspond to the intended original philosophy of the distributed approach (i.e.,
by the participants, for the participants).
1.4
Our Approaches
This dissertation attempts to make contributions by developing mechanisms
ad-dressing the above mentioned two major threats to cooperative distributed systems, pol-lution and free-riding. While addressing the two specific threats, we have noted that co-operation among participants is the key to prevent such deviation from informational and behavioral integrity, which leads us to design a general incentive mechanism for cooperation.
Therefore, a general incentive mechanism is proposed and evaluated, which can be readily
and widely used in many cooperative distributed systems to foster orenforce cooperation among participants. The mechanisms are as follow:
Anti-Pollution Mechanism (Winnowing) : Pollution (i.e., sharing of corrupted files,
or contaminating index information with bogus index records) is a de facto problem in many file sharing systems in use today. Pollution degrades the informational integrity of
the systems, squanders network resources, and frustrates participants with unprofitable
downloads (due to corrupted files) and unproductive download trials (due to bogus index
records). In this work, we propose a novel Distributed Hash Table (DHT) based
anti-pollution scheme called winnowing. Winnowing aims to reduce or eliminate decoy index records (pointing to non-existing or corrupted files) held by DHT (i.e., index) nodes in
the system, so that download attempts based on the remaining (clean) index records are
more likely to yield satisfactory results. To achieve this goal, two techniques are used:
of content and metadata pollution. By integrating these techniques, winnowing converges
quickly to a near-optimal solution. Winnowing has the added benefit that it does not reveal
a participant’s download history to other participants.
The publish verification of winnowing has been implemented on top of the latest
eMule client [8], and extensive data have been collected from the Kad network [55, 17], the
largest DHT-based P2P system in existence, using this modified client. The measurement
results have shown that the use of simple (keyword) publish verification would sharply
re-duce the fraction of bogus index records in the system (up to 35%), minimizing the impact
caused by index pollution. The findings from the measurement study are then incorporated
into our analytical model of the privacy-preserving object reputation. The model
demon-strates the robustness of the proposed object reputation to a variety of pollution attacks,
and to attacks against winnowing itself. The results of analysis are confirmed by means of
event-driven simulations, all of which are presented in Chapter 2.
Free-Riding Prevention Mechanism (Treat-Before-Trick) : As previously
men-tioned, free-riders who use others’ resources without sharing their own cause system-wide
performance degradation. Free-riding can be thought of as deviation from behavioral
in-tegrity by selfish participants in the system. Varying incentive schemes have been proposed
recently to encourage participants to cooperate by sharing their resources. They may be
classified as monetary-based [73], reciprocity-based [20, 72, 31, 14], and reputation (or
credit)-based [44], which are well summarized by Feldman et al. [30]. However, existing
techniques to counter free-riders are either complex (and thus not widely deployed) or easy
to bypass (and therefore not effective). In this work, we consider the free-riding problem
in a major P2P file-sharing system, BitTorrent. We propose a unique scheme to
penal-ize free-riding and reward compliant behavior, based on the use of (t, n) threshold secret
sharing [65]. Under the proposed scheme, files are divided and encrypted by the owner
(i.e, the seeder). The key used for encryption is divided into nsubkeys, any different t of
which are sufficient to recover the key and to decrypt the file pieces. A participant must
upload (encrypted) file pieces to obtain the subkeys necessary to decrypt file pieces which
have been downloaded (i.e., subkeys are swapped for file pieces). Only the participants who
have collected all the encrypted file pieces and at least t subkeys can complete their file downloads. This scheme is called“Treat-Before-Trick” (TBeT).
pieces and necessary subkeys). TBeT counters known free-riding strategies, incentivizes
par-ticipants to donate more upload bandwidth, and consequently increases the overall system
capacity for compliant participants. No central authority is required under the proposed
approach. TBeT has been implemented as an extension to BitTorrent, and the results of
experimental evaluation are presented in Chapter 3.
A General Incentive Mechanism (Triangle-Chaining) : In this work, we propose a
simple, completely distributed, but highly efficient fairness-enforcing incentive mechanism
for cooperative distributed systems, relying on a low cost symmetric key scheme. The
proposed scheme leveragessymmetricandasymmetric interests created at the time of com-munication to obtain information or services. A participant sends encrypted information to
another participant with an implicit request that the receiver should reciprocate with
virtu-ally the same amount of work to the participant designated by the sender. The decryption
key is givenif and only if the request is fulfilled by the receiver, meaning that reciprocation is not optional but mandatory under the proposed approach. Upon fulfilling the request,
the receiver becomes another sender of encrypted information in the next transaction, and
this process is continuously applied link by link, in a chain like manner. No centralized
monitoring or control is required, and the overhead of encryption and decryption is evenly
distributed among participants. The mechanism is called Triangle Chaining, or T-Chain
for short.
T-Chainenforcesstrong fairness among participants, limiting the system resources allotted to uncooperative participants (e.g., free-riders), while ensuring higher resource
avail-ability for compliant participants. Our simulation results demonstrate the effectiveness of
the proposed mechanism against free-riding: under realistic conditions, free-riders never
complete their downloads, and unrealistically strong collusion is required for free-riders to
complete their downloads without contribution. Moreover, it is general enough to be applied
to any distributed system requiring interactions among potentially un-trusted participants.
Since T-Chain can easily leverage asymmetric interests (in addition to symmetric ones), it
is readily applicable for many diverse Internet applications whose primary direction of
inter-actions is asymmetric, such as live and VOD streaming. To evaluate the effectiveness of our
approach, we applied T-Chain to a BitTorrent-like swarm based content centric network,
1.5
Organization of the Dissertation
The rest of this dissertation is organized as follows. Chapter 2 discusses the
anti-pollution mechanism and Chapter 3 presents the free-riding prevention mechanism.
Chap-ter 4 details the proposed general incentive mechanism and evaluates how the mechanism
could be applied to address the deviation from behavioral integrity (i.e., free-riding) on top
of BitTorrent. Finally, Chapter 5 concludes this dissertation and points out some future
Chapter 2
Anti-Pollution Mechanism
In this chapter, we investigate an example of malicious behavior in cooperative
distributed systems,pollution, which is a deviation (bymalicious participants) from coop-eration in the protection of public goods (i.e., files) shared in these systems.
Peer-to-Peer (P2P) systems, a representative family of cooperative distributed
sys-tems, have emerged rapidly as a popular way to exchange information over the Internet.
Their advantages include simplicity, robustness, flexibility, and scalable performance.
Un-fortunately, the information shared by current P2P systems has turned out to be error prone,
causing waste of network bandwidth, user time, and storage space. For example, Liang et
al. [49] showed that up to 80% of the copies of popular files in the KaZaA network [4] are
contaminated. This contamination is caused in part by deliberate file pollution [49, 18, 46].
Because pollution was not anticipated in the design of P2P systems, and they have no
central, trusted authority, P2P applications are highly vulnerable to such intentional
at-tacks [18]. Since pollution squanders network resources and discourages user participation,
the success of P2P systems is questionable unless properly addressed.
Much research has been done regarding the prevention of the pollution problem
in P2P systems. Several modeling techniques [26, 43, 46] have been introduced, which are
useful to understand the proliferation of pollution. Some previous work has given general
ideas on how pollution could be reduced [49, 12, 18, 24]. Only peer reputation [40, 22],
object reputation [74, 79], and hybrid reputation [21, 23] approaches, however, are
pro-posed as practical solutions thus far in this literature. Reputation based approaches aim to
resolve the pollution problem through cooperation among users sharing their past
pollution. Reasons include the complexity of implementation (e.g., use of complex security
mechanisms), the slow convergence to reach optimal performance (e.g., individual users are
responsible for collecting and evaluating opinions of others), and the cold start problem for
new users (e.g., other users’ opinions are barely useful until a newcomer has accumulated
a sufficient number of its own download experiences). Most of these weakness of existing
reputation systems are caused by the separation of the reputation and file sharing services.
That is, the building and acquisition of the reputation for a peer or an object is detached
from publish or look-up processes. In addition, current reputation systems disregard the
privacy concerns of participants, assuming that users willingly share their file download experiences or history (in part or as a whole) with other peers.
This chapter addresses the pollution problem in DHT-based P2P systems. In
DHT-based P2P systems, users store (or retrieve) the information on the files shared in the
system in (or from) DHT as index records (i.e., the description of the shared files or their location). No single user can download wanted files unless the matching index record is first
referenced. If the index records shared in the DHT are clean (i.e., pointing to only
non-polluted files), no system resources will be wasted on unwanted or non-polluted file downloads.
The principle driving our proposal is that benign users should cooperate in two ways with
one another in cleaning the index records to fight pollution. First, index nodes should
detect false (keyword) publish messages by confirming the contents of the publish message
upon receipt; no information published is accepted at face value. Second, privacy-preserving
object reputation is integrated into DHT as a part of the publish and look-up processes.
That is, index nodes collect feedback from past downloaders about the validity of those
files. This feedback from users is aggregated and inserted into the matching index records
in a novel way to indicate the authenticity of file contents. The results are then provided
to prospective downloaders to help with their decision about which files to download. The
results do not reveal information about specific users or their download history. Careful
consideration is given to reducing the impact of false feedback and malicious index nodes.
We call this scheme “winnowing” 1 The contributions of this work are:
• Object reputation is integrated into the DHT as a part of the file publish and look-up processes. Information is shared only between users who are interested in thesamefile contents; this reduces the complexity of reputation implementation, and accelerates
1
the convergence, as will be shown. Since newcomers can directly refer to the built-in
reputation information included in matching index records, no “cold start” problem
occurs under winnowing.
• Winnowing is privacy-preserving. The feedback from past downloaders is collected
and processed by the DHT (i.e., index nodes). Only the aggregated feedback results
are given to prospective downloaders, in a condensed form (i.e., voting credits). This
does not reveal the details of voters, downloading, or voting to other peers.
• Winnowing is easy to implement, completely distributed, and scalable. It does not
require any trusted third parties or centralized servers, which are difficult to achieve
in a large, self-organizing P2P system. No cryptographic signature is required for user
feedback. Random numbers (cryptographic nonces) are used for purposes of matching
votes to downloads.
We have implemented keyword publish verification on top of the latest version of eMule
client [8]. Using this modified client, the results of keyword publish verification have been
collected from the Kad network [55, 17], the largest DHT-based P2P system in existence.
These results are summarized in Section 2.4.1, and are incorporated into an analytical
model. This model is used to evaluate the performance of the privacy-preserving object
reputation mechanism. The model demonstrates the effectiveness of integrated reputation:
fast convergence to near-optimal performance and robustness to various pollution attacks.
All modeling results are then confirmed by means of event-driven simulations.
The remainder of this chapter is as follows: In Section 2.1, some P2P terminology
and a brief overview of the Kad network are introduced. Existing pollution strategies are
discussed in the latter part of the section. Section 2.2 investigates file sharing in the wild,
providing a deep understanding of existing P2P systems. Section 2.3 presents the protocol
description of winnowing with security considerations. The results of the measurement
of keyword publish verification and analysis of privacy-preserving object reputation are
presented in section 2.4. Section 2.5 analyzes the overhead for implementing winnowing
and discusses limitations of the proposed method followed by possible solutions. Section 2.6
briefly describes and analyzes existing methods of anti-pollution approaches and, finally,
2.1
Background
This work investigates pollution problems in DHT-based file sharing P2P networks.
A specific file being shared, whether a movie or song or something else, is referred to as
a title. There may be different versions of a title, and for one version of a title, multiple
copies could exist if the version is downloaded and shared by many users in the network. There can also be many decoys, which appear to be versions of the title, but which do not in fact exist in the network, or which contain corrupted/lower-quality content. Each file has
metadata, which is structured information that describes the file. Metadata includes the file name, the file size, the file type or format, etc [49, 17]. Akeywordis a single token extracted from the metadata of a file, usually from the file name. In the Kad system, a keyword must
be composed of at least three characters [17]. Each file might have many keywords. Auser
is an operator of a P2P client application, whereas apeer (or interchangeably anode) is the client application itself. A user is called a content owner (alternately, a publisher) when the user contributes a file to the P2P system. A user downloading a file from the system is
referred to as adownloader.
2.1.1 Publishing and Retrieving
The Kad network is a widely deployed DHT-based P2P system implemented by
several file sharing applications, such as eMule [27] or aMule [11]. It is based on one of
the well-known DHTs, Kademlia [55]. The Kad network may have more than one million
concurrent peers at any given time [71]. A key is an identifier used to store and retrieve information in and from the DHT, and has the same bit size as all Kad IDs. There are two
types of keys, a content key and a keyword key. A content key is obtained by hashing the entire content of a file, whereas a keyword key is computed by hashing each keyword.
Publishing is used to index information about a file. This information is stored by the index node in charge of that portion of the Kad ID space. A peer who wants to
publish a file (content owner) first hashes the whole content of the file to get the content
key. The content owner locates the node that has the closest Kad ID to the content key.2
For this purpose, the content owner uses iterative routing [17, 69]. Once determined, the
closest node to the content key becomes a content key owner of the file. The content
2
kk1=HASH(“dragon”), keyword publish (kk1) <keyword key1, content key, metadata>
<kk1, ck, metadata>
<ck, publisher info.>
<kk2, ck, metadata>
1 2
2
(a) publish
<kk1, ck, metadata>
<ck, publisher info.>
<kk2, ck, metadata>
File download
2
3
1
(b) retrieve
Figure 2.1: A high level overview of the publish and retrieve mechanism.
key owner updates its local index table with a content index record <content key, location information>. The location information includes the basic information on the content owner
such as the Kad ID, the IP address, TCP port, etc. Furthermore, the content owner hashes
each keyword of the file into a keyword key. For each of the keyword keys, the content
owner locates the closest node (keyword key owner) to the keyword key, in the same way as before, and publishes the index information of the file. Each keyword key owner updates
its index table with a keyword index record <keyword key, content key, publisher IP list, possible file names, metadata > of the file. For a given key, up to 10 closest nodes can
be selected as key owners (referred to as a tolerance zone), to deal with high churn rates. Content publishing followed by keyword publishing is called a2-level publishingscheme [17]. In the Kad system, content (keyword) keys are periodically republished every 5 (24) hours.
Retrieving file information from the DHT is the reverse of publishing. Users obtain a content key through a keyword search, and use this content key to retrieve location
information via a content search. The user then attempts to download the file from a
content owner via the location information that is returned.
Figure 2.1 depicts a high level overview of the publish and retrieve mechanism in
a DHT-based P2P system. In this example, we assume that the publisher (or the
down-loader) wants to publish (or retrieve) a movie file, “Dragon War.mpg”. To publish the file
(Figure 2.1(a), first, the publisher hashes the entire content to get the content key (ck)
of the file and sends content publish messages to the content key owner(s). Second, the
publisher hashes each keyword (i.e., “dragon” and “war”) to get a keyword key (kk) and
file (Figure 2.1(b), first, the downloader hashes a possible keyword (e.g., “dragon” in this
example) to get the keyword key. Then, sends keyword search messages to the keyword
key owner(s). Second, the downloader selects one content key among all possible content
keys returned by the keyword key owner(s) based on the metadata of each index record and
sends content search messsages to the content key owner(s). Finally, the downloader
down-loads the file from the content owner(s) based on the publisher (i.e., location) information
returned by the content key owner(s). The downloaded file is immediately republished and
shared until the downloader deletes or moves it.
2.1.2 Existing pollution attacks
Depending on the strategies adopted by polluters, pollution can be classified into
the categories ofcontent pollution,metadata pollution, and index pollution.
Content pollution[49, 24, 46] changes all or part of the content of a target file by
degrading its quality. This can be accomplished by the addition of white noise, by shuffling
or omitting some of its contents, or by substituting new information. Polluters can easily
create polluted files which have the same content hash value (i.e., the same content key) as
a non-polluted file. This is possible by exploiting the weakness of the current hash function
of the Kad network, MD4 [56].
Metadata pollution [49, 18] tampers with the metadata of a file, rather than
the contents themselves. Users downloading files based on that metadata will be misled
into downloading content differing from what they expect.
Index pollution [51] directly attacks the index structure of the network, instead
of the content or metadata of the target files. Index pollution can be accomplished by
inserting a massive number of false records into the index table of either a content key owner, or a keyword key owner of the target file (i.e., title). With index pollution, users fail
to locate the target file when relying upon the information in a false index record. Since
index nodes in most file-sharing based P2P systems today do not authenticate or verify
publish messages, polluters may easily contaminate the index records [51].
2.2
Sharing in the wild
For better understanding of existing file sharing P2P systems, some relevant
patterns, version popularity, and user characteristics (the fraction of NATed users, IP
dis-tribution, etc.). These results are used in explaining and validating the design of winnowing
in the following two sections.
In these experiments, distribution and access to five mp3 songs were investigated.
The first 4 famous songs (T1 ∼ T4) were selected from the Top 10 Songs [9] in June of
2008, and were very popular. The last song (T5) was selected, for comparison, from the
late 1970’s billboard charts. This song hit #1 at that time, but is considerably less popular
today than the other 4 songs.
To collect results, we modified and inserted a the crawling client (0.49a MorphXT
version 11.0 [8]) into the Kad network. First, for the keyword key owners, one keyword per
each title (KTi) was extracted from the file name and hashed into the 128-bit keyword key.
Then, the client ID of each crawler (i.e., the matching keyword key owner) was configured
to have the 128-bit value of the hashed keyword. By so doing, each crawler could receive
keyword publish messages for the mapped keyword. Second, for each content key owner, the
most popular content key (i.e., version) among all content keys of each title was selected.
The most popular content key was then configured as the crawler’s client ID, so that content
key publish messages for that content would be received.
Data were collected for 48 hours for keyword publish messages, and for 10 hours
for content publish messages; this spans twice the normal republish cycle.
0 5000 10000 15000 20000 25000 30000
12:30 22:30 08:30 18:30 04:30 11:30
total # of keyword publish messages per hour
time KT1 KT2 KT3 KT4 KT5
(a) total # of keyword publishes
0 5000 10000 15000 20000 25000 30000
12:30 22:30 08:30 18:30 04:30 11:30
# of unique content keys published per hour
time
KT1 KT2 KT3 KT4 KT5
(b) # ofuniquecontent keys
1e-04 0.001 0.01 0.1 1
1 10 100
PDF of copies (log scale)
content key number (ordered by polularity - log scale) T1 T2 T3 T4 T5
(c) version popularity
Figure 2.2: File sharing statistics in the wild (Kad network)
Figure 2.2(a) shows the total number of keyword publish messages received by
each keyword key owner. It demonstrates the time of day effect on the total number of
order of magnitude, based on the popularity of keywords. Note that the target songs are
selected from titles famous in the USA and the keyword publish messages are collected
from the winnowing clients in the EST time zone. So, this result gives a hint of the diurnal
behavior of Kad users.
Even at the high rate keyword publishes, we assumed that the number of unique
content key publishes is quite low since there must be many duplicate keyword publishes for
the same content keys. To investigate this, only the keyword publishes of unique content
keys are counted, and Figure 2.2(b) presents the results. As indicated, the number of
keyword publishes for unique content keys is less than 1/5 of total keyword publishes in most
cases and the ratio keeps decreasing as time goes on. This suggests that the proliferation
of existing contents (rather than newly introduced contents) dominates file sharing in the
wild.
Next, the version popularity of each target song is investigated. In the Kad
net-work, the number of keyword publish messages for a content key represents the popularity
of the version (i.e., content key) because there must be a keyword publish message whenever
a version is downloaded. To check the popularity, content keys are ordered based on the
number of keyword publish messages received during the measurement period. Figure 2.2(c)
demonstrates the PDF on a log-log scale for the number of copies with respect to the top
100 content keys of each title. The near linearity of the curves confirms that the version
popularity of a title in the Kad network follows a Zipf distribution, which strongly indicates
that users select a version (i.e., content key) based on its popularity if there is no other
selection standard. The results agree with the results of previous studies [49, 22] onother
P2P systems. 0 0.1 0.2 0.3 0.4 0.5
T1 T2 T3 T4 T5
fraction of NATed users
title
(a) NATed users (Kad)
0 5000 10000 15000 20000 25000
1 2 3 4 >=5
frequency
number of users per IP/24 range T1
T2
T3
T4
T5
(b) # users per IP/24 (Kad)
0 10000 20000 30000 40000 50000 60000 70000
1 2 3 4 >=5
frequency
number of users per IP/24 range
(c) # users per IP/24 (BitTorrent)
Realistic user statistics are valuable in designing an effective reputation
mecha-nism. First, the fraction of NATed users was investigated; this helps us understand the
distribution of downloaders of a file. In the Kad network, since a NATed publisher includes
its buddy3 information in the content publish message, content key owners can easily figure
out whether the publisher is a NATed one or not. Figure 2.3(a) shows the fraction of NATed
users for the most popular content key of each title, ranging from 38.55% to 46.06%. We
believe that the actual ratio of NATed users is somewhat higher than indicated in that the
titles selected are popular in ARIN (North America) area where most of the public IPv4
addresses are allocated.
Next, the number of users per small IP prefix range (i.e., IP/24) who have
down-loaded the same most popular version (content key) of each title was investigated. This is
useful for purposes of identifying colluding malicious nodes. Figure 2.3(b) demonstrates the
number of users in each IP/24 range for the most popular content key of each title. The
average number of users who have downloaded the same version in a IP/24 range was only
1.1. Note that the data have been collected for 10 hours, which is twice the length of the
content republish cycle; therefore, the actual number in a single cycle is lower than that.
A similar investigation was done for RedHat 9 Torrent tracker trace [5], which represents 5
months of activity in a BitTorrent swarm. As shown in Figure 2.3(c), the same trend was
found in this case.
These two results show that the number of users in a small IP range downloading
the same content (i.e., the same version) is relatively low, even considering the high volume
of NATed users in the wild. This will motivate the use of IP prefix-based binning for user
feedback collection and aggregation (Section 2.3.2).
2.3
Index Filtering (winnowing)
In this section we present the basic protocol description ofwinnowing. Winnowing aims to reducedecoy (i.e., bogus or corrupted) keyword index records4 in P2P systems. To achieve this goal, as a preliminary phase, keyword key owners verify whether the content
3
A buddy can receive incoming messages for a firewalled or NATed peer, and will manage its communications.
4
key in each keyword publish message is valid for all keyword publish messages received, and disregard invalid keyword publish messages. Then privacy-preserving reputation is
integrated into DHT. That is, for files which are published and downloaded, keyword key
owners collect and aggregate feedback from the users who have downloaded files via their
keyword index records. The results are then stored in the mapping keyword index records
in condensed form. This helps prospective downloaders to locate clean (unpolluted) files.
No information on the voters or their downloading history is stored in the keyword index
records or revealed to other downloaders; this preserves the privacy of downloaders.
In the discussion below, we initially assume that all index nodes are trustworthy.
This assumption is then relaxed in Section 2.3.3. Careful consideration is given to reducing
the impact of false feedback and malicious index nodes.
downloader keyword key owner publisher content key owner
CONTENT_PUBLISH_REQ KEYWORD_PUBLISH_REQ CONTENT_PUBLISH_RES CONTENT_KEY_VERI_RES K e y w o r d P u b li s h V e r if ic a ti o n KEYWORD_PUBLISH_RES CONTENT_KEY_VERI_REQ
(a) keyword publish verification (preliminary phase)
CONTENT_DOWNLOAD KEYWORD_SEARCH_REQ
CONTENT_SEARCH_RES CONTENT_SEARCH_REQ KEYWORD_SEARCH_RES
downloader keyword key owner publisher content key owner
AFR and RES
MJR and RES
P r iv a c y -p r e s e r v in g o b je c t r e p u ta ti o n
(b) privacy-preserving object reputation
Figure 2.4: Message processing in eMule with winnowing.
2.3.1 Keyword Publish Verification
To reduce index pollution (Section 2.1.2), the keyword key owner must verify the
contents of each keyword publish message it receives. The principle here is that no
informa-tion published is accepted at face value; it is confirmed by the keyword key owners before
acceptance. Otherwise, a polluter could generate a random content key which matches
no file in the network, and publish the information <keyword key, random content key, metadata> to the matching keyword key owners. When a user attempts to download a
and effort. If the user cannot locate a file after several attempts, it is likely he or she will
abandon the search and leave the system.
As seen in Figure 2.4(a), verification of the keyword publish message is
accom-plished by issuing content search messages, using as a target the content key in the keyword
publish message. A successful reply indicates the content key is valid. Conversely, if there
is no reply to the content search using the given content key within a reasonable period
of time, the content key can be considered to be bogus. Note that search queries succeeds
99.9% of the time in the current eMule [71], which means that this assumption of success
when searching for a valid file is realistic.
Content key verification may be tricked if the attacker also publishes fake location
(i.e., publisher) information for the random content keys. However, this attack requires
substantial effort and resources from the attacker in that the bogus random content keys
must be re-published every 5 hours (Section 2.1.1). In addition, index pollution is effective
in this case only when themajority of keyword index records are bogus [51]. Moreover, the bogus keyword index records will be further subjected to examination (and rejection) by
means of the privacy-preserving object reputation mechanism.
2.3.2 Privacy-preserving Object Reputation
The publish verification explained above does not prevent content and metadata
pollution (Section 2.1.2). This is because index nodes themselves are unable to judge the
quality or validity of file contents. Therefore, polluters could still store corrupted index information in the P2P system through content and metadata pollution. In addition, in
some cases, bogus index records inserted by smart polluters circumventing keyword publish
verification may exist.
To deal with these problems, privacy-preserving object reputation is proposed.
The collection and dissemination of reputation information is integrated in DHT as a part
of the file publish and look-up processes. In this approach, keyword key owners collect
feedback from users (who have downloaded files via their index records) about the contents
of files they have downloaded. The feedback from users is aggregated and combined into the
matching keyword index records, in a way that indicates the authenticity or validity of the file contents. The aggregated feedback results are then provided to prospective downloaders
Feedback Collection
To collect feedback, each keyword key owner must maintain two kinds of lists:
keyword reference lists (KRLs) andcontent key voter lists (CKVLs). One KRL is allocated to each keyword key managed by the matching keyword key owner, and one CKVL is
assigned to each content key. When a keyword key owner receives a keyword search query
from a peer P, it generates a random number R 5 and inserts <IP address, R >into the
matching KRL for this keyword key. The reply to P also contains the value ofR.
To help eliminate polluted content, some fraction of users will report their file
download experiences to the appropriate keyword key owners. The user feedback report
comprises theautomated failure report (AFR) and themanual judgement report (MJR). As seen in Figure 2.4(b), an automated failure report message is generated by the peer for every
failed download trial. This happens without requiring any user action. The report is sent
directly to the mapping keyword key owner if no location information is returned for the
content key attempted. This failure could be caused by either a bogus keyword index record
(not filtered out by the keyword publish verification), or by a stale keyword index record.
If a download attempt is successful, and upon viewing or listening to the file’s contents,
the user may send a manual judgement report by simply marking it as either “clean” or
“polluted” at his or her own discretion. This report is sent directly to the keyword key
owner. Both the AFR and the MJR contain the value ofR sent to the peer by the keyword
key owner in response to a previous keyword search query. Casting a vote is mutually
beneficial for compliant users in that when a user received a polluted one, the vote from the
user reduces the probability that other compliant users select the same defective content,
and their feedback helps the user to select a good one with the enhanced probability.
If a peer P provides a keyword key owner with feedback (AFR or MJR) about a
download experience, the keyword key owner checks the KRL to see whether the peer has
ever issued a keyword search query before. This can be done based on the peer’s IP address,
and the random numberR presented as evidence of the peer’s keyword search. If a match
in the KRL is not found, the feedback is ignored. Otherwise, the keyword key owner checks
the CKVL to see whether the peer has previously voiced its opinion about this content key,
based on the IP address of the voter and R. If so, the feedback is also ignored. Otherwise,
5
A cryptographic nonce used to prove that a voter has actually downloaded a file via one of the keyword index records of this keyword key owner. It also prevents attackers from using spoofed IP addresses outside
the feedback from this peer is aggregated with feedback from other peers as explained in
the next section, and the keyword key owner inserts <IP address, R > into the CKVL.
This prevents the same peer from casting a vote about the same content multiple times as
a result of a single search.
Attackers can cast multiple votes about content, in an attempt to manipulate
the feedback mechanism of winnowing. In order to do that, the attackers may perform
different search queries (which yield to the same content key) in order to get many voting
opportunities (i.e., different values of R). The number of such multiple malicious votes
(from the same user) is, however, easily regulated by limiting the search frequency, which
is already done in the existing eMule client [76]. More importantly, the effects are further
limited by the feedback aggregation mechanism of winnowing explained in the following.
Feedback Aggregation
In winnowing, each keyword index record has the form<keyword key, content key,
voting credits,· · ·, metadata>. The voting credits represent the validity of the content key (i.e, version), and reflect the votes (“clean” or “polluted”) previously cast by downloaders
of that content. Each accepted vote is reflected in the matching keyword index record by
increasing (for a positive vote) or decreasing (for a negative vote) the voting credits for a
content key. The voting credits do not reveal who cast votes, nor what individual votes
were cast. This preserves the privacy of the voters participating in the file sharing system.
The way in which voting credits are increased or decreased determines the
effective-ness of this object reputation scheme. The approach advocated here is
Additive-Increase-Multiplicative-Decrease (AIMD). With the AIMD approach, each keyword key owner adds
1 to the voting credits if a positive vote is received, but reduces by half (divides by 2) the
voting credits if a negative vote is received. Winnowing adapts the AIMD approach as
the method of updating voting credits in order to disseminate information about polluted
content quickly. Of course, an attacker could potentially abuse this policy to mislead peers
into believing clean content is actually polluted due to the multiplicative decrease (i.e.,
order-dependent) nature of AIMD. However, theIP prefix binning strategy in the following and theimbalanced feedback mechanism (Section 2.3.3) minimize the impact, which will be examined again in Section 2.4.
designed. That is, malicious nodes will lie (i.e., vote positively for decoy keyword index
records and vote negatively for clean ones). This is referred to here asreverse voting. Such users may also attempt to amplify the impact by forging false identities (i.e., resorting to
the Sybil attack [25]). Winnowing addresses this problem by means of the IP prefix based binning strategy with weighted voting. With this mechanism, one IP address prefix (e.g., IP/24) is mapped into a bin, and the weight of a vote in the same bin decreases as the number of votes in the bin increases. This prevents multiple votes in the same IP range
from counting as much as multiple votes fromdifferent IP ranges. The motivation for this mechanism is that it has been found in practice that the majority of users in P2P systems
are benign [50], and the number of users who have downloaded the same file (i.e., version)
in a small IP prefix range is low, as indicated by our measurement results (Section 2.2). For
this purpose, the frequency of the same IP prefix of voters is remembered in the mapping
CKVL to adjust the weight of each vote.
2.3.3 Penalizing Uncooperative or Malicious Index Nodes
Thus far, it has been assumed that all keyword key owners are honest. Under
this assumption, the processed voting results for each content key must be reflected into
the mapping keyword index record, and secured by the keyword key owners themselves.
However, this assumption may not hold in practice. In some cases, the keyword key owners
may be lazy or uncooperative, and may simply not act properly as feedback mediators. In
other cases, attackers may insert themselves as a keyword key owner of the target file, so that
they can control (and falsify) the keyword index records as they wish. This will be possible
if an attacker intentionally manipulates its client ID so that the ID matches the hash of
a keyword of the target file. Once the attacker becomes the keyword key owner(s) of the
target title and returns only decoy index records (i.e.,provides false information) whenever asked, downloaders may be unable to located clean copies of a file, or may consistently
download polluted copies of files.
To address this problem, winnowing adopts a so calledimbalanced feedback mech-anism. With imbalanced feedback, the size of voting messages differs based on whether the vote is positive (favorable) or negative (unfavorable). The size of a positive vote is
very small (possibly only a few tens of bytes) whereas the size of a negative vote is much
penalizes malicious keyword key owners in that index nodes issuing many decoy keyword
index records will receive a high volume of negative votes. The penalty is inevitable for the
attacking keyword key owners once launched by compliant users, consuming the download
bandwidth of the attackers. Second, the technique makes it more difficult for attackers to
cast many reverse votes for clean keyword index records since that will consume their own
resources (i.e., upload bandwidth).
The keyword key owner may also fail to provide keyword index records for selected
files (i.e., tell nothing). Note that ifall the keyword key owners for a keyword are replaced by such attackers, then users may not be able to find the files whose name (or metadata)
includes that keyword. However, such attacks will be difficult and expensive, given the
degree of redundancy of keyword index records in current systems. For instance, it has
been found that there are an average of 19 keyword key owners for a single keyword in the
Kad network [71], and each title is indexed by a number of keywords. Moreover, use of
multiple hash functions [77] designed for load balancing will further restrict the ability of
attackers to control all of the keyword key owners for a file.
2.4
Evaluation
In this section, the winnowing approach is evaluated. The first part checks the
effectiveness of keyword publish verification in the Kad system. In the second part,
privacy-preserving object reputation of winnowing is evaluated through analytic modeling, and the
results are compared with simulation. Modeling and simulation allow a larger variety of
conditions and assumptions to be tested than would a deployed solution. They have the
added benefit that they do not require acceptance by and the training of a user community,
which is normally a lengthy process.
2.4.1 Performance of Keyword Publish Verification (Preliminary Phase)
The keyword publish verification of winnowing was implemented on top of an
up-to-date eMule client (0.49a MorphXT version 11.0 [8]) and inserted in Kad network.
Ex-periments were conducted under the same settings in Section 2.2. Each keyword key owner
(i.e., modified winnowing client) verified the content key in each keyword publish message,
issuing content search messages (i.e., CONTENT KEY VERI REQ in Figure 2.4(a)). If