• No results found

Efficient Search in Gnutella-like Small-World Peerto-Peer


Academic year: 2021

Share "Efficient Search in Gnutella-like Small-World Peerto-Peer"


Loading.... (view fulltext now)

Full text


to-Peer Systems


Dongsheng Li, Xicheng Lu, Yijie Wang, Nong Xiao School of Computer, National University of Defense Technology,

410073 Changsha, China leedongsh@hotmail.com

Abstract. Gnutella -like peer-to-peer file -sharing systems have been widely

deployed on the Internet. However, current search techniques used in existing Gnutella -like peer-to-peer systems are often inefficient. We demonstrated the strong “small-world” property of Gnutella systems and proposed an efficient search approach CSTM to utilize the property. In CSTM, each peer maintains a local state table, which contains keyword information of data on all neighbors within T hops to guide query. A new data structure based on Bloom Filter is introduced to represent the state table compactly to reduce storage and bandwidth cost. Query cache is also adopted to utilize query locality and build shortcut connections to lately accessed peers. Simulations show that CSTM can reduce message cost remarkably while maintaining short search path length compared with flooding or random forwarding algorithm.

1 Introduction and Related Work

In recent years, peer-to-peer (P2P) computing has emerged as a novel and popular model of computation and gained significant attentions from both industry field and academic field [1,2,3]. Gnutella-like P2P systems, such as Gnutella, Freenet [4], Morpheous and Nuerogrid [5], are widely deployed and predominant on the Internet in real life because of their simplicity and usability.

However, flooding in Gnutella systems costs too much bandwidth and limits the scalability. Recently much work has been done to improve the performance and scalability of Gnutella-like P2P system. Markatos [6] studied the characteristic of Gnutella traffic and proposed some caching strategies to improve its performance; but they still used flooding for search. Lv [7] suggested random walks instead of flooding, Cohen [8] proposed and analyzed three replication strategies, and proved that the square-root strategy is optimal. Yang [9] suggested iterative deepening and Directed BFS technique to reduce messages cost. Joseph [5], Yang [9], Adamic [10], and Crespo [11] suggested that each node maintain some kind of metadata that can

* This work was supported by the National Natural Science Foundation of China under the grant No. 69933030 and 60203016, the National 863 High Technology Plan of China under the grant No. 2002AA131010 and 2003AA1Z2060, and the Excellent PHD Dissertation Foundation of China under the grant No. 200141.


provide “hints” to guide search. Yang [9] used local indices where nodes index of neighbors in the system, however flooding was still used for query. Adamic[10] utilized local information such as connectedness of a node’s neighbors and forwarded query to neighbors with high degree. Crespo [11] built summaries of content that is reachable via each neighbor of the node in different topics.

In this paper we demonstrated the “small-world” property of Gnutella network and proposed a new search approach— CSTM, to utilize the property to improve the performance and scalability of Gnutella-like P2P systems. In CSTM, each peer in the system maintains a local state table, which contains indices of keywords on all neighbors within a few hops to guide query. For the reason of “small-world” property, the desired data could be found within only a few hops. Extended Bloom Filter (EBF), which is based on Bloom Filter [12] technique, is introduced to represent the state table with little storage and bandwidth overhead. Meanwhile the query cache is adopted to utilize the locality of query and build shortcut connections to remote peers. Simulations show that CSTM can achieve much better performance compared with flooding and random forwarding algorithm in Gnutella-like P2P systems.

Bloom Filter [12] technique has been widely used as a summary technique. Summary Cache [13] uses bloom filters as compact representations for the local set of cached files. Sean C. [14] uses Bloom Filters to find nearby replicas quickly in OceanStore systems. CSTM uses Bloom Filter in some similar way to [14], but they are for different purposes and achieve different tradeoff. To our knowledge, CSTM is the first to use Bloom Filter technique to improve search in Gnutella-like P2P systems.

The rest of this paper is organized as follows. Section 2 demonstrates the “small-world” property with data traces crawled from Gnutella networks. Section 3 presents CSTM in detail. Section 4 analyzes the performance of CSTM. Conclusions are made in Section 5.

2 “Small-world” Property of Gnutella

The “small-world” phenomenon was first discovered by Stanley Milgram in the late 1960s. “Small-world” networks [15] exhibit a highly clustering; yet have typically short path lengths between arbitrary nodes, i.e., short diameter. In this paper, we model Gnutella network as an undirected graph and use the concepts of clustering coefficient and characteristic path length proposed by Watts and Strogatz [15] to analyze the characteristic of Gnutella-like peer-to-peer networks.

We determined the properties on the actual Gnutella topology data crawled by Clip2 Company [16] in the summer of 2000. We perform the experiments on many different snapshots of Gnutella networks, which are selected randomly from the data traces. We calculate the characteristic path length and clustering coefficient for Gnutella networks, compared to corresponding random networks with the same number of nodes and average degree per node. Table 1 shows five samples.

As Table 1 shows, all the Gnutella topology snapshots exhibit strong “small-world” property: characteristic path length is close to that of corresponding random


network, but the clustering coefficient is much higher (i.e., LGnutella = Lrandom and CGnutella .>> Crandom ).

Table 1. “Small-world” property of Gnutella

The “small-world” property of Gnutella networks can help to guide the design of search mechanisms in such systems. For examples, the small diameter property means that if a peer in the system has the knowledge of the data on its neighbors within several hops, it can forward query for desired data to appropriate neighbors directly and doesn’t need flooding any more. The property is utilized in CSTM.

3 Approach Description

In Gnutella-like P2P systems, peers submit some keywords of desired data for search and acquire results from the systems. In CSTM, each peer has a local state table that contains keywords information of data files on all neighbors within T hops (T is a system-wide constant). We use keywords rather than full name of files in the state table because keywords can support flexible query.


Fig. 1. Neighbor graph of peers

There are three columns in the state table. For example, Table 2 shows the state table of peer A in Figure 1.The first row of Table 2 shows that the keyword “car” is two hops away from peer A through neighbor B, and the fourth row show that the keyword “red” is one hop away from peer A through neighbor C. Each keyword may be stored on multiple peers which are either different hops away from peer A or through different neighbors of peer A, so the same keyword can take up multiple rows in the state table.

LGnutella Lrandom CGnutella Crandom

1 3.8643 3.1876 0.04513 0.006789 2 4.2884 3.6546 0.05403 0.004254 3 4.4368 4.1794 0.02311 0.003201 4 3.3728 3.0545 0.01887 0.002456 5 4.6510 3.7397 0.06033 0.008512


Hop information can be used to determine which neighbor queries are forwarded to. For example, a query for “car” will be forwarded to not neighbor B but neighbor C first because “car” is one hop away through neighbor C while two hop away through neighbor B. Hop information in the state table can also be used to support query for multiple keywords. Take peer A as an example, if peer A receives a query for “red car”, peer A knows keywords “red” and “car” are both one hop away through peer C, thus peer A can forward query to peer C. If peer A receives a query for “black car”, peer A knows that though both keyword “black” and “car” are on some peers through peer C, but they are different hops away, thus peer A knows that the two keywords are not on the same peer through neighbor C and doesn’t forward the query to C. The state table only maintains information within T hops, i.e., the value in “hop” column of the state table is no more than T.

Table 2. State table on peer A

Assuming each neighbor has M neighbors, the state table stores keyword information of MT neighbor peers. To reduce the storage and update cost of the state table, a data structure, Extended Bloom Filter (EBF), which is based on Bloom Filter technique, is used to represent state table.

Generally speaking, if peer A accesses some date on peer B, then peer A and peer B may have some common interests and peer A is more likely to access data on peer B later. To utilize the query locality, each peer maintains a query cache, which contains the mapping information between keywords lately accessed and their location. Query cache uses a timeout mechanism and discards the content that hasn’t been accessed for certain time. The replacement of query cache is performed based on LRU policy.

The state table provides nearby information around the peer and query cache provides shortcut connections to some remote peers. The combination of the two mechanisms could accelerate the process of search.

3.1 Extended Bloom Filter

Bloom Filter (BF) [12] is a compact data structures for probabilistic representation of a set. Consider a set A={a1, a2, …, an} of n elements. BF describes membership information of set A using a bit vector V of length m with all bits initially set to 0. BF chooses k independent hash functions hash1, …,hashk , each with range {1,…,m}.

Keywords Hop Neighbor

“car” 2 B “car” 3 B “car” 1 C “red” 1 C “black” 3 C … … …


For each element a

A, the bits at positions hash1(a), hash2(a), ... , hashk(a) in V

are all set to 1. A particular bit might be set to 1 multiple times by various elements. Given a query for b BF check the k bits at positions hash1(b), hash2(b), ..., hashk(b) in

V. If any of them is 0, then certainly b is not in the set A. Otherwise BF conjectures

that b is in the set although there is a certain probability that it is wrong (because all the k bits have been previously set by other elements).

We use BF (k=4) to represent the keywords on each peer. But state tables need to store hop information thus can’t be represented by BF directly. We propose a new data structure Extended Bloom Filter (EBF) based on BF technique to represent state tables in a compact way. The main idea of EBF is that the bit vector V of Bloom Filter is replaced with an array VV of T-bit binary numbers. We associate each neighbor of a peer with one EBF to represent the keywords that can be accessed through the neighbor. Each entry of the array VV is a T-bit binary number and each bit of the number represents keywords corresponding hops away through the neighbor: The highest bit of the number represents the keywords one hop away from the neighbor and the second highest bit represent the keywords two hops away from the neighbor, and so on. The lowest bit represents the keywords T hop away through the neighbor. Each EBF of the peer corresponds to one neighbor, and we use term EBF(A,B) to represent the EBF of peer A about its neighbor B. So each peer has multiple EBFs and the number of EBFs is the same as the number of its neighbors.

The values of four hash functions of keyword key are acquired as below: first calculate the MD5 signature of the key, which produces 128 bits; then divide the 128 bits into four 32-bit numbers, and finally module each 32-bit number by m and get the four values. MD5 is selected because of its well-known random properties and relatively fast implementation. The detail of EBF is referred to [17].

3.2 Query and Update

In Gnutella-like P2P system, users submit queries to any node with a stop condition (e.g., the number of results desired). When a peer receives a query for the keyword “key”, it first evaluates keywords on itself. If it could satisfy the query itself, it returns the results and the query is over. Otherwise it first gets the k hash values h1,…, hk by calculating k hash functions hash1(key), hash2(key), ..., hashk(key). For each neighbor of the peer, it gets the k T-bit binary numbers, i.e., the value of h1,…, hk entries of the corresponding EBF. It uses operator “AND” to operate all the k T-bit binary numbers and acquire the result f. The result f is also a T-bit binary number, and the position of ‘1’(from left to right) in value f indicates how many hops it needs to access key through the neighbor. Thus the larger value f is, the less hops it needs to pass through to access the keyword. We rank all the neighbors by the corresponding result f, then the peer sends the query to each neighbor in sequence, checking if the stop condition is reached whenever each query returns.

When a data object is added to or deleted from a peer, there is a possibility that the BF of the peer might change as well. If such changes occur, the peer should propagate the changes of its BF to its neighbors. The neighbors receive the update messages and update their EBFs accordingly. If needed, the neighbors will send update messages to their neighbors as well. When update messages pass through one peer, the position of


the value updated in the corresponding EBF is shifted one to right. Update messages are propagated at most T hops from the sources. When peers join in or depart from the system, the update process is also needed.

There may be some cycles in P2P networks. Cycles in P2P networks may cause the same query or update message to reach the same node multiple times. To avoid that, the identifier of source peer and sequence number is added into the messages. Duplicate messages are discarded by the peers. The detail of the query and update algorithm is referred to [17].

4 Performance Evaluations

We implement CSTM in the open source NeuroGrid simulator [5] and use a Gnutella snapshot topology in which there are 1005 peers as the underlying topology of simulated Gnutella-like P2P systems. The simulations are conducted over 1005 peers with four files per peer and three keys per file. Files and keywords in the system are selected from a pool with 10000 files and 5000 keywords.

In the experiments we simulate four algorithms: Gnutella flooding, two -way random forwarding (peers forward queries to two random neighbors each time ), CSTM without query cache and CSTM with query cache (the cache size is 30). The TTL in flooding and random forwarding algorithm is 7 and 13 respectively. We run 20,000 searches in the simulation, with each search being started at a randomly selected node. Each search was for a randomly selected file – the search terms would be the keywords of desired file. After each 2000 searches we probe the system to acquire average search length and messages cost at that time. Simulation results are shown in Figure 2. 2 4 6 8 10 1 2 3 4 5 6 7 8 9 No. of search(*2000) hops Gnutella random forwarding(2) CSTM(T=2) CSTM(T=2,cache) 100 1000 10000 1 2 3 4 5 6 7 8 9 No. of searchs(*2000) messages Gnutella random forwarding(2) CSTM(T=2) CSTM(T=2,cache)

(a) Average path length (b) Messages per search

Fig. 2. Average path length and Messages per search with the four algorithms Figure 2(a) shows that the average search length of CSTM is a little more than that of flooding, but it is much less than that of two-way random forwarding algorithm. Figure 2(b) shows that flooding in Gnutella produces too many messages (about 4000 messages per search) and two -way random forwarding also causes about 1500 messages per search, while CSTM causes no more than 250 messages, one order of


magnitude less. Figure 2 also shows that the query cache has some self-learning ability and contains more effective shortcut connections after a large number of searches. This ability can lead to the descending of the average search length and the message cost.

Then we evaluate CSTM with different values of parameter T. Figure 3 presents the simulation results when T is 1,2,3 respectively. The results show that when the value of T is increased, the search length and messages cost decreases dramatically. When T=3, the search length is almost the same as that of Gnutella while the messages produced is only about 100. The query cache is again validated to be very effective. 4 5 6 7 8 1 2 3 4 5 6 7 8 9 No. of search(*2000) hops CSTM(T=1) CSTM(T=1,cache ) CSTM(T=2) 50 150 250 350 450 1 2 3 4 5 6 7 8 9 No. of search(*2000) messages CSTM(T=1) CSTM(T=1,cache) CSTM(T=2) CSTM(T=2,cache) CSTM(T=3) CSTM(T=3,cache)

Fig. 3. Average path length and message cost with different T

Now we evaluate the storage cost of CSTM based on the data traces crawled from Gnutella network. The storage cost AverageStorage of state table in CSTM can be computed by the formula (1) showed below:

AverageStorage=NeighborsPerPeer * AverageSizeofPerEBF

= NeighborsPerPeer * T* (m/n) * KeysPerFile * AverageFilePerPeer (bit) (1)

Adar [18] observed the actual Gnutella networks for 24 hours in August, 2000 and learned that there were 31,395 peers with totally 3,019,405 sharing files in the systems and NeighborsPerPeer was less than 4. Thus AverageFilePerPeer is about 96. We take KeysPerFile =3, T=7, m/n=32 as a typical example, then the value of

AverageStorage calculated from formula (1) is 258048 bit, i.e. about 32KB. Even if

the value of AverageFilePerPeer is increased ten times, the storage cost is only 320KB. What’s more, duplicate files and keywords are not eliminated from formula (1), so the actual storage cost is much less than the value provide by formula (1). Thus we can conclude that the storage cost of CSTM is very little.

5 Conclusions

Experiments show that Gnutella networks exhibit strong “small-world” property, which brings about important effect on search performance in such systems. A new


search approach CSTM is proposed to utilize the “small-world” property of Gnutella-like P2P systems. Compared with flooding and random forwarding algorithm, the approach can reduce message cost remarkably while maintaining short search path length, thus it can improve much the performance and scalability of Gnutella-like P2P systems.


1. Clark, D.: Face-to-face with peer-to-peer networking. IEEE Computer, Vol. 34, No.1, IEEE press (2001) 18-21

2. Schoder, D., Fischbach, K.: peer-to-peer prospects. Communications of the ACM, Vol.46, No.2, (2003) 27-29

3. Li Dongsheng, Fang xinxin, Wang Yijie, Lu Xicheng, et al.: A scalable peer-to-peer network with constant degree. Lecture Notes in Computer Science, Vol. 2834, Springer-Verlag, Berlin Heidelberg, New York (2003) 414-424

4. Clark, I., Sandberg, O., Wiley, B., and Hong, T.: Freenet: a distributed anonymous information storage and retrieval system. Proc. of the Workshop on Design Issues in Anonymity and Unobservability, Berkeley, CA (2000) 311–320

5. Joseph, S.R.H: NeuroGrid: semantically routing queries in peer-to-peer networks. Proc. of International Workshop on Peer-to-Peer Computing, Pisa, Italy (2002)

6. Evangelos, P. Markatos: Tracing a large-scale peer-to-peer system: an hour in the life of Gnutella. Proc. of CCGrid2002, Berlin, Germany (2002)

7. Lv, Q., Cao, P., Cohen, E., Li K, and Shenker, S.: Search and replication in unstructured peer-to-peer networks. Proc. of the 16th annual ACM International Conference on Supercomputing (ICS), New Work (2002)

8. Cohen Edith, Shenker Scott: Replication strategies in unstructured peer-to-peer networks. Proc. of ACM Sigcomm’2002, ACM Press, Pittsburgh (2002)

9. Yang, B. and Garcia -Molina, H.: Efficient search in peer-to-peer networks. Proc of the 22nd IEEE ICDCS, Vienna, Austria (2002)

10. Adamic L. A., Humberman B., Lukose R., and Puniyani A.: Search in power law networks. Phys. Rev. E, Vol. 64, No.4 (2001) 46135–46143

11. Crespo A. and Garcia -Molina H.: Routing indices for peer-to-peer systems. Proc of the 22nd IEEE ICDCS, Vie nna, Austria (2002)

12. Bloom B.: Space/time trade-offs in hash coding with allowable errors. Communications of the ACM, Vol. 13, No.7 (1970) 422–426

13. Fan, L., Cao, P., Almeida, J., and Broder, A.: Summary cache: a scalable wide-area web cache sharing protocol. Proc. of ACM SIGCOMM’1998, ACM Press, (1998) 254–265 14. Sean, C. R. and Kubiatowicz J.: Probabilistic location and routing. Proc. Of IEEE

Infocom’2002, IEEE Computer Soc press, New Work (2002)

15. Watts, D. J. and Strogatz, S. H.: Collective dynamics of small-world networks. Nature, Vol.393 (1998) 440-442

16. Clip2 Company. http://www.clip2.com/

17. Li Dongsheng et al.: Efficient Search in Gnutella -like “Small-World” Peer-to-Peer Systems. Tech Rept. PDL-2002-11-2, National University of Defense Technology, Changsha City, China (2002)

18. Adar, Eytan and Huberman, Bernardo A.: Free riding on Gnutella. First Monday, Vol.5, No. 10 (2000)


Related documents

T-DNA tagging 158 The use of T-DNA as insertional mutagen 158 Random tagging 158 Available populations of T-DNA transformants 159 Promoter/enhancer trapping 160 Analysis of

As reported last year, the City of Cambridge notified MWRA in the fall of 2012 that new information gained from its design of the CAM004 sewer separation project had caused it to

Based upon student responses to three waves of questionnaires (pre-test, post-test, and one-year follow-up), we are able to assess short-term program effects. students in

The main contributions of the present study are: 1) a semiau- tomated semantic annotation of image patches based on a large collection of high-resolution land cover SAR images; 2)

The File System view has tabs that show real-time data storage statistics, including current compression factors showing the space saved by using data deduplication, graphs of

A NetWorker storage node can be used to improve performance by off loading from the NetWorker server much of the data movement involved in a backup or recovery operation.

Whatever insurance needs you have, Mountain America Insurance Services can help you find the policy that gives you the financial resources to protect your family and property

• This course covers methods for analysis of data from Illumina and Ion Torrent high- throughput sequencing, with or without a reference genome sequence, using free and