Many previous empirical BitTorrent studies [14, 6, 23] use reliable session identifiers provided by trackers to identify BitTorrent sessions. Since many P2P traces are collected by deploying instrumented clients that contact peers in a P2P system, reliable session identifiers are not always available in those traces. Due to the dynamic nature of P2P systems and measurement failures, peers in a P2P system cannot always be contacted by instrumented clients. Without reliable tracker logs, it is difficult to judge whether a peer is in the same session if the time interval between two consecutive observations of that peer is very long. Simply ignoring these large observation gaps and considering all observations of that peer as a single session can overestimate a peer’s session length, and as a consequence, it can influence the analysis results. Here we use a similar approach as we use to examine the influence from different peer identification intervals: when identifying sessions, if the time interval between two consecutive observations is larger than the session identification interval, we consider them to belong to two different sessions. Again we analyze the peer arrival rate, the session length, and the download speed of the trace, this time for different session identification intervals.
We find that the session identification interval can significantly influence the analysis results. Smaller intervals lead to higher arrival rates than larger intervals, and to shorter session lengths as well, as shown in Figures23and24, respectively. The reason for this is that with a small session identification interval, a series of observations of a peer are likely to be identified as belonging to multiple sessions, and these sessions are shorter than a session that contains all the observations of that peer. Using small session identification intervals also leads to higher peer download speed, but the difference in the download speed is not significant, as shown in Figure25. Tables37, 38, and39provide respectively the statistics of the peer arrival rate, session length, and download speed of traces derived with various session identification intervals.
Table 40, 41, 42shows the significance values from GOF test and parameters for fitting distributions of peer arrival rate, session length resulting from different session identification intervals, respectively.
Interval Max Mean StDev Q1 Median Q3 IQR
10 minutes 6,838 97 118 26 60 130 104
30 minutes 5,403 95 123 24 56 126 102
1 hour 4,713 94 126 24 55 122 99
10 hours 2,901 93 132 24 53 116 92
1 day 2,901 91 131 24 52 112 88
Table 39: Peer Download Speed Statistics with Various Sessions Identification Intervals (minutes).
0 0.2 0.4 0.6 0.8 1
1 10 100 1000 10000
CDF
Peer Arrival Rate
IQR
10 minutes 30 minutes 1 hour 10 hours 1 day
Figure 23: CDF of peer arrival rate resulting from various session identification intervals (horizontal axis in logarithmic scale).
0 0.2 0.4 0.6 0.8 1
1 10 100 1000 10000
CDF
Session Length (min)
IQR
10 minutes 30 minutes 1 hour 10 hours 1 day
Figure 24: CDF of session length resulting from various session identification intervals (horizontal axis in logarithmic scale).
Interval Exponential Weibull Pareto Log-Normal Gamma 10m 0.441 0.597 0.451 0.604 0.000 0.000 0.302 0.468 0.459 0.612 30m 0.179 0.398 0.354 0.574 0.000 0.002 0.324 0.568 0.366 0.560 1h 0.217 0.420 0.407 0.561 0.000 0.001 0.472 0.635 0.367 0.542 10h 0.304 0.474 0.410 0.563 0.000 0.000 0.494 0.654 0.382 0.527 1d 0.296 0.471 0.416 0.560 0.000 0.000 0.486 0.640 0.359 0.522
Table 40: p-values of KS and AD test for arrival rates with different session identification intervals.
0 0.2 0.4 0.6 0.8 1
1 10 100 1000 10000
CDF
Download Speed (kbps)
IQR
10 minutes 30 minutes 1 hour 10 hours 1 day
Figure 25: CDF of download speed resulting from various session identification intervals (horizontal axis in logarithmic scale).
Interval Exp(µ) Wbl(κ, λ) Pareto LogN(µ, σ) Gam(κ, λ) 10m 446.39 446.13 1.00 -0.08 481.75 5.49 1.40 0.95 471.38 30m 250.52 216.78 0.78 0.38 165.11 4.66 1.49 0.70 358.38 1h 163.18 141.23 0.80 0.41 100.83 4.29 1.34 0.74 219.24 10h 105.97 95.81 0.85 0.31 72.42 3.96 1.21 0.84 125.77 1d 102.10 92.15 0.85 0.32 69.57 3.92 1.22 0.84 121.76
Table 41: parameters of distributions for arrival rates with different session intervals.
Interval Exponential Weibull Pareto Log-Normal Gamma 10m 0.022 0.084 0.300 0.406 0.000 0.001 0.406 0.563 0.198 0.298 30m 0.114 0.235 0.410 0.545 0.000 0.001 0.486 0.623 0.324 0.461 1h 0.196 0.357 0.443 0.583 0.000 0.001 0.462 0.601 0.394 0.547 10h 0.296 0.436 0.459 0.603 0.000 0.001 0.381 0.551 0.493 0.605 1d 0.271 0.419 0.479 0.610 0.000 0.001 0.398 0.560 0.497 0.617
Table 42: p-values of KS and AD test for session length with different session identification intervals.
7 Related Work
This work is motivated by a number of archival approaches from other computer science disciplines: for the cluster-based communities, the Parallel Workloads Archive (PWA) [2] has become the de-facto standard for the parallel production environments community. Similarly, the Grids Workloads Archive (GWA) [3] collects Grid traces. For the Internet community, the Internet Traffic Archive (ITA) [8] and CAIDA [7] collect a large number of Internet traces to study the characteristics and usage patterns of networks. For the wireless network community, the CRAWDAD [25] archives data of a wide range of protocols and imposes a structured metadata format. For the availability research community, the Repository of Availability Traces [10] and the Failure Trace Archive [9] collect availability traces from a wide range of distributed systems.
For the P2P research community, there are few efforts [16] of sharing P2P traces. Compared with these efforts, the P2P Trace Archive is currently the largest archive for P2P traces and it is the only one that represents all traces in a unified format.
8 Conclusion and Ongoing Work
Peer-to-Peer systems have gained phenomenal popularity over the last decade, and they serve millions of users.
However, publicly available traces collected from real P2P systems are rare, which hampers the in-depth study of these systems. To address this situation, in this work we introduce the Peer-to-Peer Trace Archive that facilitates the exchange of P2P traces. With a comparative traces analysis, we show that the characteristics and usage patterns differ significantly across P2P systems, and some characteristics evolve rapidly over the years.
We also find that the way of identifying peers and sessions in BitTorrent traces can significantly influence the analysis results.
Currently, the P2P Trace Archive mainly hosts traces collected from P2P file-sharing systems. In the future, we plan to collect traces from other types of P2P applications, such as P2P live streaming and Video-on-Demand systems, massively multi-player online games, etc. We would like to invite the research community to contribute with traces to extend this Archive.
9 Acknowledgements
The research leading to this contribution has received funding from the European Community’s Seventh Frame-work Programme in the P2P-Next project under grant no 216217.
References
[1] ipoque internet studies, 2006-2009. [Online] Available: www.ipoque.com/resources/internet-studies/. 5 [2] The Parallel Workloads Archive. [Online] http://www. cs.huji.ac.il/labs/parallel/workload/., Jul 2007. 39 [3] D. E. Alexandru Iosup. The grid workloads archive, Jul 2007. [Online]. Available: http://gwa.ewi.tudelft.nl/. 39 [4] N. Andrade, M. Mowbray, A. Lima, G. Wagner, and M. Ripeanu. Influences on cooperation in bittorrent
communi-ties. In P2PECON ’05: Proceedings of the 2005 ACM SIGCOMM workshop on Economics of peer-to-peer systems, pages 111–115, New York, NY, USA, 2005. ACM.8,9,10
[5] M. Barbaro and T. Zeller. A face is exposed for AOL searcher no. 4417749. New York Times article, Aug 9 2006.
[Online] Available: http://www.nytimes.com/2006/08/09/technology/09aol.html. 6
[6] A. Bellissimo, B. N. Levine, and P. Shenoy. Exploring the use of bittorrent as the basis for a large trace repository.
Technical report, 2004. 36
[7] CAIDA Team. The Cooperative Association for Internet Data Analysis, Mar 2009. 39 [8] P. Danzig, J. Mogul, V. Paxson, and M. Schwartz. The Internet Traffic Archive, Mar 2009. 39 [9] e. a. Dick Epema, Alexandru Iosup. Failure trace archive(FTA), 2009. 39
[10] B. Godfrey and I. Stoica. Repository of availability traces (RAT), Aug 2007. 39
[11] S. Guha, N. Daswani, and R. Jain. An experimental study of the skype peer-to-peer voip system, 2006. 12 [12] S. B. Handurukande, A.-M. Kermarrec, F. Le Fessant, L. Massouli´e, and S. Patarin. Peer sharing behaviour in the
edonkey network, and implications for the design of server-less file sharing systems. In EuroSys, pages 359–371, 2006. 5,8,12
[13] A. Iosup, P. Garbacki, J. A. Pouwelse, and D. H. J. Epema. Correlating topology and path characteristics of overlay networks and the internet. In IEEE/ACM Int’l. Symp. on Cluster Computing and the Grid (CCGrid) Workshops, GP2PC, page 10, 2006. 8
[14] M. Izal, G. Urvoy-Keller, E. W. Biersack, P. A. Felber, Al, and L. Garc´es-Erice. Dissecting bittorrent: Five months in a torrent’s lifetime. pages 1–11. 2004. 36
[15] A. Klemm, C. Lindemann, M. K. Vernon, and O. P. Waldhorst. Characterizing the query behavior in peer-to-peer file sharing systems. In IMC ’04: Proceedings of the 4th ACM SIGCOMM conference on Internet measurement, pages 55–67, New York, NY, USA, 2004. ACM. 5,8,12
[16] Laboratory for Advanced System Software, University of Massachusetts Amherst. UMass Trace Repository, Mar 2009. 5,8,9,12,39
[17] F. Le Fessant, S. B. Handurukande, A.-M. Kermarrec, and L. Massouli´e. Clustering in peer-to-peer file sharing workloads. In IPTPS, pages 217–226, 2004. 8,12
[18] G. Maier, A. Feldmann, V. Paxson, and M. Allman. On dominant characteristics of residential broadband internet traffic. In IMC ’09: Proceedings of the 9th ACM SIGCOMM conference on Internet measurement conference, pages 90–102, New York, NY, USA, 2009. ACM. 31
[19] A. Narayanan and V. Shmatikov. How to break anonymity of the netflix prize dataset, 2006. 6
[20] A. Parker. The true picture of peer-to-peer filesharing, 2004. [Online] Available: http://www.cachelogic.com/. 5 [21] J. A. Pouwelse, P. Garbacki, D. H. J. Epema, and H. J. Sips. The bittorrent p2p file-sharing system: Measurements
and analysis. In IPTPS, volume 3640 of LNCS, pages 205–216. Springer, 2005. 8
[22] J. Roozenburg, J. Roozenburg, J. Roozenburg, M. Presentation, and D. I. J. A. Pouwelse. Title secure decentralized swarm discovery in tribler, 2006. 9
[23] D. Stutzbach and R. Rejaie. Understanding churn in peer-to-peer networks. In IMC ’06: Proceedings of the 6th ACM SIGCOMM conference on Internet measurement, pages 189–202, New York, NY, USA, 2006. ACM. 36 [24] L. Vu, I. Gupta, J. Liang, and K. Nahrstedt. Measurement and modeling a large-scale overlay for multimedia
streaming. In in Proc. QShine, 2007. 12
[25] J. Yeo, D. Kotz, and T. Henderson. CRAWDAD: a community resource for archiving wireless data at dartmouth.
SIGCOMM Comput. Commun. Rev., 36(2):21–22, 2006. 39
[26] B. Zhang, A. Iosup, P. Garbacki, and J. Pouwelse. A unified format for traces of peer-to-peer systems. In High Performance Distributed Computing, 2009. 6