[PDF] Ten weeks in the life of an eDonkey server

Abstract

This paper presents a capture of the queries managed by an eDonkey server during almost 10 weeks, leading to the observation of almost 9 billion messages involving almost 90 million users and more than 275 million distinct files. Acquisition and management of such data raises several challenges, which we discuss as well as the solutions we developed. We obtain a very rich dataset, orders of magnitude larger than previously avalaible ones, which we provide for public use. We finally present basic analysis of the obtained data, which already gives evidence of non-trivial features.

Full PDF

aa r X i v : . [ c s . N I] S e p Ten weeks in the life of an eDonkey server

Frederic Aidouni, Matthieu Latapy and Clemence Magnien

LIP6 – CNRS and University Pierre & Marie Curie104 avenue du president Kennedy, 75016 Paris, France

[email protected]

ABSTRACT

This paper presents a capture of the queries managed by an eDonkey server during almost 10 weeks, leading to the ob-servation of almost 9 billion messages involving almost 90million users and more than 275 million distinct ﬁles. Ac-quisition and management of such data raises several chal-lenges, which we discuss as well as the solutions we devel-oped. We obtain a very rich dataset, orders of magnitudelarger than previously avalaible ones, which we provide forpublic use. We ﬁnally present basic analysis of the obtaineddata, which already gives evidence of non-trivial features.

1. INTRODUCTION

Collecting live data on running peer-to-peer networksis an important task to grasp their fundamental prop-erties and design new protocols [7, 17, 15, 4, 5]. Tothis end, eDonkey is appealing: it is one of the cur-rently largest and most popular peer-to-peer systems.Moreover, as it is based on servers in charge of ﬁle andsource searches, it is possible to capture the traﬃc ofsuch a server to observe the queries it manages and theanswers it provides.

Contribution and context.

We describe here a continuous capture of udp / ip leveltraﬃc on an important eDonkey server during almostten weeks, from which we extract the application-levelqueries processed by the server and the answers it gave.This leads to the observation of 8 867 052 380 queries, in-volving 89 884 526 distinct ip addresses and 275 461 212distinct ﬁleID . We carefully anonymise and preprocessthis data, in order to release it for public use and makeit easier to analyse. Its huge size raises unusual andsometimes striking challenges (like for instance count-ing the number of distinct ﬁleID observed), which weaddress.The obtained data surpasses previously available onesregarding several key features: its wide time scale, thenumber of observed users and ﬁles, its rigorous mea-surement, encoding, and description, and/or the factthat it is released for public use. It also has the distinc- This paper is candidate to the best paper award. tive feature of dealing with user behaviors, rather thanprotocols and algorithms, or traﬃc analysis, e.g. [1, 13,16, 9, 19]. To this regard, it is more related to previousmeasurement-based studies of peer behaviors in varioussystems, e.g. [8, 14, 20, 6, 3], and should lead to moreresults of this kind.As a passive measurement on a server, it is comple-mentary of passive traﬃc measurements in the network[9, 19, 16], and client-side passive or active measure-ments [20, 6, 3] previously conducted on eDonkey . Upto our knowledge, it is the ﬁrst signiﬁcant dataset on eDonkey exchanges released so far (though [11, 5] usesimilar but much smaller data), and it is the largest peer-to-peer trace ever released. Of course, it also hasits own limitations (for instance, it does not contain anyinformation on direct exchanges between clients).

2. MEASUREMENT

Since our goal was to observe real-world exchangesprocessed by an eDonkey server, we had to capture thetraﬃc on an existing server (with the authorization ofits administrator and within legal limits). In this con-text, it was crucial to avoid any signiﬁcant overload onneither the server itself nor its administrator. Likewise,installing dedicated material ( e.g. a dag card) was im-possible.Moreover, it is of prime importance to ensure a highlevel of anonymisation of this kind of data. This anonymi-sation must be done in real-time during the capture. As ip addresses appear at both udp / ip and eDonkey /appli-cation levels, this implies that the network traﬃc mustbe decoded to application-level traﬃc in real-time.Finally, we want the released data to be as usefulfor the community as possible, and so we want to for-mat it in a way that makes analysis easier. This playsan important role in our encoding strategy describedin Section 2.4, with a strong impact on data usabilitywhich we illustrate in Section 3.In order to reach these goals, we set up a measure-ment procedure in three successive steps, as illustratedin Figure 1. First, we capture the network traﬃc of an eDonkey server using a dedicated program and send it to1 CAP capture eDonkey server eDonkey dialogsreconstruction

UDP trafficPCAP flow

Capture machine eDonkey traffic

PCAP decodingand formattingDialogs anonymisation

XML encoding and storageSection 2.2Section 2.3Section 2.4

Figure 1: From pcap raw traﬃc to xml repre-sentation our capture machine (Section 2.2). Then this traﬃc isreconstructed at ip level and decoded into eDonkey -leveltraﬃc, i.e. queries and corresponding answers (Section 2.3).Finally, these queries are anonymised and formated (Sec-tion 2.4) before being stored as xml documents. eDonkey is a semi-distributed peer-to-peer ﬁle exchangesystem based on directory servers. These servers in-dex ﬁles and users, and their main role is to answerto searches for ﬁles (based on metadata like ﬁlename,size or ﬁletype for instance), and searches for providers(called sources ) of given ﬁles.Files are indexed using a md4 hash code, the ﬁleID ,and are characterised by at least two metadata: nameand size. Sources are identiﬁed by a clientID , which istheir ip address if they are directly reachable or a 24bits number otherwise. eDonkey messages basically ﬁt into four families: man-agement (for instance queries asking a server for thelist of other servers it is aware of); ﬁle searches basedon metadata, and the server’s answers consisting of alist of ﬁleID with the corresponding names, sizes andother metadata; source searches based on ﬁleID , and theserver’s answers consisting of a list of sources (providers)for the corresponding ﬁles; and announcements fromclients which give to the server the list of ﬁles they pro-vide.An unoﬃcial documentation of the protocol is avail-able [10], as well as source code of clients; we do notgive more details here and refer to this document forfurther information. Before starting any traﬃc capture, one has to obtainthe agreement of a server administrator. The following guarantees made it possible to reach such an agreement:negligible impact of the capture on the system; use ofcollected data for scientiﬁc research; and high level ofanonymisation (higher than requested by law).The ideal solution would be to patch the server sourcecode to add a traﬃc recording layer. However, as thissource code is not open-source, this was impossible. Wethus had to design a traﬃc capture system at the ip level, then decode this traﬃc into eDonkey messages.The server is located in a datacenter to which we haveno access. A dedicated traﬃc interception hardware in-stallation was therefore impossible, and we had to builda software solution. To this end, we used libpcap , astandard ethernet capture library. We sent a copy ofthe traﬃc to a capture machine, in charge of decoding(Section 2.3), anonymising (Section 2.4) and storing. Figure 2: Ethernet packet losses per secondduring the captureand cumulative losses in thou-sands of packets (inset). Horizontal axes arelabelled by the number of weeks elapsed sincethe beginning of the measurement. By itsend, 250 266 packets were lost and 31 555 295 781were captured.

This approach leads to packet losses during the cap-ture, due to the duration of the capture and the net-work’s bandwidth. Indeed, libpcap uses a buﬀer wherethe kernel stores captured packets. In case of traﬃcpeaks, this buﬀer may be unsuﬃcient and get full ofpackets, while some others still arrive. The kernel can-not store these new packets in the buﬀer, and some arethus lost. The number of lost packets is stored in a ker-nel structure, and thus we know the amount of lossesthat occured, see Figure 2. These losses, although veryrare, make tcp ﬂows reconstruction very diﬃcult, aspackets are missing inside ﬂows . In this paper, we http://tcpdump.org Even without packet losses, tcp conversation reconstruc- udp traﬃc only, which constitutesabout half of the captured traﬃc. At udp level, our decoding software checks packetsand re-assembles the traﬃc. Among 14 124 818 158 udp packets captured, 2 981 are fragments and 169 are notwell-formed. This corresponds to 949 873 704 eDonkey messages, which are then decoded.The captured traﬃc is generated by many poorly re-liable clients of diﬀerent kinds (and versions), with theirown interpretation of the protocol. Moreover, theirsource codes are intricate, and the protocol embedscomplex encoding optimisations. Finally, decoding theserver traﬃc is much harder than programming a client,and requires an important work of manual decoding ofthe messages.Our decoder operates in two steps: a structural vali-dation of messages (based on their expected length, forexample), then, if successful, an attempt at eﬀective de-coding. Among the 949 873 704 handled eDonkey mes-sages, only 0.68% were not decoded by our system (78%of these messages were structurally incorrect, and thusnot decodable). Anonymisation of internet traces is a subtle issue initself [2]. Since we want to provide the obtained datafor public use, we need a very strong anonymisationscheme: clientID , ﬁleID , search strings, ﬁlenames andﬁlesizes must all be anonymised (each with a dedicatedmethod, described below). In addition, timestamps arereplaced by the time elapsed since the beginning of thecapture to further limit the desanonymisation risks.Filesizes are stored in kilo-bytes (originally they werein bytes); this precision reduction seems enough to pro-tect this information, which raises no important privacyissue. Search strings, ﬁlenames, and server descriptionsare encoded by their md5 hash code, which provides sat-isfying anonymisation while keeping a coherent dataset.Anonymising clientID with a hash code is not satis-factory: if one knows the hash function, it is easy toﬁnd the original clientID by applying the function tothe 2 possible clientID . Shuﬄing strategies are notstrong enough either for this very sensitive data. Wetherefore chose to encode clientID according to theirorder of appearance in the captured data: the ﬁrst oneis anonymised with the value 0, the second with 1 andso on. Although computationaly expensive (see below),this technique has two advantages: it ensures a verystrong anonymisation level and it makes further use ofthe dataset much easier, as anonymised clientID areintegers between 0 and N-1 (if there are N distinct cli- tion is not an easy task, as the server receives about 5000 syn packets per minute. entID ).To perform this encoding, we must be able to recog-nise previously encountered (and anonymised) clientID .We must thus store throughout the capture the set of clientID already seen, with their anonymisation. Aseach message contains at least one clientID , an over-whelming number of searches (several billions) must beperformed in this set, as well as millions of insertions.Classical data structures (like hashtables or trees) areunsatisfactory in this context: they are too slow and/ortoo space consuming. Instead, we used the fact that atmost 2 dictinct clientID exist: we used an array of 2 integers (hence of total size 16 giga-bytes), and storedthe anonymisation of each clientID in the clientID -thcell of this array. This has a high cost in central mem-ory, but allowed us to anonymise clientID with a directmemory access operation only, hence very eﬃciently. Figure 3: Size distribution of ﬁleID anonymisa-tion arrays after one week of capture. One canobserve abnormally large arrays when the arraysare indexed by the ﬁrst two bytes (array 0 con-tains 24 024 elements in this case); using otherbytes reduces this signiﬁcantly.

We also chose to anonymise the ﬁleID by their orderof appearance. Here again, the number of insertionsand searches in the corresponding set is huge. As aconsequence, classical set structures were not relevantin this case either. Moreover, because of the size of ﬁleID (128 bits), we could not use the same solution asfor clientID .A possible solution could be to use a sorted arraycontaining ﬁleID , with their anonymisation key. Arraysare compact structures, and when sorted a dichotomicsearch is very fast. However, insertion has a prohibitivecost, due to the reorganisation it implies to keep thearray sorted.One may avoid this problem in a simple way, as ﬁleID are hash codes: they are supposed to be uniformally3istributed in their coding space. As a consequence,dividing the main array in equally-sized smaller ones,indexed by any part of the ﬁleID , should reduce theirsize uniformally and thus signiﬁcantly speed up elementinsertions.In our particular situation, dividing the array size bya factor of 65 536 by using the two ﬁrst bytes to index65 536 arrays seems a good solution: as we encounter 88million distinct ﬁleID in our capture, each array lengthshould be around 1500; sorted insertion in such arraysis reasonable.However, implementing this strategy led to surpris-ing results: anonymisation arrays 0 and 256 had verylarge sizes, see Figure 3. This shows that, in practice, amajority of ﬁleID start with 0 or 256, and thus revealsthe massive presence of forged ﬁleID [12]. They inducethe unbalanced sizes of our anonymisation arrays, whichstrongly hampers our computations.We solved this problem by selecting two diﬀerent bytesin the ﬁleID to index our 65 536 arrays. Figure 3 showsthat this approach does not perfectly remove the het-erogeneity of array sizes, but it was suﬃcient for ourapplication.Finally, the processing method we have described israther space consuming, but it is able to decode udp traﬃc in real-time, which is crucial in our context.

The ﬁnal dataset we obtain consists in a series of8 867 052 380 eDonkey messages (queries from clientsand answers to these queries from the server) in xml format . It contains very rich information on users at89 884 526 distinct ip addresses dealing with 275 461 212distinct ﬁleID , while preserving the privacy of users.This dataset is publicly available with its formal spec-iﬁcation .

3. BASIC ANALYSIS

We present in this section a few basic analysis of thedata obtained above. Thanks to our formating, thecomputations needed to obtain these results have a rea-sonable cost. They give more detailed insight on ourdataset. Notice however that these statistics are subjectto measurement bias [18], and only reﬂect the content ofour data; more careful analysis should be conducted toderive accurate conclusions on the underlying objects.

Figures 4 and 5 present statistics from the ﬁle pointof view. They clearly conﬁrm the well known fact thatthese objects have a very heterogeneous nature: the We chose xml as output format because it leads to easy-to-read and rigorously speciﬁed text ﬁles, and, once com-pressed, does not have a prohibitive space cost. Figure 4: Distribution of the number of clientsproviding each ﬁle, i.e. for each value x on thehorizontal axis the number of ﬁles provided by x clients. Figure 5: Distribution of the number of clientsasking for each ﬁle, i.e. for each value x on thehorizontal axis the number of ﬁles searched by x clients. number of clients providing each ﬁle spans several or-ders of magnitude, as does the number of clients askingfor each ﬁle. In particular, some ﬁles are provided bymore than 10 000 clients, and some are searched by al-most 150 000, which is a non-neglectible fraction of allclients observed . On the other hand, a huge amountof ﬁles are provided by very few clients (more than 3.5millions are provided by only one client, and more thanone million by two clients only).The decrease of the distribution of the number ofclients providing each ﬁle is reasonably well ﬁtted bya power-law (see Figure 4), and the number of clientsasking for each ﬁle too. This captures the intrinsincheterogeneity of ﬁles regarding the number of clientsproviding or searching them. This has important con- This kind of statistics may be used to conduct audienceestimations for the ﬁles under concern, most probably audioﬁles or movies. average client.Going further, notice that a better ﬁt would be ob-tained using a combination of several power-laws, ormore subtle laws. This may indicate that ﬁles of diﬀer-ent nature coexist in the system, which is indeed true(for instance, audio ﬁle vs movies, or pornographic con-tent vs classical one). Our data may help in ivestigatingthis, but this is out of the scope of this paper.

Figure 6: Distribution of the number of ﬁlesprovided by each client, i.e. for each value x onthe horizontal axis the number of clients provid-ing x distinct ﬁles. Figure 7: Distribution of the number of ﬁleseach client asks for, i.e. for each value x on thehorizontal axis the number of clients searching x distinct ﬁles. Similarily, Figures 6 and 7 present statistics from theclient point of view. They also conﬁrm that clients arevery heterogeneous regarding the number of ﬁles theyprovide or search for: both numbers span several ordersof magnitudes, with clients providing more than 5 000ﬁles and/or searching for almost one hundred of thou-sand ﬁles, while hundreds of thousand clients provide or search only a few ﬁles. This accounts for the highheterogeneity of user behaviors regarding their use of peer-to-peer systems.Notice however that these distribution are far frompower-laws. The number of provided ﬁles would not beﬁtted for small values, and the number of ﬁles asked forclearly has several regimes (a slow slope at the begin-ning, then a sharper one, and a wide range of valueswith only few occurrences). This may reveal diﬀerentkinds of activity, and in particular some clients scan-ning the network to identify many ﬁle sources (whichis also indicated by the inhomogeneous repartition of ﬁleID observed in Section 2.4). One may investigatethis further by observing the correlations between thenumber of ﬁles provided and asked for, for instance, butthis is out of the scope of this paper.Finally, we observe that the distribution of the num-ber of ﬁles provided by each client (Figure 6) indicatesan unexpected large number of clients providing a fewthousands of ﬁles. This may be due to limitations inclient software, like for instance a maximal number ofﬁles manageable in a same directory on some systems.Likewise, the distribution of the number of ﬁles askedby each client displays a surprisingly singular value:there is a clear peak for the number of peers askingfor 52 ﬁles. This may be due to a maximal number ofqueries allowed by a widely used client software.

175 MB230 MB350 MB 700 MB1 GB1.4 GBsmall files

Figure 8: File size distribution, i.e. for each en-countered ﬁle size (horizontal axis) the numberof ﬁles having this size (vertical axis).

Many other statistics may be observed. For instance,we display in Figure 8 the distribution of the size of ex-changed ﬁles (the answers of the server to some queriesindicate the size of found ﬁles). One observes manysmall ﬁles (probably music ﬁles), and clear peaks at700 MB (typical size of a CD-ROM), and at fractions(1/2, 1/3, 1/4) or multiples (2 × ) of this value. Thepeak at 1 GB may indicate that users split very large5les (DVD images for instance) into 1 GB pieces.This plot reveals the fact that, even though in princi-ple ﬁles exchanged in P2P systems may have any size,their actual sizes are strongly related to the space ca-pacity of classical exchange and storage supports.

4. CONCLUSION

This paper presents a capture of the queries man-aged by a live eDonkey server at a scale signiﬁcantlylarger than before, both in terms of duration, numberof peers observed, and number of ﬁles observed. Thisdataset is available for public use (with its formal spec-iﬁcation) in an easy-to-use and rigorous format whichsigniﬁcantly reduces the computational cost of its anal-ysis. We present a few basic analysis which give moreinformation on the collected data.This work may be extended by conducting measure-ments of tcp eDonkey traﬃc, and more generally bymeasuring the eDonkey activity using complementarymethods (active measurements from clients, for instance).The measurement duration may also be extended evenmore, and likewise the traﬃc losses may be reduced.From an analysis point of view, this work opens manydirections for further research. For instance, it makesit possible to study and model user behaviors, commu-nities of interests, how ﬁles spread among users, etc.Most of these directions were out of reach with pre-viously available data, and they are crucial from bothfundamental and applied points of view.

5. REFERENCES [1] W. Acosta and S. Chandra. Trace driven analysisof the long term evolution of gnutella peer-to-peertraﬃc. In

PAM , 2007.[2] M. Allman and V. Paxson. Issues and etiquetteconcerning use of shared measurement data. In

IMC , 2007.[3] F. L. Fessant, S. Handurukande, A.-M.Kermarrec, and L. Massouli´e. Clustering inpeer-to-peer ﬁle sharing workloads. In

IPTPS ,2004.[4] P. Gauron, P. Fraigniaud, and M. Latapy.Combining the use of clustering and scale-freenature of user exchanges into a simple andeﬃcient P2P system. In

Euro-Par , 2005.[5] J.-L. Guillaume, S. Le-Blond, and M. Latapy.Clustering in P2P exchanges and consequences onperformances. In

IPTPS , 2005.[6] S. Handurukande, A.-M. Kermarrec, F. L.Fessant, L. Massouli´e, and S. Patarin. Peersharing behaviour in the edonkey network, andimplications for the design of server-less ﬁlesharing systems. In

EuroSys , 2006.[7] S. B. Handurukande, A.-M. Kermarrec, F. L.Fessant, L. Massouli´e, and S. Patarin. Peer sharing behaviour in the edonkey network, andimplications for the design of server-less ﬁlesharing systems. In

EuroSys ’06 , pages 359–371,New York, NY, USA, 2006. ACM.[8] D. Hughes, J. Walkerdine, G. Coulson, andS. Gibson. Peer-to-peer: Is deviant behavior thenorm on p2p ﬁle-sharing networks?

IEEEDistributed Systems Online , 7(2), 2006.[9] T. Karagiannis, A. Broido, M. Faloutsos, andK. Claﬀy. Transport layer identiﬁcation of p2ptraﬃc. In

IMC , 2004.[10] Y. Kulbak and D. Bickson. The emule protocolspeciﬁcation, 2005.[11] S. Le-Blond, M. Latapy, and J.-L. Guillaume.Statistical analysis of a P2P query graph based ondegrees and their time evolution. In

IWDC , 2004.[12] U. Lee, M. Choi, J. Cho, M. Y. Sanadidi., andM. Gerla. Understanding pollution dynamics inp2p ﬁle sharing. In

In Proceedings of the 5thInternational Workshop on Peer-to-Peer Systems(IPTPS’06) , 2006.[13] A. Legout, G. Urvoy-Keller, and P. Michiardi.Rarest ﬁrst and choke algorithms are enough. In

IMC , 2006.[14] G. Neglia, G. Reina, H. Zhang, D. Towsley,A. Venkataramani, and J. Danaher. Availabilityin bittorrent systems. In

INFOCOM , 2007.[15] W. Saddi and F. Guillemin. Measurement basedmodeling of edonkey peer-to-peer ﬁle sharingsystem. In

International Teletraﬃc Congress ,pages 974–985, 2007.[16] W. Saddi and F. Guillemin. Measurement basedmodeling of edonkey peer-to-peer ﬁle sharingsystem. In

ITC , 2007.[17] S. Saroiu, P. Gummadi, and S. Gribble. Ameasurement study of peer-to-peer ﬁle sharingsystems, 2002.[18] D. Stutzbach, R. Rejaie, N. Duﬃeld, S. Sen, andW. Willinger. On unbiased sampling forunstructured peer-to-peer networks. In

IMC , 2006.[19] K. Tutschku. A measurement-based traﬃc proﬁleof the edonkey ﬁlesharing service. In

PAM , 2004.[20] M. Zghaibeh and K. Anagnostakis. On the impactof p2p incentive mechanisms on user behavior. In