Ten weeks in the life of an eDonkey server
aa r X i v : . [ c s . N I] S e p Ten weeks in the life of an eDonkey server
Frederic Aidouni, Matthieu Latapy and Clemence Magnien
LIP6 – CNRS and University Pierre & Marie Curie104 avenue du president Kennedy, 75016 Paris, France
ABSTRACT
This paper presents a capture of the queries managed by an eDonkey server during almost 10 weeks, leading to the ob-servation of almost 9 billion messages involving almost 90million users and more than 275 million distinct files. Ac-quisition and management of such data raises several chal-lenges, which we discuss as well as the solutions we devel-oped. We obtain a very rich dataset, orders of magnitudelarger than previously avalaible ones, which we provide forpublic use. We finally present basic analysis of the obtaineddata, which already gives evidence of non-trivial features.
1. INTRODUCTION
Collecting live data on running peer-to-peer networksis an important task to grasp their fundamental prop-erties and design new protocols [7, 17, 15, 4, 5]. Tothis end, eDonkey is appealing: it is one of the cur-rently largest and most popular peer-to-peer systems.Moreover, as it is based on servers in charge of file andsource searches, it is possible to capture the traffic ofsuch a server to observe the queries it manages and theanswers it provides.
Contribution and context.
We describe here a continuous capture of udp / ip leveltraffic on an important eDonkey server during almostten weeks, from which we extract the application-levelqueries processed by the server and the answers it gave.This leads to the observation of 8 867 052 380 queries, in-volving 89 884 526 distinct ip addresses and 275 461 212distinct fileID . We carefully anonymise and preprocessthis data, in order to release it for public use and makeit easier to analyse. Its huge size raises unusual andsometimes striking challenges (like for instance count-ing the number of distinct fileID observed), which weaddress.The obtained data surpasses previously available onesregarding several key features: its wide time scale, thenumber of observed users and files, its rigorous mea-surement, encoding, and description, and/or the factthat it is released for public use. It also has the distinc- This paper is candidate to the best paper award. tive feature of dealing with user behaviors, rather thanprotocols and algorithms, or traffic analysis, e.g. [1, 13,16, 9, 19]. To this regard, it is more related to previousmeasurement-based studies of peer behaviors in varioussystems, e.g. [8, 14, 20, 6, 3], and should lead to moreresults of this kind.As a passive measurement on a server, it is comple-mentary of passive traffic measurements in the network[9, 19, 16], and client-side passive or active measure-ments [20, 6, 3] previously conducted on eDonkey . Upto our knowledge, it is the first significant dataset on eDonkey exchanges released so far (though [11, 5] usesimilar but much smaller data), and it is the largest peer-to-peer trace ever released. Of course, it also hasits own limitations (for instance, it does not contain anyinformation on direct exchanges between clients).
2. MEASUREMENT
Since our goal was to observe real-world exchangesprocessed by an eDonkey server, we had to capture thetraffic on an existing server (with the authorization ofits administrator and within legal limits). In this con-text, it was crucial to avoid any significant overload onneither the server itself nor its administrator. Likewise,installing dedicated material ( e.g. a dag card) was im-possible.Moreover, it is of prime importance to ensure a highlevel of anonymisation of this kind of data. This anonymi-sation must be done in real-time during the capture. As ip addresses appear at both udp / ip and eDonkey /appli-cation levels, this implies that the network traffic mustbe decoded to application-level traffic in real-time.Finally, we want the released data to be as usefulfor the community as possible, and so we want to for-mat it in a way that makes analysis easier. This playsan important role in our encoding strategy describedin Section 2.4, with a strong impact on data usabilitywhich we illustrate in Section 3.In order to reach these goals, we set up a measure-ment procedure in three successive steps, as illustratedin Figure 1. First, we capture the network traffic of an eDonkey server using a dedicated program and send it to1 CAP capture eDonkey server eDonkey dialogsreconstruction
UDP trafficPCAP flow
Capture machine eDonkey traffic
PCAP decodingand formattingDialogs anonymisation
XML encoding and storageSection 2.2Section 2.3Section 2.4
Figure 1: From pcap raw traffic to xml repre-sentation our capture machine (Section 2.2). Then this traffic isreconstructed at ip level and decoded into eDonkey -leveltraffic, i.e. queries and corresponding answers (Section 2.3).Finally, these queries are anonymised and formated (Sec-tion 2.4) before being stored as xml documents. eDonkey is a semi-distributed peer-to-peer file exchangesystem based on directory servers. These servers in-dex files and users, and their main role is to answerto searches for files (based on metadata like filename,size or filetype for instance), and searches for providers(called sources ) of given files.Files are indexed using a md4 hash code, the fileID ,and are characterised by at least two metadata: nameand size. Sources are identified by a clientID , which istheir ip address if they are directly reachable or a 24bits number otherwise. eDonkey messages basically fit into four families: man-agement (for instance queries asking a server for thelist of other servers it is aware of); file searches basedon metadata, and the server’s answers consisting of alist of fileID with the corresponding names, sizes andother metadata; source searches based on fileID , and theserver’s answers consisting of a list of sources (providers)for the corresponding files; and announcements fromclients which give to the server the list of files they pro-vide.An unofficial documentation of the protocol is avail-able [10], as well as source code of clients; we do notgive more details here and refer to this document forfurther information. Before starting any traffic capture, one has to obtainthe agreement of a server administrator. The following guarantees made it possible to reach such an agreement:negligible impact of the capture on the system; use ofcollected data for scientific research; and high level ofanonymisation (higher than requested by law).The ideal solution would be to patch the server sourcecode to add a traffic recording layer. However, as thissource code is not open-source, this was impossible. Wethus had to design a traffic capture system at the ip level, then decode this traffic into eDonkey messages.The server is located in a datacenter to which we haveno access. A dedicated traffic interception hardware in-stallation was therefore impossible, and we had to builda software solution. To this end, we used libpcap , astandard ethernet capture library. We sent a copy ofthe traffic to a capture machine, in charge of decoding(Section 2.3), anonymising (Section 2.4) and storing. Figure 2: Ethernet packet losses per secondduring the captureand cumulative losses in thou-sands of packets (inset). Horizontal axes arelabelled by the number of weeks elapsed sincethe beginning of the measurement. By itsend, 250 266 packets were lost and 31 555 295 781were captured.
This approach leads to packet losses during the cap-ture, due to the duration of the capture and the net-work’s bandwidth. Indeed, libpcap uses a buffer wherethe kernel stores captured packets. In case of trafficpeaks, this buffer may be unsufficient and get full ofpackets, while some others still arrive. The kernel can-not store these new packets in the buffer, and some arethus lost. The number of lost packets is stored in a ker-nel structure, and thus we know the amount of lossesthat occured, see Figure 2. These losses, although veryrare, make tcp flows reconstruction very difficult, aspackets are missing inside flows . In this paper, we http://tcpdump.org Even without packet losses, tcp conversation reconstruc- udp traffic only, which constitutesabout half of the captured traffic. At udp level, our decoding software checks packetsand re-assembles the traffic. Among 14 124 818 158 udp packets captured, 2 981 are fragments and 169 are notwell-formed. This corresponds to 949 873 704 eDonkey messages, which are then decoded.The captured traffic is generated by many poorly re-liable clients of different kinds (and versions), with theirown interpretation of the protocol. Moreover, theirsource codes are intricate, and the protocol embedscomplex encoding optimisations. Finally, decoding theserver traffic is much harder than programming a client,and requires an important work of manual decoding ofthe messages.Our decoder operates in two steps: a structural vali-dation of messages (based on their expected length, forexample), then, if successful, an attempt at effective de-coding. Among the 949 873 704 handled eDonkey mes-sages, only 0.68% were not decoded by our system (78%of these messages were structurally incorrect, and thusnot decodable). Anonymisation of internet traces is a subtle issue initself [2]. Since we want to provide the obtained datafor public use, we need a very strong anonymisationscheme: clientID , fileID , search strings, filenames andfilesizes must all be anonymised (each with a dedicatedmethod, described below). In addition, timestamps arereplaced by the time elapsed since the beginning of thecapture to further limit the desanonymisation risks.Filesizes are stored in kilo-bytes (originally they werein bytes); this precision reduction seems enough to pro-tect this information, which raises no important privacyissue. Search strings, filenames, and server descriptionsare encoded by their md5 hash code, which provides sat-isfying anonymisation while keeping a coherent dataset.Anonymising clientID with a hash code is not satis-factory: if one knows the hash function, it is easy tofind the original clientID by applying the function tothe 2 possible clientID . Shuffling strategies are notstrong enough either for this very sensitive data. Wetherefore chose to encode clientID according to theirorder of appearance in the captured data: the first oneis anonymised with the value 0, the second with 1 andso on. Although computationaly expensive (see below),this technique has two advantages: it ensures a verystrong anonymisation level and it makes further use ofthe dataset much easier, as anonymised clientID areintegers between 0 and N-1 (if there are N distinct cli- tion is not an easy task, as the server receives about 5000 syn packets per minute. entID ).To perform this encoding, we must be able to recog-nise previously encountered (and anonymised) clientID .We must thus store throughout the capture the set of clientID already seen, with their anonymisation. Aseach message contains at least one clientID , an over-whelming number of searches (several billions) must beperformed in this set, as well as millions of insertions.Classical data structures (like hashtables or trees) areunsatisfactory in this context: they are too slow and/ortoo space consuming. Instead, we used the fact that atmost 2 dictinct clientID exist: we used an array of 2 integers (hence of total size 16 giga-bytes), and storedthe anonymisation of each clientID in the clientID -thcell of this array. This has a high cost in central mem-ory, but allowed us to anonymise clientID with a directmemory access operation only, hence very efficiently. Figure 3: Size distribution of fileID anonymisa-tion arrays after one week of capture. One canobserve abnormally large arrays when the arraysare indexed by the first two bytes (array 0 con-tains 24 024 elements in this case); using otherbytes reduces this significantly.
We also chose to anonymise the fileID by their orderof appearance. Here again, the number of insertionsand searches in the corresponding set is huge. As aconsequence, classical set structures were not relevantin this case either. Moreover, because of the size of fileID (128 bits), we could not use the same solution asfor clientID .A possible solution could be to use a sorted arraycontaining fileID , with their anonymisation key. Arraysare compact structures, and when sorted a dichotomicsearch is very fast. However, insertion has a prohibitivecost, due to the reorganisation it implies to keep thearray sorted.One may avoid this problem in a simple way, as fileID are hash codes: they are supposed to be uniformally3istributed in their coding space. As a consequence,dividing the main array in equally-sized smaller ones,indexed by any part of the fileID , should reduce theirsize uniformally and thus significantly speed up elementinsertions.In our particular situation, dividing the array size bya factor of 65 536 by using the two first bytes to index65 536 arrays seems a good solution: as we encounter 88million distinct fileID in our capture, each array lengthshould be around 1500; sorted insertion in such arraysis reasonable.However, implementing this strategy led to surpris-ing results: anonymisation arrays 0 and 256 had verylarge sizes, see Figure 3. This shows that, in practice, amajority of fileID start with 0 or 256, and thus revealsthe massive presence of forged fileID [12]. They inducethe unbalanced sizes of our anonymisation arrays, whichstrongly hampers our computations.We solved this problem by selecting two different bytesin the fileID to index our 65 536 arrays. Figure 3 showsthat this approach does not perfectly remove the het-erogeneity of array sizes, but it was sufficient for ourapplication.Finally, the processing method we have described israther space consuming, but it is able to decode udp traffic in real-time, which is crucial in our context.
The final dataset we obtain consists in a series of8 867 052 380 eDonkey messages (queries from clientsand answers to these queries from the server) in xml format . It contains very rich information on users at89 884 526 distinct ip addresses dealing with 275 461 212distinct fileID , while preserving the privacy of users.This dataset is publicly available with its formal spec-ification .
3. BASIC ANALYSIS
We present in this section a few basic analysis of thedata obtained above. Thanks to our formating, thecomputations needed to obtain these results have a rea-sonable cost. They give more detailed insight on ourdataset. Notice however that these statistics are subjectto measurement bias [18], and only reflect the content ofour data; more careful analysis should be conducted toderive accurate conclusions on the underlying objects.
Figures 4 and 5 present statistics from the file pointof view. They clearly confirm the well known fact thatthese objects have a very heterogeneous nature: the We chose xml as output format because it leads to easy-to-read and rigorously specified text files, and, once com-pressed, does not have a prohibitive space cost. Figure 4: Distribution of the number of clientsproviding each file, i.e. for each value x on thehorizontal axis the number of files provided by x clients. Figure 5: Distribution of the number of clientsasking for each file, i.e. for each value x on thehorizontal axis the number of files searched by x clients. number of clients providing each file spans several or-ders of magnitude, as does the number of clients askingfor each file. In particular, some files are provided bymore than 10 000 clients, and some are searched by al-most 150 000, which is a non-neglectible fraction of allclients observed . On the other hand, a huge amountof files are provided by very few clients (more than 3.5millions are provided by only one client, and more thanone million by two clients only).The decrease of the distribution of the number ofclients providing each file is reasonably well fitted bya power-law (see Figure 4), and the number of clientsasking for each file too. This captures the intrinsincheterogeneity of files regarding the number of clientsproviding or searching them. This has important con- This kind of statistics may be used to conduct audienceestimations for the files under concern, most probably audiofiles or movies. average client.Going further, notice that a better fit would be ob-tained using a combination of several power-laws, ormore subtle laws. This may indicate that files of differ-ent nature coexist in the system, which is indeed true(for instance, audio file vs movies, or pornographic con-tent vs classical one). Our data may help in ivestigatingthis, but this is out of the scope of this paper.
Figure 6: Distribution of the number of filesprovided by each client, i.e. for each value x onthe horizontal axis the number of clients provid-ing x distinct files. Figure 7: Distribution of the number of fileseach client asks for, i.e. for each value x on thehorizontal axis the number of clients searching x distinct files. Similarily, Figures 6 and 7 present statistics from theclient point of view. They also confirm that clients arevery heterogeneous regarding the number of files theyprovide or search for: both numbers span several ordersof magnitudes, with clients providing more than 5 000files and/or searching for almost one hundred of thou-sand files, while hundreds of thousand clients provide or search only a few files. This accounts for the highheterogeneity of user behaviors regarding their use of peer-to-peer systems.Notice however that these distribution are far frompower-laws. The number of provided files would not befitted for small values, and the number of files asked forclearly has several regimes (a slow slope at the begin-ning, then a sharper one, and a wide range of valueswith only few occurrences). This may reveal differentkinds of activity, and in particular some clients scan-ning the network to identify many file sources (whichis also indicated by the inhomogeneous repartition of fileID observed in Section 2.4). One may investigatethis further by observing the correlations between thenumber of files provided and asked for, for instance, butthis is out of the scope of this paper.Finally, we observe that the distribution of the num-ber of files provided by each client (Figure 6) indicatesan unexpected large number of clients providing a fewthousands of files. This may be due to limitations inclient software, like for instance a maximal number offiles manageable in a same directory on some systems.Likewise, the distribution of the number of files askedby each client displays a surprisingly singular value:there is a clear peak for the number of peers askingfor 52 files. This may be due to a maximal number ofqueries allowed by a widely used client software.
175 MB230 MB350 MB 700 MB1 GB1.4 GBsmall files
Figure 8: File size distribution, i.e. for each en-countered file size (horizontal axis) the numberof files having this size (vertical axis).
Many other statistics may be observed. For instance,we display in Figure 8 the distribution of the size of ex-changed files (the answers of the server to some queriesindicate the size of found files). One observes manysmall files (probably music files), and clear peaks at700 MB (typical size of a CD-ROM), and at fractions(1/2, 1/3, 1/4) or multiples (2 × ) of this value. Thepeak at 1 GB may indicate that users split very large5les (DVD images for instance) into 1 GB pieces.This plot reveals the fact that, even though in princi-ple files exchanged in P2P systems may have any size,their actual sizes are strongly related to the space ca-pacity of classical exchange and storage supports.
4. CONCLUSION
This paper presents a capture of the queries man-aged by a live eDonkey server at a scale significantlylarger than before, both in terms of duration, numberof peers observed, and number of files observed. Thisdataset is available for public use (with its formal spec-ification) in an easy-to-use and rigorous format whichsignificantly reduces the computational cost of its anal-ysis. We present a few basic analysis which give moreinformation on the collected data.This work may be extended by conducting measure-ments of tcp eDonkey traffic, and more generally bymeasuring the eDonkey activity using complementarymethods (active measurements from clients, for instance).The measurement duration may also be extended evenmore, and likewise the traffic losses may be reduced.From an analysis point of view, this work opens manydirections for further research. For instance, it makesit possible to study and model user behaviors, commu-nities of interests, how files spread among users, etc.Most of these directions were out of reach with pre-viously available data, and they are crucial from bothfundamental and applied points of view.
5. REFERENCES [1] W. Acosta and S. Chandra. Trace driven analysisof the long term evolution of gnutella peer-to-peertraffic. In
PAM , 2007.[2] M. Allman and V. Paxson. Issues and etiquetteconcerning use of shared measurement data. In
IMC , 2007.[3] F. L. Fessant, S. Handurukande, A.-M.Kermarrec, and L. Massouli´e. Clustering inpeer-to-peer file sharing workloads. In
IPTPS ,2004.[4] P. Gauron, P. Fraigniaud, and M. Latapy.Combining the use of clustering and scale-freenature of user exchanges into a simple andefficient P2P system. In
Euro-Par , 2005.[5] J.-L. Guillaume, S. Le-Blond, and M. Latapy.Clustering in P2P exchanges and consequences onperformances. In
IPTPS , 2005.[6] S. Handurukande, A.-M. Kermarrec, F. L.Fessant, L. Massouli´e, and S. Patarin. Peersharing behaviour in the edonkey network, andimplications for the design of server-less filesharing systems. In
EuroSys , 2006.[7] S. B. Handurukande, A.-M. Kermarrec, F. L.Fessant, L. Massouli´e, and S. Patarin. Peer sharing behaviour in the edonkey network, andimplications for the design of server-less filesharing systems. In
EuroSys ’06 , pages 359–371,New York, NY, USA, 2006. ACM.[8] D. Hughes, J. Walkerdine, G. Coulson, andS. Gibson. Peer-to-peer: Is deviant behavior thenorm on p2p file-sharing networks?
IEEEDistributed Systems Online , 7(2), 2006.[9] T. Karagiannis, A. Broido, M. Faloutsos, andK. Claffy. Transport layer identification of p2ptraffic. In
IMC , 2004.[10] Y. Kulbak and D. Bickson. The emule protocolspecification, 2005.[11] S. Le-Blond, M. Latapy, and J.-L. Guillaume.Statistical analysis of a P2P query graph based ondegrees and their time evolution. In
IWDC , 2004.[12] U. Lee, M. Choi, J. Cho, M. Y. Sanadidi., andM. Gerla. Understanding pollution dynamics inp2p file sharing. In
In Proceedings of the 5thInternational Workshop on Peer-to-Peer Systems(IPTPS’06) , 2006.[13] A. Legout, G. Urvoy-Keller, and P. Michiardi.Rarest first and choke algorithms are enough. In
IMC , 2006.[14] G. Neglia, G. Reina, H. Zhang, D. Towsley,A. Venkataramani, and J. Danaher. Availabilityin bittorrent systems. In
INFOCOM , 2007.[15] W. Saddi and F. Guillemin. Measurement basedmodeling of edonkey peer-to-peer file sharingsystem. In
International Teletraffic Congress ,pages 974–985, 2007.[16] W. Saddi and F. Guillemin. Measurement basedmodeling of edonkey peer-to-peer file sharingsystem. In
ITC , 2007.[17] S. Saroiu, P. Gummadi, and S. Gribble. Ameasurement study of peer-to-peer file sharingsystems, 2002.[18] D. Stutzbach, R. Rejaie, N. Duffield, S. Sen, andW. Willinger. On unbiased sampling forunstructured peer-to-peer networks. In
IMC , 2006.[19] K. Tutschku. A measurement-based traffic profileof the edonkey filesharing service. In
PAM , 2004.[20] M. Zghaibeh and K. Anagnostakis. On the impactof p2p incentive mechanisms on user behavior. In