A Comparative Look into Public IXP Datasets
Rowan Kloti, Bernhard Ager, Vasileios Kotronis, George Nomikos, Xenofontas Dimitropoulos
AA Comparative Look into Public IXP Datasets
Rowan Kl ¨oti Bernhard Ager Vasileios Kotronis George Nomikos Xenofontas Dimitropoulos
ETH Zurich, Switzerland FORTH, Greece { rkloeti,bager,vkotroni } @tik.ee.ethz.ch { gnomikos,fontas } @ics.forth.gr ABSTRACT
Internet eXchange Points (IXPs) are core components of the Inter-net infrastructure where Internet Service Providers (ISPs) meet andexchange traffic. During the last few years, the number and size ofIXPs have increased rapidly, driving the flattening and shorteningof Internet paths. However, understanding the present status of theIXP ecosystem and its potential role in shaping the future Internetrequires rigorous data about IXPs, their presence, status, partici-pants, etc. In this work, we do the first cross-comparison of threewell-known publicly available IXP databases, namely of PeeringDB,Euro-IX, and PCH. A key challenge we address is linking IXP iden-tifiers across databases maintained by different organizations. Wefind different
AS-centric versus
IXP-centric views provided by thedatabases as a result of their data collection approaches. In addition,we highlight differences and similarities w.r.t. IXP participants, geo-graphical coverage, and co-location facilities. As a side-product ofour linkage heuristics, we make publicly available the union of thethree databases, which includes 40.2 % more IXPs and 66.3 % moreIXP participants than the commonly-used PeeringDB. We also pub-lish our analysis code to foster reproducibility of our experimentsand shed preliminary insights into the accuracy of the union dataset.
Keywords
Internet Exchange Points, Peering, Euro-IX, PeeringDB, PCH
1. INTRODUCTION
A large part of the interconnection between Autonomous Sys-tems (ASes) in the Internet is realized via
Internet eXchange Points (IXPs), giving them a major role in the evolution and performanceof the Internet. Notably, researchers have recently found that (i) theInternet topology is flattening due to IXP-traversing paths whichbypass the classic transit hierarchy [11–13, 16], (ii) more peeringsexist in a single large IXP than in previous sets of measurements forthe entire Internet [7], and (iii) end-to-end delays and path lengthsover IXPs are becoming shorter [8]. Furthermore, IXPs have beenproposed as cradles for hosting new technologies, such as SoftwareDefined eXchanges (SDX) [14].However, the merits and artifacts of the available IXP data havenot been thoroughly researched yet. This is in sharp contrast withextensive research on mapping the interconnections between ASesusing data from various sources, like RouteViews [2], for morethan a decade. A commonly-used source of IXP data in scientificstudies, e.g., [9, 10, 17], is PeeringDB [6]. However, in addition toPeeringDB, two other publishers maintain public databases aboutthe global IXP ecosystem, namely the
European Internet ExchangeAssociation (Euro-IX) [3] and
Packet Clearing House (PCH) [5].These datasets are contributed and kept up-to-date by different stake- holders, e.g., by the publisher, or through self-reporting by the IXPsand their participants.In this work, we do the first cross-comparison of the IXP dataprovided by PeeringDB, Euro-IX, and PCH. We compare in depthseveral attributes, like IXPs’ locations, facilities and participantinformation. We highlight the similarity of the available data, com-plementary information, and data discrepancies. We analyze in totaldata from about 499, 490, and 687 IXPs in PeeringDB, Euro-IX,and PCH, respectively. To compare the data, we introduce heuristicsto link identical IXPs across the three datasets. We find an
IXP-centric view provided by Euro-IX vs. an
AS-centric view providedby PeeringDB, reflecting differences in their often volunteer-baseddata collection approaches.Besides, we make the linked datasets and our analysis code pub-licly available [1] to support reproducibility of our experiments andrelated research efforts. Experiments where this data can be usefulinclude, but are not limited to, (i) discovering new peer-to-peer linksbased on membership data and peering policy so as to augment theInternet topology view, e.g., for modeling the effect of augmentedrouting protocols [18], (ii) investigating the peering ecosystem froma geographical perspective per continent or country, (iii) trackingthe historic evolution of IXPs and their features, (iv) pinpointing thebig players in a peering setup, and (v) working with new topologicalparadigms such as IXP multi-graphs [15] in the context of new ser-vice provisioning. Compared to using solely PeeringDB, the unionof the linked datasets includes data for 40.2 % more active IXPs and66.3 % more IXP participants.Finally, we perform a preliminary analysis of the accuracy ofthe linked datasets and find that even the combined dataset is only75 % complete when comparing with information from BGP routecollectors, indicating the need for further research in this context.Partial verification using data available on IXP websites shows morepromising results in terms of accuracy, both for the biggest IXPsand for IXPs that are randomly selected from the combined poolof available IXPs. We would like to note though that the three IXPdatasets are collected based on voluntary effort and as such, noformal guarantees about completeness, accuracy or freshness cangenerally be given.The rest of this paper is structured as follows. We first discussdifferences and similarities in particular w.r.t. the collection method-ologies of the PeeringDB, Euro-IX, and PCH datasets in Section 2.Then, we introduce our heuristics to link IXPs across datasets in Sec-tion 3. We compare the IXP status, location, and facility informationin Section 4 and the IXP participant information in Section 5. Wediscuss and evaluate the accuracy of the datasets in Section 6. Fi-nally, Section 7 concludes our paper and points to future directions.
2. DATA SOURCES a r X i v : . [ c s . N I] N ov e analyze and cross-compare the three most extensive publiclyavailable IXP datasets, which are provided by PeeringDB [6], Euro-IX [3], and PCH [5]. The datasets inform primarily about IXPsand their participants in varying levels of detail. In Table 1 wecompare the types of information and their level of availability ineach of the datasets. Importantly, naming and location informationis contained in all datasets, enabling us to identify and link identicalIXPs in Section 3. We built custom web crawlers and parsers, whichwe make publicly available [1]. A crawl typically takes between10 and 30 minutes, depending on the dataset. We acquired alldatasets on September 19, 2014. In the remainder of this section,we discuss intrinsic characteristics of each dataset, shedding lighton the underlying methodology used by the three data providers tocollect and maintain the data. PeeringDB [6] is a worldwide database that aims to serve ISPswhich wish to participate in the IXP peering ecosystem. The dataavailable consists of 499 IXPs, their facilities and their participants(i.e., peering ASes). PeeringDB has detailed information about allregistered IXPs, unlike Euro-IX which only has detailed informationabout its affiliate IXPs, while data on non-affiliate IXPs is limited toname, location and status. Moreover, PeeringDB provides detailedinformation about individual participants, i.e., ASes that peer atIXPs. The data is self-reported by both IXPs and participants.
Our second dataset is a list of 490 IXPs provided by the Euro-pean Internet Exchange Association (Euro-IX) [3]. Its membershipconsists mostly of European IXPs, which are typically run as coop-erative non-profit entities, in contrast to North American Internetexchanges, which are often run as for-profit businesses. Accordingly,European Internet exchanges are generally transparent about peer-ing arrangements. Some of the largest IXPs are located in Europe.Euro-IX supplies information both for affiliated and non-affiliatedIXPs. According to the official Euro-IX website [3], “the databaseinformation is a combination of both affiliated and non-affiliatedIXP content. While the affiliated IXP content is highly accurate,the non-affiliated IXP content is updated on a best effort basis andis nonetheless considered to be quite accurate”. From direct com-munication with Euro-IX staff, we know that the information isgenerally provided by the IXPs themselves. About two thirds of theIXPs represented have an account to keep their data up-to-date byself-reporting, while 62 of these IXPs (approximately 14 %) haveautomated the update procedure, which helps improve data com-pleteness and accuracy. Euro-IX provides a website URL and acontact email for all IXPs and participants, i.e., the ASes which con-nect to an IXP, for 285 of the IXPs. For a subset of IXPs (we assumethese are the ones which are registered members of Euro-IX), moredetailed information is available (c.f. Table 1). For IXP participantsthere is limited information, including AS numbers (ASN), name,update time-stamp, IPv6 support capability, and sometimes a URL.Euro-IX does not provide details about IXPs’ individual co-location facilities. However, location information at the city leveland, for most IXPs, geographical coordinates are available. SinceIXPs can be distributed over several co-location facilities, these loca-tion values may not accurately reflect the physical IXP location. Forinstance,
CyrusOne is a distributed (likely not Euro-IX affiliated)IXP in Arizona and Texas with points of presence in Austin, Dallas,Houston, Phoenix and San Antonio, but appears in the Euro-IXdatabase only at Carrollton, a suburb of Houston, where its corpo-rate headquarters are located. In addition, Euro-IX does not provideinformation about IP address prefixes assigned to IXPs, which could potentially be used for linking IXPs across databases.
The Packet Clearing House (PCH) is a non-profit research in-stitute concerning itself with Internet routing and traffic exchange,among other areas pertaining to Internet operation and economics.PCH provides an extensive directory of 687 IXPs [5], includingmany historical ones. Indeed, Chatzis et al. [10] claim that PCHnever removes IXPs from the listing, and marks them defunct onlyafter sufficient verification. According to direct communication withPCH staff, 70 % of the IXPs listed are compiled by PCH staff, 25 %are contributed by the Internet community and some 5 % are addedby the IXP operators themselves. PCH peers at many IXPs itself; theBGP information PCH obtains over these peerings is then used toderive participant lists. PCH also compiles traffic data from MRTGfiles (for 24 IXPs); the other data sources do not have automatictraffic information. For 190 subnets (corresponding to nearly asmany IXPs) participant data is entered manually. PCH reports ona per-port basis, not a per-participant basis. As such, an ASN canappear multiple times as a member of an IXP. There are also numer-ous instances of participant entries containing peering IP addressesbut no ASNs. We only consider entries with ASNs, as we have noother consistent basis for matching the participants across datasets.
During our data pre-processing and analysis, we observed severalartifacts (some quite time consuming) in the datasets, which wereport here to simplify future researchers’ work.PeeringDB has two sources of information on connectivity be-tween IXPs and ASes. For each IXP, there is a list of participants,including ASNs. However, for every participant, there is also a list ofIXPs. These do not necessarily coincide. A quarter of IXPs presentin the PeeringDB dataset have differences between the two sourcesof information, with more ASNs being listed in the participants’IXP list. This is a consequence of the fact that some participantsadvertise more than one ASN. The difference in terms of numberof participants is 5.7 % on average, although typically no more thana handful of entries. Only 0.5 % of ASNs are responsible for thisdifference. In general, using the latter data source (participants’IXPs) is preferable due to a slightly higher completeness.The Euro-IX dataset has 20 IXPs whose participants consist par-tially, and nine whose participants consist entirely of the reservedASN “0”. In these cases, the administrator has apparently neglectedto enter an ASN. These participants contribute about 2 % of theparticipant entries present, and there are no other duplicate entries.We also note that PCH has 39 IXPs which have multiple par-ticipant entries with the same ASN, with 237 ASNs duplicated intotal. Many others have no associated ASN reported at all. As notedin Section 2.3, this is a result of the port-based reporting used byPCH.
3. LINKING IXPS ACROSS DATASETS
In this section we describe our methodology for identifying andlinking identical IXPs in different datasets as well as other pre-processing steps that were necessary to sanitize the data. We use theterm mapping to refer to identical IXPs that have been linked in twodatasets. The key challenge is that IXPs lack consistent identifiersacross the datasets. There are several cases of IXPs sharing the samename when they are separate entities, and many cases of identicalIXPs being represented by different names in the three datasets. Anexample is ‘SIX’—a name that occurs with minor variations 5 timesin PeeringDB (i.e., SIX, S-IX, SIX.SK, SIX SI, SIX NO for Seattle-,Stuttgart-, Slovak-, Slovenian-, Stavanger- IXP respectively). In
XP MembersData set C oun t r y a nd c it y C on ti n e n t C oo r d i n a t e s L ong N a m e C o mm on N a m e S t a t u s ( ac ti v e ) M e d i a T yp e ( E t h e r n e t , e t c ) P r o t o c o l ss uppo r t e d W e b s it e C on t ac ti n f o r m a ti on C o s t s E s t a b li s h m e n t d a t e M e m b e r s h i p r e qu i r e m e n t s A S N u m b e r N e t w o r k i n t e r n a l s A ss o c i a t e d m e m b e r s f ac iliti e s D e t a il e d f ac ilit y i n f o O r g a n i za ti on A S N I P a dd r e ss ( a t I X P ) C o m p a ny N a m e C o m p a ny W e b s it e P r o t o c o l ss uppo r t e d D a t e L a s t U pd a t e d U R L s N e t w o r kd e t a il s P o li c y i n f o r m a ti on A pp r ox r e fi x e s T X T R ec o r d N e t w o r k s t a t u s Euro-IX (cid:51) + ◦ (cid:51) (cid:51) ◦ (cid:51) + (cid:51) (cid:51) ◦ ◦ ◦ ◦ (cid:51) (cid:51) ◦ (cid:51) (cid:51) + (cid:51) PeeringDB (cid:51) (cid:51) ◦ (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) ◦ (cid:51) (cid:51) (cid:51) (cid:51) (cid:51) + (cid:51) + + + + + + + PCH + + + + + (cid:51) ◦ ◦ ◦ ◦ ◦ + ◦ ◦ ◦ (cid:51) Table 1: Comparison of information available from the Euro-IX, PeeringDB, and PCH datasets. Available = (cid:51) , mostly available = + ,sometimes available = ◦ .Euro-IX, there are only three variations of ‘SIX’, two of which donot directly match the ones in PeeringDB, and at least two differentIXPs in Euro-IX share the exact name ‘SIX’. In addition, for variousreasons (i.e., geographically distributed IXPs), some IXPs exist assingle entities in one dataset and as multiple entities in the other.Due to the large number of IXPs in each dataset, linking all IXPsmanually is very tedious and time consuming. Unfortunately, afully automated approach is not desirable, either, as human expertiseis necessary to validate possibly ambiguous mappings. For thesereasons, we use a hybrid approach, in which we first automaticallyproduce candidate mappings based on custom heuristics and then wemanually verify which candidates actually correspond to the sameIXP. Our heuristics to generate candidates for mapping exploit IXPnaming and location information and are inclusive in their design.In other words we are conservative in ruling out possible mappings,at the cost of additional manual validation effort.During our analysis we found that IXPs are sometimes presentedat different granularity in the different datasets, e.g., at a facilitylevel in one dataset and as a whole in another. Thus we first mergesuch sibling IXP records into single entities using the same overallapproach as with linking IXPs across datasets. We produce mappingcandidates for IXPs that share the same name and location. Weexplored several schemes for transforming names in order to getgood mapping candidates between the different datasets. We applythese name transforming schemes one-by-one, on the original name.After each step, we manually check the produced mappings andremove successfully mapped IXPs from the working datasets. Alldatasets provide name aliases, which we also take into consideration.Moreover, differences in the location naming convention requireadditional pre-processing.Overall, we first merge 26 sibling IXP records into 7 IXPs for atotal of 471 IXPs in the Euro-IX dataset, 30 siblings into 12 IXPs fora total of 480 IXPs in the PeeringDB dataset, and 47 siblings into18 IXPs for a total of 657 IXPs in the PCH dataset. We then use thefollowing heuristics to produce candidates (with the results for Euro-IX/PeeringDB, Euro-IX/PCH, PeeringDB/PCH being respectivelyreported next to each variant):1. Directly identical names (214 / 184 / 162 mappings)2. Converting to lower case (16 / 21 / 26 new mappings).3. Truncating the name at the second word boundary (2 / 15 / 3new mappings).4. Truncating the name at the first word boundary (67 / 101 / 76new mappings).5. Removing non-word characters (4 / 8 / 8 new mappings).6. Various combinations of these, and manual matching (theremaining mappings).We also explored heuristics based on common IXP member infor- Active IXPsDataset Size of IndexEuro-IX PeeringDB PCH Intersection Union Jaccard Overlap (cid:51) (cid:51) (cid:51)
273 673 40.6% 73.0% (cid:51) (cid:51)
355 566 62.7% 80.5% (cid:51) (cid:51)
303 512 59.2% 81.0% (cid:51) (cid:51)
288 566 50.9% 77.0%
Table 2: Intersection and union of the IXP sets which are present indifferent combinations of datasets, as well as similarity indexes forthe sets.mation such as ASNs. However, this turned out to be insufficientin practice due to incomplete reporting of IXP member ASNs (cf.Section 5.1). Another possible attribute that could be explored forlinking is assigned IXPs’ IP address prefixes. This data is providedby PeeringDB and PCH, but not by Euro-IX. We therefore did notconsider it.In total we find 380, 379 and 344 mappings, respectively. Table 2shows the size of the intersection (the IXPs that match based onthe previous process) and the union (all IXPs) of the datasets, aswell as the Jaccard index and overlap index between two sets A and B defined as: J ( A, B ) = | A ∩ B || A ∪ B | and O ( A, B ) = | A ∩ B | min( | A | , | B | ) .Intuitively, the Jaccard index indicates the similarity between sets,while the overlap index indicates the degree to which the smallerset is a subset of the larger. We include both in order to indicatethe extent to which the difference is simply the result of one datasetbeing more complete than the other, rather than the datasets beingpartially orthogonal. For comparing all three sets we use straight-forward extensions of the Jaccard and overlap indices, using allthree sets as parameters. All mappings have been manually verifiedand our approach to generate candidates for mapping is inclusive asexplained beforehand. We therefore do not expect false mappings,but we could have missed few mappings in cases we had insufficientor ambiguous information.We highlight that the datasets provide a lot of complementaryinformation. We interpret this, as well as the differences in IXPnames, as indicators that the datasets do not in general have a com-mon source. We further elaborate on this finding in the next section.In total, we find 441, 480 and 374 active IXPs in the Euro-IX,PeeringDB and PCH datasets (after merging), respectively. If wealso consider inactive IXPs (e.g., IXPs marked as “defunct” or “un-known”) there are 471, 480 and 657 IXPs in the Euro-IX, PeeringDBand PCH datasets. Note that 43.1 % of the IXPs present in the PCHdataset are inactive. We make the compiled datasets available in [1].Compared to the commonly-used PeeringDB, the combined datasetincludes information for 40.2 % more active IXPs. ocation Number of IXPsContinent Country City Euro-IX PeeringDB PCHAfrica Total 31 25 30
Asia Pacific Japan Tokyo 9 6 11
Total 17 14 23
Indonesia Jakarta 4 8 9
Total 6 13 16Total 75 88 116
Australia
Total 16 20 23
Europe Russian Federation 24 24 19France Paris 9 8 14
Total 19 20 28
Germany 16 16 25United Kingdom London 7 12 10
Total 15 12 22
Sweden 13 11 14Poland 11 12 10
Total 201 196 200
Middle East
Total 8 8 10
North America United States of America New York 8 7 14Los Angeles 5 3 10Chicago 4 4 9
Total 92 89 156
Canada 13 16 17
Total 110 107 179
South America Brazil 28 41 36
Total 48 55 64World Total 490 499 687
Table 3: IXPs in each database by continent. For each continent,we display the countries and cities with the most IXPs. The valuesreported are based on raw data before merging sibling IXPs becausesome IXPs are distributed in multiple cities.
4. STATUS, LOCATIONS, AND FACILITIES
In this section we compare the PeeringDB, Euro-IX, and PCHdatabases with respect to the geographical distribution of IXPs, theco-location facilities that house IXPs, and the IXP status informa-tion.
All of the datasets contain information concerning the locationof IXPs. Based on this, in Table 3 we show the geographical dis-tribution of the IXPs across the globe, and compare how differentregions are represented in each dataset. We observe that the ge-ographical coverage of Euro-IX and PeeringDB is similar, whilePCH has somewhat richer coverage in terms of sheer IXP numbers (including inactive IXPs). On the continent level, Europe has thelargest share of IXPs, which corresponds to approximately 40 % inthe Euro-IX and PeeringDB datasets and 30 % in the PCH dataset.Interestingly, Euro-IX does not have substantially more IXPs rep-resented in Europe than the other datasets. The next largest regionis North America, where PCH has much greater numbers than theother datasets—as discussed in Section 4.3, this is largely due toinactive IXPs. PCH also has a greater number of IXPs for the Asia-Pacific region, with Euro-IX having the least. The other regions arebroadly similar. The ranking of the largest countries is also similaracross the datasets. The largest cities differ more, with only majorworld cities being consistently at the top of all of the datasets. Inline with our expectations, it appears that more affluent regions havea better coverage by IXPs.
Euro-IX provides only the number of facilities for a limited subsetof 106 IXPs (22 %), with these IXPs having a mean and medianof 6 and 3 facilities, respectively. PCH generally does not pro-vide any facility-related information, although occasionally multipleaddresses are listed. In contrast, PeeringDB contains detailed infor-mation about facilities, representing them with separate databaseentities. There are 1,465 facilities listed, 365 of which are in theUnited States, 126 in Germany, 114 in the United Kingdom, 94 inFrance and 86 in the Netherlands. The majority of the facilities are not associated with an IXP, while 298 IXPs do not report theirfacilities. 16 facilities are associated with neither IXP nor ISP enti-ties. These observations suggest that the information on the IXPs’facilities is limited . Besides, 133 of the facilities associated with anIXP have more than one IXP present, while 112 IXPs are presentat more than one facility and 13 are present at more than 10. Thisindicates that large IXPs are in reality geographically distributedentities.Understanding the drivers and implications of this expansionand transformation that large IXPs undergo is an interesting subjectfor future work.
The Euro-IX and PCH datasets contain information about thestatus of IXPs, i.e., whether or not they are currently active. Of allthe IXPs in the Euro-IX dataset, 460 are marked as active, 23 asdefunct and 7 as under construction. The PCH dataset contains 392marked active, 90 defunct, 43 planned, 6 deprecated, while 92 havean unknown status. In the PCH dataset 52 entries have the status“not an exchange”. Of the 379 common IXPs between these twodatasets, 303 share an active status, while 9 share a defunct status.10 of the matched entries appear as defunct only in the PCH datasetand 4 only in the Euro-IX dataset. Overall, the status information ofthe 379 linked IXPs is 82.8 % consistent between the Euro-IX andPCH datasets .PeeringDB contains no information on the status of IXPs. Still,a total of 28 PeeringDB entries are marked as defunct in at leastone of the Euro-IX (21 entries) or PCH (15 entries) datasets. It isnoteworthy that of these 28 IXPs only six report zero participantsin PeeringDB, while the others usually report between one and20, with one IXP reporting 43 participants. We also checked thewebsites of IXPs marked as deprecated in Euro-IX or PCH, but yetstill reported on PeeringDB. The results showed that most websitescannot be reached or have extremely few members. For example,NWIX Missoula reports only 4 active members, LIX (Luxembourg)has merged with LU-CIX, and five websites don’t report an activeIXP any more.Lastly, all but two of the IXPs appearing only in the Euro-IXdataset (38) are marked as active. In contrast, half of the 259 IXPswhich are only present in the PCH dataset are either defunct (65) orhave unknown status (65), and only 56 of these IXPs are marked asactive. Many of the PCH-only IXPs are located in North America.Indeed, according to the PCH dataset,
North America has the largestnumber of defunct IXPs , which is likely due to IXPs deployed in theearly history of Internet development.
5. IXP PARTICIPANTS
For many use cases, the participants (i.e., peering ASes) of IXPsconstitute the most important content of the datasets. Thus, we takea closer look at them in this section.
Excluding IXPs which have no participants listed, the Euro-IX,PeeringDB and PCH datasets have a mean of 44.3, 27.0 and 30.8
10 100Number of members0.00.20.40.60.81.0 F r e q u e n c y EuroIXPeeringDBPCH (a) By IXP F r e q u e n c y EuroIXPeeringDBPCH (b) By ASNFigure 1: CDFs of the ASes per IXP (Fig. 1a) versus the IXPs perAS (Fig. 1b), for each of the databases. In Fig. 1a, IXPs with noparticipants are omitted.participants per IXP, respectively, with corresponding medians of17, 8 and 15. This suggests that PeeringDB entries have on averageconsiderably fewer IXP participants listed than Euro-IX entries.Fig. 1a shows the distribution of participant counts for the threedatasets. We see that, in general, Euro-IX has the largest number ofparticipants per IXP. Euro-IX provides an
IXP-centric view as itsdata is primarily self-reported by IXPs. Besides, IXPs affiliated withEuro-IX typically have a high number of participants—a mean of104 and a median of 53, contrasting with a mean of 24 and a medianof 13 for non-affiliates—as a result of more complete reporting andalso because many of the largest IXPs, e.g., LINX, AMS-IX, andDE-CIX, are Euro-IX affiliates. This indicates that large IXPs aregenerally better represented in the Euro-IX database.On the other hand, 205 Euro-IX IXPs, 104 PeeringDB IXPs and636 PCH IXPs have no participants listed. 89 % (53 %) of the Euro-IX (PCH) IXPs which have no participants listed are marked asactive. Interestingly enough, seven Euro-IX affiliate IXPs have noparticipants in the Euro-IX database. Of these, only two separateIXPs appear in each one of the other databases. One of these,CyrusOne, has a limited amount of information about their IXPconnectivity available in PeeringDB.We further analyze IXP participants from the perspective of theparticipating ASes. The Euro-IX dataset contains records of 6,697ASes, connected to 1.9 IXPs on average. In PeeringDB, there are3,784 ASes represented; these are connected to an average of 2.8IXPs. Finally, PCH contains 1,138 ASes, connected to an average of1.4 IXPs. 2,167 (Euro-IX), 1,999 (PeeringDB) and 201 (PCH) ASesare connected to more than one IXP; 98, 127 and 5 are connectedto more than ten, respectively. Table 4 shows the ASNs whichare connected to the largest number of IXPs. We see that PacketClearing House is among the most prolific peers. PCH’s ASN3856 is used to acquire BGP dumps, reflecting its strategy for dataacquisition. PCH’s ASN 42 is used for hosting anycasted DNSzones. We also note the presence of large CDNs, like Akamai.Fig. 1b shows the distribution of participant counts from the ASes’perspective for the three databases. The values of IXPs per ASfor PeeringDB are generally higher than the values for Euro-IX.These differences likely stem from the mechanisms with which thedatasets are formed. In contrast to Euro-IX, PeeringDB provides an
AS-centric view as its data is self-reported by ASes.
We build IXP-to-ASN links for each dataset, which represent (IXP , ASN) memberships, and perform set-theoretic operationson the extracted links using the Jaccard and overlap indexes asintroduced in Section 3. In Table 5 we compare the number andsimilarity of the IXP participants by continent and IXP sizes.
Number of IXPsASN Name Policy Network Type Euro-IX PeeringDB PCH20940 Akamai Technologies Open Content 61 91 316939 Hurricane Electric Open NSP 66 84 3215169 Chief Telecom Inc. Open NSP 60 76 243856 Packet Clearing House Open Educ./Research 50 74 2142 Packet Clearing House Open Educ./Research 44 75 218075 Microsoft Selective NSP 37 59 2222822 Limelight Networks Selective Content 41 39 1815133 EdgeCast Networks, Inc. Open Content 25 31 1816509 Chief Telecom Inc. Open NSP 21 44 710310 Yahoo! Selective Content 27 27 14
Table 4: The ASNs connecting to the largest number of IXPs (rankedby the sum). The ancillary information is as reported by PeeringDB.The Jaccard index of IXP-ASN links between Euro-IX and Peer-ingDB is at a mere 40 %. Merging PeeringDB with Euro-IX in-creases the available IXP membership information by 58.9 %. Thisnumber goes to 66.3 % when merging PeeringDB both with Euro-IXand PCH. Note that the similarity between the Euro-IX and Peer-ingDB participant information is greatest in Europe, the region forwhich both datasets have the largest quantity of membership infor-mation (links in Table 5). In the case of Euro-IX, this constituteswell over half of all participant information available. 75 % of thelinks in Europe (corresponding to 46 % of all links) are contributedby just the Euro-IX affiliated IXPs. Other regions are reportedmore sparsely, yielding lower similarity: North and South Americahave Jaccard indexes of 35 % and 38 %, respectively, and other re-gions have values under 30 %. For the Middle East, the number ofparticipants is so small that the similarity is not meaningful.As expected, the Jaccard index is much lower for comparisonsinvolving the PCH dataset due to the limited membership data withinthe PCH dataset. In terms of the overlap index, the PCH datasethas nearly the same (low) similarity to both of the other datasets,but there are some notable differences between regions: PCH ismore in line with Euro-IX within Europe, and otherwise closer toPeeringDB. However, these differences are small in regions with ameaningful amount of information.Looking at the size categories in Table 5, we find that larger IXPshave a greater similarity, across all pairs of datasets. This holdsfor both the Jaccard and overlap index. Unfortunately, PCH doesnot provide participant information for the IXPs in the largest sizecategory, namely AMS-IX, DE-CIX (both Frankfurt and Hamburg),LINX, NIX.CZ (Prague), PTT S˜ao Paulo, and SIX (Seattle).
6. COMPLETENESS OF THE IXP PARTIC-IPANT DATA
In this section we do a first analysis of the accuracy of the IXPparticipant information extracted from the three databases. In par-ticular, we try to answer the question of the completeness of thecollected information. We cross-compare the collected lists withIXP participant data extracted from 1) live BGP sessions observedin IXP route collector BGP summary data; and 2) 40 IXP websites.
In Section 3 and Section 4 we showed that by linking the availableIXP datasets we can significantly increase the available informationabout IXPs and their participants. In this section, we extract IXPparticipant information from BGP summaries collected by PCH at77 of their route collectors [4] to compare and evaluate the com-pleteness of the participant information in all datasets, including thelinked one. The BGP data include information about establishedsessions with BGP peers over the IXP in contrast to the partially umber of links Euro-IX/PeeringDB Euro-IX/PCH PeeringDB/PCHCategory Euro-IX PeeringDB PCH Jaccard Overlap Jaccard Overlap Jaccard OverlapContinentAfrica 247 163 27 23.5% 47.9% 2.24% 22.2% 9.83% 63.0%Asia Pacific 1049 1105 516 28.4% 45.4% 22.3% 55.2% 22.1% 56.8%Australia 353 470 49 20.7% 39.9% 6.07% 46.9% 8.81% 85.7%Europe 7747 5370 1937 46.3% 77.3% 22.6% 92.0% 29.1% 85.1%Middle East 41 32 27 40.4% 65.6% 47.8% 81.5% 63.9% 85.2%North America 2059 2436 1009 35.1% 56.8% 25.9% 62.5% 27.2% 73.0%South America 1088 693 2 38.0% 70.7% 0.0918% 50.0% 0.289% 100%Size of IXPLess than 30 3375 3074 246 24.2% 40.8% 2.52% 36.2% 4.96% 63.8%30 to 59 1948 1324 277 31.5% 59.2% 11.1% 80.5% 11.4% 59.2%60 to 119 2837 2159 855 38.9% 64.8% 18.8% 68.3% 24.8% 69.9%120 to 239 2064 1749 1041 49.1% 71.8% 33.7% 75.1% 41.3% 78.3%240 or more 2360 1963 1155 74.3% 93.9% 44.1% 93.2% 49.5% 89.4%Total 12584 10269 3574 40.1% 63.7% 20.5% 77.1% 25.0% 77.4%
Table 5: The number of IXP-to-ASN links by category, and the Jaccard and overlap indexes between each pair of datasets for each category.The categories used are continent and
IXP size —the latter is computed by averaging over all the datasets in order to yield a consistentclassification scheme for the three datasets.self-reporting origins of the other datasets. Thus, they are a groundtruth for BGP peering sessions. PCH tries to openly peer with allother IXP participants. Still, the data may miss participants who donot choose to peer with PCH. We assume that all peer ASes seen bythe IXP route collector peer over the IXP fabric. To verify this, wemanually scanned the next hop IPs and ASNs within the summaryrecords to determine which ASNs are actually peering at the IXPsby checking for IP addresses from the prefixes assigned to the IXPs.We used BGP data collected on the 19th of Sept 2014, i.e., the samedate as the other datasets, and again successfully linked the IXPidentifiers of the 77 available PCH BGP route collectors with theIXP identifiers in the other datasets using AS membership and IPaddress information. The route collectors contain location informa-tion in their name (typically an airport code) which we utilized forfurther verification of the linked identifiers.In Table 6 we report the number of IXP-to-ASN links by datasetfor the 77 IXPs with BGP route collectors and the Jaccard andoverlap similarity between the reference BGP data and the fourother datasets. First, we find that approximately 72 % of the BGPIXP-to-ASN tuples are reported in the linked dataset, while thecorresponding figure is 65.8 % for PeeringDB and lower for theother datasets. Moreover, we find that Euro-IX and PeeringDBinclude many IXP-to-ASN links which are not present in the BGPdata. This indicates that the BGP data is not complete, either. Inparticular, the route collectors report only approximately 56 % ofthe membership contained in the databases. The underlying reasonsinclude the fact that not all IXP participants may be willing to peerwith a route collector, and that the databases may contain stale data.Besides, the validation dataset used in our study (and in all similarvalidation studies) is subject to selection bias, i.e., bias due to theIXPs and/or ISPs that provide useful information for validation.Indeed, looking at our set of 77 IXPs we find that the PeeringDB,PCH and Euro-IX datasets are in larger agreement for this validationset than for the overall comparison. For example, PeeringDB andEuro-IX now have a Jaccard similarity of 53.1 % as compared to40.1 % in the earlier analysis (cf. Table 5). We conclude that thefigures presented on dataset completeness in the 77 IXPs may bepositively biased. This indicates that the information we have aboutthe completeness of the available IXP participant data, even afterlinking multiple databases, may be still largely incomplete.
We extracted participant lists from IXPs’ websites as an additionalsource of cross-verification. In particular, we designed customcrawlers for 40 IXP websites in total, which include (i) the 20largest IXPs by number of participants, and (ii)
20 randomly selectedIXPs. We selected two sets of IXPs to mitigate the problem of theselection bias we discussed above. If the website of an IXP did notlist participant information, then we selected a further IXP either bysize or randomly from the two lists above. The website data werecollected during the 2nd half of August 2015. At the same time weextracted and linked fresh data from Euro-IX, PeeringDB, and PCHfor the selected IXPs to compare fairly with website data.From IXPs’ websites, we extracted in total 6,182 IXP-to-ASN links for the top-20 IXPs and 1,181 links for the 20 random IXPs.We find that 94 % of the links in the top-20 IXPs are reported in theunion of PeeringDB, Euro-IX, and PCH. This number changes to85 % for the 20 random IXPs. In Fig. 2 we show the common infor-mation (i.e., the Jaccard index) between the websites and the linkeddataset, and the information only in one of the two sources for eachof the top-20 IXPs. We order IXPs by the percentage of commonlinks. We see that for most websites the fraction of common linksis above 80 %. For many IXPs, we observe that the linked datasetscontain more IXP-to-ASN links than the websites of the IXPs. Only6% of the links are present only on websites. In contrast, 14 % of thelinks are present only in the linked dataset. Interestingly, this showsthat the union of the three databases contains more informationabout IXP participants than the websites of the IXPs themselves.
7. CONCLUSIONS AND FUTURE WORK
The quest for representative datasets is perpetual for the researchcommunity. Taking into account the rising interest in IXP-relateddata, in this work we (i) compared three rich IXP datasets in orderto assess their strengths and weaknesses, and (ii) combined themin order to improve the completeness of the publicly available IXPdata. Our results show that the three datasets have similar geograph-ical coverage, with PCH having many more IXPs, but also manyinactive ones. In addition, PeeringDB has an AS-centric bias, whileEuro-IX has an IXP-centric bias due the nature of the self-reportingmethodologies used by the two providers. PCH includes very littleinformation about IXP members. Furthermore, our results show thatthe datasets have partially common as well as rich complementary umber of links BGP/UNION BGP/Euro-IX BGP/PeeringDB BGP/PCHBGP UNION Euro-IX PeeringDB PCH Jaccard Overlap Jaccard Overlap Jaccard Overlap Jaccard Overlap6,425 8,121 6,087 5,749 3,547 46.1% 71.5% 42.2% 61.0% 45.1% 65.8% 35.3% 73.4%
Table 6: The number of IXP-to-ASN links by dataset for the 77 IXPs with BGP route collectors; and the Jaccard and overlap indexes betweeneach dataset and the ground truth links extracted from the BGP route collectors. UNION denotes the linked dataset containing PeeringDB,Euro-IX, and PCH. % o f I X P p a r t i c i p a n t s In Both Only in PeeringDB + EuroIX + PCH Only in IXP websites
Figure 2: Common and complementary participant information inIXP websites and in the union of PeeringDB, Euro-IX, and PCHdatasets. We show the top-20 IXPs with public participant data intheir websites.information. With respect to complementary, we show for examplethat by linking the datasets we increase the number of IXP recordsby 40.2 % compared to using solely PeeringDB. Even more com-plementary information is available for IXP member information,which previous studies have also shown to be incomplete in Peer-ingDB [17, 19]. Finally, to aid future research, we have made thedataset snapshots as well as the mappings we constructed availableto the public, together with the code used to construct them [1].Still, our results show that while the datasets are partially consis-tent, they are also incomplete. In particular, the datasets appear to belargely in agreement on the existence of IXPs, and certain attributessuch as their operational status. Some of the datasets offer betterquantity for certain geographical regions, e.g., Euro-IX for Europeand PeeringDB for the US. However the consistency between thedatasets w.r.t. the IXP participants is surprisingly low. We haveto stress that it is unclear to which degree these differences stemfrom under-reporting, resp., from over-reporting such as out-agedinformation. Our study is a first step towards an in-depth analysisof IXP datasets. The study opens a number of questions for fu-ture work. We would like to understand how the datasets can becleverly combined, exploiting their individual strengths to improvethe accuracy of the available data. In particular, the ground truthbehind the available IXP data is still elusive and hard to determine.Other sources of possible ground truth we did not explore in thiswork are: (i)
IXPs’ looking glass servers, (ii)
IXPs’ newsletters,and (iii) event/feeds at IXP websites, which announce new IXPmembers. A final line of enquiry is understanding the growth trendsand consistency of the IXP datasets over time within the evolvingInternet peering ecosystem.
Acknowledgments
We want to thank Euro-IX, PeeringDB and Packet Clearing Housefor providing free, publicly available sources of information onInternet Exchange Points. In particular, we want to thank the staffof Euro-IX and Packet Clearing House for providing us information about how data is collected for those datasets. This work has beenpartly funded by the European Research Council Grant Agreementno. 338402.
8. REFERENCES [1] Datasets and Software accompanying the paper. https://bitbucket.org/RKloti/a-comparative-look-into-public-ixp-datasets-partially.git .[2] The Route Views Project. .[3] European Internet Exchange Association. . Datasets collected on:2014-09-19, at 21:58 CEST.[4] Packet Clearing House (PCH) - Data. .[5] Packet Clearing House - Internet Exchange Directory. https://prefix.pch.net/applications/ixpdir/ .Datasets collected on: 2014-09-19, at 21:58 CEST.[6] PeeringDB. . Datasetscollected on: 2014-09-19, at 11:22 CEST.[7] A
GER , B., C
HATZIS , N., F
ELDMANN , A., S
ARRAR , N., U
HLIG , S.,
AND W ILLINGER , W. Anatomy of a Large European IXP. In
Proc. ofACM SIGCOMM (2012).[8] A
HMAD , M. Z.,
AND G UHA , R. Studying the Effect of InterneteXchange Points on Internet Link Delays. In
Proc. of the SpringSimulation Multiconference (2010).[9] A
UGUSTIN , B., K
RISHNAMURTHY , B.,
AND W ILLINGER , W. IXPs:Mapped? In
Proc. of ACM IMC (2009).[10] C
HATZIS , N., S
MARAGDAKIS , G., F
ELDMANN , A.,
AND W ILLINGER , W. There is More to IXPs Than Meets the Eye.
ACMSIGCOMM CCR 43 , 5 (Nov. 2013).[11] D
HAMDHERE , A.,
AND D OVROLIS , C. The Internet is Flat:Modeling the Transition from a Transit Hierarchy to a Peering Mesh.In
Proc. of ACM CONEXT (2010).[12] G
ILL , P., A
RLITT , M., L I , Z., AND M AHANTI , A. The FlatteningInternet Topology: Natural Evolution, Unsightly Barnacles orContrived Collapse? In
Passive and Active Network Measurement .Springer, 2008, pp. 1–10.[13] G
REGORI , E., I
MPROTA , A., L
ENZINI , L.,
AND O RSINI , C. TheImpact of IXPs on the AS-level Topology Structure of the Internet.
Comput. Commun. 34 , 1 (Jan. 2011).[14] G
UPTA , A., V
ANBEVER , L., S
HAHBAZ , M., D
ONOVAN , S. P.,S
CHLINKER , B., F
EAMSTER , N., R
EXFORD , J., S
HENKER , S.,C
LARK , R.,
AND K ATZ -B ASSETT , E. SDX: A Software DefinedInternet Exchange. In
Proc. of ACM SIGCOMM (2014).[15] K
OTRONIS , V., D
IMITROPOULOS , X., K L ¨ OTI , R., A
GER , B.,G
EORGOPOULOS , P.,
AND S CHMID , S. Control Exchange Points:Providing QoS-enabled End-to-End Services via SDN-basedInter-domain Routing Orchestration. In
Research Track of the 3rdOpen Networking Summit (ONS) (2014).[16] L
ABOVITZ , C., I
EKEL -J OHNSON , S., M C P HERSON , D.,O
BERHEIDE , J.,
AND J AHANIAN , F. Internet Inter-domain Traffic.
ACM SIGCOMM CCR 41 , 4 (Aug. 2010).[17] L
ODHI , A., L
ARSON , N., D
HAMDHERE , A., D
OVROLIS , C.,
ANDCLAFFY , K . Using peeringDB to Understand the Peering Ecosystem. ACM SIGCOMM CCR 44 , 2 (Apr. 2014).[18] L
YCHEV , R., G
OLDBERG , S.,
AND S CHAPIRA , M. BGP Security inPartial Deployment: Is the Juice Worth the Squeeze? In
Proc. of ACMSIGCOMM (2013).[19] S