[PDF] Characterising Web Site Link Structure

Abstract

The topological structures of the Internet and the Web have received considerable attention. However, there has been little research on the topological properties of individual web sites. In this paper, we consider whether web sites (as opposed to the entire Web) exhibit structural similarities. To do so, we exhaustively crawled 18 web sites as diverse as governmental departments, commercial companies and university departments in different countries. These web sites consisted of as little as a few thousand pages to millions of pages. Statistical analysis of these 18 sites revealed that the internal link structure of the web sites are significantly different when measured with first and second-order topological properties, i.e. properties based on the connectivity of an individual or a pairs of nodes. However, examination of a third-order topological property that consider the connectivity between three nodes that form a triangle, revealed a strong correspondence across web sites, suggestive of an invariant. Comparison with the Web, the AS Internet, and a citation network, showed that this third-order property is not shared across other types of networks. Nor is the property exhibited in generative network models such as that of Barabasi and Albert.

Full PDF

aa r X i v : . [ c s . I R ] A ug Characterising Web Site Link Structure

Shi Zhou ∗ Dept. of Computer ScienceUniversity College LondonAdastral Park, IpswichIP5 3RE, [email protected] Ingemar CoxDepts. of Computer Science andElectrical & Electronic EngineeringUniversity College LondonGower Street, LondonWC1E 6BT, [email protected] Vaclav PetricekDept. of Computer ScienceUniversity College LondonGower Street, LondonWC1E 6BT, [email protected]

Abstract

The topological structures of the Internet and the Webhave received considerable attention. However, there hasbeen little research on the topological properties of individ-ual web sites. In this paper, we consider whether web sites(as opposed to the entire Web) exhibit structural similari-ties. To do so, we exhaustively crawled 18 web sites as di-verse as governmental departments, commercial companiesand university departments in different countries. Theseweb sites consisted of as little as a few thousand pages tomillions of pages. Statistical analysis of these 18 sites re-vealed that the internal link structure of the web sites aresigniﬁcantly different when measured with ﬁrst and second-order topological properties, i.e. properties based on theconnectivity of an individual or a pairs of nodes. However,examination of a third-order topological property that con-sider the connectivity between three nodes that form a tri-angle, revealed a strong correspondence across web sites,suggestive of an invariant. Comparison with the Web, theAS Internet, and a citation network, showed that this third-order property is not shared across other types of networks.Nor is the property exhibited in generative network modelssuch as that of Barab´asi and Albert.

Index Terms – Hypertext systems, Topology, Modeling.

The Web has become a global tool for sharing informa-tion. It can be represented as a huge graph which consistsof billions of hypertext web pages connected by hyperlinkspointing from one web page to another [4, 11]. Each webpage is part of a larger web site, which is loosely deﬁned asa group of web pages whose URL addresses use the samedomain name, such as cs.ucl.ac.uk and ieee.org . Studying and understanding the Web’s topological struc-ture is important as it may lead to improved techniques forinformation retrieval. Link structure of the Web has beenused in algorithms like Pagerank [16] and HITS [9] to es-timate the importance of web pages, and in [8, 3, 10] forcommunity discovery and clustering. These algorithms donot typically use the internal link structure within a web site,but rather, rely on external links between web sites. Never-theless, the internal structure of a web site is important. Forexample the statistical property of web site link structuremay be used as an informative measure of web site quality,e.g. navigability [20].There is surprisingly little study of the structural proper-ties of web sites in general. Certainly, it is well known thatexamination of the graph structure of an individual web sitecan be used to calculate the mean diameter of the web site,and other metrics, that can then be used to infer propertiesregarding the navigability of the web site. However, we areunaware of prior work that provides a statistical topolog-ical characterization of all web sites. As such, web sites,as opposed to the Web, are often considered to exhibit anarbitrary statistical topological structure.However, this study reveals that the topology of web sitesis not arbitrary. In fact, examination of the triangle coefﬁ-cient (the number of triangles of a node) as a function of de-gree (the number of links of the node) reveals a very strongcorrelation across web sites, suggestive of a possible invari-ant of web site link structure. Moreover, this third-orderproperty varies across other networks, such as the Web, theInternet and citation networks. Thus, it appears to stronglycharacterise web sites.This paper is organised as follows. In Section 2 we in-troduce a number of topological metrics which have beenused to characterise and compare network structures. InSection 3 we introduce the datasets used in this study. Theseconsist of 18 web sites which vary in size from a few thou-sand pages to millions of pages. The web sites cover a broadange of entities: 9 government sites from various countries,3 commercial sites, 3 educational sites and 3 very largesites, (IEEE, Wikipedia, Yahoo!). In Section 4 we presentour statistical results and discuss the implications. In addi-tion to comparing data across web sites, we also comparewith (i) subsets of the Web, (ii) a citation network, (iii) theAS-level Internet network, and (iv) the generative model ofBarab´asi and Albert [2]. Section 5 summarises the key re-sults and discusses how this work can be used to improvegenerative models of hypertext networks.

We brieﬂy review and deﬁne the following topologicalproperties, which are grouped into three orders accordingto the scope of information required to compute them [12].These are (i) the st -order properties, e.g. degree distribu-tion, (ii) the nd -order properties, e.g. degree correlationand rich-club connectivity, and (iii) the rd -order proper-ties, e.g. triangle coefﬁcient and clustering coefﬁcient. st -Order Properties Figure 1. Example of (a) an undirected graphand (b) a directed graph.

The link structure of a web site can be described as anundirected graph on which a node represents a web pageand a link denotes the existence of at least one hyperlinkconnection between two nodes. The connectivity, or degree k , of a node is deﬁned as the number of links, or neighbours,the node has. For example in Figure 1a, node A has fourneighbours B , C , D and E , and its degree k A = 4 . Aweb site can also be described as a directed graph on whicheach link has a direction pointing from one node to another.The in-degree k in of a node is then deﬁned as the numberof incoming links and the out-degree k out the number ofoutgoing links. For example in Figure 1b, node A has threeincoming links from nodes B , C and E , i.e. k in = 3 , andtwo outgoing links to nodes C and D , i.e. k out = 2 . Thispaper studies web sites link structure as undirected graphsunless speciﬁcally stated.The degree of a node measures a node’s local connectiv-ity. Topological properties calculated by using the degree of individual nodes are classiﬁed as st -order properties,e.g. the average degree ¯ k of nodes in a network. The most studied topological property for large networks isthe degree distribution P ( k ) , which is deﬁned as the proba-bility that a randomly selected node has degree k . A randomgraph [7] is characterised by a Poisson degree distributionwhere the distribution peaks at the network’s average de-gree. It has been reported that a number of networks [2]follow a power-law degree distribution, P ( k ) ∼ k − γ , < γ < . (1)This means that most nodes have very few links, while afew nodes have a very large number of links. nd -Order Properties Topological properties are classiﬁed as nd -order prop-erties if they are based on the degree information of thetwo end nodes of a link, such as the joint degree distri-bution P ( k, k ′ ) [6], which is the probability that a ran-domly selected link connects a node of degree k with anode of degree k ′ . The nd -order properties provide amore complete description of a network’s structure thanthe st -order properties. For example the degree distri-bution can be obtained from the joint degree distribution: P ( k ) = (¯ k/k ) P k ′ P ( k, k ′ ) . The nearest-neighbours average degree, k nn , of k -degreenodes [17, 22], is a projection of the joint degree distribu-tion, given by k nn ( k ) = ¯ k P k ′ k ′ P ( k, k ′ ) kP ( k ) . (2)A network is called an assortative network if k nn ( k ) in-creases with k , which means nodes tend to attach to sim-ilar nodes, i.e. high–degree nodes to high–degree nodes andlow–degree nodes to low–degree nodes (‘assortative mix-ing’). Many social networks are assortative networks. Anetwork is a disassortative network if k nn ( k ) decreases with k , i.e. high–degree nodes tend to connect with low–degreenodes and vice versa (‘disassortative mixing’). This is thecase for most information and communications networks.A network’s degree correlation, or mixing pattern, canbe summarised by a single scalar called the assortative co-efﬁcient [14, 15], α = L − P i s i d i − [ L − P i ( s i + d i )] L − P i ( s i + d i ) − [ L − P i ( s i + d i )] , (3)here L is the number of links and s i , d i are degrees of theend nodes of the i th link with i = 1 , , ..., L . The value of α is in the range of [ − , . For assortative networks α > and for disassortative networks α < . The rich-club connectivity [26, 5] measures how tightlythe high-degree nodes, rich nodes, interconnect with them-selves. If N >k is the number of nodes with degrees largethan k and they share E >k links between themselves, therich-club connectivity is deﬁned as φ ( k ) = 2 E >k N >k ( N >k − , (4)where N >k ( N >k − / is the maximum possible num-ber of links that the N >k nodes can have. For example inFigure 1a, there are ﬁve nodes ( A , B , C , D and E ) withdegrees larger than 2 and they have 8 links between them,thus ϕ (2) = × (5 − / = 0 . , which means the 5 best-connected nodes are 80% fully interconnected. The rich–club connectivity is a nd -order property because whethera link belongs to E >k depends on the degrees of the link’stwo end nodes. Figure 2. Graphs (a) with a rich-club and (b)without a rich-club.

The rich–club connectivity is a different projection of thejoint degree distribution, φ ( k ) = N ¯ k P k max k ′ ,k ′′ = k +1 P ( k ′ , k ′′ )[ N P k max k ′ = k +1 P ( k ′ )] · [ N P k max k ′ = k +1 P ( k ′ ) − , (5)where N is the total number of nodes and k max is the max-imum degree in a network. The rich–club connectivity doesnot trivially relate with the degree correlation [24]. For ex-ample the two graphs shown in Figure 2 are both disassor-tative networks, but for the 3 best-connected nodes in Fig-ure 2a, φ = 1 , and in Figure 2b, φ = 0 . rd -Order Properties The rd -order properties are based on connectivity infor-mation between three nodes that form a triangle. The triangle coefﬁcient ∆ is deﬁned as the number of tri-angles a node shares, which is equivalent to the number oflinks among the node’s neighbours [25]. Triangle is the ba-sic unit for network redundancy. The more triangles, themore alternative paths between nodes. In-triangle and out-triangle coefﬁcients

On a directedgraph, a node’s neighbours can be divided into two groups:in-neighbours, which are connected with incoming links;and out-neighbours, which are connected with outgoinglinks. An in-triangle of a node consists of the node and twoof its in-neighbours, and an out-triangle consists of the nodeand two out-neighbours. For example in Figure 1b, node A has two in-triangles ABC and

ACE and one out-triangle

ACD , therefore node A ’s in-triangle coefﬁcient ∆ in is 2and out-triangle coefﬁcient ∆ out is 1. A more widely studied rd -order property is the clusteringcoefﬁcient C , which is deﬁned as the ratio of actual linksamong a node’s neighbours to the maximal possible numberof links they can share [23]. The clustering coefﬁcient of anode can be given as a function of a node’s degree and itstriangle coefﬁcient, C = ∆ k ( k − / . (6)Two nodes with different triangle coefﬁcients can have thesame clustering coefﬁcient. For example in Figure 1a, node B has three neighbours and one triangle and node C has sixneighbours and ﬁve triangles ( CBA , CAD , CAE , CED and

CF H ). However, their clustering coefﬁcients are thesame: ∆ B = 13(3 − / − / C . Therefore one should use the triangle coefﬁcient to infer theclustering information of nodes with different degrees.

Here we brieﬂy summarise the various datasets used inthis study.

We exhaustively crawled the 18 web sites of the or-ganisations listed in Table 1: 1) the national audit of-ﬁce or equivalent of Canada (AO-CA), Italy (AO-IT), theUnited Kingdom (AO-UK) and the United States (AO-US); able 1. Properties Of The Datasets

Web Site Number Number Average Assortative AverageDataset domain name of nodes of links degree coefﬁcient triangle coef.AO-CA cac.gc.ca 12,730 120,485 15.94 -0.35 159.78AO-IT corteconti.it 32,614 200,516 11.96 -0.40 186.11AO-UK nao.gov.uk 4,027 25,453 11.84 -0.36 89.40AO-US gao.gov 19,625 223,998 21.69 -0.63 289.37FO-AU dfat.gov.au 29,140 791,039 53.25 -0.78 1,066.30FO-CZ mzv.cz 31,246 778,163 45.23 -0.13 1,134.06FO-DE auswaertiges-amt.de 46,219 2,234,535 94.10 -0.56 4,439.89FO-JP mofa.go.jp 52,206 493,861 17.11 -0.37 177.23FO-UK fco.gov.uk 33,280 694,255 36.29 -0.16 884.54COM-HSBC hsbc.co.uk 51,043 68,454 2.62 -0.05 7.97COM-NEXT next.co.uk 74,989 557,466 14.11 -0.47 182.55COM-SKODA skoda-auto.com 49,341 727,119 28.39 -0.30 292.12EDU-AUCK arts.auckland.ac.nz 12,457 129,870 17.64 -0.21 258.13EDU-UCB haas.berkeley.edu 100,025 373,521 6.90 -0.09 84.85EDU-UCL cs.ucl.ac.uk 36,554 229,711 10.81 -0.15 70.34LARGE-IEEE ieee.org 1,977,923 5,614,610 5.54 -0.05 57.92LARGE-WIKI zh.wikipedia.org 1,913,510 8,249,248 8.12 -0.13 64.54LARGE-YAHOO yahoo.com 3,448,289 12,039,165 6.72 -0.08 81.69Web – 43,425 173,696 7.96 -0.12 38.43Citation network – 244,864 897,170 7.33 -0.08 4.20AS Internet – 9,200 28,957 6.30 -0.24 21.37BA model – 10,000 30,000 6.00 -0.02 0.162) the foreign ofﬁce or equivalent of Australia (FO-AU),the Czech Republic (FO-CZ), German (FO-DE), Japan(FO-JP) and the UK (FO-UK); 3) commercial web sites,such as HSBC bank in the UK (COM-HSBC), the UK re-tailer NEXT (COM-NEXT) and the automobile companySKODA (COM-SKODA); 4) educational web sites, suchas the Faculty of Arts at the University of Auckland, NewZealand (EDU-AUCK), the Haas School of Business atthe University of California at Berkeley (EDU-UCB), andthe Department of Computer Science at University CollegeLondon (EDU-UCL); and 5) three very large web sites withmillions of web pages, such as the IEEE (LARGE-IEEE),Wikipedia in the language of Simpliﬁed Chinese (LARGE-WIKI) and Yahoo! (LARGE-YAHOO).We used the Nutch 1.6.0 crawler( http://lucene.apache.org/nutch ). Eachcrawl was started from a web site’s homepage and wasrestricted to the web site’s domain as listed in Table 1.The crawler was conﬁgured to allow for complete siteacquisition and collected all web pages up to a depth of18. The default parameters were a 5-second delay betweenrequests to the same host, and 10,000 attempts to retrievepages that fail with a ‘soft’ error [20]. We discardedhyperlinks pointing to web pages outside the web site’sdomain and removed self-loops and duplicated hyperlinks. We are aware of a number of available data sourcesof the Web. We did not extract web sites data from thembecause they aim to sample the entire Web and contain veryincomplete information of the internal link structure of in-dividual web sites. For example the Stanford WebBase data( http://dbpubs.stanford.edu:8091/˜testbed/doc2/WebBase )contains only 400 web pages with NASA’s domain name( nasa.org ). WT10g is a mega dataset of the Web proposed by theannual international Text REtrieval Conference (TRECs, http://trec.nist.gov ). WT10g is constructedfrom more than 320 gigabytes of archived data containing1.7M web pages and hyperlinks between them. It is re-ported that WT10g retains properties of the larger Web [21]and has been used as a data resource for research on Webretrieval and modelling. We randomly sampled 10 subsetsof WT10g, each of which contains 50,000 web pages andlinks between those pages. In this paper we use the averageproperties of the 10 WT10g subsets as an approximation ofthe Web’s link structure. .3 Citation Network

The citation network [19] data was extracted fromthe online computer science publication database CiteSeer( http://citeseer.ist.psu.edu/ ). The CiteSeerdata contain 575K entries, from which we extracted244,864 records having at least one reference (outgoinglink) or citation (incoming link).

The Internet topology at the autonomous systems (AS)level has been extensively studied in recent years [18, 25,13, 12]. On the AS Internet, nodes represent Internet serviceproviders and links represent connections between them. Inthis paper we use the AS Internet dataset ITDK0304 col-lected by CAIDA [1].

The Barab´asi and Albert (BA) model [2] has been widelyused in the study of complex networks. This model showsthat a power-law degree distribution can be produced by twomechanisms: growth , where the network “grows” from asmall random graph by attaching new nodes to old nodesin the existing system; and preferential attachment , where anew node is attached preferentially to nodes that are alreadywell connected.

Here we summarise our experimental ﬁndings. We ex-amine a variety of ﬁrst, second and third-order topologicalproperties and compare them across the various web sites.We then compare the topological properties of web siteswith other networks, speciﬁcally, the Web, AS network, acitation network, and the generative network of Barab´asiand Albert. st And nd -Order Properties As shown in Table 1, the size and the average degree of theweb sites vary signiﬁcantly. The foreign ofﬁce web siteshave very large average degrees, whereas the three largeweb sites with millions of web pages have very small aver-age degrees. Figure 3a, b and c illustrate the degree distribu-tion P ( k ) , the degree correlation k nn ( k ) , and the rich-clubconnectivity φ ( k ) of the 18 web sites on a log-log scale.Also shown are their average properties, depicted by cir-cles . It is clear that the st and nd -order properties of the The average degree distribution ¯ P ( k ) is obtained as such: for agiven k , if at least X ≥ of the 18 web sites have P ( k ) > , then web sites exhibit huge variations over several orders of mag-nitudes. Thus, the web sites cannot be well characterised bythe average of these properties. For example, in Figure 3c,some web sites with nodes of degree k > are almostfully interconnected with themselves, i.e. φ ≈ , whereasin other web sites the interconnectedness is much looser,with φ less than 0.001. rd -Order Properties Figure 3d shows the complementary cumulative distributionof the triangle coefﬁcient P c (∆) , which is the probabilitythat a node’s triangle coefﬁcient is larger than ∆ . Figure 3eshows the relationship between triangle coefﬁcient and de-gree ∆( k ) , i.e. the average triangle coefﬁcient of k -degreenodes. Although the web sites do not show an agreementon P c (∆) , they do exhibit a clear correspondence on ∆( k ) .Some web sites have sharp spikes on their ∆( k ) curves.These spikes reﬂect the existence of star-like subgraphs inthese web sites, e.g. a web page with a long list of hyper-links pointing to documents or images. Compared to thelarge number of web pages contained in a web site, the lim-ited number of such spikes are not statistically signiﬁcant.The average over all the web sites of the triangle coefﬁ-cient as a function of degree is also depicted in Figure 3e,see circles, and is a smooth curve, which well represents allthe web sites. This is suggestive of a structural invariant ofweb sites.Figure 3f shows the web sites show a similar correspon-dence on the relationship between clustering coefﬁcient anddegree C ( k ) . Note that the average clustering coefﬁcient,depicted by circles, is not a monotonic function of degree.This is because the clustering coefﬁcient is itself a func-tion of the degree and triangle coefﬁcient. In the followingwe do not consider C ( k ) further, as the triangle coefﬁcient, ∆( k ) , contains all information provided by C ( k ) . Here we compare the topological properties of the aver-age over all web sites, with those of other networks, specif-ically the Web, a citation network, the AS Internet, and theBA model.

Figure 4a shows that the degree distribution of the Web,the citation network, the AS Internet and the BA modelcan be well described as a power-law P ( k ) ∼ k − γ with < γ < . However the average degree distribution ofthe web sites is very different: for k < or k > , it ¯ P ( k ) = X − P i P i ( k ) where i = 1 , ...X . Other average propertiesare calculated in similar ways. igure 3. Topological properties of the web sites: a) degree distribution, P ( k ) ; b) nearest-neighboursaverage degree of k -degree nodes, k nn ( k ) ; c) rich-club connectivity as a function of degree, φ ( k ) ;d) complementary cumulative distribution of triangle coefﬁcient, P c (∆) ; e) correlation between trian-gle coefﬁcient and degree, ∆( k ) ; and f) correlation between clustering coefﬁcient and degree, C ( k ) . can be described as a power-law; but for < k < , thedistribution increases exponentially with degree. Figure 4b shows that the citation network and the AS Inter-net are typical disassortative networks where k nn decreasesmonotonically with k . The BA model is an example of aneutral network where k nn does not change with k . For theaverage of the web sites, and the Web, k nn ﬁrst increasesand then decreases with k , and peaks at k = 30 and k = 15 respectively. For large degrees, the average k nn of the websites is signiﬁcantly larger than all other networks. Figure 4c shows that the AS Internet has the highest rich-club connectivity, with a fully interconnected core, i.e. φ ( k ) = 1 , for k > . The citation network has the low-est rich-club connectivity. Although the BA model is verydifferent from the web sites when measured by k nn ( k ) , thetwo exhibits similar rich-club connectivity for k > . Figure 4d shows that the web sites contain signiﬁcantlymore triangles than all other networks. The high densityof triangles ensures the navigability of the web sites.

Figure 4e shows that, in general, all the networks exhibita positive correlation between triangle coefﬁcient and de-gree. This is because the larger the degree of a node, themore neighbours a node has, and thus the higher the chanceof forming triangles. As discussed in Section 4.1.2, all theweb sites exhibit a very similar relationship between trian-gle coefﬁcient and degree, that is well characterised by theaverage over all the web sites. The average correlation be-tween triangle coefﬁcient and degree of the web sites can beclosely ﬁtted by a function given as f ( x ) = 0 . x . − .

36 log ( x ) or log ( f ( x )) = − . ( x )+2 . ( x ) − . . igure 4. Comparison between the average of the web sites and (i) the Web, (ii) a citation network, (iii)the AS Internet, and (iv) the BA model: a) degree distribution, P ( k ) ; b) nearest-neighbours averagedegree of k -degree nodes, k nn ( k ) ; c) rich-club connectivity as a function of degree, φ ( k ) ; d) comple-mentary cumulative distribution of triangle coefﬁcient, P c (∆) ; e) triangle coefﬁcient as a functionof degree, ∆( k ) ; and f) three triangle properties: triangle coefﬁcient versus degree, ∆( k ) ; in-trianglecoefﬁcient versus in-degree, ∆ in ( k in ) ; and out-triangle coefﬁcient versus out-degree, ∆ out ( k out ) . It is clear that the relationship between triangle coefﬁcientand degree is different from the other networks. The BAmodel exhibits the lowest number of triangles as a func-tion of node degree, followed by the citation network, andthen the AS Internet. For degree k < , the Web dataclosely follows that of the average over web sites, but di-verges thereafter. Figure 4f examines the three relationships of (i) trianglecoefﬁcient versus degree ∆( k ) , (ii) in-triangle coefﬁcientversus in-degree ∆ in ( k in ) , and (iii) out-triangle coefﬁcientversus out-degree ∆ out ( k out ) , for the citation network andthe average over all 18 web sites. That is, here, we considerthe networks as directed graphs.For the web sites, these three relationships closely over-lap one another. This means the probability of formingtriangles with a web page’s in-neighbours or with its out-neighbours are the same. However, for the citation network, ∆ in ( k in ) is one order of magnitude larger than ∆ out ( k out ) for the same degrees. This means the probability of a paperforming triangles with its citations (in-neighbours) is sig-niﬁcantly larger than it forming triangles with its references(out-neighbours).This structural difference between web sites and the ci-tation network may reﬂect their different evolution dynam-ics. For a citation network, when a paper is published allits references existed before the publication of the paperand, of course, cannot be changed. However, a paper canalways acquire new citations, and these citations may refer-ence other citations (thus continuing to form triangles). Incontrast, for a web site, web pages and their associated hy-perlinks can be added, deleted or revised at any time. Forweb sites, there is no equivalent to a reference to a page thatremains static and unable to be changed in the future. We examined a number of topological properties of hy-perlink data crawled from 18 diverse web sites. Our em-irical results showed that the link structures of the websites are signiﬁcantly different when measured with 1st and2nd-order topological properties. This is probably to be ex-pected since the web sites are designed for different pur-poses and developed independently. However we observedthat web sites share a common 3rd-order topological prop-erty, the relationship between triangle coefﬁcient and de-gree. This common relationship is unexpected and sugges-tive of a topological invariant for web sites. Comparisonwith the Web, the AS Internet, a citation network and theBarab´asi-Albert model showed that this third-order prop-erty is not shared across other types of networks. Thus, thisproperty appears to strongly characterise web sites. Thephysical meaning of this 3rd-order property is that given thenumber of hyperlinks to and from a particular web page, wecan statistically estimate how the web page’s neighbouringpages are interlinked; and this statistical estimation is validfor all web sites.Further evaluation on a wider variety of web sites isneeded to verify that this 3rd-order property is an invariant.If so, then the fundamental question is why? Possible expla-nations include standardised web site designing principles,popular web site developing tools, or universal evolutiondynamics which fundamentally reﬂect the common natureand function of web sites as a way of organising and dis-seminating information. The answer to this question mayprove valuable for research on a number of issues, such asmodelling web site and other document networks, recom-mendations for building web sites in the future, optimizingsearch engine algorithms, and understanding the fundamen-tal principles governing the evolution of the Web.

This work is partly supported by the UK NufﬁeldFoundation grant NAL/01125/G and a grant from theCambridge-MIT Institute.

References [1] The Cooperative Association For Internet Data Analysis. .[2] A. Barab´asi and R. Albert. Emergence of scaling in randomnetworks.

Science , 286:509, 1999.[3] K. Bharat, B.-W. Chang, M. Henzinger, and M. Ruhl. Wholinks to whom: Mining linkage between web sites. In

Proc.of IEEE Intl. Conf. on Data Mining (ICDM) , 2001.[4] A. Broder, R. Kumar, F. Maghoul, S. R. P. RaghavanRa-jagopalan, S., and A. Tomkins. Graph structure in theweb:Experiments and models. In

WWW’00: Proc.of the 9th Intl.Conf. on World Wide Web , 2000.[5] V. Colizza, A. Flammini, M. A. Serrano, and A. Vespignani.Detecting rich-club ordering in complex networks.

NaturePhysics , 2:110–115, 2006. [6] S. N. Dorogovtsev and J. F. F. Mendes. Evolution of net-works.

Adv. Phys. , 51(1079), 2002.[7] P. Erd˝os and A. R´enyi. On the evolution of random graphs.

Publ. Math. Inst. Hung. Acad. Sci. , 5:17, 1960.[8] X. He, H. Zha, C. Ding, and H. Simon. Web document clus-tering using hyperlink structures.

Computational Statisticsand Data Analysis , 41(1):19–45, 2001.[9] J. Kleinberg. Authoritative sources in a hyperlinked envi-ronment.

Journal of the ACM , 46(5):604–632, 1999.[10] R. Kumar, P. Raghavan, S. Rajagopalan, and A. Tomkins.Trawling the web for emerging cyber-communities.

Com-puter Networks , 31(11-16):1481–1493, 1999.[11] M. Levene.

An Introduction to Search Engines and WebNavigation . Pearson Education, 2005.[12] P. Mahadevan, D. Krioukov, K. Fall, and A. Vahdat. Sys-tematic topology analysis and generation using degree cor-relations. In

Proc. of SIGCOMM’06 , pages 135–146. ACMPress, New York, 2006.[13] P. Mahadevan, D. Krioukov, M. Fomenkov, B. Huffaker,X. Dimitropoulos, K. Claffy, and A. Vahdat. The InternetAS-level Topology: Three Data Sources and One DeﬁnitiveMetric.

Comput. Commun. Rev. , 36(1):17–26, 2006.[14] M. E. J. Newman. Assortative mixing in networks.

Phys.Rev. Lett. , 89(208701), 2002.[15] M. E. J. Newman. Mixing patterns in networks.

Phys. Rev.E , 67(026126), 2003.[16] L. Page, S. Brin, R. Motwani, and T. Winograd. The pager-ank citation ranking: Bringing order to theweb. Technicalreport, Stanford Digital Library Technologies Project, 1998.[17] R. Pastor-Satorras, A. V´azquez, and A. Vespignani. Dynam-ical and correlation properties of the internet.

Phys. Rev.Lett. , 87(258701), 2001.[18] R. Pastor-Satorras and A. Vespignani.

Evolution and Struc-ture of the Internet - A Statistical Physics Approach . Cam-bridge University Press, Cambridge, 2004.[19] V. Petricek, I. J. Cox, H. Han, I. Councill, and C. L.Giles. A comparison of on-line computer science cita-tion databases. In

ECDL’2005: Proc. of the 9th EuropeanConf. on Research and Advanced Technology for Digital Li-braries . Springer, 2005.[20] V. Petricek, T. Escher, I. J. Cox, and H. Margetts. The webstructure of e-government - developing a methodology forquantitative evaluation. In

WWW’06: Proc.of the 15th Intl.Conf. on World Wide Web , 2006.[21] I. Soboroff. Does wt10g look like the web? In

ACM SI-GIR’02 , pages 423–425, 2002.[22] A. V´azquez, M. Bogu˜n´a, Y. Moreno, R. Pastor-Satorras,and A. Vespignani. Topology and correlations in structuredscale-free networks.

Phys. Rev. E , 67(046111), 2003.[23] D. J. Watts and S. H. Strogatz. Collective dynamics of‘small-world’ networks.

Nature , 393:440, 1998.[24] S. Zhou and R. Mondrag´on. Structural constraints in com-plex networks.

New J. of Physics , 9(173), 2007.[25] S. Zhou and R. J. Mondrag´on. Accurately modelling theInternet topology.

Phys. Rev. E , 70(066108), 2004.[26] S. Zhou and R. J. Mondrag´on. The rich-club phenomenonin the Internet topology.