Agents, Bookmarks and Clicks: A topical model of Web traffic
Mark Meiss, Bruno Gonçalves, José J. Ramasco, Alessandro Flammini, Filippo Menczer
AAgents, Bookmarks and Clicks:A topical model of Web navigation
Mark R. Meiss , ∗ Bruno Gonçalves , , José J. Ramasco Alessandro Flammini , Filippo Menczer , , , School of Informatics and Computing, Indiana University, Bloomington, IN, USA Center for Complex Networks and Systems Research, Indiana University, Bloomington, IN, USA Pervasive Technology Institute, Indiana University, Bloomington, IN, USA Complex Networks and Systems Lagrange Laboratory (CNLL), ISI Foundation, Turin, Italy
ABSTRACT
Analysis of aggregate and individual Web traffic has shown thatPageRank is a poor model of how people navigate the Web. Us-ing the empirical traffic patterns generated by a thousand users, wecharacterize several properties of Web traffic that cannot be repro-duced by Markovian models. We examine both aggregate statisticscapturing collective behavior, such as page and link traffic, and in-dividual statistics, such as entropy and session size. No model cur-rently explains all of these empirical observations simultaneously.We show that all of these traffic patterns can be explained by anagent-based model that takes into account several realistic browsingbehaviors. First, agents maintain individual lists of bookmarks (anon-Markovian memory mechanism) that are used as teleportationtargets. Second, agents can retreat along visited links, a branch-ing mechanism that also allows us to reproduce behaviors such asthe use of a back button and tabbed browsing. Finally, agents aresustained by visiting novel pages of topical interest, with adjacentpages being more topically related to each other than distant ones.This modulates the probability that an agent continues to browse orstarts a new session, allowing us to recreate heterogeneous sessionlengths. The resulting model is capable of reproducing the collec-tive and individual behaviors we observe in the empirical data, rec-onciling the narrowly focused browsing patterns of individual userswith the extreme heterogeneity of aggregate traffic measurements.This result allows us to identify a few salient features that are nec-essary and sufficient to interpret the browsing patterns observed inour data. In addition to the descriptive and explanatory power ofsuch a model, our results may lead the way to more sophisticated,realistic, and effective ranking and crawling algorithms. ∗ Corresponding author. Email: [email protected]
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$10.00.
Categories and Subject Descriptors
H.3.4 [
Information Storage and Retrieval ]: Systems and Soft-ware—
Information networks ; H.4.3 [
Information Systems Appli-cations ]: Communications Applications—
Information browsers ;H.5.4 [
Information Interfaces and Presentation ]: Hypertext/ Hy-permedia—
Navigation
Keywords
Web links, navigation, traffic, clicks, browsing, entropy, sessions,agent-based model, bookmarks, back button, interest, topicality,PageRank, BookRank
1. INTRODUCTION
Despite its simplicity, PageRank [6] has been a remarkably ro-bust model of human Web browsing characterizing it as a randomsurfing activity. Such models of Web surfing have allowed us tospeculate how people interact with the Web. As ever more peo-ple spend a growing portion of their time online, their Web tracesprovide an increasingly informative window into human dynam-ics. The availability of large volumes of Web traffic data enablessystematic testing of PageRank’s underlying navigation assump-tions [20]. Traffic patterns aggregated across users have revealedthat some of its key assumptions—uniform random walk and uni-form random teleportation—are widely violated, making PageRanka poor predictor of traffic. Such results leave open the question ofhow to design a better Web navigation model. Here we expand onour previous empirical analysis [20, 19] by considering also indi-vidual traffic patterns [14]. Our results provide further evidencefor the limits of simple (memoryless) Markovian models such asPageRank. They suggest the need for an agent-based model withmore realistic features, such as memory and topicality, to accountfor both individual and aggregate traffic patterns observed in real-world data.Models of user browsing also have important practical applica-tions. First, the traffic received by pages and Web sites has a directimpact on the financial success of many companies and institutions.Indirectly, understanding traffic patterns has consequences for pre-dicting advertising revenues and on policies used to establish ad-vertising prices [11]. Second, realistic models of Web navigationcould guide the behavior of intelligent crawling algorithms, im-proving the coverage of important sites by search engines [8, 25].Finally, improved traffic models may lead to enhanced search rank-ing algorithms [6, 28, 17]. a r X i v : . [ c s . N I] M a r ontributions and outline In the remainder of this paper, after some background on relatedand prior work, we describe a data set collected through a fieldstudy from over a thousand users on the main campus of IndianaUniversity. We previously introduced a model of browsing be-havior, called
BookRank [3], which extends PageRank by addinga memory mechanism. Here we introduce a novel agent-basedmodel, which also accounts for the topical interests of users. Wecompare the traffic patterns generated by these models with both aggregated and individual
Web traffic data from our field study.Our main contributions are summarized below: • We show that the empirical diversity of pages visited by in-dividual users, as measured by Shannon entropy, is not wellpredicted by either PageRank or BookRank. This suggeststhat a typical user has both focused interests and recurrenthabits, meaning that the diversity apparent in many aggre-gate measures of traffic must be a consequence of the diver-sity across individual interests. • When we build logical sessions by assembling requests basedon referrer information and initiating sessions based on link-independent jumps [19], we find that endowing the modelwith a simple memory mechanism such as bookmarks (as inthe BookRank model) is sufficient to correct the mismatchbetween PageRank and the distributions of aggregate mea-sures of traffic, but not to capture the broad distributions ofindividual session size and depth. • We introduce an agent-based navigation model,
ABC , withthree key, realistic ingredients: (1) bookmarks are managedand used as teleportation targets, defining boundaries betweenlogical sessions and allowing us to capture the diverse pop-ularity of starting pages; (2) a back button is available toaccount for tabbed browsing and explain the branching ob-served in empirical traffic; and (3) topical interests drive anagent’s decision to continue browsing or start a new session,leading to diverse session sizes. The model also takes intoconsideration the topical locality of the Web, so that an in-teresting page is likely to link to other interesting pages. • Finally, we demonstrate that the novel ingredients of ABC al-low it to match or exceed PageRank and BookRank in repro-ducing the empirical distributions of page traffic, link traffic,and of the popularity of session starting pages, while out-performing both PageRank and BookRank in modeling usertraffic entropy and size and depth of sessions.
2. BACKGROUND
There have been many empirical studies of Web traffic patterns.The most common approach is the analysis of Web server logs.These have ranged from small samples of users from a few arbi-trarily selected Web server logs [16] to large samples of users fromthe logs of large organizations [14]. One advantage of this method-ology is that it allows us to distinguish individual users though theirIP addresses (even if they may be anonymized), thus capturing in-dividual traffic patterns [14]. Conversely, the methodology has thedrawback of biasing both the sample of users and the sample of theWeb graph being observed based on the choice of target server.An alternative source of Web traffic data is browser toolbars,which gather traffic information based on the surfing activity ofmany users. While the population is larger in this scenario, it isstill biased by users who have opted to install a particular piece of software. Moreover, traffic information from toolbars is not gen-erally available to researchers. Adar et al. [1] used this approachto study the patterns of revisitation to pages, but did not considerwhether pages are revisited within the same session or across dif-ferent sessions. A related approach is to identify a panel of usersbased on desired characteristics, then ask them to install trackingsoftware. This eliminates many sources of bias but incurs signif-icant experimental costs. Such an approach has been used to de-scribe the exploratory behavior of Web surfers [4]. These studiesdid not propose models to explain the observed traffic patterns.The methodology adopted in the study reported here capturestraffic data directly from a running network. This approach wasfirst adopted by Qiu et al. [27], who used captured HTTP packettraces in the UCLA computer science department to investigatehow browsing behavior is driven by search engines. Our study re-lies on a larger sample of users.One of the important traffic features we study here is the statis-tical characterization of browsing sessions. A common assumptionis that long pauses correspond to breaks between sessions. Basedon this assumption, many researchers have relied on timeouts as away of defining sessions, a technique we have recently found to beflawed [19]. This has led to the definition of time-independent log-ical sessions, based on the reconstruction of session trees rooted atpages requested without a referrer. The model presented here is inpart aimed at explaining the broad distributions of size and depthempirically observed for these logical sessions.An aspect of Web traffic that has not received much attention inthe literature is the role of page content in driving users’ browsingpatterns. A notable exception is a study of the correlation betweenchanges in page content and revisit patterns [2].On the modeling side, the most basic notion of Web navigationis that users move erratically, performing a random walk throughpages in the Web graph. PageRank [6] is a random walk modi-fied by the process of teleportation (random jumps), modeling howusers start new browsing sessions by a Poissonian process with uni-formly random starting points. This Markovian process has nomemory, no way to backtrack, and no notion of user interests orpage content. The stationary distribution of visitation frequencygenerated by PageRank can be compared with empirical traffic data.We have shown that the fundamental assumptions underlying Page-Rank—uniform link selection, uniform teleportation sources andtargets—are all violated by actual user behavior, making PageRanka poor model of actual users [20]. Such results leave open the ques-tion of how to design a better Web navigation model. That is thegoal of the present paper; we use such a random walk as a nullmodel to evaluate our alternative model.More realistic models have been introduced in recent years tocapture potentially relevant features of real Web browsing behav-ior, such as the back button [18, 5]. There have also been attemptsto model the role of the interplay between user interests and pagecontent in shaping browsing patterns. Huberman et al. proposed amodel in which pages visited by a user have interest values de-scribed by a random walk; the navigation continues as long asthe current page has a value above a threshold [15]. This kind ofmodel is closely related to algorithms designed to improve topicalcrawlers [21, 24, 25].We previously proposed a model in which the users maintaina list of bookmarks from which they start new sessions, provid-ing memory of previously visited pages [3]. We called this model
BookRank, since the bookmark selection is carried out according toa ranking based on the frequency of visits to each bookmark. Thismodel is able to reproduce a fair number of characteristics observedin empirical traffic data, including the page and link traffic distribu-ions. Unfortunately, BookRank fails to account for features relatedto the navigation patterns of individual users, such as entropy andsession characteristics. This failure is not remedied by the introduc-tion of a back button into the model. In the remainder of this paper,we extend the BookRank model to address these shortcomings.
3. EMPIRICAL TRAFFIC DATA3.1 Data acquisition
The HTTP request data we use in this study was gathered froma dedicated FreeBSD server located in the central routing facilityof the Bloomington campus of Indiana University [19]. This sys-tem had a 1 Gbps Ethernet port that received a mirror of all out-bound network traffic from one of the undergraduate dormitories.This dormitory is home to just over a thousand undergraduates splitroughly evenly between men and women. To the best of our knowl-edge this is the largest population sample whose every click hasbeen recorded and studied over an extended period of time.To identify individual requests, we first capture only packets des-tined for TCP port 80. While this does eliminate Web traffic run-ning on non-standard ports, it allows for an improved rate of cap-ture that more than offsets the lost data. We make no attempt tocapture or analyze encrypted (HTTPS) traffic using TCP port 443.For each packet we capture, we use a regular expression search todetermine whether it contains an HTTP GET request. If we do finda request, we analyze the packet further and log the request, mak-ing note of the MAC address of the client, a timestamp, the virtualhost and path requested, the referring URL, and a flag indicatingwhether the user agent matches a mainstream browser. We recordthe MAC address only in order to distinguish the traffic of indi-vidual users. We thus assume that most computers have a singleprimary user, which is reasonable: most students own computers,and only a few public workstations are available in the dormitory.Furthermore, as long as users do not replace their network interfacecard, this information remains constant.The aggregate traffic was low enough to permit full rate of col-lection without dropping packets. While this collection system of-fers a rare opportunity to capture the complete browsing activity ofa large population, we do recognize some potential disadvantages.Because we do not perform TCP stream reassembly, we can onlyanalyze requests that fit in a single Ethernet frame. The vast major-ity of requests do so, but some GET-based Web services do generateextremely long URLs. Without stream reassembly, we also cannotlog the Web server’s response to each request, making us unawareof failed requests and redirects. A user can spoof the HTTP refer-rer field; we assume that few students do so. Finally, although theyare in a residential setting, the students are at an academic institu-tion and represent a biased sample of the population of Web usersat large. This is an inevitable consequence of any local study of aglobal and diverse system such as the Web.The click data was collected over a period of about two months,from March 5, 2008 through May 3, 2008. This period included aweek-long vacation during which no students were present in thebuilding. During the full data collection period, we detected nearly408 million HTTP requests from a total of 1,083 unique MAC ad-dresses.Only a minority of HTTP requests reflect an actual human beingtrying to fetch a Web page for display. We retain only requests thatare likely to be for actual Web pages, as opposed to media files,style sheets, Javascript code, images, and so forth. We make thisdetermination based on the extension of the URL requested, whichis imprecise but a reasonable heuristic in the absence of access tothe MIME type of the server response. We also filtered out a small
Table 1: Approximate dimensions of the filtered andanonymized data set.
Page requests 29,494,409Unique users 967Unique URLs 2,503,002Unique target URLs 2,084,031Unique source URLs 864,420Number of sessions 11,174,254Mean sessions per user 11,556subset of users with negligible (mostly automated) activity. Finally,we removed some spoofed requests generated by an anonymizationservice that attempted to obscure traffic to an adult chat site.Privacy concerns and our agreement with the Human SubjectsCommittee of our institution also obliged us to strip off all identi-fiable query parameters from the URLs. Applying this anonymiza-tion procedure affects roughly one-third of the remaining requests.This procedure means that two URLs with different CGI variableswill be treated as the same. While this is a mistaken assumption forsites in which the identity of the page being requested is a queryparameter, it helps in the common case that the parameters affectsome content within a largely static framework.Once we have a filtered set of HTTP requests (“clicks”), we or-ganize each user’s clicks into a set of sessions. These sessions arenot based on a simple timeout threshold; our prior work demon-strates that most statistics of timeout-based sessions are functionsof the particular timeout used, which turns out to be arbitrary [19].Instead, we organize the clicks into tree-based logical sessions us-ing the referrer information associated with each request, accord-ing to an algorithm described formally in our previous work [19].The basic notions are that new sessions are initiated by requestswith an empty referrer field; that each request represents a directededge from a referring URL to a target URL; and that requests areassigned to the session in which their referring URL was most re-cently requested.The session trees built in this way offer several advantages. First,they mimic the multitasking behavior of users in the age of tabbedbrowsing: a user may have several active sessions at a time. Sec-ond, the key properties of these session trees, such as size anddepth, are relatively insensitive to an additional timeout constraintintroduced for the sake of plausibility [19]. In the current analysis,we impose a half-hour timeout as we form the sessions: a click can-not be associated with a session tree that has not received additionalrequests within thirty minutes.Most importantly, the tree structure allows us to infer how usersbacktrack as they browse. Because modern browsers follow so-phisticated caching mechanisms to improve performance, unlessoverridden by HTTP options, a browser will generally not issueanother request for a recently accessed page. This prevents us fromobserving multiple links pointing to the same page (within a singlelogical session) and gives us no direct way of determining whenthe user presses the back button. However, session trees allow us to infer information about backwards traffic: if the next request in thetree comes from a URL other than the most recently visited one,the user must have navigated to that page, or opened it in a separatetab.The dimensions of the resulting data set are shown in Table 1.In § 3.2, we present the most relevant properties of this data forthe discussion that follows; more detailed analysis of the empiricalsessions can be found in [19]. .2 Data descriptors
Any statistical description strives to achieve a compromise be-tween the need to summarize the behavior of the data and the needto describe such behavior accurately. In the case of many human ac-tivities, including those on the Web, we know that the data does notbehave in a normal (Gaussian) fashion, but rather fits into heavy-tailed distributions approximated best by power laws [7, 20]. Inmany cases, the mean and median are not a sufficient descriptionof the data, as shown by a large and diverging variance and heavyskew. The next best description of any quantity is a histogram of itsvalues. We therefore present these distributions in terms of their es-timated probability density functions rather than measures of cen-tral tendency. To characterize the properties of our traffic data andevaluate the models proposed later in this paper, we focus on thedistributions of the six quantities outlined below.
Page traffic
The total number of visits to each page. Because ofcaching mechanisms, the majority of revisits to a page by asingle user beyond the first visit within each session will notbe represented in the data.
Link traffic
The total number of times each link between pageshas been traversed by a user, as identified by the referrer anddestination URLs in each request. Again, because of cachingbehavior, we typically observe only the first click to a desti-nation page within each session.
Empty referrer traffic
The number of times each page is used toinitiate a new session. We assume that a request without a re-ferring page corresponds to the user initiating a new sessionby using a bookmark, opening a link from another applica-tion, or manually entering an address.
Entropy
Shannon information entropy. For an individual user j ,the entropy is defined as S j = − (cid:80) i ρ ij log ρ ij where ρ ij is the fraction of visits of user j to site i aggregated acrosssessions. Session size
The number of unique pages visited in a logical ses-sion tree.
Session depth
The maximum tree distance between the startingpage of a session and any page visited within the same ses-sion. (Recall that session graphs have a tree-like structurebecause requests that go back to a previously visited pageare usually served from the browser cache.)We have already characterized some of these distributions in pre-liminary work [3, 19]. Another feature sometimes used to charac-terize random browsing behavior is the distribution of return time,which in this case would be the number of clicks between two con-secutive visits to the same page by a given user [14, 3]. However,cache behavior and overlapping sessions mean that this informationcannot be retrieved in a reliable way from the empirical data.
To properly analyze these distributions, we compare them withthose generated by two reference models based on PageRank-likemodified random walkers with teleportation probability p t = 0 . .To obtain a useful reference model for traffic data that is based onindividuals, we imagine a population of PageRank random walk-ers, as many as the users in our study. The first reference model(PageRank) is illustrated in Fig. 1. Each walker browses for asmany sessions as there were empirical sessions for the correspond-ing real-world user. The PageRank sessions are terminated by the p tTeleport to random page t New Session? Forward
Figure 1: Schematic illustration of the PageRank model.
Ranked Bookmark List p tPick Start Bookmark t New Session? Forward
P(R)~R– β U pda t e Figure 2: Schematic illustration of the BookRank model. constant-probability jumps, so the total number of pages visitedby a walker may differ from the corresponding user. Teleportationjumps lead to session-starting pages selected uniformly at random.The second reference model (BookRank) is illustrated in Fig. 2.The key realistic ingredient that differentiates this model from Page-Rank is related to memory: agents maintain individual lists of book-marks that are chosen as teleportation targets based on the numberof previous visits. Initially, each agent randomly selects a start-ing page (node). Then, agents navigate the Web by repeating thefollowing steps:1. With probability − p t , the agent navigates locally, followinga link from the present node selected with uniform probabil-ity. Unless previously visited, the new node is added to thebookmark list. The frequency of visits is recorded, and thelist of bookmarks is kept ranked from most to least visited.2. Otherwise, with probability p t , the agent teleports (jumps) toa previously visited page (bookmark). The bookmark withrank R is chosen with probability P ( R ) ∝ R − β .The above mechanism mimics the use of frequency ranking invarious features of modern browsers, such as URL completion inthe address bar and suggested starting pages in new windows. Thefunctional form P ( R ) for the bookmark choice is motivated by dataon selection among a ranked list of search results [13].In our simulations, browsing occurs on scale-free networks with N nodes and degree distribution P ( k ) ∼ k − γ , generated accordingto the growth model of Fortunato et al. [29]. We used a large graphwith N = 10 nodes to ensure that the network would be largerthan the number of pages visited in the empirical data (cf. Table 1).We also set γ = 2 . to match our data set. This graph is constructedwith symmetric links to prevent dangling links; as a result, eachnode’s in-degree is equal to its out-degree.Within a reference model’s session, we simulate the browser’scache by recording traffic for links and pages only when the targetpage has not been previously visited in the same session. This waywe can measure in the models the number of unique pages visited T -12 -9 -6 -3 P ( T ) EmpiricalPageRankBookRank
Figure 3: Empirical distribution of page traffic versus base-lines. ! -12 -9 -6 -3 P ( ! ) EmpiricalPageRankBookRank
Figure 4: Empirical distribution of link traffic versus baselines. in a session, which we can compare with the empirical session size.We assume that that cached pages are reset between sessions.
We first consider the aggregate distribution of traffic received byindividual pages, as shown in Fig. 3. The empirical data show avery broad power-law distribution for page traffic, P ( T ) ∼ T − α ,with exponent α ≈ . , which is consistent with our prior resultsfor host-level traffic [20, 19].Theoretical arguments [26] suggest that PageRank should be-have in a similar fashion. If we disregard teleportation, a nodeof in-degree k may expect a visit if one of its neighbors has beenvisited in the previous step. The traffic it will receive will be there-fore proportional to its degree, if no degree-degree correlations arepresent in the graph. This intuition, as well as prior empirical re-sults [20], lead us to expect that PageRank’s prediction of the dis-tribution of traffic received by a Web page is described by a powerlaw P ( T ) ∼ T − α where α ≈ . is the same exponent observedin the distribution of the in-degree [7, 12]. Indeed this is consistentwith the distribution generated by the PageRank reference modelin Fig. 3. On the other hand, the traffic generated by BookRank isbiased toward previously visited pages (bookmarks), and thereforehas a broader distribution (by three orders of magnitude) in betteragreement with the empirical data, as shown in Fig. 3.The distribution of weights ω across links between pages allowsus to consider the diversity of traffic crossing each hyperlink in theWeb graph. In Fig. 4, we compare the distribution of link traffic re-sulting from the reference models with that from the empirical data.The data reveals a very wide power law for P ( ω ) with degree . .This is consistent with our prior results for host-level traffic [20].The comparison with PageRank and BookRank in Fig. 4 is avivid illustration of the diversity of links when we consider theirprobability of actually being clicked. A rough argument may againhelp to make sense of the PageRank reference model’s poor perfor-mance at reproducing the data. If we disregard teleportation, the The fact that α < is significant: in this case, both the varianceand the mean of the distribution diverge in the limit of an infinite-size network. T -12 -9 -6 -3 P ( T ) EmpiricalPageRankBookRank
Figure 5: Empirical distribution of traffic originating fromjumps (page requests with empty referrer) versus baselines. S P ( S ) EmpiricalPageRankBookRank
Figure 6: Empirical distribution of user entropy versus base-lines. traffic to a page is roughly proportional to the page in-degree. Thetraffic expected on a link would be thus proportional to the trafficto the originating page and inversely proportional to the out-degreeof the page if we assume that links are chosen uniformly at random.Since a node’s in-degree and out-degree are equal in our simulatedgraphs, this would lead to a link traffic that is independent of the de-gree and therefore essentially constant for all links. This is reflectedin the quickly decaying distribution of link traffic for PageRank. Inthe case of BookRank, the stronger heterogeneity in the probabil-ity of visiting pages is reflected in a heterogeneous choice of links,resulting in a broad distribution that fits the empirical data well asshown in Fig. 4.Our empirical data in Fig. 5 show clearly that all pages are notequally likely to be chosen as the starting point of a browsing ses-sion. Their popularity as starting points is roughly distributed as apower law with an exponent close to 1.8 (consistent with prior re-sults for host-level traffic [20]), implying a diverging variance andmean when the number of sessions considered increases. While notunexpected from a qualitative point of view, this demonstrates howoff the mark is one of the basic hypotheses underlying the Page-Rank class of browsing processes, namely uniform teleportation.PageRank assumes a uniform probability for a page to be chosenas a starting point, and its failure to reproduce the empirical datais evident in Fig. 5. The bookmarking mechanism, on the otherhand, captures well the non-uniform probability of starting pages,so that the distribution generated by BookRank is a good match tothe empirical data, as shown in Fig. 5.We now turn from the aggregate properties of the system andattempt to characterize individual users. The simplest hypothesiswould be that the broad distributions characterizing aggregate userbehavior are a reflection of extreme variability within the trafficgenerated by single users, thus concluding that there is no suchthing as a “typical” user from the point of view of traffic gener-ated. To capture how diverse is the behavior in a group of users, weadopt Shannon’s information entropy of a user as defined above.Entropy directly measures the focus of a user’s interests, offer-ing a better probe into single user behavior than, for instance, the N s -10 -8 -6 -4 -2 P ( N s ) EmpiricalPageRankBookRank
Figure 7: Empirical distribution of session size (unique pagesper session) versus baselines. D s -10 -8 -6 -4 -2 P ( D s ) EmpiricalPageRankBookRank
Figure 8: Empirical distribution of session depth versus base-lines. number of distinct pages visited; two users who have visited thesame number of pages can have very different measures of entropy.Given an arbitrary number of visits N v , the entropy is maximum( S = N v log( N v ) ) when N v pages are visited once, and minimum( S = 0 ) when all visits have been paid to a single page. The dis-tribution of entropy across users is shown in Fig. 6. We observethat the reference PageRank model produces higher entropy thanobserved in the empirical data. One can interpret this by the waya PageRank walker picks starting pages with uniform probability,while a real user is more likely to start from a previously visitedpage, and therefore to revisit neighboring pages. BookRank is moresimilar to such repetitive behavior, and indeed we observe lowerentropy values in Fig. 6. However, BookRank underestimates theentropy as well as its variability across users.Finally, we can consider the distributions that characterize logi-cal sessions, namely the size (number of unique pages) and depth(distance from a session’s starting page) distributions. Figs. 7 and 8show that both empirical distributions are rather broad, spanningthree orders of magnitude, which is a surprisingly large proportionof very long sessions. In contrast, both PageRank and BookRankreference models generate very short sessions. The probabilisticteleportation mechanism that determines when a PageRank walkerstarts a new session is incapable of capturing broadly distributedsession sizes. In fact, session size is upper-bounded by the length (cid:96) (number of clicks) of a session, which exhibits a narrow, expo-nential distribution P ( (cid:96) ) ∼ (1 − p t ) (cid:96) . Note that the exponentiallyshort sessions are not inconsistent with the high entropy of Page-Rank walkers (Fig. 6), which is a result of the frequent jumps torandom targets rather than the browsing behavior.
4. ABC MODEL
The empirical analysis in the previous section demonstrates thata more sophisticated model of user behavior is needed to captureindividual navigation patterns. We build upon the BookRank modelby adding two additional ingredients.First, we provide agents with a back button.
A backtracking mechanism is needed to capture the tree-like structure of sessions(see also top row of Fig. 9). Our data also indicates that the in-coming and outgoing traffic of a site are seldom equal. Indeed,the ratio between incoming and outgoing clicks is distributed overmany orders of magnitude [20]. This violation of flow conserva-tion cannot be explained by teleportation alone, demonstrating thatusers’ browsing sessions have many branches. Finally, our priorresults show that the average node-to-depth ratio of session trees isalmost two. All of these observations are consistent with the use oftabs and the back button. Other studies have shown that the backbutton is used frequently [9, 30]. We therefore use the back buttonto model any branching behavior.The second ingredient has to do with the fact that the BookRankmodel fails to predict individual statistics: all agents are identical,session size has a narrow, exponential distribution, and the com-parison with the empirical entropy distribution is unsatisfactory. Inthe real world, the duration of a session depends on the intentions(goals) and interests of a user; different users have different inter-ests. Visiting relevant pages, those whose topics match the user’sinterests, will lead to more clicks and thus longer sessions. Wetherefore introduce the elements of different agents with distinct in-terests and page topicality into the model. The idea is that an agentspends some attention when navigating to a new page, and attentionis gained when visiting pages whose topics match the user’s inter-ests. To model this process, we imagine that each agent stores some“energy” (units of attention) while browsing. Visiting a new pageincurs a higher energy cost than going back to a previously visitedpage. Known pages yield no energy, while unseen pages may in-crease the energy store by some random amount that depends onthe page’s relevance to the agent. Agents continue to browse untilthey run out of energy, whereupon they start a new session.We call the resulting model
ABC for its main ingredients: agents,bookmarks and clicks. Clicks are driven by the topicality of pagesand agent interests, in a way that is in part inspired by the
InfoSpi-ders algorithms for topical crawlers [21, 24, 25]. InfoSpiders weredesigned to explore the Web graph in an adaptive and intelligentfashion, driven by the similarity between search topics and pagecontent. Better matches led to more energy and more explorationof local link neighborhoods. Irrelevant pages led to agents runningout of energy and dying, so that resources would be allocated tomore promising neighborhoods. In ABC, this idea is used to modelbrowsing behavior.The ABC model is illustrated in Fig. 10. Each agent starts at arandom page with an initial amount of energy E . Then, for eachtime step:1. If E ≤ , the agent starts a new session by teleporting to abookmark chosen as in BookRank.2. Otherwise, if E > , the user continues the current session,following a link from the present node. There are two alter-natives:(a) With probability p b , the back button is used, leadingback to the previous page. The agent’s energy is de-creased by a fixed cost c b .(b) Otherwise, with probability − p b , a forward link isclicked with uniform probability. The agent’s energy isupdated to E − c f + ∆ where c f is a fixed cost and ∆ is a stochastic value representing the new page’s rele-vance to the user. As in BookRank, the bookmark list isupdated with new pages and ranked by visit frequency.The dynamic variable ∆ in the ABC model is a measure of rele-vance of a page to a user’s interests. The simplest way to model rel- igure 9: Representation of a few typical and representative session trees from the empirical data (top) and from the ABC model(bottom). Animations are available at cnets.indiana.edu/groups/nan/webtraffic . Ranked Bookmark List
E < 0
Pick Start BookmarkE = E0
E > 0 1–p b p b New Session? Back Button?
ForwardE = E – cf + ∆ BackE = E – cb U pda t e P ( R ) ~ R – β Figure 10: Schematic illustration of the ABC model. evance is by a random variable, for example drawn from a Gaussiandistribution. In this case the amount of stored energy behaves as arandom walk. It has been shown that the session duration (cid:96) (numberof clicks until the random walk reaches E = 0 ) has a power-lawtail P ( (cid:96) ) ∼ (cid:96) − [15]. However, our empirical results suggest alarger exponent [19]. More importantly, we know from empiricalstudies that the content similarity between two Web pages is corre-lated with their distance in the link graph, and so is the probabilitythat a page is relevant with respect to some given topic [10, 23, 22].Therefore, two neighbor pages are likely to be related topically, andthe relevance of a page t to a user is related to the relevance of apage r that links to t . To capture such topical locality , we introducecorrelations between the ∆ values of consecutively visited pages.For the starting page we use an initial value ∆ = 1 . Then, whena page t is visited for the first time in a given session, ∆ t is deter-mined by ∆ t = ∆ r (1 + (cid:15) ) where r is the referrer page, (cid:15) is a random variable uniformly dis-tributed in [ − η, η ] and η is a parameter controlling the degree oftopical locality. In a new session we assume a page can again beinteresting and thus provide the agent with energy, even if it was visited in a previous session. However, the same page will yielddifferent energy in different sessions, based on changing user inter-ests.
5. MODEL EVALUATION5.1 Simulation of ABC model
We ran two sets of simulations of the ABC model, in whichagents navigate two distinct scale-free graphs. One (G1) is the ar-tificial network discussed in § 3.3. Recall that N = 10 nodesand the degree distribution is a power law with exponent γ = 2 . to match our data set. The second graph (G2) is derived from anindependent, empirical, anonymous traffic data set. The data is ob-tained by extracting the largest strongly connected component froma traffic network generated by the entire Indiana University systempopulation (about 100,000 people) [20]. This way there are no dan-gling links, but the nodes correspond to actual visited pages and theedges to actual traversed links. G2 is based on three weeks of traf-fic in November 2009; it has N = 8 . × nodes and the samedegree distribution with exponent γ ≈ . .Within each session we simulate the browser’s cache as discussedin § 3.3 so that we can measure the number of unique pages visitedby the model agents and compare it with the empirical session size.The proposed models have various parameters. In prior work [29],we have shown that the distribution of traffic with empty referrergenerated by our models is related to the parameter β (cf. BookRankdescription in § 3.3). Namely, the distribution is well approxi-mated by a power law P ( T ) ∼ T − α , where α = 1 + 1 /β .To match the empirical exponent α ≈ . we set the parame-ter β = 1 / ( α −
1) = 1 . . We also fit the back button probability p b = 0 . from the data.The ABC model contains a few additional free parameters: theinitial energy E , the forward and backward costs c f and c b , andthe topical locality parameter η . The initial energy and the costsare closely related, and together they control session durations. Wetherefore set E = 0 . arbitrarily and use an energy balance argu-ment to find suitable values of the costs. Empirically, the averagesession size is close to two pages. The net loss per click of anagent is − δE = p b c b + (1 − p b )( c f − (cid:104) ∆ (cid:105) ) where (cid:104) ∆ (cid:105) = 1 is the expected value of the energy from a new page. By set- T -12 -9 -6 -3 P ( T ) EmpiricalBookRankABC (G1)ABC (G2)
Figure 11: Distribution of page traffic generated by ABC modelversus data and baseline. ! -12 -9 -6 -3 P ( ! ) EmpiricalBookRankABC (G1)ABC (G2)
Figure 12: Distribution of link traffic generated by ABC modelversus data and baseline. ting c f = 1 and c b = 0 . , we obtain an expected session size − (1 − p b ) E /δE = 2 (counting the initial page). In general,higher costs lead to shorter sessions and lower entropy. We ran anumber of simulations to explore the sensitivity of the model tothe parameter η , settling on η = 0 . . Smaller values mean thatall pages have similar relevance, and the session size and depthdistributions become too narrow. Larger values imply more noise(absence of topical locality), and the session distributions becometoo broad. The results shown below refer to this combination ofparameters.The number of users in the simulation, and the number of ses-sions for each user, are taken from the empirical data. Because themodel is computationally intensive, we partitioned the simulatedusers into work queues of roughly equal session counts, which weexecuted in parallel on a high-performance computing cluster. The simulations of the ABC model users generate session treesthat can be compared visually to those in the empirical data, asshown in Fig. 9. For a more quantitative evaluation of our model,we compare its results with empirical findings described in § 3. Foreach of the distributions discussed earlier, we also compare ABCwith the reference BookRank model. The latter is simulated on theartificial G1 network.A first aspect to check is whether the model is able to reproducethe general features of the traffic distributions. In Fig. 11 we plotthe number of visits received by each page. Agreement between theABC model and data is as good as or better than for the BookRankreference model. Similarly, the distributions of link traffic (Fig. 12)and teleportation traffic (Fig. 13) show that the ABC model repro-duces the empirical data as accurately as BookRank.The good agreement between both BookRank and ABC modelsand the data provides further support for our hypothesis that therank-based bookmark choice is a sound cognitive mechanism tocapture session behavior in Web browsing.Let us now consider how our model captures the behavior of sin-gle users. The entropy distribution across users is shown in Fig. 14, T -12 -9 -6 -3 P ( T ) EmpiricalBookRankABC (G1)ABC (G2)
Figure 13: Distribution of traffic originating from jumps (pagerequests with empty referrer) generated by ABC model versusdata and baseline. S P ( S ) EmpiricalBookRankABC (G1)ABC (G2)
Figure 14: Distribution of user entropy generated by ABCmodel versus data and baseline. where the model predictions are compared with the distributionfound in the empirical data. The ABC model yields entropy dis-tributions that are somewhat sensitive to the underlying network,but that in any case fit the empirical entropy data much better thanBookRank, in terms of both the location of the peak and the vari-ability across users. This result suggests that bookmark memory,back button, and topicality are crucial ingredients in explaining thefocused habits of real users.Having characterized traffic patterns from aggregating across usersessions, we can study the sessions one by one and analyze theirstatistical properties. In Fig. 15, we show the distribution of sessionsize as generated by the ABC model. The user interests and topi-cal locality ingredients account for the broad distribution of sessionsize, capturing that of the empirical data much better that the shortsessions generated by the BookRank reference model. Agents visit-ing relevant pages tend to keep browsing, and relevant pages tend tolead to other interesting pages, explaining the longer sessions. Weargue that the diversity apparent in the aggregate measures of trafficis a consequence of this diversity of individual interests rather thanthe behavior of extremely eclectic users who visit a wide variety ofWeb sites — as shown by the narrow distribution of entropy.The entropy distribution discussed above depends not only onsession length, but also on how far each user navigates away fromthe initial bookmark where a session is initiated. One way of an-alyzing this is by the distribution of session depth, as shown inFig. 16. The agreement between the empirical data and the ABCmodel is excellent and significantly better than the one observedwith the BookRank baseline. Once again topicality is shown to bea key ingredient to understand real user behavior on the Web.
6. CONCLUSIONS
Several previous studies have shown that memoryless Marko-vian processes, such as PageRank, cannot explain many patternsobserved in real Web browsing. In particular, the diversity of ses-sion starting points, the global diversity on link traffic, and the het-erogeneity of session sizes. The picture is further complicated by N s -10 -8 -6 -4 -2 P ( N s ) EmpiricalBookRankABC (G1)ABC (G2)
Figure 15: Distribution of session size (unique pages per ses-sion) generated by ABC model versus data and baseline. Thedistribution generated by the BookRank model cannot be dis-tinguished from that of the PageRank baseline because theyboth use an identical exponential model of sessions. D s -10 -8 -6 -4 -2 P ( D s ) EmpiricalBookRankABC (G1)ABC (G2)
Figure 16: Distribution of session depth generated by ABCmodel versus data and baseline. the fact that, despite such diverse aggregate measurements, individ-ual behaviors are quite focused. These observations call for a non-Markovian agent-based model that can help explain the empiricaldata by taking into account several realistic browsing behaviors.Here we proposed three novel ingredients for such a model. First,agents maintain individual lists of bookmarks (a memory mecha-nism) that are used as teleportation targets. Second, agents haveaccess to a back button (a branching mechanism) that also allowsto reproduce tabbed browsing behavior. Finally, agents have topicalinterest that are matched by page content, modulating the probabil-ity that an agent continues to browse or starts a new session andthus allowing to capture heterogeneous session sizes.We have shown that the resulting ABC model is capable of repro-ducing with remarkable accuracy the aggregate traffic patterns weobserve in our empirical data. More importantly, our model offersthe first account of a mechanism that can generate key properties oflogical sessions. This allows us to argue that the diversity apparentin page, link, and bookmark traffic is a consequence of the diver-sity of individual interests rather than the behavior of very eclecticusers. Our model is able to capture, for the first time, the extremeheterogeneity of aggregate traffic measurements while explainingthe narrowly focused browsing patterns of individual users.Of course, the ABC model is more complex than prior modelssuch as PageRank or even BookRank. However, its greater pre-dictive power suggests that bookmarks, tabbed browsing, and top-icality are salient features in interpreting how we browse the Web.In addition to the descriptive and explanatory power of an agent-based model with these ingredients, our results may lead the wayto more sophisticated, realistic, and hence more effective rankingand crawling algorithms.The ABC model relies on several key parameters, and while wehave attempted to make reasonable, realistic choices for some ofthese parameters and explored the sensitivity of our model with re- spect to some others, further work is needed to achieve a completepicture of the combined effect of the multiple parameters. We al-ready know, for example, that some parameters such as networksize, costs, and topical locality play a key role in modulating thebalance between individual diversity (entropy) and session size.While, in its current incarnation, the ABC model is a clear step inthe right direction, it still shares some of the limitations present inprevious efforts. The most notable example is the uniform choiceamong outgoing links from a page, which may be responsible forthe imperfect match between the individual entropy values of ourmodel agents and those of actual users.Future work can also explore intrinsic, node-dependent jumpprobabilities to model the varying intrinsic relevance that users at-tribute to sites; for example, well-known sites such as CNN orWikipedia are likely to be seen as more reliable or credible thanunknown personal blogs. Restrictions on the subset of nodes reach-able by each user, in the form of disconnected components for in-dividual sessions, can be used to model different areas of interest.
Acknowledgments
The authors would like to thank the Advanced Network Manage-ment Laboratory and the Center for Complex Networks and Sys-tems Research, both parts of the Pervasive Technology Institute atIndiana University, and L. J. Camp of the IU School of Informat-ics and Computing, for support and infrastructure. We also thankthe network engineers of Indiana University for their support in de-ploying and managing the data collection system. This work wasproduced in part with support from the Institute for InformationInfrastructure Protection research program. The I3P is managedby Dartmouth College and supported under Award 2003-TK-TX-0003 from the U.S. DHS, Science and Technology Directorate. BGwas supported in part by grant NIH-1R21DA024259 from the Na-tional Institutes of Health. JJR is funded by the project 233847-Dynanets of the European Union Commission. This material isbased upon work supported by the NSF award 0705676. This workwas supported in part by a gift from Google. Opinions, findings,conclusions, recommendations or points of view in this documentare those of the authors and do not necessarily represent the officialposition of the U.S. Department of Homeland Security, Science andTechnology Directorate, I3P, National Science Foundation, IndianaUniversity, Google, or Dartmouth College.
7. REFERENCES [1] Eytan Adar, Jaime Teevan, and Susan Dumais. Large scaleanalysis of web revisitation patterns. In
Proc. CHI , 2008.[2] Eytan Adar, Jaime Teevan, and Susan Dumais. Resonance onthe web: Web dynamics and revisitation patterns. In
Proc.CHI , 2009.[3] B. Gonçalves, M.R. Meiss, J.J. Ramasco, A. Flammini an dF. Menczer. Remembering what we like: Toward anagent-based model of Web t raffic.
Late Breaking ResultsWSDM , 2009.[4] Thomas Beauvisage. The dynamics of personal territories onthe web. In
HT ’09: Proceedings of the 20th ACM conferenceon Hypertext and hypermedia , pages 25–34, New York, NY,USA, 2009. ACM.[5] M. Bouklit and F. Mathieu. BackRank: an alternative forPageRank? In
Proc. WWW Special interest tracks andposters , pages 1122–1123, 2005.[6] S Brin and L Page. The anatomy of a large-scalehypertextual Web search engine.
Computer Networks ,30(1–7):107–117, 1998.7] A Broder, SR Kumar, F Maghoul, P Raghavan,S Rajagopalan, R Stata, A Tomkins, and J Wiener. Graphstructure in the Web.
Computer Networks , 33(1–6):309–320,2000.[8] J Cho, H Garcia-Molina, and L Page. Efficient crawlingthrough URL ordering.
Computer Networks ,30(1–7):161–172, 1998.[9] A. Cockburn and B. McKenzie. What do web users do? anempirical analysis of web use.
Int. J. of Human-ComputerStudies , 54(6):903–922, 2001.[10] BD Davison. Topical locality in the Web. In
Proc. 23rdInternational ACM SIGIR Conference on Research andDevelopment in Information Retrieval , pages 272–279, 2000.[11] Fred Douglis. What’s your PageRank?
IEEE InternetComputing , 11(4):3–4, 2007.[12] S. Fortunato, M. Boguna, A. Flammini, and F. Menczer.Approximating PageRank from in-degree. In
Proc. WAW2006 , volume 4936 of
LNCS , pages 59–71. Springer, 2008.[13] S. Fortunato, A. Flammini, F. Menczer, and A. Vespignani.Topical interests and the mitigation of search engine bias.
Proc. Natl. Acad. Sci. USA , 103(34):12684–12689, 2006.[14] B. Gonçalves and J. J. Ramasco. Human dynamics revealedthrough web analytics.
Phys. Rev. E , 78:026123, 2008.[15] BA Huberman, PLT Pirolli, JE Pitkow, and RM Lukose.Strong regularities in World Wide Web surfing.
Science ,280(5360):95–97, 1998.[16] L.D. Catledge and J.E. Pitkow. Characterizing browsingstrategies in the World-Wide-Web.
Computer Networks andISDN Systems , 27:1065–1073, 1995.[17] Yuting Liu, Bin Gao, Tie-Yan Liu, Ying Zhang, Zhiming Ma,Shuyuan He, and Hang Li. BrowseRank: letting Web usersvote for page importance. In
Proc. SIGIR , pages 451–458,2008.[18] F. Mathieu and M. Bouklit. The effect of the back button in arandom walk: application for PageRank. In
Proc. WWWAlternate track papers & posters , pages 370–371, 2004.[19] M. Meiss, J. Duncan, B. Gonçalves, J. J. Ramasco, andF. Menczer. What’s in a session: tracking individual behavioron the Web. In
Proc. 20th ACM Conf. on Hypertext andHypermedia (HT) , 2009.[20] M. Meiss, F. Menczer, S. Fortunato, A. Flammini, andA. Vespignani. Ranking web sites with real user traffic. In
Proc. WSDM , pages 65–75, 2008.[21] F Menczer. ARACHNID: Adaptive Retrieval AgentsChoosing Heuristic Neighborhoods for InformationDiscovery. In
Proc. 14th International Conference onMachine Learning , pages 227–235, San Francisco, CA,1997. Morgan Kaufmann.[22] F Menczer. Lexical and semantic clustering by Web links.
Journal of the American Society for Information Science andTechnology , 55(14):1261–1269, 2004.[23] F Menczer. Mapping the semantics of web text and links.
IEEE Internet Computing , 9(3):27–36, May/June 2005.[24] F Menczer and RK Belew. Adaptive retrieval agents:Internalizing local context and scaling up to the Web.
Machine Learning , 39(2–3):203–242, 2000.[25] F Menczer, G Pant, and P Srinivasan. Topical web crawlers:Evaluating adaptive algorithms.
ACM Transactions onInternet Technology , 4(4):378–419, 2004.[26] J. D. Noh and H. Rieger. Random walks on complexnetworks.
Phys. Rev. Lett. , 92:118701, 2004. [27] Feng Qiu, Zhenyu Liu, and Junghoo Cho. Analysis of userweb traffic with a focus on search activities. In AnHai Doan,Frank Neven, Robert McCann, and Geert Jan Bex, editors,
Proc. 8th International Workshop on the Web and Databases(WebDB) , pages 103–108, 2005.[28] F. Radlinski and T. Joachims. Active exploration for learningrankings from clickthrough data. In
Proc. KDD , 2007.[29] A. Flammini S. Fortunato and F. Menczer. Scale-freenetwork growth by ranking.
Phys. Rev. Lett. , 96:218701,2006.[30] L. Tauscher and S. Greenberg. How people revisit webpages: Empirical findings and implications for the design ofhistory systems.