What's in a Session: Tracking Individual Behavior on the Web
Mark Meiss, John Duncan, Bruno Gonçalves, José J. Ramasco, Filippo Menczer
WWhat’s in a Session:Tracking Individual Behavior on the Web
Mark Meiss , [email protected] John Duncan [email protected] Bruno Gonçalves [email protected]é J. Ramasco [email protected] Filippo Menczer , fi[email protected] School of Informatics, Indiana University, Bloomington, IN, USA Advanced Network Management Lab, Indiana University, Bloomington, IN, USA Complex Networks Lagrange Laboratory, ISI Foundation, Torino, Italy
ABSTRACT
We examine the properties of all HTTP requests generatedby a thousand undergraduates over a span of two months.Preserving user identity in the data set allows us to discovernovel properties of Web traffic that directly affect modelsof hypertext navigation. We find that the popularity ofWeb sites—the number of users who contribute to theirtraffic—lacks any intrinsic mean and may be unbounded.Further, many aspects of the browsing behavior of individ-ual users can be approximated by log-normal distributionseven though their aggregate behavior is scale-free. Finally,we show that users’ click streams cannot be cleanly seg-mented into sessions using timeouts, affecting any attemptto model hypertext navigation using statistics of individualsessions. We propose a strictly logical definition of sessionsbased on browsing activity as revealed by referrer URLs; auser may have several active sessions in their click stream atany one time. We demonstrate that applying a timeout tothese logical sessions affects their statistics to a lesser extentthan a purely timeout-based mechanism.
Categories and Subject Descriptors
C.2.2 [
Computer-Communication Networks ]: NetworkProtocols—
HTTP ; H.3.4 [
Information Storage and Re-trieval ]: Systems and Software—
Information networks ; H.5.4[
Information Interfaces and Presentation ]: Hypertext/Hypermedia—
Navigation, user issues
General Terms
Measurement
Keywords
Web traffic, Web session, popularity, navigation, click stream
Permission to make digital or hard copies of all or part of this work forpersonal or classroom use is granted without fee provided that copies arenot made or distributed for profit or commercial advantage and that copiesbear this notice and the full citation on the first page. To copy otherwise, torepublish, to post on servers or to redistribute to lists, requires prior specificpermission and/or a fee.Copyright 200X ACM X-XXXXX-XX-X/XX/XX ...$5.00.
1. INTRODUCTION
We report our analysis of the Web traffic of approximatelyone thousand residential users over a two-month period.This data set preserves the distinctions between individualusers, making possible detailed per-user analysis. We believethis is the largest study to date to examine the complete clickstreams of so many users in their place of residence for anextended period of time, allowing us to observe how actualusers navigate a hyperlinked information space while not un-der direct observation. The first contributions of this workinclude the discoveries that the popularity of Web sites asmeasured by distinct visitors is unbounded; that many of thepower-law distributions previously observed in Web trafficare aggregates of log-normal distributions at the user level;and that there exist two populations of users who are dis-tinguished by whether or not their Web activity is largelymediated by portal sites.A second set of contributions concerns our analysis ofbrowsing sessions within the click streams of individual users.The concept of a Web session is critical to modeling real-world navigation of hypertext, understanding the impact ofsearch engines, developing techniques to identify automatednavigation and retrieval, and creating means of anonymizing(and de-anonymizing) user activity on the Web. We showthat a simple timeout-based approach is inadequate for iden-tifying sessions and present an algorithm for segmenting aclick stream into logical sessions based on referrer informa-tion. We use the properties of these logical sessions to showthat actual users navigate hypertext in ways that violate astateless random surfer model and require the addition ofbacktracking or branching.Finally, we emphasize which aspects of this data presentpossible opportunities for anomaly detection in Web traf-fic. Robust anomaly detection using these properties makesit possible to uncover “bots” masquerading as legitimateuser agents. It may also undermine the effectiveness ofanonymization tools, making it necessary to obscure addi-tional properties of a user’s Web surfing to avoid betrayingtheir identity.
Contributions and Outline
In the remainder of this paper, after some background andrelated work, we describe the source and collection proce-dures of our Web traffic data. The raw data set includes a r X i v : . [ c s . H C ] M a r ver 400 million HTTP requests generated by over a thou-sand residential users over the course of two months, and webelieve it to provide the most accurate picture to date of thehypertext browsing behavior of individual users as observeddirectly from the network.Our main contributions are organized into three sections: • We confirm earlier findings of scale-free distributionsfor various per-site traffic properties aggregated acrossusers. We show this also holds for site popularity asmeasured by the number of unique vistors. ( § • We offer the first characterization of individual trafficpatterns involving continuous collection from a largepopulation. We find that properties such as jump fre-quency, browsing rates, and the use of portals are notscale-free, but rather log-normally distributed. Onlywhen aggregated across users do these properties ex-hibit scale-free behavior. ( § • We investigate the notion of a Web “session,” show-ing that neither a simple timeout nor a rolling averageprovide a robust definition. We propose an alternativenotion of logical session and provide an algorithm forits construction. While logical sessions have no inher-ent temporal scale, they are amenable to the additionof a timeout with little net effect on their statisticalproperties. ( §
2. BACKGROUND
Internet researchers have been quick to recognize thatstructural analysis of the Web becomes far more useful whencombined with actual behavioral data. The link structure ofthe Web can differ greatly from the set of paths that are ac-tually navigated, and it tells us little about the behavior ofindividual users. A variety of behavioral data sources existthat can allow researchers to identify these paths and im-prove Web models accordingly. The earliest efforts have usedbrowser logs to characterize user navigation patterns [4],time spent on pages, bookmark usage, page revisit frequen-cies, and overlap among user paths [6]. The most directsource of behavioral data comes from the logs of Web servers,which have been used for applications such as personaliza-tion [16] and improving caching behavior [21]. More recentefforts involving server logs have met with notable successin describing typical user behavior [10]. Because search en-gines serve a central role in users’ navigation, their log datais particularly useful in improving search results based onuser behavior [1, 12].Other researchers have turned to the Internet itself as asource of data on Web behavior. Network flow data gen-erated by routers, which incorporates high-level details ofInternet connections without revealing the contents of indi-vidual packets, has been used to identify statistical prop-erties of Web user behavior and discriminate peer-to-peertraffic from genuine Web activity [14, 15, 7].The most detailed source of behavioral data consists ofactual Web traffic captured from a running network, as wedo here. The present study most closely relates to the workof Qiu et al. [19], who used captured HTTP packet traces to
Collection(cid:13)System(cid:13)
Global(cid:13)Internet(cid:13)Dormitory(cid:13)(~ 1,000 users)(cid:13) Campus Router(cid:13)
Figure 1: System architecture for data collection. investigate a variety of statistical properties of users’ brows-ing behavior, especially the extent on which they appear torely on search engines in their navigation of the Web.We have also used captured HTTP requests in our pre-vious work to describe ways in which PageRank’s random-surfer model fails to approximate actual user behavior, whichcalls into question its use for ranking search results [13]. Oneway of overcoming these shortcomings is to substitute actualtraffic data for ranking pages [11]. However, this may cre-ate a feedback cycle in which traffic grows super-linearlywith popularity, leading to a situation (sometimes called“Googlearchy”) in which a few popular sites dominate theWeb and lesser known sites are difficult to discover [18, 8].More importantly for the present work, simply acceptingtraffic data as a given does not further our understanding ofuser behavior. We can also overcome the deficiencies of therandom-surfer model by improving the model itself. Thispaper offers analysis of key features of observed behavior tosupport the development of improved agent-based models ofWeb traffic [9].The present study also relates to work in anomaly detec-tion and anonymization software for the Web. The WebTap project, for example, attempted to discover anomaloustraffic requests using metrics such as request regularity andinterrequest delay time, quantities which we discuss in thepresent work [2]. The success of systems that aim to pre-serve the anonymity of Web users is known to be dependenton a variety of empirical properties of behavioral data, someof which we directly address here [20].
3. DATA DESCRIPTION3.1 Data Source
The click data we use in this study was gathered froma dedicated FreeBSD server located in the central routingfacility of the Bloomington campus of Indiana University(Figure 1). This system had a 1 Gbps Ethernet port thatreceived a mirror of all outbound network traffic from oneof the undergraduate dormitories. This dormitory consistsof four wings of five floors each and is home to just overa thousand undergraduates. Its population is split roughlyevenly between men and women, and its location causes it tohave a somewhat greater proportion of music and educationstudents than other campus housing.To obtain information on individual HTTP requests pass-ing over this interface, we first use a Berkeley Packet Filterto capture only packets destined for TCP port 80. Whilethis eliminates from consideration all Web traffic runningon non-standard ports, it does give us access to the greatmajority of it. We make no attempt to capture or analyzeencrypted (HTTPS) traffic using TCP port 443. Once weave obtained a packet destined for port 80, we use a reg-ular expression search against the payload of the packet todetermine whether it contains an HTTP GET request.If we do find an HTTP GET request in the packet, weanalyze the packet further to determine the virtual host con-tacted, the path requested, the referring URL, and the ad-vertised identity of the user agent. We then write a recordto our raw data files that contains the MAC address of theclient system, a timestamp, the virtual host, the path re-quested, the referring URL, and a flag indicating whetherthe user agent matches a mainstream browser (Internet Ex-plorer, Mozilla/Firefox, Safari, or Opera). We maintainrecord of the MAC address only in order to distinguish thetraffic of individual users. We thus assume that most com-puters in the building have a single primary user, which isreasonable in light of the connectedness of the student popu-lation (only a small number of public workstations are avail-able in the dormitory). Furthermore, as long as the usersdo not replace the network interface in their computer, thisinformation remains constant.The aggregate traffic of the dormitory was sufficiently lowso that our sniffing system could maintain a full rate ofcollection without dropping packets. While our collectionsystem offers a rare opportunity to capture the completebrowsing activity of a large user population, we do recognizesome potential disadvantages of our data source. Because wedo not perform TCP stream reassembly, we can only ana-lyze HTTP requests that fit in a single 1,500 byte Ethernetframe. While the vast majority of requests do so, some GET-based Web services generate extremely long URLs. Withoutstream reassembly, we cannot log the Web server’s responseto each request: some requests will result in redirections orserver errors, and we are unable to determine which ones.Finally, a user can spoof the HTTP referrer field; we assumethat few students do so, and those who do generate a smallportion of the overall traffic.
The click data was collected over a period of about twomonths, from March 5, 2008 through May 3, 2008. This pe-riod included a week-long vacation during which no studentswere present in the building. During the full data collectionperiod, we logged nearly 408 million HTTP requests from atotal of 1,083 unique MAC addresses.Not every HTTP request from a client is indicative of anactual human being trying to fetch a Web page; in fact,such requests actually constitute a minority of all HTTPrequests. For this reason, we retain only those URLs thatare likely to be requests for actual Web pages, as opposedto media files, style sheets, Javascript code, images, and soforth. This determination is based on the extension of theURL requested, which is imprecise but functions well as aheuristic in the absence of access to the HTTP
Content-type header in the server responses. We also filtered out asmall subset of users with negligible activity; their trafficconsisted largely of automated Windows Update requestsand did not provide meaningful data about user activity.Finally, we also discovered the presence of a poorly-writtenanonymization service that was attempting to obscure trafficto a particular adult chat site by spoofing requests fromhundreds of uninvolved clients. These requests were alsoremoved from the data set.We found that some Web clients issue duplicate HTTP re-
Table 1: Approximate dimensions of the filtered andanonymized data set.
Page requests 29.8 millionUnique users 967Web servers 630,000Referring hosts 110,000quests (same referring URL and same target URL) in nearlysimultaneous bursts. These bursts occur independently ofthe type of URL being requested and are less than a singlesecond wide. We conjecture that they may involve check-ing for updated content, but we are unable to confirm thiswithout access to the original HTTP headers. Because thisbehavior is so rapid that it cannot reflect deliberate activityof individual users, we also removed the duplicate requestsfrom the data set.Privacy concerns and our agreement with the Human Sub-jects Committee of our institution also obliged us to try toremove all identifying information from the referring andtarget URLs. One means of doing so is to strip off all iden-tifiable query parameters from the URLs. Applying thisanonymization procedure affects roughly one-third of the re-maining requests.The resulting data set (summarized in Table 1) is the basisfor all of the description and analysis that follows.
4. HOST-BASED PROPERTIES
Our first priority in analyzing this data set was to verifythat its statistics were consistent with those of previous stud-ies. A previous study performed by several of the authorsused a similar data collection method to perform completelyanonymized click records from the entire Indiana Universitycommunity of roughly 100,000 users [13]. In that study, wefound that the distribution of the number of requests di-rected to each Web server (“in-strength”, or s in ) could bewell fitted by a power law Pr( s in ) ∼ s − γin with exponent γ ≈ .
8. The distribution of s in in the present data set isconsistent with this, as shown in Figure 2A; the distributionis linear on a log-log scale for nearly six orders of magnitudewith a slope of roughly 1.75. Similarly, in the previous study,we found that the number of requests citing each Web serveras a referrer (“out-strength”, or s out ) could be approximatedby a power law with γ ≈ .
7. In the current study, we findthat γ ≈ .
75 for s out , as shown in Figure 2B. The overalldistribution of traffic is thus found to be in concordance withprevious results.The previous study was conducted under conditions ofcomplete anonymity for users, retaining not even informa-tion as to whether two requests came from the same or dif-ferent clients. Because the present data set does attributeeach request to a particular user, we were now able to ex-amine the relative popularity of Web server as measured bythe number of distinct users contributing to their traffic. Asshown in Figures 3A and 3B, we find that the distributionof the number of users u contributing to the inbound traf-fic of a Web server is well approximated by a power lawPr( u ) ∼ u − β with β ≈ . β ≈ . β < -12 -10 -8 -6 -4 -2 P ( R eque s t s ) Requests exponent = 1.75 B -12 -10 -8 -6 -4 -2 P ( R e f e rr a l s ) Referrals exponent = 1.75
Figure 2: Distributions of in-strength (A) and out-strength (B) for each Web server in the data set. Inthese and the following plots in this paper, power-law distributions are fitted by least-squares regres-sion on log values with log-bin-averaging and ver-ified using the cumulative distributions and maxi-mum likelihood methods [5]. the variance diverges as the distribution grows and is boundedonly by the finite size of the data collection. Furthermore, if β ≤
2, as seems to be the case for both incoming and outgo-ing popularity, the distributions lack any intrinsic mean aswell. This implies the lack of any inherent ceiling of popu-larity for Web sites, regardless of the size of the user popula-tion. Indeed, the data show that the social networking siteFacebook is a popular destination for almost 100% of thestudents in our study, handily eclipsing any major searchengine or news site.
5. USER-BASED PROPERTIES
The behavior of individual users is of critical interest fornot only models of traffic, but also applications such as net-work anomaly detection and the design of anonymizationtools. Because nearly all per-server distributions in Webtraffic exhibit scale-free properties and have extremely heavytails, one might anticipate that the same would be true ofWeb users. If the statistics that describe user behavior lackwell-defined central tendencies, than very little individualbehavior can be described as anomalous. However, since anygiven user has only finite time to devote to Web surfing, weknow that user-based distributions must be bounded. Thequestion is whether we can characterize “normal” individual A -7 -6 -5 -4 -3 -2 -1 P ( U s e r s ) Users exponent = 2.0 B -7 -6 -5 -4 -3 -2 -1 P ( U s e r s ) Users exponent = 1.9
Figure 3: Distributions of the number of uniqueusers incoming to (A) and outgoing from (B) foreach Web server in the data set. These distributionsserve as a rough measure of the popularity of a Website and imply that the potential audience of themost popular sites is essentially unbounded. traffic. If we can establish a clear picture of the typical user,unusual users are easy to identify and have a more difficulttime maintaining their anonymity.We first consider the distribution of sizes of the users’individual click streams, both in terms of the total num-ber of requests generated by each user and the number ofempty-referrer requests generated. The second distributionis of interest because it describes the number of times a userhas jumped directly to a specific page (e.g., using a book-mark, start page, hyperlink in an e-mail, etc.) instead ofnavigating there from already viewed pages. The resultingdistributions are shown in Figures 4A and 4B. Although thesmaller size of this distributions makes fitting more difficult,we do observe reasonably strong log-normal fits for these dis-tributions, finding that the average user generated around16,600 requests from 2,500 start pages over the course of twomonths. We removed from both distributions a small num-ber of users (roughly 50 in each case) whose click streamswere very small: under 2,500 requests or under 500 startpages. Most of these were users who did not begin gener-ating any traffic until late in the study, possibly because ofnew hardware or the approach of final exams.We next examine the distribution of the ratio of the num-ber of empty-referrer requests to the total number of re-quests for each user. This is a rough measure of the “jumppercentage” (sometimes referred to as the teleportation pa- -6 -5 -5 -5 -5 P ( T o t a l R eque s t s ) Total Requests B -5 -5 -5 -5 -4 -4 -4 -4 P ( E m p t y R e f e rr e r R eque s t s ) Empty Referrer Requests
Figure 4: Distributions of the number of requestsmade per user (A) and the number of empty-referrerrequests made per user (B). We show a referencelog-normal fits to each distribution, which omit somelow-traffic users as described in the text. rameter) in the surfing behavior of users, which is a valueof critical importance to the PageRank algorithm [17]. Astrong central tendency would imply that a random surferhas a fairly constant jump probability overall even if thechance of jumping varies strongly from page to page. Asshown in Figure 5, we do observe a strong fit to a log-normaldistribution with a mean of about 15%, which matches re-markably well the jump probability most often used in Page-Rank calculations.Besides the number of requests generated by each user,it is interesting to inspect the rate at which those requestsare generated. Because not every user was active for the fullduration of data collection, this cannot be deduced directlyfrom the distribution of total requests. In Figure 6 we showthe distribution of the number of requests per second foreach user generating an average of at least fifty requests overthe time they were active. We again obtain a reasonable fitto a log-normal distribution with a mean of about 0.0037requests per second or 320 requests per day.Finally, we consider the ratio of the number of unique re-ferring sites to the number of unique target sites for eachuser. This ratio serves as a rough measure of the extent towhich a user’s behavior is typified by searching or surfing. Ifthe number of referring hosts is low as compared to the num-ber of servers contacted, this implies that the user browsesthe Web through a fairly small number of gateway sites,such as search engines, social networking sites, or a personal -1.8 -1.4 -1.0 -0.6 -0.2 P ( % o f E m p t y R e f e rr e r R eque s t s ) % of Empty Referrer Requests Figure 5: Distribution of the proportion of empty-referrer requests made by each user, which roughlycorresponds to the chance that a user will jump to anew location instead of continuing to surf. We showa reference log-normal fit to the distribution. -3.5 -3.0 -2.5 -2.0 -1.5 -1.0 -0.5 P ( R eque s t s / S e c ond ) Requests/Second
Figure 6: Distribution of the number of requests persecond made by each user, together with a referencelog-normal fit to the distribution. bookmark file. If the number of referring hosts is high com-pared to the number of servers contacted, this implies thata user is more given to surfing behavior: they discover newsites through navigation. We observe in Figure 7 that thedistribution is bimodal, implying the existence of two groupsof user: one more oriented toward portals and one more ori-ented toward browsing. Portal users visit on average almostfour sites for each referrer, while surfers visit only about1.5 sites. In support of this characterization, we note thatthe overall mean ratio is 0.54, but that this drops to 0.37among users with more than 60% of their traffic connectedto Facebook in some way.
6. DEFINING SESSIONS
When we contemplate the design of Web applications ormodeling the behavior of Web users, we are naturally drawnto the notion of a Web session. The constrained environ-ments in which we most often observe users on the Web makeit easy to imagine that a user sits down at the computer, firesup a Web browser, issues a series of HTTP requests, then -1.2 -1.0 -0.8 -0.6 -0.4 -0.2 P ( UR / UH ) Unique Referrers / Unique Hosts
Portal UsersSurfers
Figure 7: Distribution of the ratio of unique refer-ring sites to unique target sites for each user. Weapproximate this bimodal distribution with two log-normals with means at 0.28 and 0.65. closes the browser and moves on to other, unrelated tasks.This is certainly the behavior we observe when users visita research lab to participate in a study or must dial into amodem pool before beginning to surf the Web. The subjectsof the present study did not fit these conditions; they have24-hour access to dedicated network connections, and weobserved the traffic they generate in an environment that isboth their home and their workplace. This distinction madeus suspect that we might face some difficulty in selecting theoptimal value for a timeout.In our first attempt to segment individual click streamsinto sessions, we settled on a five-minute inactivity time-out as a reasonable start, a decision informed by previousresearch in the field [19]. We found that each user’s clickstream split into an average of 520 sessions over the two-month period. A typical session lasted for a bit over tenminutes and included around sixty requests to twelve dif-ferent Web servers. These values seemed plausible for thepopulation: one can imagine the typical student participat-ing in ten ten-minute Web sessions every day.The straightforward approach of identifying sessions us-ing inactivity timeouts thus seemed promising, so we ex-perimented with a variety of different timeouts to find anoptimal value. Because of the log-normal distributions ofuser activity we had seen, it did not seem unreasonable tosuppose that some of the per-session statistics would remainrelatively constant as we adjusted the timeout and otherswould show dramatic changes in the neighborhood of somecritical threshold.The results, shown in Figure 8, show this to be far fromthe case. All the statistics we examined (mean number ofsessions per user, session duration, number of requests, andnumber of hosts contacted) turn out to exhibit strong andregular dependence on the particular timeout used. Theyhave no large discontinuities, areas of particular stability, oreven local maxima and minima. This implies there is noreason based on the data to select one timeout thresholdover any other; the choice is purely arbitrary and becomesthe prime determiner of all relevant statistics.While we did expect some dependence on the timeoutvalue, this result surprised us. We conjectured that the ob-served behavior might be a side-effect of considering every M ean D u r a t i on ( s e c ) Timeout (sec) M ean R eque s t s Timeout (sec) U n i que H o s t s Timeout (sec) S e ss i on s pe r U s e r Timeout (sec)
Figure 8: Session statistics as a function of timeout.Top left: Mean duration of sessions in seconds. Topright: Mean number of hosts contacted. Bottomleft: Mean number of requests. Bottom right: Meannumber of sessions per user. user’s sessions as part of the same distribution; if we were toobserve the click streams of individual users, we might seemore pronounced clustering of HTTP requests in time.To test this notion, we picked several users at random.For each one, we found the distribution of “interclick times”,defined as the number of seconds elapsed between each pairof successive page requests. A user with n requests in theirclick stream would thus yield n − t ) ∼ ∆ t − τ over nearly six orders ofmagnitude, as shown in Figure 9. Moreover, we found τ < R value of 0.989. The resulting distribution of power-lawexponents, shown in Figure 10, is strongly normal with amean value (cid:104) τ (cid:105) ≈ .
6. This confirmed the finding that in-terclick times have no central tendency; in fact, scale-freebehavior is so pervasive that a user agent exhibiting regular-ity in the timing of its requests would constitute an anomaly.These results make it clear that a robust definition of Websession cannot be based on a simple timeout.The next natural approach would be to segment a clickstream into sessions based on the rolling average of the num- -8 -6 -4 -2 P ( I n t e r v a l ) Interval (sec) -8 -6 -4 -2 P ( I n t e r v a l ) Interval (sec) -8 -6 -4 -2 P ( I n t e r v a l ) Interval (sec) -8 -6 -4 -2 P ( I n t e r v a l ) Interval (sec)
Figure 9: The distributions of the time betweensuccessive requests for four randomly selected users.Each distribution is well approximated by powerlaws with exponents below two, suggesting bothunbounded variance and the lack of a well-definedmean. ber of clicks per unit time dropping below a threshold value.This approach turns out to be even more problematic thanusing a simple timeout, as there are now two parameter toconsider: the width of the window for the rolling average andthe threshold value. If the window selected is too narrow,then the rolling average will often drop all the way to zero,and the scheme becomes prey to the same problems as thesimple timeout. If the window selected is too large, then therolling average is so insensitive to change that meaningfulsegmentation is impossible. In the end, the choice is onceagain arbitrary. Moreover, examination of the moving aver-age click rate for several users shows that the magnitudes ofthe spikes in the click rate fit a smooth distribution. Thismakes the selection of a threshold value arbitrary as well:the number of sessions becomes a function of the thresholdrather than a feature of the data itself.Logging both the referring URL and target URL for HTTPrequests makes possible a third and more robust approachto constructing user sessions. We expand on the notion of referrer trees as described in [19] to segment a user’s clickstream into a set of logical sessions based on the followingalgorithm:1. Initialize the set of sessions T and the map N : U (cid:55)→ T from URLs to sessions.2. For each request with referrer r and target u :(a) If r is the empty referrer, create a new session t with root node u , and set N ( u ) = t .(b) Otherwise, if the session N ( r ) is well-defined, at-tach u to N ( r ) if necessary, and set N ( u ) = N ( r ).(c) Otherwise, create a new session t with root node r and single leaf node u , and set N ( r ) = N ( u ) = t .This algorithm assembles requests into sessions based onthe referring URL of a request matching the target URL of a P ( τ ) Power Law Exponent ( τ ) Figure 10: The distribution of the exponent τ forthe best power-law approximation to the distribu-tion of interclick times for each user. The fit is anormal distribution with mean (cid:104) τ (cid:105) = 1 . and stan-dard deviation σ = 0 . . These low values of τ indi-cate unbounded variance and the lack of any centraltendency in the time between requests for a Websurfer. previous request. Requests are assigned to the session withthe most recent use of the referring URL. Furthermore, eachinstance of a request with an empty referrer generates a newlogical session.Before we examine the properties of the logical sessionsdefined by this algorithm, we must highlight the differencesbetween logical sessions and our intuitive notion of session.A logical session does not represent a period of time in whicha user opens a Web browser, browses some set of Web sites,and then leaves the computer. It instead connects requestsrelated to the same browsing behavior. If the user openslinks in multiple tabs or uses the browser’s back button,the subsequent requests will all be part of the same logicalsession. If the user then jumps directly to a search engine,they will start a new logical session. Tabbed browsing anduse of the back button make it entirely possible for a user tohave multiple active logical session at any point in time. Auser who always keeps a popular news site open in a browsertab might have a logical session related to that site thatlasts for the entire collection period. Logical sessions thussegment a user’s click stream into tasks rather than timeintervals. They also enjoy the advantage of being definedwithout reference to any external parameters, making theirproperties comparable across data sets and insensitive to thejudgment of individual researchers.The first statistics of interest for these logical sessions con-cern their tree structures. The number of nodes in the treeis a count of the number of requests in the session. In Fig-ure 11A, we show the probability density function for theper-user distribution of the average size of a logical session.This distribution is well approximated with a log-normalfunction, showing that the typical user has a mean of around6.1 requests per session. The depth of the trees indicates theextent to which users follow chains of links during a Web-related task. Figure 11B shows the distribution of the av-erage depth of the logical sessions for each user. It is againwell-approximated with a log-normal, showing that a typicaluser’s sessions have a depth of about three links. In other P ( M ean N ode s i n T r ee ) Mean Nodes in Tree B P ( M ean D ep t h o f T r ee ) Mean Depth of Tree
Figure 11: Distributions of the mean number ofrequests per logical session per user (A) and themean depth of logical session per user (B). In bothcases, we consider only non-trivial trees. We showreference log-normal fits to each distribution. words, an average session usually travels no more than twoclicks away from the page on which it began.The ratio between the number of nodes in each tree andits depth is also of interest. If this ratio is equal to 1, thenthe tree is just a sequence of clicks, which corresponds to theassumptions of the random walker model used by PageRank.As this ratio grows past 1, the branching factor of the treeincreases, the assumptions of PageRank break down, and arandom walker must either backtrack or split into multipleagents. Figure 12 shows the distribution of the average node-to-depth ratio for each user, which is well-approximated by anormal distribution with mean (cid:104) µ (cid:105) = 1 .
94. We thus see thatsessions have structure that cannot be accurately modeledby a stateless random walker: there must be provision forbacktracking or branching.Although there is a strong central tendency to the meansize and depth of a logical session for each user, the samedoes not hold for logical sessions in general. In Figure 13, weshow the distributions of the node count and depth for log-ical sessions aggregated across all users. When we removeper-user identifying information in this way, we are onceagain confronted with heavy-tailed distributions exhibitingunbounded variance. This implies that the detection of au-tomated browsing traffic is a much more tractable task ifsome form of client identity can be retained.Even though we have defined sessions logically, they canstill be considered from the perspective of time. If we cal- P ( N ode / D ep t h ) Node / Depth Ratio
Figure 12: The distribution of the average ratioof the node count to the tree depth for the logicalsessions of each user. The fit is a normal distributionwith mean µ = 1 . and standard deviation σ = 0 . ,showing that the branching factor of logical sessionsis significantly greater than one. culate the difference between the timestamp of the requestthat first created the session and the timestamp of the mostrecent request to add a leaf node, we obtain the durationof the logical session. When we examine the distributionof the durations of the sessions of a user, we encounter thesame situation as for the case of interclick times: power-lawdistributions Pr(∆ t ) ∼ ∆ t − τ for every user. Furthermore,when we consider the exponent of the best power-law fit ofeach user’s data, we find the values are normally distributedwith a mean value (cid:104) τ (cid:105) ≈ .
2, as shown in Figure 14. Nouser has a well-defined mean duration for their logical ses-sions; as also suggested by the statistics of interclick times,the presence of strong regularity in a user’s session behaviorwould be anomalous.It is natural to speculate that we can get the best of bothworlds by extending the definition of a logical session to in-clude a timeout, as was done in previous work on referrertrees [19]. Such a change is quite straightforward to im-plement: we simply modify the algorithm so that a requestcannot attach to an existing session unless the attachmentpoint was itself added within the timeout. This allows usto have one branch of the browsing tree time out while stillallowing attachment on a more active branch.While the idea is reasonable, we unfortunately find thatthe addition of such a timeout mechanism once again makesthe statistics of the sessions strongly dependent on the par-ticular timeout selected. As shown in Figure 15, the numberof sessions per user, mean node count, mean depth, and ra-tio of nodes to tree depth are all dependent on the timeout.On the other hand, in contrast to sessions defined purelyby timeout, this dependence becomes smaller as the time-out increases, suggesting that logical sessions with a timeoutof around 15 minutes may be a reasonable compromise formodeling and further analysis.
7. CONCLUSIONS
In this paper we have built on the network-sniffing ap-proach to gathering Web traffic that we first explored in [13],extending it to track the behavior of individual users. The -12 -10 -8 -6 -4 -2 P ( N ode s i n T r ee ) Nodes in Tree exponent = 2.85 B -12 -10 -8 -6 -4 -2 P ( D ep t h o f T r ee ) Depth of Tree exponent = 2.55
Figure 13: Distributions of the number of requestsper logical session (A) and the depth of each logicalsession (B), with reference power-law fits. resulting data set provides an unprecedented and accuratepicture of human browsing behavior in a hypertext informa-tion space as manifested by over a thousand undergraduatestudents in their residences.The data confirm previous findings about long-tailed dis-tributions in site traffic and reveal that the popularity ofsites is likewise unbounded and without any central ten-dency. They also show that while many aspects of Webtraffic have been shown to obey power laws, these power-law distributions often represent the aggregate of distribu-tions that are actually log-normal at the user level. Thelack of any regularity in interclick times for Web users leadsto the conclusion that sessions cannot be meaningfully de-fined with a simple timeout, leading to our presentation oflogical sessions and an algorithm for deriving them from aclick stream. These logical sessions illustrate further draw-backs of the random surfer model and can be modified toincorporate timeouts in a relatively robust way.These findings have direct bearing on future work in mod-eling user behavior in hypertext navigation. The stabilityof the proportion of empty-referrer requests across all usersimplies that although not every page is equally likely to bethe cause of a jump, the overall chance of a jump occurringis constant in the long run. The finding that the branch-ing factor of the logical sessions is definitely greater thanone means that plausible agent-based models for randomwalks must incorporate state, either through backtrackingor branching [3].Our indications as to which distributions show central P ( τ ) Power Law Exponent ( τ ) Figure 14: The distribution of exponent τ for thebest power-law approximation to the distribution oflogical session duration for each user. The fit is anormal distribution with mean (cid:104) τ (cid:105) = 1 . and stan-dard deviation σ = 0 . . These low values of τ indi-cate unbounded variance and the lack of any centraltendency in the duration of a logical session. tendencies and which do not are of critical importance foranomaly detection and anonymization. To appear plau-sibly human, an agent must not stray too far from theexpected rate of requests, proportion of empty-referrer re-quests, referrer-to-host ratio, and node count and tree depthvalues for logical sessions. Because these are log-normal dis-tributions, agents cannot deviate more than a multiplicativefactor away from their means. At the same time, a cleveragent must mimic the heavy-tailed distributions of the spac-ing between requests and duration of sessions; too much reg-ularity appears artificial.Although our method of collection does afford us with alarge volume of data, it suffers from several disadvantageswhich we are working to overcome in future studies. First,our use of the file extension (if any) in requested URLs is anoisy indicator of whether a request truly represent a pagefetch. We are also unable to detect whether any request isactually satisfied or not; many of the requests may actuallyresult in server errors or redirects. Both of these problemscould be largely mitigated without much overhead by cap-turing the first packet of the server’s response, which shouldindicate an HTTP response code and a content type in thecase of successful requests.This data set is inspiring the development of an agent-based model that replaces the uniform distributions of Page-Rank with more realistic distributions and incorporates book-marking behavior to capture the branching behavior ob-served in logical sessions [9]. Acknowledgments
The authors would like to thank the Advanced NetworkManagement Laboratory at Indiana University and Dr. JeanCamp of the IU School of Informatics for support and infras-tructure. We also thank the network engineers of IndianaUniversity for their support in deploying and managing thedata collection system. Special thanks are due to Alessan-dro Flammini for his insight and support during the analysisof this data. M ean N ode C oun t Timeout (sec) M ean D ep t h Timeout (sec) M ean N ode / D ep t h R a t i o Timeout (sec) S e ss i on s pe r U s e r Timeout (sec)
Figure 15: Logical session statistics as a functionof timeout. Top left: Mean node count. Top right:Mean tree depth. Bottom left: Mean ratio of nodecount to tree depth. Bottom right: Mean numberof sessions per user.
This work was produced in part with support from theInstitute for Information Infrastructure Protection researchprogram. The I3P is managed by Dartmouth College andsupported under Award Number 2003-TK-TX-0003 fromthe U.S. DHS, Science and Technology Directorate. Thismaterial is based upon work supported by the National Sci-ence Foundation under award number 0705676. This workwas supported in part by a gift from Google. Opinions,findings, conclusions, recommendations or points of view inthis document are those of the authors and do not nec-essarily represent the official position of the U.S. Depart-ment of Homeland Security, Science and Technology Direc-torate, I3P, National Science Foundation, Indiana Univer-sity, Google, or Dartmouth College.
8. REFERENCES [1] E. Agichtein, E. Brill, and S. Dumais. Improving Websearch ranking by incorporating user behaviorinformation. In
Proc. 29th ACM SIGIR Conf. , 2006.[2] K. Borders and A. Prakash. Web tap: Detectingcovert web traffic. In
In Proceedings of the 11th ACMConference on Computer and CommunicationSecurity , pages 110–120. ACM Press, 2004.[3] M. Bouklit and F. Mathieu. Backrank: an alternativefor pagerank? In
WWW ’05: Special interest tracksand posters of the 14th international conference onWorld Wide Web , pages 1122–1123, New York, NY,USA, 2005. ACM.[4] L. D. Catledge and J. E. Pitkow. Characterizingbrowsing strategies in the World-Wide Web.
ComputerNetworks and ISDN Systems , 27(6):1065–1073, 1995.[5] A. Clauset, C. R. Shalizi, and M. E. J. Newman.Power-law distributions in empirical data. Technicalreport, arXiv:0706.1062v1 [physics.data-an], 2007.[6] A. Cockburn and B. McKenzie. What do Web users do? An empirical analysis of Web use.
Intl. Journal ofHuman-Computer Studies , 54(6):903–922, 2001.[7] J. Erman, A. Mahanti, M. Arlitt, and C. Williamson.Identifying and discriminating between web andpeer-to-peer traffic in the network core. In
WWW ,pages 883–892, 2007.[8] S. Fortunato, A. Flammini, F. Menczer, andA. Vespignani. Topical interests and the mitigation ofsearch engine bias.
Proc. Natl. Acad. Sci. USA ,103(34):12684–12689, 2006.[9] B. Gon¸calves, M. Meiss, J. J. Ramasco, A. Flammini,and F. Menczer. Remembering what we like: Towardan agent-based model of web traffic. In
WSDM(Late-breaking papers) , 2009.[10] B. Gon¸calves and J. J. Ramasco. Human dynamicsrevealed through web analytics.
Phys. Rev. E ,78:026123, 2008.[11] Y. Liu, B. Gao, T. Y. Liu, Y. Zhang, Z. Ma, S. He,and H. Li. BrowseRank: Letting web users vote forpage importance. In
SIGIR , 2008.[12] J. Luxenburger and G. Weikum.
Query-Log BasedAuthority Analysis for Web Information Search ,volume 3306 of
Lecture Notes in Computer Science ,pages 90–101. Springer Berlin / Heidelberg, 2004.[13] M. Meiss, F. Menczer, S. Fortunato, A. Flammini, andA. Vespignani. Ranking Web sites with real usertraffic. In
Proc. 1st ACM International Conference onWeb Search and Data Mining (WSDM) , 2008.[14] M. Meiss, F. Menczer, and A. Vespignani. On the lackof typical behavior in the global Web traffic network.In
Proc. 14th International World Wide WebConference , pages 510–518, 2005.[15] M. Meiss, F. Menczer, and A. Vespignani. Structuralanalysis of behavioral networks from the Internet.
Journal of Physics A , 2008.[16] B. Mobasher, R. Cooley, and J. Srivastava. Automaticpersonalization based on web usage mining.
Communications of the ACM , 43(8):141–151, 2000.[17] L. Page, S. Brin, R. Motwani, and T. Winograd. ThePageRank citation ranking: Bringing order to theWeb. Technical report, Stanford University DatabaseGroup, 1998.[18] S. Pandey, S. Roy, C. Olston, J. Cho, andS. Chakrabarti. Shuffling a stacked deck: The case forpartially randomized ranking of search engine results.In K. B¨ohm, C. S. Jensen, L. M. Haas, M. L. Kersten,P.-˚A. Larson, and B. C. Ooi, editors,
Proc. 31stInternational Conference on Very Large Databases(VLDB) , pages 781–792, 2005.[19] F. Qiu, Z. Liu, and J. Cho. Analysis of user web trafficwith a focus on search activities. In A. Doan,F. Neven, R. McCann, and G. J. Bex, editors,
Proc.8th International Workshop on the Web andDatabases (WebDB) , pages 103–108, 2005.[20] C. Viecco, A. Tsow, and L. J. Camp. Privacy-awarearchitecture for sharing web histories.
IBM SystemsJournal , publication pending.[21] Q. Yang and H. H. Zhang. Web-log mining forpredictive web caching.