Predicting Online Video Engagement Using Clickstreams
PPredicting Online Video Engagement Using Clickstreams
Everaldo Aguiar
University of Notre DameNotre Dame, Indiana [email protected]
Saurabh Nagrecha
University of Notre DameNotre Dame, Indiana [email protected]
Nitesh V. Chawla ∗ University of Notre DameNotre Dame, Indiana [email protected]
ABSTRACT
In the nascent days of e-content delivery, having a superiorproduct was enough to give companies an edge against thecompetition. With today’s fiercely competitive market, oneneeds to be multiple steps ahead, especially when it comes tounderstanding consumers. Focusing on a large set of web por-tals owned and managed by a private communications com-pany, we propose methods by which these sites’ clickstreamdata can be used to provide a deep understanding of their vis-itors, as well as their interests and preferences. We furtherexpand the use of this data to show that it can be effectivelyused to predict user engagement to video streams.
Author Keywords
Clickstream; predictive analysis; online video; userengagement.
ACM Classification Keywords
I.5.2. Pattern Recognition: Design Methodology
INTRODUCTION
The constant growth in volume, speed, availability, and func-tionality of the Web brings with it not only a variety of chal-lenges and risks, but also a number of opportunities. Whilethere have been a series of major advances in the field overtime, one that has been given a considerable amount of atten-tion in more recent years is that of personalization .Data about users’ online activity is continuously captured andanalyzed. Advanced recommendation systems are now ableto tell us what products we might be interested in buying, thebooks we will enjoy reading, what movies we should watchnext, and even which diseases we are at risk of contracting.From a business perspective, the benefits of being able to un-derstand customers in this level of detail are unquestionable.Methods for capturing user data on the Web are also becom-ing increasingly efficient. As described in [20], the brows-ing behavior of individual users can be recorded at the gran-ularity of mouse clicks with little to no work needing to bedone. A number of services, both free and proprietary, offeruser tracking solutions that can be implemented and deployedwithin minutes. However, the feedback that one usually getsfrom these tools is often in the form of simplistic aggregatestatistics that do not offer a deeper understanding of user be-havior.With that in mind, we set to analyze the application of someof these ideas to a specific context, while having as our major ∗ Corresponding Author goal the understanding of each user as an individual unit. Forthis study, we were provided a large dataset that describesuser clicks generated within a two-month span and acrossa number of websites managed by a large communicationscompany.This paper describes the process through which we parsed,analyzed, and drew knowledge from a user-generated click-stream dataset provided by a large communications company.We begin by showing, from a more general perspective, howthis type of data can be used to identify particularly interest-ing trends in user interest, and to further illustrate the useful-ness of this information, we describe how we applied methodsto predict user engagement to video streams and discuss theiraccuracy.
Video Start 25% Complete 50% Complete 75% Complete CompleteProgress into video0.00.20.40.60.81.0 V i e w e r r e t e n t i o n CommunityEntertainmentFoodHealthHomeNewsPoliticsSportsTechnologyWeather
Figure 1. Video viewership drop-off by category of content
Distributing content that entices user engagement and cap-tures large audiences is the ultimate goal of all web mediaproviders. Measuring and forecasting these variables, how-ever, is not an easy task. As Figure 1 illustrates, as time goesby, the amount of users that remain tuned to video streamsdramatically decreases. For certain categories, the percent-age of users that actually watch videos to completion can beas low as 20%.To address this undesired outcome, we propose the develop-ment of clickstream-based models that can learn the individ-ual preferences and characteristics of each user, and utilizethis information to predict how “engaged” they will be to aparticular video stream. Being able to know, in advance, if a r X i v : . [ c s . L G ] M a y user is likely to exit a video prematurely allows contentproviders some leeway to implement personalized interven-tion strategies aimed at maximizing viewership retention.The remaining portion of the paper is organized as follows –The next section gives an overview of the most recent relatedliterature. That is followed by a detailed coverage of click-stream data representation and a description of our particulardataset. We then elaborate on the methods applied in thisstudy, the results obtained, and their importance. Finally, thelast section draws conclusions about this exercise and arguesfor the latent potential that resides in user-generated click-stream data. RELATED WORK
Interest in analyzing the online activities of users is as oldas providing consumable content itself. This problem haspiqued the interest of multiple fields, namely marketing, psy-chology and computer science.Since user activity provides an immense amount of measur-able secondary data, various models to predict multiple as-pects of their behavior have been proposed. User interactionhas been studied at various levels— from gaze tracking [8]to broader patterns of path traversal within a website [1, 22].Simple duration and dwell-time [4] can be used to predictwhen a user exits the site. User classification [21] can beused to identify what the user is specifically looking for andeven morph the website [16] according to the custom tastesof that particular user profile. Personalized content based onclick history has been implemented and widely adopted bycommercial content providers [6, 19].With the distribution of video content online becoming main-stream, the way we study user engagement has been greatlyenriched. Studies like [7] have measured the role of videocontent quality in influencing user engagement, but did notutilize clickstreams to contextualize the video views. On-line video engagement for Massive Open Online Courses(MOOCs) [13] has shown that the lessons learned from an-alyzing video views can be used to improve video author-ing, editing and interface design. It also emphasizes the valueof video dropout as a metric for engagement. Though theMOOC work lacks the contextual history of the users, in thispaper we leverage similar and many other clickstream fea-tures to predict video engagement.
CLICKSTREAM DATA REPRESENTATION
Clickstream data consists of a “virtual trail” that users leavebehind while they interact with a given system, website orapplication. More specifically, data that describes the state ofa user’s current session is recorded each time a click is per-formed, and the aggregation of that produces a clickstream ,which can be used to reconstruct all actions taken by the userwhile he or she utilized that given product.While applicable to a variety of scenarios, the collection andanalysis of clickstreams has become most notably popular inthe context of Web-based tools and websites. As highlightedby Srivastava et. al. [25], the analysis of such information
Figure 2. A simple illustration of the clickstream of a typical user has potential applications in a number of areas such as web-site personalization and modification, system improvement,business intelligence and usage characterization. Our contri-butions fall mainly within the first and last domains.
Our Dataset
The data we utilized for this study was provided to us by alarge U.S.-based communications company that operates inthe radio, TV, newspaper and online media domain. Theymanage a few dozens of websites, all of which are embeddedwith clickstream capturing functionality. Next, we give a de-tailed description of the most important features this datasetcontains.User activity is continuously captured by numerous serversacross the country and is then concatenated at the end of theday in the form of daily “dumps”. We utilized 59 of these filesthat covered the period ranging from December 4, 2012 toJanuary 31, 2013. Altogether, these files contain an upwardsof 65 million click instances.Each click instance recorded is characterized by a large num-ber of features (161 in this case). Table 1 lists a small subsetof the most relevant features and a brief description of each.With that information we are able to determine (1) how usersreached the website, (2) what attracted them there, (3) whateature Type Feature Name Description
Nominal
Browser The browser that was usedChannel The site that the page view belongs toCity The city the user accessed the page fromCookies Whether the user had cookies turned on or notCountry The country the user accessed the page fromDomain Domain of the user’s ISPExclude hit Identifies web crawlersFirst hit page URL the user first landed on the websiteFrequency of visits Denotes hourly, daily, weekly, monthly or yearly visitIP Refers to the IP address of the userNew visit Determines whether the user is new to the site, based on cookiesReferrer Lists the URL of the website that referred this userRegion Refers to the state or region the user was inSearch Keywords The search string which led to the particular pageSection The section of the website where the click took placeSubsection Subsection of the website where the click took place
Numeric
First hit time Timestamp of when the user first landed on the websiteLast click Time stamp of when the last click was made by the userLast visit Refers to when the user visited the site lastTime & Date Timestamp of when the click instance happenedVisit number Refers to the number of times the user has visited the site
Table 1. Dataset features described actions they performed while on the site and (4) how theyeventually exited.Note that while there is no feature that captures the event ofa user leaving the website, as is common practice, we workunder the assumption that when a user is inactive for a periodlonger than 30 minutes (i.e., no click events originate fromthis person during that time), we simply say that the user hasexited the site.This assumption allows us to group these click events fromthe original datasets into user sessions, which illustrate thepath a user takes while browsing the website and can be usedto identify areas that attract more (or less) traffic.Figure 2 illustrates one individual session chosen at randomfrom our dataset. We can see that the user in this case wasreferred to our domain through a link that he or she foundon a social network website and that their visit consisted ofseveral hops, most of which happened in the news section.Aggregating these sessions allows us to visualize which ar-eas of the website are more popular, as well as which linksconnecting different sections are traversed the most. Take forinstance the example illustrated in Figure 3. To generate thisparticular graph, we isolated the sessions corresponding to acertain newspaper’s website, its 12 most popular sections, andthe traffic between them. Among other observations, we no-ticed that the readers of this particular newspaper were oftenprone to navigating to the sports section and reading multiplearticles there.Furthermore, these sessions can be aggregated, producing ahigh-level view of the entire website structure by popularityof section. Figure 3 illustrates this concept. Lastly, we note that based on information retrieved from spe-cific features of our dataset, it is possible to determine if a useris simply browsing text articles, displaying image galleries orstreaming online video. The following sections of this paperwill describe how we used this fact to aid in the developmentof predictive models for video viewership engagement.
METHODSIdentification of Video Exit Instances
When a user watches a video, a separate log entry is madecorresponding to when he or she completes watching a certainpercentage of the video, while a player ID remains constant.This makes the clickstream log reflect a cumulative history ofthe viewer’s progress within that video.By filtering the data to get only clicks corresponding to videoinstances, and then by IP address, we obtain the entire videoviewing activity of each IP. From this modified dataset, weisolate an individual “video view” table by specifying theplayer ID. This table is then sorted chronologically and fil-tered by session timeout. The last entry corresponds to theviewer’s exit point. This gives us a unique session for a visit.In combination with the current session data, and data fromcookies, we retrieve the user’s unique historical browsing pat-terns. It should be noted that in the absence of cookies, wetreat the user as a fresh incoming visitor. For our analysis, weisolated only the instances where the user exited the video.Due to the inherently discretized nature of the data collection,we get a coarse-grained estimate of when the user reached acertain percent of the video. If the last entry shows that theuser watched 50% of a video, it can be inferred that the userexited at p % , such that p ∈ [50 , . igure 3. Clickstream network for a news-media website. The vari-ous nodes displayed here represent different sections. The direction ofthe arrows represents user traffic flowing between these sections and thethickness is indicative of volume of said traffic. Feature Selection
Using various feature selection methods, we reduced the sizeof our dataset from the original 161 features to the 12 best de-scriptors. Among these were features like IP, location, con-tent annotations, and referrer information. Out of the 161features in a typical video exit instance, 40 are mutually re-dundant, and 32 are constant in value. This motivates the needto find a set of features that is the best descriptor of the targetclass (in this case, the percent of video the user watches be-fore exiting) [14]. We investigated various feature selectionmethods which support mixed data types and ranked the topfeatures. One would expect these features to encompass mea-surable user traits which influence their interest in the video.Various feature selection methods aim to remove redundantand irrelevant features using different statistical means, whichhave their respective strengths. Though a popular choice inmachine learning, correlation based feature selection (CFS)was not considered due to the sparse nature of the data [15].A more detailed study of these can be found in [28, 12]. Thefeature selection methods employed in this problem are de-scribed below:
Chi Squared
The chi squared ( χ ) method measures how much deviationis seen in the observed data from the case where the classvalue and the feature are independent of each other. It evalu-ates whether the feature and class occurrences are randomlyrelated, or exhibit some relation. Features Chi IG GR oneR Symm
Time 1 1 7 - 2IP 2 2 9 - 3First hit referrer 3 3 5 2 5First hit page 4 5 10 - 7Story title 5 4 2 1 1Search engine 6 7 3 3 8City 7 6 - - 9ISP 8 8 - - 10Referrer type 9 10 1 - 4
Table 2. Feature Selection Rankings. (description of abbreviations)
Information Gain
Information gain [23] measures how much entropy is lostwhen the feature is present vs. absent.
Gain Ratio
Information gain favors attributes with many values overthose with fewer values, the gain ratio [24] compensates forthis by factoring in the amount of split caused by the feature.
One R
One R formulates a set of simple relationships between thefeatures and ranks the features based on how accurate theserules are.
Symmetric Uncertainty
Symmetric uncertainty [26, 10] targets attributes which cor-relate well with the class but have little intercorrelation.The results of these feature selection methods are summa-rized in Table 2. The attributes in the table are the ones whichconsistently appear in the top 10%. These are the attributeswhich influence video exit points the most.The time of viewing influences at what point people are proneto exit the video. IP address, in conjunction with location,and ISP indicate who is watching the video and thus offera personalized facet to the prediction. The number of pagesviewed by a person and frequency of visits can be perceivedto be reflective of the person’s interest in the site. The re-ferrer which brought the viewer on the site can influence theengagement of the viewer; a viewer coming from a social net-work link interacts differently than one who had the site book-marked on their browser. The entry point is the first page theviewer saw in their current viewing session; this determinestheir interest in consuming further content. The actual title ofthe story includes the section which the video is under. As wehad observed in Figure 1, users viewing “Technology” relatedvideos were less likely to exit than those viewing “Entertain-ment” related videos.
Classification
Our aim is to predict how much of the video a user watchesbefore exiting. In our dataset, we find that this is representedby 5 distinct markers, which correspond to the percentage ofhe video the user watched before exiting. We formulate twoclassification tasks- to predict what percent of the video iswatched, and whether the user exits the video “early” (beforereaching 50% of the video).This prediction task involving 5 classes. Since it is relevant topredict users who exit early on in the video, we assume thatusers who exit the video at the beginning or having viewed25% of the video to have exited “early”. As described above,this would correspond to users who have viewed 0 to 49% ofthe video.
Figure 4. Converting the Percentages Classification to Early Exit Clas-sification: The 5 class problem (top) is reduced to a binary classificationproblem by merging classes (bottom).
We can refine the problem as the binary prediction of these“early exits”. The classes would then be a merger of the pre-viously mentioned 5 classes, with the first two combined toform that of “early exits” and the latter 3 being those whochose to not exit early. This simplification is depicted in therepresentative expected confusion matrices for both of theclassification tasks as displayed in Figure 4. We have per-formed both prediction analyses on our data.
Naive Bayes
Among the simplest and most primitive classification algo-rithms, this probabilistic method is based on the Bayes The-orem [2] and strong underlying independence assumptions.That is, each feature is assumed to contribute independentlyto the class outcome.
C4.5 Decision trees
C4.5 Decision Trees [24] work by building a tree structurewhere split operations are performed on each node based oninformation gain values for each feature of the dataset andthe respective class. At each level, the attribute with highestinformation gain is chosen as the basis for the split criterion.
Repeated Incremental Pruning to Produce Error Reduction
RIPPER [5] is a rule based classification tree learner. Itis algorithmically faster than C4.5, having a complexity of O ( n ( log ( n )) ) as opposed to C4.5’s complexity of the order O ( n ) . RIPPER constructs an initial set of rules and then it-eratively optimizes it according to a tunable parameter. It isimplemented in Weka under the “JRip class”. Random forests
Random forests [3] combine multiple tree predictors in anensemble. New instances being classified are pushed downthe trees, and each tree reports a classification. The “forest”then decides which label to assign to this new instance basedon the aggregate number of votes given by the set of trees.
Decision Tables
Decision Table classifiers [18] are built by concatenating aseries of rules derived from the feature set to correspondingclass outcomes. This method as its major advantages the factthat it is easy to interpret and notably efficient.
Random Subspaces
The random subspace method [17] is an ensemble classifierwhose individual classifiers operate on random subsets of thefeature set. The predictions made by the individual classifiersare combined using the posterior probabilities of each classin the constituent classifiers. This method looks at the clas-sification problem from various perspectives by randomizingthe selection of features.
Stacking
Stacking [27] is a meta-classification scheme which employsan ensemble of classifiers and performs the learning task ontwo levels. First, the classifiers in the ensemble are trained onthe data, then the meta-classifier learns from their predictionsand the training labels of the data.
Key Performance Indices / Metrics Utilized
Our key performance index is the accuracy of prediction ofwhen the user will drop-off in the video. To obtain these pre-dictions, we perform 10-fold cross-validation on the availabledata using various classification methods. In 10-fold crossvalidation, the data is randomly partitioned into 10 subsetsand predictions are made on each of these. These predictionsare then aggregated to provide the overall performance of theclassifier, which we measured by the accuracy and area underthe Receiver Operating Characteristics curve (AUROC), allataset Classifier Acc AUROCNB 0.416
C4.5 0.547 0.699Multiclass RIP 0.547 0.629DT 0.543 0.717ST ST Table 3. Summary of results obtained for each classifier and dataset.The classifiers used are NB: Naive Bayes, C4.5: C4.5 decision tree, RIP-PER: Repeated Incremental Pruning to Produce Error Reduction, DT:Decision Table, ST: Stacking using random subspaces of decision trees of which are described in further detail below. Each of thesemeasures depict various aspects of the prediction results.
Accuracy
The accuracy of a classifier is perhaps the simplest measure-ment of its performance. It represents the percentage of totalinstances that were correctly classified. We would like for thisto be as high as possible. The baseline for accuracy is that of aperfectly random prediction. For a binary classification prob-lem, this would be 50% and for a 5 class problem, the base-line accuracy would be 20%. Any classifier which deliversstatistically greater accuracy than these respective baselines,is considered to be better than a random predictor.
Receiver Operating Characteristics (ROC) curves
A system tuned to increase accuracy does not necessarilymake it a good predictor. Relying on accuracy alone doesnot provide insights into the nature of misclassified instances.ROC curves [11] are a way to quickly compare multiple clas-sifiers. The goal of a classifier in ROC space is to be asclose to the upper-left corner as possible. In ROC space, ifthe curve for one classifier is closer to the upper-left cornerthan that for another, then it is considered to have a superiorperformance.
EXPERIMENTAL RESULTS
We evaluated the performance of each of the classifiers used,with 10-fold cross validation for both the multiclass and thebinary classification predictions. Table 3 summarizes the re-sults of all experiments.
Multiclass Prediction
We see that in terms of sheer accuracy, the stacked classifiersperformed slightly better than other methods, achieving anaccuracy of 56.9%. In terms of AUROC, however, it is seenthat Naive Bayes performs much better, closely followed byDecision Tables. These simple classifiers might not have thebest accuracy, but outperform the others.
Binary Class Prediction
In this second scenario, we associate a semantic meaningto the drop-off percentage point and predict if the user willexit early or not. This refinement of the problem statementgives us a much better performance across the board. Thestacked classifiers, for instance, achieve a remarkable accu-racy of 84.6% when predicting which users exited their videostreams prematurely. As it was the case with the multiclassproblem, we again saw that Decision Tables and Naive Bayessurpassed the other classifiers in terms of AUROC values.Though stacked classifiers give greater accuracy, they are notas good as Decision Tables or Naive Bayes in predicting earlydrop-off. This is still reflective of the general trends observedin the multiclass problem as we have merely merged classes,the underlying data remains the same.In both, the multiclass and binary class prediction, it is ob-served that simpler rule based learners outperform compli-cated meta-classifiers. This is documented in [9], showingthat stacking does not always outperform the best classifier. T r ue po s i t i v e r a t e False positive rateROC curve comparison between different classifiers naive bayesc4.5ripperdecision tablestacking
Figure 5. ROC curves for the binary class problem. A comparison ofvarious classifiers to predict early exit behavior.
We see that simple classification algorithms can be used toachieve comparable, or even better performance than compli-cated meta-classifiers. Besides the obvious performance su-periority, it is desirable to use simpler classifiers on groundsof computational complexity, as implementing these is algo-rithmically more scalable and thus offers faster runtime.
CONCLUSIONS
We demonstrated how clickstream data can be used to pre-dict “early exits” in online videos. By constructing models tothis effect, we were able to identify with high accuracy whichvideo streaming sessions are likely to terminate prematurely.Additionally, we compared and contrasted the performance ofa number of classifiers, highlighting those that we found to bearticularly fit to this problem. Having knowledge of such in-formation would allow content providers to personalize howtheir media is distributed so as to increase user retention, andas a result, business value.
REFERENCES
1. Banerjee, A., and Ghosh, J. Clickstream clustering usingweighted longest common subsequences. In
Proceedingsof the web mining workshop at the 1st SIAM conferenceon data mining , vol. 143, Citeseer (2001), 144.2. Bayes, M., and Price, M. An essay towards solving aproblem in the doctrine of chances. By the late Rev. Mr.Bayes, communicated by Mr. Price, in a letter to JohnCanton, M.A. and F.R.S.
Philosophical Transactions(1683-1775) (1763), 370–418.3. Breiman, L. Random forests.
Machine learning 45 , 1(2001), 5–32.4. Bucklin, R. E., and Sismeiro, C. A model of web sitebrowsing behavior estimated on clickstream data.
Journal of Marketing Research (2003), 249–267.5. Cohen, W. W. Fast effective rule induction. In
ICML ,vol. 95 (1995), 115–123.6. Das, A. S., Datar, M., Garg, A., and Rajaram, S. Googlenews personalization: scalable online collaborativefiltering. In
Proceedings of the 16th internationalconference on World Wide Web , ACM (2007), 271–280.7. Dobrian, F., Sekar, V., Awan, A., Stoica, I., Joseph,D. A., Ganjam, A., Zhan, J., and Zhang, H.Understanding the impact of video quality on userengagement.
SIGCOMM-Computer CommunicationReview 41 , 4 (2011), 362.8. Dreze, X., and Hussherr, F.-X. Internet advertising: Isanybody watching?
Journal of interactive marketing 17 ,4 (2003), 8–23.9. Dˇzeroski, S., and ˇZenko, B. Is combining classifierswith stacking better than selecting the best one?
Machine learning 54 , 3 (2004), 255–273.10. Eom, J.-H., and Zhang, B.-T. Machine learning-basedtext mining for biomedical information analysis.
Genomics & Informatics 2 , 2 (2004), 99–106.11. Fawcett, T. An introduction to roc analysis.
Patternrecognition letters 27 , 8 (2006), 861–874.12. Forman, G. An Extensive Empirical Study of FeatureSelection Metrics for Text Classification.
The Journal ofMachine Learning Research 3 (2003), 1289–1305.13. Guo, J. K. P. J., Krzysztof, D. T. S. P. M., and Miller, Z.G. R. C. Understanding in-video dropouts andinteraction peaks in online lecture videos. 14. Guyon, I., and Elisseeff, A. An introduction to variableand feature selection.
The Journal of Machine LearningResearch 3 (2003), 1157–1182.15. Hall, M. A.
Correlation-based Feature Selection forMachine Learning . PhD thesis, The University ofWaikato, 1999.16. Hauser, J. R., Urban, G. L., Liberali, G., and Braun, M.Website morphing.
Marketing Science 28 , 2 (2009),202–223.17. Ho, T. K. The random subspace method for constructingdecision forests.
Pattern Analysis and MachineIntelligence, IEEE Transactions on 20 , 8 (1998),832–844.18. Kohavi, R. The power of decision tables. In
MachineLearning: ECML-95 . Springer, 1995, 174–189.19. Liu, J., Dolan, P., and Pedersen, E. R. Personalized newsrecommendation based on click behavior. In
Proceedings of the 15th international conference onIntelligent user interfaces , ACM (2010), 31–40.20. Mobasher, B., Cooley, R., and Srivastava, J. AutomaticPersonalization Based on Web Usage Mining.
Communications of the ACM 43 , 8 (2000), 142–151.21. Moe, W. W. Buying, searching, or browsing:Differentiating between online shoppers using in-storenavigational clickstream.
Journal of ConsumerPsychology 13 , 1 (2003), 29–39.22. Montgomery, A. L., Li, S., Srinivasan, K., and Liechty,J. C. Modeling online browsing and path analysis usingclickstream data.
Marketing Science 23 , 4 (2004),579–595.23. Quinlan, J. R. Induction of decision trees.
MachineLearning 1 , 1 (1986), 81–106.24. Quinlan, J. R.
C4.5: Programs for Machine Learning ,vol. 1. Morgan Kaufmann, 1993.25. Srivastava, J., Cooley, R., Deshpande, M., and Tan, P.-N.Web usage mining: Discovery and applications of usagepatterns from web data.
ACM SIGKDD ExplorationsNewsletter 1 , 2 (2000), 12–23.26. Witten, I. H., and Frank, E.
Data Mining: PracticalMachine Learning Tools and Techniques . MorganKaufmann, 2005.27. Wolpert, D. H. Stacked generalization.
Neural networks5 , 2 (1992), 241–259.28. Yang, Y., and Pedersen, J. O. A comparative study onfeature selection in text categorization. In