Predictability of Popularity: Gaps between Prediction and Understanding
PPredictability of Popularity: Gaps between Prediction and Understanding
Benjamin Shulman
Dept. of Computer ScienceCornell [email protected]
Amit Sharma
Microsoft ResearchNew York, [email protected]
Dan Cosley
Information ScienceCornell [email protected]
Abstract
Can we predict the future popularity of a song, movie ortweet? Recent work suggests that although it may be hard topredict an item’s popularity when it is first introduced, peek-ing into its early adopters and properties of their social net-work makes the problem easier. We test the robustness of suchclaims by using data from social networks spanning music,books, photos, and URLs. We find a stronger result: not onlydo predictive models with peeking achieve high accuracy onall datasets, they also generalize well, so much so that modelstrained on any one dataset perform with comparable accuracyon items from other datasets.Though practically useful, our models (and those in otherwork) are intellectually unsatisfying because common for-mulations of the problem, which involve peeking at the firstsmall- k adopters and predicting whether items end up in thetop half of popular items, are both too sensitive to the speedof early adoption and too easy. Most of the predictive powercomes from looking at how quickly items reach their firstfew adopters, while for other features of early adopters andtheir networks, even the direction of correlation with popu-larity is not consistent across domains. Problem formulationsthat examine items that reach k adopters in about the sameamount of time reduce the importance of temporal features,but also overall accuracy, highlighting that we understand lit-tle about why items become popular while providing a con-text in which we might build that understanding. How does a book, song, or a movie become popular? Thequestion of how cultural artifacts spread through social net-works has captured the imagination of scholars for decades.Many factors are cited as important for an item to spread vi-rally through social networks and become popular: its intrin-sic quality (Gladwell 2006a; Simonoff and Sparrow 2000),the characteristics of its initial adopters (Gladwell 2006b),the emotional response it elicits (Berger and Milkman 2012),and so on. Often, explanations are used to justify the popu-larity of different items after the fact (Berger 2013), mak-ing it hard to apply these explanations to new events (Watts2011).Online social networks allow us to observe individual-level traces of how items are transferred between people,allowing more precise modeling of the phenomenon. Pre-dicting the future popularity of an item based on attributes
Copyright c (cid:13) of the item and the person who introduced it has emerged asa useful problem, both to understand processes of informa-tion diffusion and to inform content creation and feed designon social media platforms. For example, Twitter’s managersmay want to highlight new tweets that are more likely to be-come popular, while its users may want to learn from char-acteristics of popular tweets to imporve their own.In general, even with detailed information about an item’scontent or the person sharing it, it is hard to predict whichitems will become more popular than others (Bakshy etal. 2011; Martin et al. 2016). The problem becomes moretractable when we are allowed to peek into the initial spreadof an item. The intuition is that early activity data about thespeed of adoption, characteristics of people who adopt it andthe connections between them might predict the item’s fate.This intuition shows encouraging results for both predictingthe final popularity of an item (Szabo and Huberman 2010;Pinto, Almeida, and Gonc¸alves 2013; Zhao et al. 2015) andwhether an item will end up in the top 50% of popularitems (Cheng et al. 2014; Romero, Tan, and Ugander 2013;Weng, Menczer, and Ahn 2013).Buoyed by these successes, one might conclude thatthe availability of rich features about the item and socialnetwork of early adopters has helped us understand whyitems become popular. However, past work studies individ-ual datasets and varying versions of the prediction prob-lem, making it hard to compare results. For instance, stud-ies disagree on the direction of the effect of network struc-tural features on item popularity (Lerman and Hogg 2010;Romero, Tan, and Ugander 2013).In this paper, we try to unify these observations on popu-larity prediction through studying different problem formu-lations and kinds of features over a wide range of onlinesocial networks. Using an existing formulation that predictswhether the final popularity of items is above the medianbased on features of the first five adopters (Cheng et al.2014), we confirm past work (Szabo and Huberman 2010)showing that features about those adopters and their so-cial network are at best weak predictors of popularity com-pared to temporal features. For instance, a single temporalheuristic—the average rate of early adoption—is a betterpredictor than all non-temporal features combined across allfour websites. Further, models trained on one dataset andtested on others using temporal features generalize fairly a r X i v : . [ c s . S I] M a r ell, while those that use network structural features gen-eralize badly.In one reading, this is a useful contribution: peeking-based popularity models that include temporal informationachieve up to 83% accuracy on Twitter and generalize wellacross datasets. From a practical standpoint, we encouragecontent distributors to use temporal features for predictingthe future success of items.Intellectually, however, our finding is not very satisfying.Rather than identifying features that shed light on why itemsbecome popular, we mostly see that items that become pop-ular fast are more likely to achieve higher popularity in theend. Rapid adoption may be a signal of quality, interesting-ness, and eventual popularity—but doesn’t tell us why. Theeffect might also be driven by cumulative advantage (Frankand Cook 2010; Watts 2011): items that receive attentionearly have more chances to spread through via interfaces thathighlight popular or trending items.An alternative formulation of the problem that reduces theeffect of temporal features lets us see just what early adopterand network features tell us. This formulation, called Tem-poral Matching, compares items that achieve similar levelsof popularity in the same amount of time, rather than themore common formulation of looking at the first k adoptersregardless of the time it takes to reach k . Controlling for theaverage rate of an item’s adoption turns popularity predic-tion into a hard problem. Using the same features as be-fore, prediction accuracy across all datasets drops below65%. Such a decrease in accuracy underscores the impor-tance of choosing problem formulations that highlightingrelevant phenomena in popularity evolution. Current mod-els may fare well on certain formulations, but there is stillmuch to learn about how items become popular. Formulations of the prediction problem
We start by identifying two key dimensions to consider whendefining the popularity prediction task: how much peekinginto early activity on an item is allowed, and whether thetask is a regression or classification. For ease of exposition,we use item to denote entities that are consumed in onlinesocial networks.
Adoption refers to an explicit action or en-dorsement of an item, such as loving a song, favoriting aphoto, rating a book highly or retweeting a URL. Finally,we define popularity of an item as the number of peoplewho have adopted it.
Predicting apriori versus peeking into early activity
Predicting popularity a priori for items such as movies (Si-monoff and Sparrow 2000) or songs (Pachet and Sony 2012)has long been considered a hard problem. One of the mostsuccessful approaches has been to gauge audiences’ interestin an item before it is officially released, such as by measur-ing the volume of tweets (Asur, Huberman, and others 2010)or search queries (Goel et al. 2010). Such methods can workwell for mainstream, popular items for which there mightbe measurable prior buzz, but are unlikely to be useful forgenuinely new items such as tweets or photos uploaded byusers. For such items, popularity prediction is tricky, even whenprecise data about the content of each tweet and the seeduser’s social network is known. On Twitter, models withextensive content features such as the type of content, itssource and topic, crowdsourced scores of interestingness,and features about the seed user such as indegree and pastpopularity of tweets are only able to explain less than halfof the variance in popularity (Martin et al. 2016). Further,the content features are usually less important than featuresof the seed user (Bakshy et al. 2011; Martin et al. 2016;Jenders, Kasneci, and Naumann 2013).In response, scholars have suggested modified versions ofthe problem where one peeks into early adoption activity foran item. In studies on networks including Facebook (Chenget al. 2014), Twitter (Lerman and Hogg 2010; Zhao et al.2015; Tsur and Rappoport 2012; Kupavskii et al. 2013),Weibo (Yu et al. 2015), Digg (Lerman and Hogg 2010;Szabo and Huberman 2010) and Youtube (Pinto, Almeida,and Gonc¸alves 2013), early activity data consistently pre-dicts future popularity with reasonable accuracy. In light ofthese results, we focus on the peeking variant of the problemin this paper.
Classification versus regression
In addition to how much data we look at, we must also spec-ify what to predict. A number of studies have used regressionformulations, predicting an item’s exact final popularity: thenumber of retweets for a URL (Bakshy et al. 2011), voteson a Digg post (Lerman and Hogg 2010) or page views ofa Youtube video (Szabo and Huberman 2010). However, wemay often be more interested in popularity relative to otheritems rather than an exact estimate. For example, both mar-keters and platform owners may want to select ‘up and com-ing’ items to feature in the interface versus others .These motivations lead nicely to a classification problemwhere the goal is to predict whether an item will be morepopular then a certain percentage of other items. For in-stance, Romero et al. predict whether the number of adoptersof a hashtag on Twitter will double, given a set of hash-tags with the same number of initial adopters (Romero, Tan,and Ugander 2013). Cheng et al. generalize this formula-tion to show that predicting whether an item will double itspopularity is equivalent to classifying whether an item be-comes more popular than the median and study this ques-tion in the case of Facebook photos that received at least fiveadopters (Cheng et al. 2014). Besides the practical appealof classifying popular items, classification is also a simplertask than predicting the actual number of adoptions (Ban-dari, Asur, and Huberman 2012), thus providing a favorablescenario for evaluating the limits of predictability of popu-larity. Therefore, we focus on the classification problem inthis paper. Such featuring makes some items more salient than others andsurely affects the final popularity of both featured and non-featureditems; typically, formulations of the problem look at very smallslices of early activity, which presumably minimizes these effects. tudy Problem Formulation Content Structural Early Adopters Temporal
Bakshy et al. (2011) Regression (no peeking) n – Y –Martin et al. (2016) Regression (no peeking) n – Y –Szabo et al. (2010) Regression – n – Y Tsur et al. (2012) Regression
Y Y – Y Pinto et al. (2013) Regression – – – Y Yu et al. (2015) Regression – n – Y Romero et al. (2013) Classification ( k = { , } , n = 50% ) – Y – –Cheng et al. (2014) Classification ( k = 5 , n = 50% ) n Y Y Y
Lerman et al. (2008) Classification ( k = 10 , n = 80% ) – Y Y –Weng et al. (2013) Classification ( k = 50 , n = { , , } ) – Y n –Table 1: A taxonomy of problem formulations for popularity prediction, along with importance of feature categories. Y meansthat the features in the category were useful for prediction, n means they were tried but not as useful, and – that they were notstudied. Most studies report temporal and structural features as important predictors. Our problem: Peeking-based classification
Based on the above discussion, the general peeking-basedclassification problem can be stated as:
P1:
Given a set of items and data about their earlyadoptions, which among them are more likely to be-come popular?
This question has a broad range of formulations based onhow we define the early activity period, how much activ-ity we are allowed to poke at, and how we define popular .The early activity period may be defined in terms of timeelapsed t since an item’s introduction (Szabo and Huberman2010), or in terms of a fixed number k of early adoptions(Romero, Tan, and Ugander 2013). Fixing the early activityperiod in terms of number of adoptions has the useful side-effect of filtering out items with less than k adoptions over-all, both making the problem harder and eliminating unpop-ular (thus often uninteresting) items. For this reason, mostpast work on peeking-based classification defines early ac-tivity in terms of the number of adoptions k .The popularity threshold for what is “popular” may alsobe set at different percentiles ( n % ). Table 1 summarizes pastwork based on their choices of problem formulation andchoice of ( k, n ) . One common approach is to collect allitems that have k or more adoptions, then peek into the first k adoptions and predict whether eventual popularity of itemslies above or below the median (Cheng et al. 2014). We callthis Balanced Classification since there are guaranteed to bean equal number of high and low popularity items. Anothervariation is to only consider the top- n percentile of items ashigh popularity (Lerman and Galstyan 2008), a formulationthat is arguably better-aligned with most use cases aroundcontent promotion than Balanced Classification. However, itis also harder than Balanced Classification; for this reason,and to continue to align with prior work, we focus on Bal-anced Classification.While restricting to items with k adoptions helps to levelthe playing field because it provides a set of comparablypopular items to study, it ignores the time taken to reach k adoptions. Based on prior work, our suspicion is that in thisformulation temporal features dominate the others. To con-trol for this temporal signal, we later introduce a problem formulation where both k and t are fixed. That is, we collectall items that received exactly k adoptions in a given timeperiod t , and then predict which of them would be in the tophalf of popular items. We call this the Temporally MatchedBalanced Classification problem, and as we will see, chang-ing the definition has a profound impact on the quality of themodels. Choosing features
We now turn to the selection of features for prediction. Partof the allure of modeling is that the features that prove im-portant might give information about why some items be-come popular in ways that could be both practically and sci-entifically interesting. Features used in prior work can bebroadly grouped into four main categories: content, struc-tural, early adopters and temporal (Cheng et al. 2014). Ta-ble 1 shows which feature categories were used in priorstudies, with cells in bold representing features that werereported to be useful for prediction. While all feature cat-egories have been reported to be important contributors toprediction accuracy in at least some studies, temporal andstructural features are frequently reported as important.Temporal patterns of early adoption—how quickly theearly adopters act—are a major predictor of popularity.Szabo and Huberman show that temporal features alonecan predict future popularity reliably (Szabo and Huber-man 2010). When information about the social network orits users is hard to obtain, utilizing temporal features canbe fruitful, achieving error rates as low as 15% in a re-gression formulation (Pinto, Almeida, and Gonc¸alves 2013;Zhao et al. 2015). A natural next question is to ask how muchthese errors can be decreased by adding other features whenwe do have such information.Features about the seed user and early resharers—collectively called early adopters—also matter. On Twitter,for example, the number of followers of the seed user andthe fraction of her past tweets that received retweets increasethe accuracy of predictions (Tsur and Rappoport 2012). In-formation about other early adopters is also useful for pre-dicting photo cascades in Facebook (Cheng et al. 2014).The structure of the underlying social network also haspredictive power (Lerman and Galstyan 2008; Romero, Tan,nd Ugander 2013; Cheng et al. 2014). However, these stud-ies do not agree on the direction of effect of these features.For instance, on Digg, low network density is connectedwith high popularity (Lerman and Galstyan 2008), but onTwitter, both very low and very high densities are posi-tively correlated with popularity (Romero, Tan, and Ugan-der 2013). Their intuition is that a lower network densityindicates that the item is capable of appealing to a generalaudience, while a higher network density indicates a tight-knit community supporting the item, both of which can bepowerful drivers for an item’s popularity.Finally, while Tsur et al. report content features to be use-ful (Tsur and Rappoport 2012), most studies find contentfeatures to have little predictive power (Table 1). Even fordomains such as songs or movies where item information isreadily available, content features are not significantly asso-ciated with item popularity (Pachet and Sony 2012). Further,content features do not generalize well; it is hard to computegeneralizable content features across different item domains.For these reasons, we do not consider content features in thiswork.
Features
Based on the above discussion, we use the following cate-gories of features, with the aim of reproducing and extend-ing the features used in past work (Cheng et al. 2014): tem-poral, structural, and early adopters. To these we add a set ofnovel features based on preference similarity between earlyadopters.
Temporal.
These features have to do with the speed ofadoptions during the early adoption period between the firstand k th adoption. This leads to a set of features that focuson the rate of adoption: • time i : time between the initial adoption and the i th adop-tion ( ≤ i ≤ k ). (Zhao et al. 2015; Maity et al. 2015;Weng, Menczer, and Ahn 2013) • time ...k/ : Mean time between adoptions for the firsthalf (rounded down) of the adoptions. • time k/ ...k : Mean time between adoptions for the last half(rounded up) of the adoptions. Structural.
These features have to do with the structure ofthe network around early adopters and can be broken downinto two sub-categories: ego network features that relate theearly adopters to their local networks, and subgraph featuresthat consider only connections between the early adopters.
Early adopters’ ego network features • in i : Indegree of the i th early adopter ( ≤ i ≤ k ). This isa proxy for the number of people who may be exposed toan early adopter’s activity. For undirected networks, thiswill simply be the degree, or the number of friends of anearly adopter. (Bakshy et al. 2011; Zhao et al. 2015) • reach : Number of nodes reachable in one step from theearly adopters. • connections : Number of edges from early adopters to theentire graph. (Romero, Tan, and Ugander 2013) Early adopters’ subgraph features • indegree sub : Mean indegree (friends or followers) foreach node in the subgraph of early adopters. (Lerman andGalstyan 2008) • density sub : Number of edges in the subgraph of earlyadopters. (Romero, Tan, and Ugander 2013) • cc sub : Number of connected components in the subgraphof early adopters. (Romero, Tan, and Ugander 2013) • dist sub : Mean distance between connected nodes in thesubgraph of early adopters. This is meant to measure howfar the item has spread in the initial early adopters, similarto the cascade depth feature by Cheng et al. • sub in i : Indegree of the i th adopter on the subgraph ( ≤ i ≤ k ). (Lerman and Galstyan 2008) Features of early adopters.
These features capture infor-mation about early adopters, such as their popularity, senior-ity, or activity level, which might be proxies for their influ-ence. They can be divided into two sub-categories: featuresof the first user to adopt an item (root), and features averagedover other early adopters (resharers).
Root features • activity root : Number of adoptions in the four weeks be-fore the end of the early adoption period. This is simi-lar to a measure used by Cheng et al. which measuredthe number of days a user was active. (Cheng et al. 2014;Petrovic, Osborne, and Lavrenko 2011; Yang and Counts2010) • age root : Length of time the user has been registered onthe social network. • popularity root : Number of friends or followers on the so-cial network. (Lerman and Galstyan 2008; Tsur and Rap-poport 2012) Resharer features • activity resharer : Mean number of adoptions in the fourweeks before the end of the early adoption period. • age resharer : Mean length of time the users have been reg-istered on the social network. • popularity resharer : Mean number of friends or followerson the social network. (Tsur and Rappoport 2012) Similarity
To these previously tested features, we addfeatures related to preference similarity between the earlyadopters. As with network density, our intuition is that sim-ilarity between early adopters may matter in two ways: highsimilarity may signify a niche item, or one that people withsimilar interests are likely to adopt, while low similaritymight indicate an item that could appeal to a wide varietyof people.Similarity was computed using the Jaccard index of twousers’ adoptions that occurred before the end of the earlyadoption period of the item in question. We computed themedian, mean and maximum of similarity between adoptersbecause these give us an idea of the distribution of the affin-ity of the early adopters; we do not include users who hadless than five adoptions before the item in question because ataset Last.fm Twitter Flickr Goodreads
Number of users 437k 737k 183k 252kNumber of items 5.8M 64k 10.9M 1.3MNumber of adoptions 44M 2.7M 33M 28MMean adoptions 7.6 41.8 3.0 21.4Median adoptions 1 1 1 1Maximum adoptions 11062 82507 2762 88027Table 2: Descriptive statstics for users, items, and adoptions in each dataset. We use adoption to mean loving a song on Last.fm,tweeting a URL on Twitter, favoriting a photo on Flickr, and rating a book on Goodreads. The average number of adoptions peritem varies quite a bit, but the median popularity of 1 is consistent across datasets.they are likely to have little overlap. The features we ex-tracted are: • sim count : Number of similarities that could be computedbetween early adopters. • sim mean : Mean similarity between early adopters. • sim med : Median similarity between early adopters. • sim max : Maximum similarity between early adopters. Data and Method
Datasets from four online social networks
We build models using data from four different online socialplatforms: Last.fm, Flickr, Goodreads and Twitter. Theseplatforms span a broad range of online activity, includingsongs, photos, books and URLs; they also have a variety ofuser interfaces, use cases, and user populations. These vari-ations reduce the risk of overfitting to properties of a partic-ular social network. • Last.fm:
A music-focused social network where userscan friend one another and love songs. We consider adataset of 437k users and the songs they loved from theirstart date until February 2014 (Sharma and Cosley 2016). • Flickr:
A photo sharing website where users can friendone another and favorite photos. We use data collectedover 104 days in 2006 and 2007 (Cha, Mislove, and Gum-madi 2009). • Goodreads:
A book rating website where users can friendone another and rate books. The dataset consists of 252kusers and their ratings before August 2010. Unlike theother sites, Goodreads users rate books; we consider anyrating at or above 4 (out of 5) as an endorsement (adop-tion) of the book (Huang et al. 2012). • Twitter:
A social networking site where users can formdirected edges with one another and broadcast tweets ,messages no longer than 140 characters (as of 2010). TheTwitter dataset consists of URLs tweeted by 737k usersfor three weeks of 2010 (Hodas and Lerman 2014).All of these websites have an active social network, pro-viding an activity feed that allows users to explore, like, andreshare the items that their friends shared. The Last.fm feedshows songs that friends have to listened to or loved , Flickrshows photos that friends have favorited , Goodreads showsbooks that friends have rated , and Twitter shows tweets with
Item percentile C u m u l a t i v epe r c en t ageo f i n t e r a c t i on s FlickrLast.fmGoodreadsTwitter
Figure 1: Cumulative percentage of adoptions by items foreach dataset. Items on the x-axis are sorted by their popular-ity; the lines show a step pattern because multiple items mayhave the same number of adoptions. We observe a substan-tial skew in popularity. For example, the most popular 20%of items account for 60% of adoptions in Flickr and morethan 90% of adoptions in other datasets.URLs that followees have favorited or retweeted . Thus, likepast studies on online social networks such as Facebook,Twitter and Digg, we expect active peer influence processesthat should make structural and early adopter features rele-vant.Table 2 shows descriptive statistics about the datasets,all of which have more than 150k users and millions ofitems (with the exception of Twitter with 64k URLs). Twit-ter has the highest mean adoptions per item ( ), followedby Goodreads ( ). The maximum number of adoptions foran item also varies, from more than 80k in Twitter andGoodreads to 2.7k in Flickr. The median number of adop-tions is consistent, however: at least half of the items haveonly 1 adoption. The skew in popularity distribution is bet-ter shown in Figure 1. The 20% of the most popular itemsaccount for over 60% of adoptions in Flickr and over 90%of the adoptions in the other three websites. On Twitter, theskew is extreme: over 81% adoptions are on 4% of items. ast.fm Flickr Goodreads Twitter Datasets N u m be r o f i n t e r a c t i on s Figure 2: Boxplot showing the number of adoptions after 28days (10 for Twitter) for items which have at least 5 adop-tions. The bold partial line is the mean number of adoptions.Across datasets, most items receive less than 20 adoptions.
Classification methodology
We first operationalize the Balanced Classification formu-lation on these datasets. As a reminder, k is the number ofearly adoptions that we peek at for each item, and we pre-dict which of these items will end up more popular than themedian item.We measure the final popularity at a time T days after thefirst adoption of the item. To be consistent with prior work,we follow Cheng et al. and set k = 5 and T = 28 days forLast.fm, Flickr and Goodreads. Because the Twitter datasetis only three weeks long, we use a smaller T = 10 days. Toavoid right-censoring, we include only items that had theirfirst adoption at least T days before the last recorded times-tamp in each dataset. The parameter k also acts as a filter, al-lowing only items with at least k adoptions. Figure 2 showsproperties of the data thus constructed.We classify items based on their popularity after T days,labeling those above the median 1 and others as 0. For eachitem, we extract features from the early adoption period,the time between the first and k th adoption. We use 5-foldcross validation to select the items that we train on, then usethe trained model to predict final popularity of items in thetest set. Since we use median popularity as the classifica-tion threshold, the test data has a roughly equal number ofitems in each class, allowing us to use accuracy as a reason-able evaluation metric. We tried several classification mod-els using Weka (Hall et al. 2009), including logistic regres-sion, random forests and support vector machines. Logisticregression models generally performed best, so we report re-sults for those models unless otherwise specified. Balanced classification
We start by comparing the predictive power of models us-ing different sets of features across the four datasets on theBalanced Classification problem.
Goodreads Flickr Last.fm Twitter
Dataset P e r c en t a cc u r a cy alltemporalall-temporalresharestructuralsimilarityroot Figure 3: Accuracy for prediction models incorporating dif-ferent categories of features. The y-axis starts at 50%, thebaseline for a random classifier on the balanced formulation.On all datasets, temporal features are the most predictive, al-most as accurate as using all available features.
Temporal features dominate
Figure 3 shows the prediction accuracy of the models. Simi-lar to prior work on Facebook that used peeking (Cheng et al.2014), when using all features we are able to predict whetheran item will be above the median popularity around three-fourths of the time: 73% for Goodreads, 75% for Flickr, 81%for Last.fm and 83% for Twitter.Training models with individual feature categories showsthat temporal features are by far most important. Across allfour datasets, a model using only temporal features performsalmost as well as the full model. The next best feature cat-egory, resharer features, is able to predict 71% on Twitterand less than 60% on the other three datasets. Even a modelthat uses all non-temporal features, denoted by the “all-temporal” line in Figure 3, is not very good. For Goodreadsand Flickr, this model is not much better than a random clas-sifier. For Last.fm and Twitter, accuracy for non-temporalfeatures improves somewhat, but is still at least 10% worsethan when including temporal features.Even a single temporal feature can be more predictivethan models constructed from all non-temporal features.Consider the feature time x , which is the number of daysfor an item to receive x number of adoptions. At x = 5 = k , the feature time —time taken for an item to receive5 adoptions—is the most predictive temporal feature forall datasets. A model based on this single feature achievesmore than 70% accuracy on all datasets and accounts fornearly 97% of the accuracy of the full model for eachdataset. While past work has highlighted the importance oftemporal features as a whole (Szabo and Huberman 2010;Cheng et al. 2014), it is interesting to find that we may noteven need multiple temporal features: a single measure isable to predict final popularity class label for items in alldatasets.est \ Train Last.fm Flickr Goodreads Twitter
Using only temporal features
Last.fm
Last.fm
Table 3: Prediction accuracy for models trained on onedataset (columns) and tested on each dataset (rows). Thediagonals report accuracy on the same dataset, while othercells report accuracy when the model is trained on onedataset and tested on another. The power of temporal fea-tures generalizes across domains: testing a model on anydataset, trained on any other dataset, loses no more than 5%accuracy compared to testing a model on the same dataset.For non-temporal features, prediction accuracy decreasessubstantially when applying models to other datasets.
Cross-domain prediction
The analysis in the previous section confirms past findingsabout the importance of temporal features across a range ofwebsites. We now extend these results to show that temporalfeatures are not only powerful, they are also general: modelslearnt on one item domain using temporal features are read-ily transferable to others. In contrast, non-temporal featuresdo not generalize well: even the direction of their effect isnot consistent across domains. To show this, we train pre-diction models separately for each dataset, as before, thenapply each model to every dataset.
Temporal features generalize
Table 3 shows the accuracy of models trained only on tem-poral features from one dataset and tested on all four. Read-ing across the rows shows that regardless of which socialnetwork a model was trained on, its accuracy on test datafrom another network remains within 5% of the accuracy ontest data from the same network.Such consistent prediction accuracy is impressive, espe-cially because the median time to reach 5 adoptions varies,ranging from 1 day in Flickr to 15 days for Goodreads. Thissuggests that there are general temporal patterns that are as-sociated with future popularity, at least across these particu-lar networks.
Other features have inconsistent effects
The story is less rosy for non-temporal features. Table 3shows the cross-domain prediction accuracy for modelstrained on all non-temporal features (in light of their lowaccuracy when taken individually, we combine all non-temporal features). Accuracies on the same dataset corre-spond to the “all-temporal” line in Figure 3; they are gener-ally low and drop further when tested on a different dataset. In particular, models trained on other websites do poorlywhen tested on Twitter, with the Last.fm and Flickr modelsperforming worse than a random guesser on Twitter data.Meanwhile, a model trained on Twitter is almost 10 percent-age points worse than the Last.fm-trained model for predict-ing popularity on Last.fm.Not only does prediction accuracy drop across websites,but fitting single-feature logistic regression models for eachfeature shows that for 12 of the 25 features, the coeffi-cient term flips between being positive and negative acrossmodels fit on different datasets. Similar to the contrast-ing results found in prior work (Lerman and Hogg 2010;Romero, Tan, and Ugander 2013), we find that all measuresof subgraph structural features of the early adopters, namely indegree sub , density sub , cc sub , dist sub and sub in i (ex-cept for sub in and sub in ), can predict either higheror lower popularity depending on the dataset. For exam-ple, a higher density sub —number of edges in the subgraphof early adopters—is associated with higher popularity onFlickr ( β coefficient=0.04), whereas on Last.fm, a higherdensity is associated with lower popularity ( β coefficient=-0.09). Features from the root, resharer and similarity cat-egories show a similar dichotomous association with finalitem popularity. Gaps between prediction and understanding
These results show that not only are non-temporal featuresweak predictors, the direction of their effect on popularity isinconsistent across different domains. Combining this withour observation that a single temporal heuristic is almost asgood a predictor as the full model raises questions aboutwhat it is that popularity prediction models are predictingand how they contribute to our understanding of popularity.
Temporal features drive predictability
While our work may seem contrary to recent work thatclaims that early adopters and properties of their social net-work matter for prediction, many of their findings are consis-tent with our own. Most prior work that uses peeking findsthat temporal features are a key predictor (Tsur and Rap-poport 2012; Szabo and Huberman 2010; Pinto, Almeida,and Gonc¸alves 2013; Yu et al. 2015). Further, even thoughCheng et al. conclude temporal and structural features aremajor predictors of cascade size, they report for predictingphotos’ popularity on Facebook, accuracy for temporal fea-tures alone (78%)is nearly as good as the full model (79.5%)(Cheng et al. 2014).By holding modeling, feature selection and problem for-mulation consistent, we contribute to this literature bydemonstrating the magnitude and generality of the predic-tive power of temporal features across a range of social net-works. Having multiple networks also lets us show that, un-like temporal features, using non-temporal features does notgeneralize well to new contexts. These features might beuseful for understanding the particulars of a given website,but it seems likely that they are capturing idiosyncrasies ofthat site rather than telling us something general about howitems become popular in social networks. s cumulative advantage the whole story?
If non-temporal features are weakly predictive and not gen-eralizable, and all that matters is the rate of initial adop-tion, then how do predictive exercises with peeking advancescientific understanding of what drives popularity? In otherwords, what does it mean when one claims that popularity ispredictable once we know about initial adopters?One answer is that early, rapid adoption is a signal of in-trinsic features of an item that help to determine its pop-ularity. Items with better content command a higher initialpopularity, and thus the predictive power of early tempo-ral features is simply a reflection of content quality or in-terestingness to the social network in question. Given in-creasing evidence from multiple domains that content fea-tures are at best weakly connected to an item’s popularity(Salganik, Dodds, and Watts 2006; Pachet and Sony 2012;Martin et al. 2016), this seems unlikely to be the whole ex-planation.Another explanation is that items that get attention earlyare more likely to be featured in the interface, via feeds,recommendations or ads; they might also be spread throughexternal channels could drive up the rate of early adoption.Those would be interesting questions to explore. Still, what-ever be the driving reasons, these models are telling us thatonce items achieve initial popularity, they are much morelikely to become more popular in the future. This is simplya restatement of cumulative advantage, or the rich-get-richerphenomenon (Borghol et al. 2012).Overall, though, we find that neither our results nor otherwork say much about why or how items become popular, ex-cept that items that share temporal patterns of popular itemsearly on tend to be the ones that are more popular in the fu-ture, and that making popularity salient and ordering itemsby popularity can increase this effect (Salganik, Dodds, andWatts 2006). While such predictions are practically usefulfor promoting content, they are not so useful for informingcreation of new content or assessing its value, nor for under-standing the mechanisms by which items become popular.
Temporally matched balanced classification
In this section, we give a problem formulation that lessensthe importance of temporal features by conditioning on theaverage rate of adoption. That is, instead of considering allitems with k adoptions, we consider items with k adoptionswithin about the same amount of time. Given the dominanceof cumulative advantage, such a formulation would be bet-ter suited for future research in understanding how items be-come popular, as gains in accuracy will likely shed light onattributes of early adopters, items, and networks that affecttheir final popularity. k-t problem formulation We call this formulation Temporally Matched BalancedClassification, or a k-t formulation of the problem:
P2:
Among items with exactly k adoptions at the end ofa fixed time period t , which ones would be higher thanthe median popularity at a later time T ? Goodreads Flickr Last.fm
Dataset P e r c en t a cc u r a cy all, fixed k tall-temporal, fixed k tall, fixed kall-temporal, fixed k Figure 4: Percent accuracy for fixed t & k using all featuresand non-temporal features, and for fixed k with all featuresand non-temporal features. k = 5 , T = 28 days for all; t = 15 days for Goodreads, t = 1 days for Flickr, and t = 7 days for Last.fm. Fixing t reduces accuracy substantiallycompared to when t is not fixed. As expected when control-ling for time, non-temporal features now provide most of theexplanatory power.To do this, for each dataset we filtered items to those thathad exactly k adoptions in t days. We extracted features ofthese items as previously described, adding a new temporalfeature for each day in t : • adoptions i : Number of adoptions on day i of the earlyadopter period. (Szabo and Huberman 2010; Tsur andRappoport 2012; Pinto, Almeida, and Gonc¸alves 2013)As before, we choose k = 5 and T = 28 days. For eachdataset, we set t to be the median time it took an item toreach five adoptions: t = 15 for Goodreads, t = 7 forLast.fm, and t = 1 for Flickr. We exclude Twitter due toa lack of data when we filter for both k and t . We againdo 5-fold cross-validation, predicting if each item would beabove or below the final median popularity after T days.Figure 4 shows the results. As we hoped, non-temporalfeatures now provide most of the explanatory power in thefull model. Further, comparing the all-temporal series withfixed k and t to the one with only fixed k shows that the abso-lute accuracy of non-temporal features increases in this for-mulation. This suggests that de-emphasizing temporal fea-tures in prediction might in fact improve our understandingof other features that drive popularity.Our understanding, however, is limited: even conditioningon a single temporal feature makes for a much harder prob-lem, with the overall prediction accuracy below 65% for alldatasets even when using all features. There is clearly muchroom for improvement. Discussion and Conclusion
Using multiple problem formulations, we show that tempo-ral features matter the most in predicting the popularity oftems given data about initial adopters and our current abil-ity to build explanatory features of those adopters and theirnetworks. Using datasets from a variety of social networks,we show that temporal features are not only better at predict-ing popularity than all other features combined, but that theyreadily generalize to new contexts. When we discount tem-poral phenomena by removing temporal features or adjust-ing the problem formulation, accuracy decreases substan-tially.From a practical point of view, these results provide em-pirical support for a promising approach where only tem-poral features are used to predict future popularity (Szaboand Huberman 2010; Zhao et al. 2015) because the drop inaccuracy by casting aside non-temporal features is gener-ally small. Maybe creative feature engineering is not worththe effort for the Balanced Classification task. This way oflooking at the problem resonates a bit with the Netflix prize,where most of the learners that were folded into the winningmodel were never implemented in Netflix’s actual algorithm,in part because the cost of computing and managing thoselearners was not worth the incremental gains (Amatriain andBasil 2012).Although less valuable than temporal features, the non-temporal features examined so far do have some predictivepower on their own. This might be useful when temporal in-formation is unavailable: for example, for very new items(Borghol et al. 2012), or for external observers or datasetswhere timestamps are unavailable (Cosley et al. 2010). En-couragingly, non-temporal features increase in accuracy alittle on the k-t formulation compared to the fixed- k bal-anced classification problem, suggesting that making timeless salient might allow other factors to become more visi-ble and modelable.Using k-t models could also bend time to our advantage.Comparing the overall performance and predictive featuresin models with smaller versus larger t might highlight item,adopter, and network characteristics that predict faster adop-tion (and eventual popularity). Another way to frame thisintuition is that instead of predicting eventual popularity, weshould try to predict initial adoption speed.Deeper thinking about the context of sharing might alsobe useful. Algorithmic and interface factors, for instance,have been shown to create cumulative advantage effects; itwould be interesting to look more deeply into how systemfeatures might influence adoption behaviors. Likewise, dif-fusion models tend to focus attention on sharers rather thanreceivers of information—but those receivers’ preferences,goals and attention budgets likely shape their adoption be-haviors (Sharma and Cosley 2015). Thus, consideration ofaudience-based features might be a way forward.Most generally, we encourage research in this area to gobeyond the low-hanging fruit of time. For building bettertheories of diffusion, maximizing accuracy with temporal in-formation may act both as a crutch that makes the problemtoo easy, and as a blindfold that makes it hard to examinewhat drives those rapid adoptions that predict eventual pop-ularity. Acknowledgments
This work was supported by the National Science Founda-tion under grants IIS 0910664 and IIS 1422484, and by agrant from Google for computational resources.
References
Amatriain, X., and Basil, J. 2012. Net-flix recommendations: Beyond the 5 stars. http://techblog.netflix.com/2012/04/netflix-recommendations-beyond-5-stars.html .Asur, S.; Huberman, B.; et al. 2010. Predicting the futurewith social media. In
Web Intelligence and Intelligent AgentTechnology (WI-IAT), 2010 IEEE/WIC/ACM InternationalConference on , volume 1, 492–499.Bakshy, E.; Hofman, J. M.; Mason, W. A.; and Watts, D. J.2011. Everyone’s an influencer: quantifying influence ontwitter. In
Proceedings of the fourth ACM international con-ference on Web search and data mining .Bandari, R.; Asur, S.; and Huberman, B. A. 2012. Thepulse of news in social media: Forecasting popularity. In
Sixth International AAAI Conference on Weblogs and SocialMedia , 26–33.Berger, J., and Milkman, K. L. 2012. What makes onlinecontent viral?
Journal of marketing research
Contagious: Why things catch on . Simonand Schuster.Borghol, Y.; Ardon, S.; Carlsson, N.; Eager, D.; and Ma-hanti, A. 2012. The untold story of the clones: content-agnostic factors that impact youtube video popularity. In
Proceedings of the 18th ACM SIGKDD international confer-ence on Knowledge discovery and data mining , 1186–1194.Cha, M.; Mislove, A.; and Gummadi, K. P. 2009. Ameasurement-driven analysis of information propagation inthe flickr social network. In
Proceedings of the 18th inter-national conference on World wide web , 721–730.Cheng, J.; Adamic, L.; Dow, P. A.; Kleinberg, J. M.; andLeskovec, J. 2014. Can cascades be predicted? In
Pro-ceedings of the 23rd international conference on World wideweb , 925–936.Cosley, D.; Huttenlocher, D. P.; Kleinberg, J. M.; Lan, X.;and Suri, S. 2010. Sequential influence models in social net-works.
Fourth International AAAI Conference on Weblogsand Social Media
The winner-take-allsociety: Why the few at the top get so much more than therest of us . Random House.Gladwell, M. 2006a. The formula.
The New Yorker .Gladwell, M. 2006b.
The tipping point: How little thingscan make a big difference . Little, Brown.Goel, S.; Hofman, J. M.; Lahaie, S.; Pennock, D. M.; andWatts, D. J. 2010. Predicting consumer behavior with websearch.
Proceedings of the National Academy of Sciences
ACM SIGKDD explorations newsletter
Scientific reports
Proceedings of thefifth ACM international conference on Web search and datamining , 573–582.Jenders, M.; Kasneci, G.; and Naumann, F. 2013. Analyzingand predicting viral tweets. In , 657–664.Kupavskii, A.; Umnov, A.; Gusev, G.; and Serdyukov, P.2013. Predicting the audience size of a tweet. In
SeventhInternational AAAI Conference on Weblogs and Social Me-dia .Lerman, K., and Galstyan, A. 2008. Analysis of social vot-ing patterns on digg. In
Proceedings of the first workshopon Online social networks , 7–12.Lerman, K., and Hogg, T. 2010. Using a model of socialdynamics to predict popularity of news. In
Proceedings ofthe 19th international conference on World wide web , 621–630.Maity, S. K.; Gupta, A.; Goyal, P.; and Mukherjee, A. 2015.A stratified learning approach for predicting the popularityof twitter idioms. In
Ninth International AAAI Conferenceon Web and Social Media .Martin, T.; Hofman, J. M.; Sharma, A.; Anderson, A.; andWatts, D. J. 2016. Limits to prediction: Predicting successin complex social systems. In
Proceedings of the 25th inter-national conference on World wide web .Pachet, F., and Sony, C. 2012. Hit song science.
Music DataMining
Fifth In-ternational AAAI Conference on Weblogs and Social Media .Pinto, H.; Almeida, J. M.; and Gonc¸alves, M. A. 2013. Us-ing early view patterns to predict the popularity of youtubevideos. In
Proceedings of the sixth ACM international con-ference on Web search and data mining .Romero, D. M.; Tan, C.; and Ugander, J. 2013. On theinterplay between social and topical structure. In
SeventhInternational AAAI Conference on Weblogs and Social Me-dia .Salganik, M. J.; Dodds, P. S.; and Watts, D. J. 2006. Exper-imental study of inequality and unpredictability in an artifi-cial cultural market.
Science .Sharma, A., and Cosley, D. 2015. Studying and model-ing the connection between people’s preferences and con-tent sharing. In
Proceedings of the 18th ACM Conferenceon Computer Supported Cooperative Work & Social Com-puting , 1246–1257.Sharma, A., and Cosley, D. 2016. Distinguishing betweenpersonal preferences and social influence in online activ-ity feeds. In
Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Comput-ing , 1091–1103.Simonoff, J. S., and Sparrow, I. R. 2000. Predictingmovie grosses: Winners and losers, blockbusters and sleep-ers.
Chance
Communications of the ACM
Proceedings of the fifth ACM interna-tional conference on Web search and data mining , 643–652.Watts, D. J. 2011.
Everything is obvious:* Once you knowthe answer . Crown Business.Weng, L.; Menczer, F.; and Ahn, Y.-Y. 2013. Virality predic-tion and community structure in social networks.
Scientificreports
Fourth In-ternational AAAI Conference on Weblogs and Social Media
IEEE Interna-tional Conference on Data Mining .Zhao, Q.; Erdogdu, M. A.; He, H. Y.; Rajaraman, A.; andLeskovec, J. 2015. Seismic: A self-exciting point processmodel for predicting tweet popularity. In