[PDF] Predictability of Popularity: Gaps between Prediction and Understanding

Abstract

Can we predict the future popularity of a song, movie or tweet? Recent work suggests that although it may be hard to predict an item's popularity when it is first introduced, peeking into its early adopters and properties of their social network makes the problem easier. We test the robustness of such claims by using data from social networks spanning music, books, photos, and URLs. We find a stronger result: not only do predictive models with peeking achieve high accuracy on all datasets, they also generalize well, so much so that models trained on any one dataset perform with comparable accuracy on items from other datasets. Though practically useful, our models (and those in other work) are intellectually unsatisfying because common formulations of the problem, which involve peeking at the first small-k adopters and predicting whether items end up in the top half of popular items, are both too sensitive to the speed of early adoption and too easy. Most of the predictive power comes from looking at how quickly items reach their first few adopters, while for other features of early adopters and their networks, even the direction of correlation with popularity is not consistent across domains. Problem formulations that examine items that reach k adopters in about the same amount of time reduce the importance of temporal features, but also overall accuracy, highlighting that we understand little about why items become popular while providing a context in which we might build that understanding.

Full PDF

PPredictability of Popularity: Gaps between Prediction and Understanding

Benjamin Shulman

Dept. of Computer ScienceCornell [email protected]

Amit Sharma

Microsoft ResearchNew York, [email protected]

Dan Cosley

Information ScienceCornell [email protected]

Abstract

Can we predict the future popularity of a song, movie ortweet? Recent work suggests that although it may be hard topredict an item’s popularity when it is ﬁrst introduced, peek-ing into its early adopters and properties of their social net-work makes the problem easier. We test the robustness of suchclaims by using data from social networks spanning music,books, photos, and URLs. We ﬁnd a stronger result: not onlydo predictive models with peeking achieve high accuracy onall datasets, they also generalize well, so much so that modelstrained on any one dataset perform with comparable accuracyon items from other datasets.Though practically useful, our models (and those in otherwork) are intellectually unsatisfying because common for-mulations of the problem, which involve peeking at the ﬁrstsmall- k adopters and predicting whether items end up in thetop half of popular items, are both too sensitive to the speedof early adoption and too easy. Most of the predictive powercomes from looking at how quickly items reach their ﬁrstfew adopters, while for other features of early adopters andtheir networks, even the direction of correlation with popu-larity is not consistent across domains. Problem formulationsthat examine items that reach k adopters in about the sameamount of time reduce the importance of temporal features,but also overall accuracy, highlighting that we understand lit-tle about why items become popular while providing a con-text in which we might build that understanding. How does a book, song, or a movie become popular? Thequestion of how cultural artifacts spread through social net-works has captured the imagination of scholars for decades.Many factors are cited as important for an item to spread vi-rally through social networks and become popular: its intrin-sic quality (Gladwell 2006a; Simonoff and Sparrow 2000),the characteristics of its initial adopters (Gladwell 2006b),the emotional response it elicits (Berger and Milkman 2012),and so on. Often, explanations are used to justify the popu-larity of different items after the fact (Berger 2013), mak-ing it hard to apply these explanations to new events (Watts2011).Online social networks allow us to observe individual-level traces of how items are transferred between people,allowing more precise modeling of the phenomenon. Pre-dicting the future popularity of an item based on attributes

Copyright c (cid:13) of the item and the person who introduced it has emerged asa useful problem, both to understand processes of informa-tion diffusion and to inform content creation and feed designon social media platforms. For example, Twitter’s managersmay want to highlight new tweets that are more likely to be-come popular, while its users may want to learn from char-acteristics of popular tweets to imporve their own.In general, even with detailed information about an item’scontent or the person sharing it, it is hard to predict whichitems will become more popular than others (Bakshy etal. 2011; Martin et al. 2016). The problem becomes moretractable when we are allowed to peek into the initial spreadof an item. The intuition is that early activity data about thespeed of adoption, characteristics of people who adopt it andthe connections between them might predict the item’s fate.This intuition shows encouraging results for both predictingthe ﬁnal popularity of an item (Szabo and Huberman 2010;Pinto, Almeida, and Gonc¸alves 2013; Zhao et al. 2015) andwhether an item will end up in the top 50% of popularitems (Cheng et al. 2014; Romero, Tan, and Ugander 2013;Weng, Menczer, and Ahn 2013).Buoyed by these successes, one might conclude thatthe availability of rich features about the item and socialnetwork of early adopters has helped us understand whyitems become popular. However, past work studies individ-ual datasets and varying versions of the prediction prob-lem, making it hard to compare results. For instance, stud-ies disagree on the direction of the effect of network struc-tural features on item popularity (Lerman and Hogg 2010;Romero, Tan, and Ugander 2013).In this paper, we try to unify these observations on popu-larity prediction through studying different problem formu-lations and kinds of features over a wide range of onlinesocial networks. Using an existing formulation that predictswhether the ﬁnal popularity of items is above the medianbased on features of the ﬁrst ﬁve adopters (Cheng et al.2014), we conﬁrm past work (Szabo and Huberman 2010)showing that features about those adopters and their so-cial network are at best weak predictors of popularity com-pared to temporal features. For instance, a single temporalheuristic—the average rate of early adoption—is a betterpredictor than all non-temporal features combined across allfour websites. Further, models trained on one dataset andtested on others using temporal features generalize fairly a r X i v : . [ c s . S I] M a r ell, while those that use network structural features gen-eralize badly.In one reading, this is a useful contribution: peeking-based popularity models that include temporal informationachieve up to 83% accuracy on Twitter and generalize wellacross datasets. From a practical standpoint, we encouragecontent distributors to use temporal features for predictingthe future success of items.Intellectually, however, our ﬁnding is not very satisfying.Rather than identifying features that shed light on why itemsbecome popular, we mostly see that items that become pop-ular fast are more likely to achieve higher popularity in theend. Rapid adoption may be a signal of quality, interesting-ness, and eventual popularity—but doesn’t tell us why. Theeffect might also be driven by cumulative advantage (Frankand Cook 2010; Watts 2011): items that receive attentionearly have more chances to spread through via interfaces thathighlight popular or trending items.An alternative formulation of the problem that reduces theeffect of temporal features lets us see just what early adopterand network features tell us. This formulation, called Tem-poral Matching, compares items that achieve similar levelsof popularity in the same amount of time, rather than themore common formulation of looking at the ﬁrst k adoptersregardless of the time it takes to reach k . Controlling for theaverage rate of an item’s adoption turns popularity predic-tion into a hard problem. Using the same features as be-fore, prediction accuracy across all datasets drops below65%. Such a decrease in accuracy underscores the impor-tance of choosing problem formulations that highlightingrelevant phenomena in popularity evolution. Current mod-els may fare well on certain formulations, but there is stillmuch to learn about how items become popular. Formulations of the prediction problem

We start by identifying two key dimensions to consider whendeﬁning the popularity prediction task: how much peekinginto early activity on an item is allowed, and whether thetask is a regression or classiﬁcation. For ease of exposition,we use item to denote entities that are consumed in onlinesocial networks.

Adoption refers to an explicit action or en-dorsement of an item, such as loving a song, favoriting aphoto, rating a book highly or retweeting a URL. Finally,we deﬁne popularity of an item as the number of peoplewho have adopted it.

Predicting apriori versus peeking into early activity

Predicting popularity a priori for items such as movies (Si-monoff and Sparrow 2000) or songs (Pachet and Sony 2012)has long been considered a hard problem. One of the mostsuccessful approaches has been to gauge audiences’ interestin an item before it is ofﬁcially released, such as by measur-ing the volume of tweets (Asur, Huberman, and others 2010)or search queries (Goel et al. 2010). Such methods can workwell for mainstream, popular items for which there mightbe measurable prior buzz, but are unlikely to be useful forgenuinely new items such as tweets or photos uploaded byusers. For such items, popularity prediction is tricky, even whenprecise data about the content of each tweet and the seeduser’s social network is known. On Twitter, models withextensive content features such as the type of content, itssource and topic, crowdsourced scores of interestingness,and features about the seed user such as indegree and pastpopularity of tweets are only able to explain less than halfof the variance in popularity (Martin et al. 2016). Further,the content features are usually less important than featuresof the seed user (Bakshy et al. 2011; Martin et al. 2016;Jenders, Kasneci, and Naumann 2013).In response, scholars have suggested modiﬁed versions ofthe problem where one peeks into early adoption activity foran item. In studies on networks including Facebook (Chenget al. 2014), Twitter (Lerman and Hogg 2010; Zhao et al.2015; Tsur and Rappoport 2012; Kupavskii et al. 2013),Weibo (Yu et al. 2015), Digg (Lerman and Hogg 2010;Szabo and Huberman 2010) and Youtube (Pinto, Almeida,and Gonc¸alves 2013), early activity data consistently pre-dicts future popularity with reasonable accuracy. In light ofthese results, we focus on the peeking variant of the problemin this paper.

Classiﬁcation versus regression

In addition to how much data we look at, we must also spec-ify what to predict. A number of studies have used regressionformulations, predicting an item’s exact ﬁnal popularity: thenumber of retweets for a URL (Bakshy et al. 2011), voteson a Digg post (Lerman and Hogg 2010) or page views ofa Youtube video (Szabo and Huberman 2010). However, wemay often be more interested in popularity relative to otheritems rather than an exact estimate. For example, both mar-keters and platform owners may want to select ‘up and com-ing’ items to feature in the interface versus others .These motivations lead nicely to a classiﬁcation problemwhere the goal is to predict whether an item will be morepopular then a certain percentage of other items. For in-stance, Romero et al. predict whether the number of adoptersof a hashtag on Twitter will double, given a set of hash-tags with the same number of initial adopters (Romero, Tan,and Ugander 2013). Cheng et al. generalize this formula-tion to show that predicting whether an item will double itspopularity is equivalent to classifying whether an item be-comes more popular than the median and study this ques-tion in the case of Facebook photos that received at least ﬁveadopters (Cheng et al. 2014). Besides the practical appealof classifying popular items, classiﬁcation is also a simplertask than predicting the actual number of adoptions (Ban-dari, Asur, and Huberman 2012), thus providing a favorablescenario for evaluating the limits of predictability of popu-larity. Therefore, we focus on the classiﬁcation problem inthis paper. Such featuring makes some items more salient than others andsurely affects the ﬁnal popularity of both featured and non-featureditems; typically, formulations of the problem look at very smallslices of early activity, which presumably minimizes these effects. tudy Problem Formulation Content Structural Early Adopters Temporal

Bakshy et al. (2011) Regression (no peeking) n – Y –Martin et al. (2016) Regression (no peeking) n – Y –Szabo et al. (2010) Regression – n – Y Tsur et al. (2012) Regression

Y Y – Y Pinto et al. (2013) Regression – – – Y Yu et al. (2015) Regression – n – Y Romero et al. (2013) Classiﬁcation ( k = { , } , n = 50% ) – Y – –Cheng et al. (2014) Classiﬁcation ( k = 5 , n = 50% ) n Y Y Y

Lerman et al. (2008) Classiﬁcation ( k = 10 , n = 80% ) – Y Y –Weng et al. (2013) Classiﬁcation ( k = 50 , n = { , , } ) – Y n –Table 1: A taxonomy of problem formulations for popularity prediction, along with importance of feature categories. Y meansthat the features in the category were useful for prediction, n means they were tried but not as useful, and – that they were notstudied. Most studies report temporal and structural features as important predictors. Our problem: Peeking-based classiﬁcation

Based on the above discussion, the general peeking-basedclassiﬁcation problem can be stated as:

P1:

Given a set of items and data about their earlyadoptions, which among them are more likely to be-come popular?

This question has a broad range of formulations based onhow we deﬁne the early activity period, how much activ-ity we are allowed to poke at, and how we deﬁne popular .The early activity period may be deﬁned in terms of timeelapsed t since an item’s introduction (Szabo and Huberman2010), or in terms of a ﬁxed number k of early adoptions(Romero, Tan, and Ugander 2013). Fixing the early activityperiod in terms of number of adoptions has the useful side-effect of ﬁltering out items with less than k adoptions over-all, both making the problem harder and eliminating unpop-ular (thus often uninteresting) items. For this reason, mostpast work on peeking-based classiﬁcation deﬁnes early ac-tivity in terms of the number of adoptions k .The popularity threshold for what is “popular” may alsobe set at different percentiles ( n % ). Table 1 summarizes pastwork based on their choices of problem formulation andchoice of ( k, n ) . One common approach is to collect allitems that have k or more adoptions, then peek into the ﬁrst k adoptions and predict whether eventual popularity of itemslies above or below the median (Cheng et al. 2014). We callthis Balanced Classiﬁcation since there are guaranteed to bean equal number of high and low popularity items. Anothervariation is to only consider the top- n percentile of items ashigh popularity (Lerman and Galstyan 2008), a formulationthat is arguably better-aligned with most use cases aroundcontent promotion than Balanced Classiﬁcation. However, itis also harder than Balanced Classiﬁcation; for this reason,and to continue to align with prior work, we focus on Bal-anced Classiﬁcation.While restricting to items with k adoptions helps to levelthe playing ﬁeld because it provides a set of comparablypopular items to study, it ignores the time taken to reach k adoptions. Based on prior work, our suspicion is that in thisformulation temporal features dominate the others. To con-trol for this temporal signal, we later introduce a problem formulation where both k and t are ﬁxed. That is, we collectall items that received exactly k adoptions in a given timeperiod t , and then predict which of them would be in the tophalf of popular items. We call this the Temporally MatchedBalanced Classiﬁcation problem, and as we will see, chang-ing the deﬁnition has a profound impact on the quality of themodels. Choosing features

We now turn to the selection of features for prediction. Partof the allure of modeling is that the features that prove im-portant might give information about why some items be-come popular in ways that could be both practically and sci-entiﬁcally interesting. Features used in prior work can bebroadly grouped into four main categories: content, struc-tural, early adopters and temporal (Cheng et al. 2014). Ta-ble 1 shows which feature categories were used in priorstudies, with cells in bold representing features that werereported to be useful for prediction. While all feature cat-egories have been reported to be important contributors toprediction accuracy in at least some studies, temporal andstructural features are frequently reported as important.Temporal patterns of early adoption—how quickly theearly adopters act—are a major predictor of popularity.Szabo and Huberman show that temporal features alonecan predict future popularity reliably (Szabo and Huber-man 2010). When information about the social network orits users is hard to obtain, utilizing temporal features canbe fruitful, achieving error rates as low as 15% in a re-gression formulation (Pinto, Almeida, and Gonc¸alves 2013;Zhao et al. 2015). A natural next question is to ask how muchthese errors can be decreased by adding other features whenwe do have such information.Features about the seed user and early resharers—collectively called early adopters—also matter. On Twitter,for example, the number of followers of the seed user andthe fraction of her past tweets that received retweets increasethe accuracy of predictions (Tsur and Rappoport 2012). In-formation about other early adopters is also useful for pre-dicting photo cascades in Facebook (Cheng et al. 2014).The structure of the underlying social network also haspredictive power (Lerman and Galstyan 2008; Romero, Tan,nd Ugander 2013; Cheng et al. 2014). However, these stud-ies do not agree on the direction of effect of these features.For instance, on Digg, low network density is connectedwith high popularity (Lerman and Galstyan 2008), but onTwitter, both very low and very high densities are posi-tively correlated with popularity (Romero, Tan, and Ugan-der 2013). Their intuition is that a lower network densityindicates that the item is capable of appealing to a generalaudience, while a higher network density indicates a tight-knit community supporting the item, both of which can bepowerful drivers for an item’s popularity.Finally, while Tsur et al. report content features to be use-ful (Tsur and Rappoport 2012), most studies ﬁnd contentfeatures to have little predictive power (Table 1). Even fordomains such as songs or movies where item information isreadily available, content features are not signiﬁcantly asso-ciated with item popularity (Pachet and Sony 2012). Further,content features do not generalize well; it is hard to computegeneralizable content features across different item domains.For these reasons, we do not consider content features in thiswork.

Features

Based on the above discussion, we use the following cate-gories of features, with the aim of reproducing and extend-ing the features used in past work (Cheng et al. 2014): tem-poral, structural, and early adopters. To these we add a set ofnovel features based on preference similarity between earlyadopters.

Temporal.

These features have to do with the speed ofadoptions during the early adoption period between the ﬁrstand k th adoption. This leads to a set of features that focuson the rate of adoption: • time i : time between the initial adoption and the i th adop-tion ( ≤ i ≤ k ). (Zhao et al. 2015; Maity et al. 2015;Weng, Menczer, and Ahn 2013) • time ...k/ : Mean time between adoptions for the ﬁrsthalf (rounded down) of the adoptions. • time k/ ...k : Mean time between adoptions for the last half(rounded up) of the adoptions. Structural.

These features have to do with the structure ofthe network around early adopters and can be broken downinto two sub-categories: ego network features that relate theearly adopters to their local networks, and subgraph featuresthat consider only connections between the early adopters.

Early adopters’ ego network features • in i : Indegree of the i th early adopter ( ≤ i ≤ k ). This isa proxy for the number of people who may be exposed toan early adopter’s activity. For undirected networks, thiswill simply be the degree, or the number of friends of anearly adopter. (Bakshy et al. 2011; Zhao et al. 2015) • reach : Number of nodes reachable in one step from theearly adopters. • connections : Number of edges from early adopters to theentire graph. (Romero, Tan, and Ugander 2013) Early adopters’ subgraph features • indegree sub : Mean indegree (friends or followers) foreach node in the subgraph of early adopters. (Lerman andGalstyan 2008) • density sub : Number of edges in the subgraph of earlyadopters. (Romero, Tan, and Ugander 2013) • cc sub : Number of connected components in the subgraphof early adopters. (Romero, Tan, and Ugander 2013) • dist sub : Mean distance between connected nodes in thesubgraph of early adopters. This is meant to measure howfar the item has spread in the initial early adopters, similarto the cascade depth feature by Cheng et al. • sub in i : Indegree of the i th adopter on the subgraph ( ≤ i ≤ k ). (Lerman and Galstyan 2008) Features of early adopters.

These features capture infor-mation about early adopters, such as their popularity, senior-ity, or activity level, which might be proxies for their inﬂu-ence. They can be divided into two sub-categories: featuresof the ﬁrst user to adopt an item (root), and features averagedover other early adopters (resharers).

Root features • activity root : Number of adoptions in the four weeks be-fore the end of the early adoption period. This is simi-lar to a measure used by Cheng et al. which measuredthe number of days a user was active. (Cheng et al. 2014;Petrovic, Osborne, and Lavrenko 2011; Yang and Counts2010) • age root : Length of time the user has been registered onthe social network. • popularity root : Number of friends or followers on the so-cial network. (Lerman and Galstyan 2008; Tsur and Rap-poport 2012) Resharer features • activity resharer : Mean number of adoptions in the fourweeks before the end of the early adoption period. • age resharer : Mean length of time the users have been reg-istered on the social network. • popularity resharer : Mean number of friends or followerson the social network. (Tsur and Rappoport 2012) Similarity

To these previously tested features, we addfeatures related to preference similarity between the earlyadopters. As with network density, our intuition is that sim-ilarity between early adopters may matter in two ways: highsimilarity may signify a niche item, or one that people withsimilar interests are likely to adopt, while low similaritymight indicate an item that could appeal to a wide varietyof people.Similarity was computed using the Jaccard index of twousers’ adoptions that occurred before the end of the earlyadoption period of the item in question. We computed themedian, mean and maximum of similarity between adoptersbecause these give us an idea of the distribution of the afﬁn-ity of the early adopters; we do not include users who hadless than ﬁve adoptions before the item in question because ataset Last.fm Twitter Flickr Goodreads

Number of users 437k 737k 183k 252kNumber of items 5.8M 64k 10.9M 1.3MNumber of adoptions 44M 2.7M 33M 28MMean adoptions 7.6 41.8 3.0 21.4Median adoptions 1 1 1 1Maximum adoptions 11062 82507 2762 88027Table 2: Descriptive statstics for users, items, and adoptions in each dataset. We use adoption to mean loving a song on Last.fm,tweeting a URL on Twitter, favoriting a photo on Flickr, and rating a book on Goodreads. The average number of adoptions peritem varies quite a bit, but the median popularity of 1 is consistent across datasets.they are likely to have little overlap. The features we ex-tracted are: • sim count : Number of similarities that could be computedbetween early adopters. • sim mean : Mean similarity between early adopters. • sim med : Median similarity between early adopters. • sim max : Maximum similarity between early adopters. Data and Method

Datasets from four online social networks

We build models using data from four different online socialplatforms: Last.fm, Flickr, Goodreads and Twitter. Theseplatforms span a broad range of online activity, includingsongs, photos, books and URLs; they also have a variety ofuser interfaces, use cases, and user populations. These vari-ations reduce the risk of overﬁtting to properties of a partic-ular social network. • Last.fm:

A music-focused social network where userscan friend one another and love songs. We consider adataset of 437k users and the songs they loved from theirstart date until February 2014 (Sharma and Cosley 2016). • Flickr:

A photo sharing website where users can friendone another and favorite photos. We use data collectedover 104 days in 2006 and 2007 (Cha, Mislove, and Gum-madi 2009). • Goodreads:

A book rating website where users can friendone another and rate books. The dataset consists of 252kusers and their ratings before August 2010. Unlike theother sites, Goodreads users rate books; we consider anyrating at or above 4 (out of 5) as an endorsement (adop-tion) of the book (Huang et al. 2012). • Twitter:

A social networking site where users can formdirected edges with one another and broadcast tweets ,messages no longer than 140 characters (as of 2010). TheTwitter dataset consists of URLs tweeted by 737k usersfor three weeks of 2010 (Hodas and Lerman 2014).All of these websites have an active social network, pro-viding an activity feed that allows users to explore, like, andreshare the items that their friends shared. The Last.fm feedshows songs that friends have to listened to or loved , Flickrshows photos that friends have favorited , Goodreads showsbooks that friends have rated , and Twitter shows tweets with

Item percentile C u m u l a t i v epe r c en t ageo f i n t e r a c t i on s FlickrLast.fmGoodreadsTwitter

Figure 1: Cumulative percentage of adoptions by items foreach dataset. Items on the x-axis are sorted by their popular-ity; the lines show a step pattern because multiple items mayhave the same number of adoptions. We observe a substan-tial skew in popularity. For example, the most popular 20%of items account for 60% of adoptions in Flickr and morethan 90% of adoptions in other datasets.URLs that followees have favorited or retweeted . Thus, likepast studies on online social networks such as Facebook,Twitter and Digg, we expect active peer inﬂuence processesthat should make structural and early adopter features rele-vant.Table 2 shows descriptive statistics about the datasets,all of which have more than 150k users and millions ofitems (with the exception of Twitter with 64k URLs). Twit-ter has the highest mean adoptions per item ( ), followedby Goodreads ( ). The maximum number of adoptions foran item also varies, from more than 80k in Twitter andGoodreads to 2.7k in Flickr. The median number of adop-tions is consistent, however: at least half of the items haveonly 1 adoption. The skew in popularity distribution is bet-ter shown in Figure 1. The 20% of the most popular itemsaccount for over 60% of adoptions in Flickr and over 90%of the adoptions in the other three websites. On Twitter, theskew is extreme: over 81% adoptions are on 4% of items. ast.fm Flickr Goodreads Twitter Datasets N u m be r o f i n t e r a c t i on s Figure 2: Boxplot showing the number of adoptions after 28days (10 for Twitter) for items which have at least 5 adop-tions. The bold partial line is the mean number of adoptions.Across datasets, most items receive less than 20 adoptions.

Classiﬁcation methodology

We ﬁrst operationalize the Balanced Classiﬁcation formu-lation on these datasets. As a reminder, k is the number ofearly adoptions that we peek at for each item, and we pre-dict which of these items will end up more popular than themedian item.We measure the ﬁnal popularity at a time T days after theﬁrst adoption of the item. To be consistent with prior work,we follow Cheng et al. and set k = 5 and T = 28 days forLast.fm, Flickr and Goodreads. Because the Twitter datasetis only three weeks long, we use a smaller T = 10 days. Toavoid right-censoring, we include only items that had theirﬁrst adoption at least T days before the last recorded times-tamp in each dataset. The parameter k also acts as a ﬁlter, al-lowing only items with at least k adoptions. Figure 2 showsproperties of the data thus constructed.We classify items based on their popularity after T days,labeling those above the median 1 and others as 0. For eachitem, we extract features from the early adoption period,the time between the ﬁrst and k th adoption. We use 5-foldcross validation to select the items that we train on, then usethe trained model to predict ﬁnal popularity of items in thetest set. Since we use median popularity as the classiﬁca-tion threshold, the test data has a roughly equal number ofitems in each class, allowing us to use accuracy as a reason-able evaluation metric. We tried several classiﬁcation mod-els using Weka (Hall et al. 2009), including logistic regres-sion, random forests and support vector machines. Logisticregression models generally performed best, so we report re-sults for those models unless otherwise speciﬁed. Balanced classiﬁcation

We start by comparing the predictive power of models us-ing different sets of features across the four datasets on theBalanced Classiﬁcation problem.

Goodreads Flickr Last.fm Twitter

Dataset P e r c en t a cc u r a cy alltemporalall-temporalresharestructuralsimilarityroot Figure 3: Accuracy for prediction models incorporating dif-ferent categories of features. The y-axis starts at 50%, thebaseline for a random classiﬁer on the balanced formulation.On all datasets, temporal features are the most predictive, al-most as accurate as using all available features.

Temporal features dominate

Figure 3 shows the prediction accuracy of the models. Simi-lar to prior work on Facebook that used peeking (Cheng et al.2014), when using all features we are able to predict whetheran item will be above the median popularity around three-fourths of the time: 73% for Goodreads, 75% for Flickr, 81%for Last.fm and 83% for Twitter.Training models with individual feature categories showsthat temporal features are by far most important. Across allfour datasets, a model using only temporal features performsalmost as well as the full model. The next best feature cat-egory, resharer features, is able to predict 71% on Twitterand less than 60% on the other three datasets. Even a modelthat uses all non-temporal features, denoted by the “all-temporal” line in Figure 3, is not very good. For Goodreadsand Flickr, this model is not much better than a random clas-siﬁer. For Last.fm and Twitter, accuracy for non-temporalfeatures improves somewhat, but is still at least 10% worsethan when including temporal features.Even a single temporal feature can be more predictivethan models constructed from all non-temporal features.Consider the feature time x , which is the number of daysfor an item to receive x number of adoptions. At x = 5 = k , the feature time —time taken for an item to receive5 adoptions—is the most predictive temporal feature forall datasets. A model based on this single feature achievesmore than 70% accuracy on all datasets and accounts fornearly 97% of the accuracy of the full model for eachdataset. While past work has highlighted the importance oftemporal features as a whole (Szabo and Huberman 2010;Cheng et al. 2014), it is interesting to ﬁnd that we may noteven need multiple temporal features: a single measure isable to predict ﬁnal popularity class label for items in alldatasets.est \ Train Last.fm Flickr Goodreads Twitter

Using only temporal features

Last.fm

Table 3: Prediction accuracy for models trained on onedataset (columns) and tested on each dataset (rows). Thediagonals report accuracy on the same dataset, while othercells report accuracy when the model is trained on onedataset and tested on another. The power of temporal fea-tures generalizes across domains: testing a model on anydataset, trained on any other dataset, loses no more than 5%accuracy compared to testing a model on the same dataset.For non-temporal features, prediction accuracy decreasessubstantially when applying models to other datasets.

Cross-domain prediction

The analysis in the previous section conﬁrms past ﬁndingsabout the importance of temporal features across a range ofwebsites. We now extend these results to show that temporalfeatures are not only powerful, they are also general: modelslearnt on one item domain using temporal features are read-ily transferable to others. In contrast, non-temporal featuresdo not generalize well: even the direction of their effect isnot consistent across domains. To show this, we train pre-diction models separately for each dataset, as before, thenapply each model to every dataset.

Temporal features generalize

Table 3 shows the accuracy of models trained only on tem-poral features from one dataset and tested on all four. Read-ing across the rows shows that regardless of which socialnetwork a model was trained on, its accuracy on test datafrom another network remains within 5% of the accuracy ontest data from the same network.Such consistent prediction accuracy is impressive, espe-cially because the median time to reach 5 adoptions varies,ranging from 1 day in Flickr to 15 days for Goodreads. Thissuggests that there are general temporal patterns that are as-sociated with future popularity, at least across these particu-lar networks.

Other features have inconsistent effects

The story is less rosy for non-temporal features. Table 3shows the cross-domain prediction accuracy for modelstrained on all non-temporal features (in light of their lowaccuracy when taken individually, we combine all non-temporal features). Accuracies on the same dataset corre-spond to the “all-temporal” line in Figure 3; they are gener-ally low and drop further when tested on a different dataset. In particular, models trained on other websites do poorlywhen tested on Twitter, with the Last.fm and Flickr modelsperforming worse than a random guesser on Twitter data.Meanwhile, a model trained on Twitter is almost 10 percent-age points worse than the Last.fm-trained model for predict-ing popularity on Last.fm.Not only does prediction accuracy drop across websites,but ﬁtting single-feature logistic regression models for eachfeature shows that for 12 of the 25 features, the coefﬁ-cient term ﬂips between being positive and negative acrossmodels ﬁt on different datasets. Similar to the contrast-ing results found in prior work (Lerman and Hogg 2010;Romero, Tan, and Ugander 2013), we ﬁnd that all measuresof subgraph structural features of the early adopters, namely indegree sub , density sub , cc sub , dist sub and sub in i (ex-cept for sub in and sub in ), can predict either higheror lower popularity depending on the dataset. For exam-ple, a higher density sub —number of edges in the subgraphof early adopters—is associated with higher popularity onFlickr ( β coefﬁcient=0.04), whereas on Last.fm, a higherdensity is associated with lower popularity ( β coefﬁcient=-0.09). Features from the root, resharer and similarity cat-egories show a similar dichotomous association with ﬁnalitem popularity. Gaps between prediction and understanding

These results show that not only are non-temporal featuresweak predictors, the direction of their effect on popularity isinconsistent across different domains. Combining this withour observation that a single temporal heuristic is almost asgood a predictor as the full model raises questions aboutwhat it is that popularity prediction models are predictingand how they contribute to our understanding of popularity.

Temporal features drive predictability

While our work may seem contrary to recent work thatclaims that early adopters and properties of their social net-work matter for prediction, many of their ﬁndings are consis-tent with our own. Most prior work that uses peeking ﬁndsthat temporal features are a key predictor (Tsur and Rap-poport 2012; Szabo and Huberman 2010; Pinto, Almeida,and Gonc¸alves 2013; Yu et al. 2015). Further, even thoughCheng et al. conclude temporal and structural features aremajor predictors of cascade size, they report for predictingphotos’ popularity on Facebook, accuracy for temporal fea-tures alone (78%)is nearly as good as the full model (79.5%)(Cheng et al. 2014).By holding modeling, feature selection and problem for-mulation consistent, we contribute to this literature bydemonstrating the magnitude and generality of the predic-tive power of temporal features across a range of social net-works. Having multiple networks also lets us show that, un-like temporal features, using non-temporal features does notgeneralize well to new contexts. These features might beuseful for understanding the particulars of a given website,but it seems likely that they are capturing idiosyncrasies ofthat site rather than telling us something general about howitems become popular in social networks. s cumulative advantage the whole story?

If non-temporal features are weakly predictive and not gen-eralizable, and all that matters is the rate of initial adop-tion, then how do predictive exercises with peeking advancescientiﬁc understanding of what drives popularity? In otherwords, what does it mean when one claims that popularity ispredictable once we know about initial adopters?One answer is that early, rapid adoption is a signal of in-trinsic features of an item that help to determine its pop-ularity. Items with better content command a higher initialpopularity, and thus the predictive power of early tempo-ral features is simply a reﬂection of content quality or in-terestingness to the social network in question. Given in-creasing evidence from multiple domains that content fea-tures are at best weakly connected to an item’s popularity(Salganik, Dodds, and Watts 2006; Pachet and Sony 2012;Martin et al. 2016), this seems unlikely to be the whole ex-planation.Another explanation is that items that get attention earlyare more likely to be featured in the interface, via feeds,recommendations or ads; they might also be spread throughexternal channels could drive up the rate of early adoption.Those would be interesting questions to explore. Still, what-ever be the driving reasons, these models are telling us thatonce items achieve initial popularity, they are much morelikely to become more popular in the future. This is simplya restatement of cumulative advantage, or the rich-get-richerphenomenon (Borghol et al. 2012).Overall, though, we ﬁnd that neither our results nor otherwork say much about why or how items become popular, ex-cept that items that share temporal patterns of popular itemsearly on tend to be the ones that are more popular in the fu-ture, and that making popularity salient and ordering itemsby popularity can increase this effect (Salganik, Dodds, andWatts 2006). While such predictions are practically usefulfor promoting content, they are not so useful for informingcreation of new content or assessing its value, nor for under-standing the mechanisms by which items become popular.

Temporally matched balanced classiﬁcation

In this section, we give a problem formulation that lessensthe importance of temporal features by conditioning on theaverage rate of adoption. That is, instead of considering allitems with k adoptions, we consider items with k adoptionswithin about the same amount of time. Given the dominanceof cumulative advantage, such a formulation would be bet-ter suited for future research in understanding how items be-come popular, as gains in accuracy will likely shed light onattributes of early adopters, items, and networks that affecttheir ﬁnal popularity. k-t problem formulation We call this formulation Temporally Matched BalancedClassiﬁcation, or a k-t formulation of the problem:

P2:

Among items with exactly k adoptions at the end ofa ﬁxed time period t , which ones would be higher thanthe median popularity at a later time T ? Goodreads Flickr Last.fm

Dataset P e r c en t a cc u r a cy all, ﬁxed k tall-temporal, ﬁxed k tall, ﬁxed kall-temporal, ﬁxed k Figure 4: Percent accuracy for ﬁxed t & k using all featuresand non-temporal features, and for ﬁxed k with all featuresand non-temporal features. k = 5 , T = 28 days for all; t = 15 days for Goodreads, t = 1 days for Flickr, and t = 7 days for Last.fm. Fixing t reduces accuracy substantiallycompared to when t is not ﬁxed. As expected when control-ling for time, non-temporal features now provide most of theexplanatory power.To do this, for each dataset we ﬁltered items to those thathad exactly k adoptions in t days. We extracted features ofthese items as previously described, adding a new temporalfeature for each day in t : • adoptions i : Number of adoptions on day i of the earlyadopter period. (Szabo and Huberman 2010; Tsur andRappoport 2012; Pinto, Almeida, and Gonc¸alves 2013)As before, we choose k = 5 and T = 28 days. For eachdataset, we set t to be the median time it took an item toreach ﬁve adoptions: t = 15 for Goodreads, t = 7 forLast.fm, and t = 1 for Flickr. We exclude Twitter due toa lack of data when we ﬁlter for both k and t . We againdo 5-fold cross-validation, predicting if each item would beabove or below the ﬁnal median popularity after T days.Figure 4 shows the results. As we hoped, non-temporalfeatures now provide most of the explanatory power in thefull model. Further, comparing the all-temporal series withﬁxed k and t to the one with only ﬁxed k shows that the abso-lute accuracy of non-temporal features increases in this for-mulation. This suggests that de-emphasizing temporal fea-tures in prediction might in fact improve our understandingof other features that drive popularity.Our understanding, however, is limited: even conditioningon a single temporal feature makes for a much harder prob-lem, with the overall prediction accuracy below 65% for alldatasets even when using all features. There is clearly muchroom for improvement. Discussion and Conclusion

Using multiple problem formulations, we show that tempo-ral features matter the most in predicting the popularity oftems given data about initial adopters and our current abil-ity to build explanatory features of those adopters and theirnetworks. Using datasets from a variety of social networks,we show that temporal features are not only better at predict-ing popularity than all other features combined, but that theyreadily generalize to new contexts. When we discount tem-poral phenomena by removing temporal features or adjust-ing the problem formulation, accuracy decreases substan-tially.From a practical point of view, these results provide em-pirical support for a promising approach where only tem-poral features are used to predict future popularity (Szaboand Huberman 2010; Zhao et al. 2015) because the drop inaccuracy by casting aside non-temporal features is gener-ally small. Maybe creative feature engineering is not worththe effort for the Balanced Classiﬁcation task. This way oflooking at the problem resonates a bit with the Netﬂix prize,where most of the learners that were folded into the winningmodel were never implemented in Netﬂix’s actual algorithm,in part because the cost of computing and managing thoselearners was not worth the incremental gains (Amatriain andBasil 2012).Although less valuable than temporal features, the non-temporal features examined so far do have some predictivepower on their own. This might be useful when temporal in-formation is unavailable: for example, for very new items(Borghol et al. 2012), or for external observers or datasetswhere timestamps are unavailable (Cosley et al. 2010). En-couragingly, non-temporal features increase in accuracy alittle on the k-t formulation compared to the ﬁxed- k bal-anced classiﬁcation problem, suggesting that making timeless salient might allow other factors to become more visi-ble and modelable.Using k-t models could also bend time to our advantage.Comparing the overall performance and predictive featuresin models with smaller versus larger t might highlight item,adopter, and network characteristics that predict faster adop-tion (and eventual popularity). Another way to frame thisintuition is that instead of predicting eventual popularity, weshould try to predict initial adoption speed.Deeper thinking about the context of sharing might alsobe useful. Algorithmic and interface factors, for instance,have been shown to create cumulative advantage effects; itwould be interesting to look more deeply into how systemfeatures might inﬂuence adoption behaviors. Likewise, dif-fusion models tend to focus attention on sharers rather thanreceivers of information—but those receivers’ preferences,goals and attention budgets likely shape their adoption be-haviors (Sharma and Cosley 2015). Thus, consideration ofaudience-based features might be a way forward.Most generally, we encourage research in this area to gobeyond the low-hanging fruit of time. For building bettertheories of diffusion, maximizing accuracy with temporal in-formation may act both as a crutch that makes the problemtoo easy, and as a blindfold that makes it hard to examinewhat drives those rapid adoptions that predict eventual pop-ularity. Acknowledgments

This work was supported by the National Science Founda-tion under grants IIS 0910664 and IIS 1422484, and by agrant from Google for computational resources.

References

Amatriain, X., and Basil, J. 2012. Net-ﬂix recommendations: Beyond the 5 stars. http://techblog.netﬂix.com/2012/04/netﬂix-recommendations-beyond-5-stars.html .Asur, S.; Huberman, B.; et al. 2010. Predicting the futurewith social media. In

Web Intelligence and Intelligent AgentTechnology (WI-IAT), 2010 IEEE/WIC/ACM InternationalConference on , volume 1, 492–499.Bakshy, E.; Hofman, J. M.; Mason, W. A.; and Watts, D. J.2011. Everyone’s an inﬂuencer: quantifying inﬂuence ontwitter. In

Proceedings of the fourth ACM international con-ference on Web search and data mining .Bandari, R.; Asur, S.; and Huberman, B. A. 2012. Thepulse of news in social media: Forecasting popularity. In

Sixth International AAAI Conference on Weblogs and SocialMedia , 26–33.Berger, J., and Milkman, K. L. 2012. What makes onlinecontent viral?

Journal of marketing research

Contagious: Why things catch on . Simonand Schuster.Borghol, Y.; Ardon, S.; Carlsson, N.; Eager, D.; and Ma-hanti, A. 2012. The untold story of the clones: content-agnostic factors that impact youtube video popularity. In

Proceedings of the 18th ACM SIGKDD international confer-ence on Knowledge discovery and data mining , 1186–1194.Cha, M.; Mislove, A.; and Gummadi, K. P. 2009. Ameasurement-driven analysis of information propagation inthe ﬂickr social network. In

Proceedings of the 18th inter-national conference on World wide web , 721–730.Cheng, J.; Adamic, L.; Dow, P. A.; Kleinberg, J. M.; andLeskovec, J. 2014. Can cascades be predicted? In

Pro-ceedings of the 23rd international conference on World wideweb , 925–936.Cosley, D.; Huttenlocher, D. P.; Kleinberg, J. M.; Lan, X.;and Suri, S. 2010. Sequential inﬂuence models in social net-works.

Fourth International AAAI Conference on Weblogsand Social Media

The winner-take-allsociety: Why the few at the top get so much more than therest of us . Random House.Gladwell, M. 2006a. The formula.

The New Yorker .Gladwell, M. 2006b.

The tipping point: How little thingscan make a big difference . Little, Brown.Goel, S.; Hofman, J. M.; Lahaie, S.; Pennock, D. M.; andWatts, D. J. 2010. Predicting consumer behavior with websearch.

Proceedings of the National Academy of Sciences

ACM SIGKDD explorations newsletter

Scientiﬁc reports

Proceedings of theﬁfth ACM international conference on Web search and datamining , 573–582.Jenders, M.; Kasneci, G.; and Naumann, F. 2013. Analyzingand predicting viral tweets. In , 657–664.Kupavskii, A.; Umnov, A.; Gusev, G.; and Serdyukov, P.2013. Predicting the audience size of a tweet. In

SeventhInternational AAAI Conference on Weblogs and Social Me-dia .Lerman, K., and Galstyan, A. 2008. Analysis of social vot-ing patterns on digg. In

Proceedings of the ﬁrst workshopon Online social networks , 7–12.Lerman, K., and Hogg, T. 2010. Using a model of socialdynamics to predict popularity of news. In

Proceedings ofthe 19th international conference on World wide web , 621–630.Maity, S. K.; Gupta, A.; Goyal, P.; and Mukherjee, A. 2015.A stratiﬁed learning approach for predicting the popularityof twitter idioms. In

Ninth International AAAI Conferenceon Web and Social Media .Martin, T.; Hofman, J. M.; Sharma, A.; Anderson, A.; andWatts, D. J. 2016. Limits to prediction: Predicting successin complex social systems. In

Proceedings of the 25th inter-national conference on World wide web .Pachet, F., and Sony, C. 2012. Hit song science.

Music DataMining

Fifth In-ternational AAAI Conference on Weblogs and Social Media .Pinto, H.; Almeida, J. M.; and Gonc¸alves, M. A. 2013. Us-ing early view patterns to predict the popularity of youtubevideos. In

Proceedings of the sixth ACM international con-ference on Web search and data mining .Romero, D. M.; Tan, C.; and Ugander, J. 2013. On theinterplay between social and topical structure. In

SeventhInternational AAAI Conference on Weblogs and Social Me-dia .Salganik, M. J.; Dodds, P. S.; and Watts, D. J. 2006. Exper-imental study of inequality and unpredictability in an artiﬁ-cial cultural market.

Science .Sharma, A., and Cosley, D. 2015. Studying and model-ing the connection between people’s preferences and con-tent sharing. In

Proceedings of the 18th ACM Conferenceon Computer Supported Cooperative Work & Social Com-puting , 1246–1257.Sharma, A., and Cosley, D. 2016. Distinguishing betweenpersonal preferences and social inﬂuence in online activ-ity feeds. In

Proceedings of the 19th ACM Conference on Computer-Supported Cooperative Work & Social Comput-ing , 1091–1103.Simonoff, J. S., and Sparrow, I. R. 2000. Predictingmovie grosses: Winners and losers, blockbusters and sleep-ers.

Chance

Communications of the ACM

Proceedings of the ﬁfth ACM interna-tional conference on Web search and data mining , 643–652.Watts, D. J. 2011.

Everything is obvious:* Once you knowthe answer . Crown Business.Weng, L.; Menczer, F.; and Ahn, Y.-Y. 2013. Virality predic-tion and community structure in social networks.

Scientiﬁcreports

Fourth In-ternational AAAI Conference on Weblogs and Social Media

IEEE Interna-tional Conference on Data Mining .Zhao, Q.; Erdogdu, M. A.; He, H. Y.; Rajaraman, A.; andLeskovec, J. 2015. Seismic: A self-exciting point processmodel for predicting tweet popularity. In