Uncovering the Dynamics of Crowdlearning and the Value of Knowledge
UUncovering the Dynamics of Crowdlearningand the Value of Knowledge
Utkarsh Upadhyay, Isabel Valera, and Manuel Gomez-RodriguezMax Plank Institute for Software Systems, { utkarshu, ivalera, manuelgr } @mpi-sws.org Abstract
Learning from the crowd has become increasingly popular in the Web and social media. Thereis a wide variety of crowdlearning sites in which, on the one hand, users learn from the knowledgethat other users contribute to the site, and, on the other hand, knowledge is reviewed and curatedby the same users using assessment measures such as upvotes or likes. In this paper, we presenta probabilistic modeling framework of crowdlearning, which uncovers the evolution of a user’sexpertise over time by leveraging other users’ assessments of her contributions. The model allows forboth off-site and on-site learning and captures forgetting of knowledge. We then develop a scalableestimation method to fit the model parameters from millions of recorded learning and contributingevents. We show the effectiveness of our model by tracing activity of ∼ thousand users in StackOverflow over a 4.5 year period. We find that answers with high knowledge value are rare. Newbiesand experts tend to acquire less knowledge than users in the middle range. Prolific learners tend tobe also proficient contributors that post answers with high knowledge value. Keywords:
User modelling; Education; Markets and crowds; Social and information networks. “By learning you will teach; by teaching you will learn.” —Latin proverbQuestion answering (Q&A) sites, online communities, wikis and microblogs offer unprecedentedopportunities for people to learn about a wide variety of topics, acquire specialized knowledge or beup to date with latest breaking news. A common feature shared by most of these platforms is thatknowledge is contributed by the crowd – it is crowdsourced – and it is also the crowd who reviewsand curates the contributed knowledge. For example, in Q&A sites, users can learn by reading answersothers post to their own or similar questions; in wikis, a set of editors write and review the content ofpages in a collaborative fashion, and this content is then publicly accessible to others; in microblogs,users post small pieces of information, which can then be assessed by other users by means of likes,shares or replies. All of the above are examples of crowdlearning , in which users can play the role ofa learner , a contributor , or switch between both. There have been many recent works on identifyingexperts (or potential experts) in Q&A websites [15, 36] and microblogs [13, 25], as well as modelinglearning in controlled settings [6, 12]. However, general models of crowdlearning are largely inexistentto date. Such models are of outstanding interest since they would allow us:(i) to better understand how people learn over time and become experts;(ii) to identify questions with high knowledge value, which systematically help users increase theirexpertise; 1 a r X i v : . [ c s . S I] D ec iii) to investigate the interplay between learners and contributors.In this paper, we propose a probabilistic generative model of crowdlearning, especially designed tofit fine-grained crowdlearning event data [1]. The key idea behind our modeling framework is simple:every time a user learns from a knowledge item contributed by other users, she may increase her expertiseand, as a consequence, her subsequent contributions be more knowledgeable and assessed more highlyby others in terms of, e.g. , upvotes, likes or shares. Thus, by jointly modeling learning events , inwhich users acquire effective knowledge , and contributing events (in short, contributions ), in whichusers contribute with their expertise to a knowledge item, our framework will reach the above mentionedgoals. In this work, we aim to measure those aspects of the learning process for which we have evidencein the observed data, i.e. , a measure of effective knowledge that leads to measurable increase in users’ effective learning . A general model of abstract knowledge and learning remains a challenging endeavor.In more detail, we model each user’s expertise as a latent stochastic process that evolves over timeand think of the other users’ assessment of her contributions as noisy samples from this stochasticprocess localized in time. Moreover, this stochastic process is driven by two types of learning: off-site learning and on-site learning. The proposed formulation also captures characteristic properties ofthe learning process, previously studied in the literature, such as forgetting [21] and initial expertise [30].We then develop an efficient parameter estimation method that finds the model parameters that maximizethe likelihood of an observed set of learning and contributing events via convex optimization. Finally,we show the effectiveness of our model by tracing learning and contributing events in data gathered fromStack Overflow over a 4.5 year period. Our experiments reveal several interesting insights:I. The knowledge value of items follow a log-normal distribution.II. Users with very low or very high initial expertise, i.e. , newbies and experts, tend to increase theirknowledge the least and, in contrast, users in the middle range tend to increase it the most. Thissuggests that the learning curve may be sigmoidal, in agreement with existing literature [20].III. Although there are fewer contributors than learners in absolute numbers, the distribution of knowl-edge in the contributions is fat tailed while the distribution of knowledge learned is heavy tailed.IV. Users who learn from high knowledge items are also more proficient at providing answers withhigh knowledge value. Related work.
Our work lies in the intersection between expert identification, learning and knowledgetracing and student modeling.Identifying topical authorities or experts, i.e. , users who provide high quality contributions, onQ&A [15, 17, 26, 36] and microblogging sites [13, 25], has received a lot of attention recently. The prob-lem of expert finding in Q&A sites was first studied by Zhang et al. [36], who formulated it as a rankingproblem and developed several PageRank based methods. Shortly after, Jurczyk and Agichtein [17]tackled the problem using link analysis techniques. More recently, Pal and Konstan [26] approached theproblem from the perspective of supervised learning and developed Gaussian classification models todistinguish between ordinary and (potential) experts users, and Hanrahan et al. [15] described a methodto find experts given a specific target question. In the context of microblogging, the problem of expertfinding was first studied by Pal and Counts [25], who proposed a set of features for characterizing con-tributors and then formulated the problem using unsupervised learning in this feature space. Since then,Ghosh et al. [13] mined Twitter users’ lists to find topical authorities and Kao et al. [19] and Paulina etal. [27] leveraged temporal statistics on the users’ activity to identify experts. Finally, in the context ofweb search, White et al. [34] studied how expertise influence search and Eickhoff et al. [9] investigatedhow a user can increase her expertise as she looks for procedural and declarative knowledge using asearch engine. However, in contrast to our work, previous work on expert identification did not capturethe evolution of users’ expertise over extended periods of time nor accounted for the knowledge valueof their contributions. 2he interest in the field of modeling and measuring learning is very old and several paradigms havebeen developed over the last century in the experimental psychology literature [10, 22, 24]. Most of theresearch has, however, either happened in strictly controlled environments ( i.e. , schools or study groups)or used centralized assessments ( e.g. , SAT or local testing). The work most closely related to ours isknowledge tracing and student modeling, which has been carried on by researchers from the learninganalytics, educational data mining and intelligent tutoring communities. In this line of work, severalprobabilistic models have been proposed: Bayesian knowledge tracing [8, 14, 31, 35], performancefactor analysis [28] and ensembles [5]. However, previous work typically relies on controlled assessmentand manually annotated knowledge items, even if allowing for different knowledge values per item [6].Only very recently, Piech et al. [29] solved this limitation by resorting to recurrent neural networks tomodel the learning of students, unfortunately, they use metrics that are not suitable for crowdlearning.In summary, our goal here is a general understanding of crowdlearning dynamics, from uncover-ing the evolution of users’ expertise over time and understanding the interplay between learning andcontributing, to identifying questions with a high knowledge value, which systematically help users toincrease their expertise. In contrast, previous work has focused either on identifying experts, not theirexpertise evolution, or modeling learning of students in controlled environment with manually annotatedknowledge items.
In a crowdlearning site, users often play two different functional roles: contributors and learners . In theformer role, they share their knowledge on a topic (or topics) with other users within the site; and, inthe latter role, they acquire knowledge by reading what other users contributed to the site. Then, we canthink of users’ expertise as latent stochastic processes that evolve over time, and think of the assessmentsof their contributions to the site as noisy samples from these stochastic processes localized in time. Here,we propose a modeling framework that uncovers the evolution of these processes by modeling two typesof learning:I.
Off-site learning , which accounts for the knowledge that the user accumulates outside the site;and,II.
On-site learning , which accounts for the knowledge that the user gains by reading other users’contributions within the site.Next, we formulate our generative model, starting from the data it is design for.
Crowdlearning data.
Given a crowdlearning site with a set of users U and a set of learning areas(or topics) A , we first define a knowledge item q as the smallest quantum of knowledge a user canlearn from within the site. For example, in a Q&A site, a knowledge item corresponds to a questionand its answer(s); in Twitter, it corresponds to a tweet; and in a wiki site, it corresponds to a wikipage. Intuitively, each knowledge item q provides certain (latent) knowledge value, k q ∈ R + , andcontains knowledge about a subset of topics A q ∈ A . Here, we assume that knowledge is additive, i.e. , k q = (cid:80) a ∈A q k qa = w Ta k q , where k qa ∈ R + is the knowledge value contained in item q about topic a , k q = [ k qa ] a ∈A , and w qa = 1 if a ∈ A q and w qa = 0 , otherwise. The model can be extended tonon-binary weights to represent fractional presence of topics in a knowledge item [7].Then, we define two types of events: learning events , in which users acquire knowledge by readingcontributions by other users, and contributing events (or contributions ), in which users contribute to thecrowd by sharing their knowledge. Formally, we represent each learning event as a triplet l := ( u ↑ user , t ↑ time , knowledge item ↓ q ) , (1)3hich means that a user u ∈ U learned from knowledge item q at time t . Here, a knowledge item q may contain one or more contributions by other users. For example, in a Q&A site, a knowledgeitem corresponds to a question and its answers, typically contributed by different users. In a learningevent, we do not distinguish the knowledge provided by individual contributions, but instead, considerthe knowledge of the item as a whole. Moreover, note that, if the knowledge value of an item is zero,the learning event will not increase the expertise of the learner. Then, we denote the history of learningevents associated to user u up to time t by H lu ( t ) = (cid:83) i : t i We represent each user’s expertise as a multidimensional (latent)stochastic process e ∗ u ( t ) , in which the a -th entry, e ∗ ua ( t ) ∈ R + , represents the user u ’s expertise on topic a at time t . Here, the sign ∗ means that the expertise e ∗ ua ( t ) depends on her learning history H lu ( t ) . Then,every time a user u contributes to a knowledge item q at time t , we draw the contribution’s score froma distribution p ( s |A q , e ∗ u ( t )) . Further, we represent the times of the learning and contributing eventswithin the site by two sets of counting processes, denoted by two vectors N l ( t ) and N c ( t ) , in whichthe u -th entries, N lu ( t ) and N cu ( t ) , count the number of times user u learned from and contributed tothe crowdlearning site up to time t . Then, we can characterize these counting processes using theircorresponding intensities as E [ d N l ( t ) | H ( t )] = λ l ( t ) dt and E [ d N c ( t ) | H ( t )] = λ c ( t ) dt where d N l ( t ) := [ dN lu ( t )] u ∈U and d N c ( t ) := [ dN cu ( t )] u ∈U denote the number of learning and con-tributing events in the window [ t, t + dt ) and λ l ( t ) := [ λ lu ( t )] u ∈U and λ c ( t ) := [ λ cu ( t )] u ∈U denote thevector of intensities associated to all the users. Here, there is a wide variety of intensity functions onecan choose from [1]. However, modeling the times of learning and contributing events is not the mainfocus of this work – we refer the reader to the growing literature on social activity modeling using pointprocesses [11, 33, 37]. Next, we specify the functional form of each user’s expertise e ∗ u ( t ) and the scoredistribution p ( s |A q , e ∗ u ( t )) . Stochastic process for expertise. The expertise e ∗ ua ( t ) of a user u on a topic a at time t takes thefollowing form: e ∗ ua ( t ) := initial expertise (cid:122)(cid:125)(cid:124)(cid:123) α ua + µ ua · t (cid:124) (cid:123)(cid:122) (cid:125) off-site learning + on-site learning (cid:122) (cid:125)(cid:124) (cid:123)(cid:88) i : q i ∈H lu ( t ) k q i a · κ ω ( t − t i ) where the first term, α ua ∈ R + , models the initial expertise of user u on a topic a when she joined thecrowdlearning site; the second term, µ ua ∈ R + , assumes a linear trend for the off-site learning processas a first order approximation ; and, the third term models the knowledge a user acquires by means oflearning events within the crowdlearning site. Here, κ ω ( t ) is a nonnegative kernel function that modelsthe rate at which users forget the knowledge they learn from knowledge items. Following previous workon the psychology literature [4, 21], which argues that people forget at an exponential rate, we opt for Several other shapes for the learning curve have been proposed in Heathcote, et al. [16]. However, we chose the linear form for itssimplicity and ease in model estimation, as suggested by Skinner [32]. o f k n o w l e d g e i t e m s (a) LE per question o f u s e r s (b) LE per user . . . . . . P r o bab ili t y (c) LET per user . . . . . . P r o bab ili t y (d) CET per user Figure 1: Statistics of learning events (LE), tags of learning events (LET) and tags of contributing events(CET) in the Stack Overflow dataset. In Panels (c) and (d), the x -axis denotes the tag index in order ofpopularity for each user.an exponential kernel κ ω ( t ) := exp( − ωt ) I ( t ≥ . However, our model estimation method does notdepend on this particular choice.For compactness, we write each user’s expertise as a row vector of length |A| , i.e. , e ∗ u ( t ) = α u + µ u · t + (cid:88) i : q i ∈H lu ( t ) k q i · κ ω ( t − t i ) (3)where α u = [ α ua ] a ∈A , µ u = [ µ ua ] a ∈A and k q i = [ k q i a ] a ∈A . Here, by definition, k q i a = 0 if a / ∈ A q i .Then, we can gather the model parameters for all users in three matrices α , µ and k with sizes |U | × |A| , |U | × |A| and |Q| × |A| . Score distribution. Given a contribution c = ( u, t, q, s ) , the particular choice of score distribution p ( s |A q , e ∗ u ( t )) depends on the observed data. In this work, we consider discrete non-negative scores, s ∈ N , which fit well several scenarios of interest. For example, in Stack Overflow, scores may correspondto the number of upvotes that answers receive; in Twitter, to the number of likes or retweets that tweetsreceive; and, in Pinterest, to the repins that a pin receives. A natural choice in such cases is the Poissondistribution: p ( s |A q , e ∗ u ( t )) ∼ Poisson (cid:32) w Tq e ∗ u ( t ) w Tq (cid:33) , (4)Here, is a column vector of ones with length |A| . With this choice, the average of the score distri-bution is simply the average expertise of user u at time t across the topics A q the knowledge item q isabout. Moreover, the greater the expertise of a user, the greater the scores given by other users to hercontributions, as one may expect in real-world data.Note that depending on the recorded data, we could choose a different score distribution, e.g. , forcontinuous assessments like time elapsed between the question and the answer, one may choose a con-tinuous distribution. Our model estimation method in Section 2 can be easily adapted to any distributionthat is jointly log-concave with respect to the model parameters α , µ and k . Efficient Parameter Estimation. Given a collection of learning and contributing events, H l ( T ) and H c ( T ) , recorded during a time period [0 , T ) , we find the optimal model parameters α , µ and k bysolving the following maximum likelihood estimation problem:maximize α ≥ , µ ≥ , k ≥ L ( α , µ , k ) , (5)where we compute the log-likelihood L ( α , µ , k ) using Eq. 3 and Eq. 4, i.e. , L ( α , µ , k ) = (cid:88) ( u,t,q,s ) ∈H c ( T ) s · log (cid:32) w Tq e ∗ u ( t ) w Tq (cid:33) − w Tq e ∗ u ( t ) w Tq . (6)5 50 100 150 200 250 300 350 Time (days) E x p e r t i s e / U p v o t e s Learning EventScoresTrue ExpertiseEstimated Expertise (a) Expertise evolution True knowledge value, k E s t . k n o w l e d g e v a l u e , ˆ k (b) tag True knowledge value, k E s t . k n o w l e d g e v a l u e , ˆ k ≥ tags (c) tags Figure 2: In (a), estimated and true expertise evolution for a user, picked at random, in the -tag syn-thetic dataset. In (b) & (c), estimated (y-axis) against true (x-axis) knowledge item values. Each pointcorresponds to a knowledge item variable, and the line defined by x = y corresponds to zero estimationerror. Our estimation method achieves Spearman’s correlations ρ = 0 . and ρ = 0 . . True trend, µ E s t i m a t e d t r e n d , ˆ µ (a) µ ( -tag) . . . . . . True baseline, α . . . . . . E s t i m a t e d ba s e il n e , ˆ α (b) α ( -tag) Figure 3: Estimated (y-axis) against true (x-axis) model parameters for the -tag synthetic datasets.Each point corresponds to a user’s (a) trend µ u or (b) baseline α u variable, and the line defined by x = y corresponds to zero estimation error. Our estimation method achieves a Spearman’s correlation ρ µ = 0 . and ρ α = 0 . . The results for -tag synthetic datasets are qualitatively similar.Since e ∗ u ( t ) is linear in the model parameters α , µ and k , the function log x − x is concave, anda composition of a linear function with a linear combination of concave functions is concave, the opti-mization problem above is jointly convex in α , µ , and k . As a consequence, the global optimum canbe efficiently found by many algorithms. In practice, the limited memory BFGS with bounded variables(L-BFGS-B) algorithm [38] worked best for our problem. Remarks. In this work, we are measuring effective learning , which accounts for the ability of a user toget better assessment of her posts, and effective knowledge , which accounts for the gain in this abilitythat learning from a knowledge item causes. Making these quantities correspond to real-life expertiseand knowledge value on a crowdlearning website requires careful mapping from the features on thatwebsite to learning events and scores.Moreover, using our model, one can only measure learning and knowledge if there is overlap be-tween the topics of a user’s learning and contributing events. Therefore, there is a trade-off between thegranularity of the topics chosen and the amount of data available for inference: increasing the granular-ity ensures accurate mappings between learning and contributing events, but reduces the amount of dataavailable to learn the model parameters. We discuss this further in Section 5.6 C o rr e l a t i o n , ρ k µ α (a) CR vs. learning events C o rr e l a t i o n , ρ k µ α (b) CR vs. contributions R M S E ( µ ) (c) RMSE ( µ ) vs. contributions Figure 4: Estimated against true model parameters for the -tag synthetic dataset. Panels (a) and (b)show the correlation (CR) between the estimated and true model parameters against number of learningevents and median number of contributions per knowledge item, respectively. Panel (c) shows the RMSEfor the estimated trends, µ , against number of contributed events per user. In Panel (a), the number ofcontributing events is , and the red dotted line shows the threshold (10) we chose for the learningevents per knowledge item in our dataset (see Section 4). In Panels (b) and (c), the number of learningevents is always , and the red dotted lines show the median number of contributions per knowledgeitem and the minimum number of contributions per user in the experiments on our dataset, respectively(see Section 4). In this section, we first show that our model estimation method can accurately recover the true modelparameters from learning and contributing events generated under realistic conditions. We then showthat, as long as there are a sufficient number of contributions per learning event, the estimation becomesmore accurate as we feed more events into the estimation procedure. Finally, we show that our estimationmethod can easily scale up to millions of users, knowledge items, and learning and contributing events. Experimental setup. We carefully craft an experimental setup to closely mimic some of the empiricalpatterns observed in real crowdlearning data, as given by Figure 1. Here, for simplicity, we assume thetopics associated to each knowledge item are specified by means of tags.Given a set of users and knowledge items, we draw the users’ offsite learning rates { µ ua } and initialexpertise { α ua } from U (0 , and U (0 , , and the knowledge value of the items from the rescaledlog-normal distribution . × ln N (0 , . These choices ensure that the distribution of scores whichusers receive resembles the distribution in real data. We set the users’ forgetting decay rate to ω =(11 . days ) − , such that 50% of the knowledge is forgotten roughly after the first week, and assume thatthe intensities of both users’ learning and contributing events are (homogeneous) Poisson processes.We denote the total simulation time by T . We set each user’s learning event rate to T /n , where n is drawn from a log-normal distribution, so that the number of events per user fits well the empiricaldistribution (see Figure 1b), and each user’s contributing rate to T /m , where m is drawn from anuniform distribution for easy control. Finally, for each user, we shuffle the tag labels and set her taglearning propensity, defined as the probability that she up-votes a knowledge item with a given tag, andher tag contributing propensity, defined as the probability that she contributes to a knowledge item witha given tag, using the empirical distributions (see Figures 1c and 1d).Then, we generate learning and contributing events as follows. First, we generate the timings ofeach user’s learning events by drawing samples from the corresponding Poisson process, and assigneach learning event to a knowledge item such that the user’s tag learning propensity is satisfied. Then,we generate the timings of each user’s contributions by drawing samples from the corresponding Pois-son process, assign each contributing event to a knowledge item such that the user’s tag contributing7 . × × Learning events T i m e ( m i n s ) PrecompOptim (a) RT vs. learning events . × × Contributions T i m e ( m i n s ) PrecompOptim (b) RT vs. contributions Figure 5: Running time (RT) of our model estimation method. In Panel (a), we consider ∼ millioncontributing events while varying number of learning events (and knowledge items); in Panel (b), weconsider ∼ . million learning events while varying the number of contributions (per learner). Forpre-processing, we used ten machines with cores and, for the optimization itself, we used a singlemachine with cores. The memory requirements were below GB at all points of the pre-processingand optimization.propensity is satisfied, and draw the quality score from a Poisson distribution that depends on the user’sexpertise on the item tags at the time of the event, as given by Eq. (4). Unless explicitly stated, we onlyconsider knowledge items with at least associated learning events. Given this data, our goal is tofind the knowledge value of the items users learned from, as well as the users’ offsite learning rates andinitial expertise by solving the maximum likelihood estimation problem defined in Eq. (5). Parameter estimation accuracy. We evaluate the accuracy of our model estimation procedure acrossall users and knowledge items for a - and -tag dataset with ∼ knowledge items. Figure 2 sum-marizes the results for the estimation of the knowledge item values by means of two scatter plots. Inall cases, we find that points lie close to the line x = y , i.e. , their estimation error is close to zero. Wealso observe that the estimation of knowledge items in the -tag dataset is more challenging than in the -tag dataset. Additionally, Figure 3 summarizes the results for the estimation of the user’s expertisebaseline and trend variable for the -tag dataset using scatter plots. Results for the -tag dataset arequalitatively similar although the estimation is more challenging. In particular, if we look at the estima-tion of the knowledge values, trends and baseline for the -tag dataset, our estimation method achievesa Spearman’s correlations ρ k = 0 . , ρ µ = 0 . and ρ α = 0 . while, for the -tag dataset, it achieves ρ k = 0 . , ρ µ = 0 . and ρ α = 0 . . This is most likely due to the mixing of tag knowledge variableswithin the same knowledge item, i.e. , the linear combination of knowledge variables in Eq. (6). Parameter estimation accuracy vs. number of learning events. In our model, we can think of learn-ing events as measurements of the amount of knowledge in a knowledge item, which are accumulatedover time in the users’ expertise, and of contributing events as noisy samples of the users’ expertise atparticular points in time. Therefore, intuitively, the more users learn from a knowledge item the easier itshould become to accurately estimate their associated knowledge value, as long as these users also con-tribute to other knowledge items with overlapping topics. Figure 4a confirms this intuition by showingthe Spearman’s correlation against minimum number of learning events per knowledge item in a -tagdataset with , contributions. Parameter estimation accuracy vs. number of contributing events. As pointed out above, we canthink of the score of contributing events as noisy samples of users’ expertise at particular points intime. Therefore, one may expect the accuracy of our model parameter estimation to improve as thenumber of contributions increases, due to a more fine-grained sampling of each user’s expertise. Fig-8re 4b gives empirical evidence that this indeed happens, by showing the Spearman’s correlation againstaverage number of answers per learning event in a -tag dataset with , learning events. Figure 4cshows how the RMSE of the estimation of µ decreases as the number of contributions made by the userincreases. Scalability of parameter estimation. Crowd-learning sites such as Stack Overflow or AskReddit arerapidly increasing their number of active users, questions and answers. For example, Stack Overflowrecently crossed the ∼ 10 million questions mark . The pre-computation of all coefficients in Eq. (6),which is the running time bottleneck, can be readily parallelized. Figure 5 shows that our model estima-tion method easily handles up to millions of learning events and contributing events, and scales almostlinearly with the number of learning events and contributions.Thus, it should be possible to scale up our estimation method even further. In this section, we apply our model estimation method to a large-scale crowdlearning dataset from StackOverflow. First, we evaluate our model quantitatively by means of a prediction task: given two differentanswers to a question, predict which one will receive a higher score. Then, we discuss the distributionof the knowledge values and the effect of the kernel parameter on the estimation, identify different typesof learners and derive insights into their main characteristics. Finally, we study the interplay betweenlearners and contributors in crowdlearning sites and investigate to which extent users switch betweenlearning and contributing over time. Data description. Our Stack Overflow dataset comprises ∼ million questions, ∼ . million answers,and ∼ . million upvotes. These questions and answers were posted by ∼ . million users during asix year period, from the site’s inception on July 31, 2008 to September 14, 2014. Importantly, for eachupvote, our dataset contains its associated user identity, question or answer identity and timestamp .We discard the events which happened before 2010-01-01 (before the site had fully matured) and after2014-06-01 (the extent of the data-dump we had access to). Whenever in the data a user upvotes ( writes )an answer, we record it as a learning (contributing) event involving the user and the knowledge itemcontaining the answer. Moreover, we select the number of upvotes a user’s answer received in the firstweek after posting it as the score of the contribution; downvotes were discarded because they constituteless than 3% of total votes cast. Here, we consider only the first week of voting to prevent old contribu-tions from gaining an unfair advantage as they have more time to accumulate upvotes. Figure 1 providesgeneral statistics on learning and contributing events and tags usage. We find that the learning eventsper user (per question) follow a log-normal (power-law) distribution. As shown, the tag usage is highlyskewed towards few tags; most users contribute and learn only from their favorite tags. Data preprocessing. In Section 3, we have shown that the accuracy of our estimation method dependsdramatically on the number of learning and contributing events per question and user (refer to Figure 4).As a consequence, we can only expect our model estimation method to provide reliable and accurateresults in real data if the data we start with contains enough learning and contributing events per questionand user. To this aim, we carefully pre-process our large-scale dataset of learning and contributingevents. We only consider:(i) Knowledge items with more than associated learning events, which corresponds to a correlationvalue ≥ . between true and estimated knowledge parameters in synthetic data, as shown inFigure 4a. http://meta.stackoverflow.com/questions/303045/10-million-questions Stack Overflow generously gave us access to these additional metadata, which allows us to readily fit our model. ≥ . ≥ . ≥ . ≥ . ≥ . ≥ . ≥ . Table 1: Performance of our model against a linear baseline model at predicting which one of twoanswers to a question will receive a higher score. As the difference in score between the answers(and, hence, the users’ expertise) increases, the competitive advantage of our model becomes morepronounced.(ii) Users that contribute (answer) more than times in at least 10 unique months, which correspondsto a RMSE value ≤ for the estimated users’ baseline and trend parameters in synthetic data, asshown in Figure 4c; and,(iii) Top 10 tags in terms of number of learning events in the recorded data ( i.e. , java , c , javascript , php , android , jquery , python , html , c++ , and mysql ).After these preprocessing steps, our dataset consists of ∼ 25 thousand users who learn from ∼ 66 thou-sand knowledge items by means of ∼ . million learning events, and contribute to ∼ ∼ . million contributing events. Then, we correct for the overall decreasingtrend on number of upvotes per answer over time and since, for each knowledge item, most learn-ing events occur after all the contributions (answers) to the knowledge item took place, we assume itsknowledge value to be constant. We use the first event of each user in our dataset as as an estimate ofher joining time.Finally, we would like to highlight that the preprocessing steps above do not aim to reduce the size ofthe original dataset but to increase the accuracy of our estimated model parameters and the reliability ofour derived qualitative insights — our model estimation method does easily scale to millions of learningand contributing events. In this case, the pre-processing of the raw data using five machines with cores each took ∼ minutes and our estimation method, implemented using the Intel MKL libraries,took ∼ . hours on a single machine with cores. The memory requirements were below GB atall points of the pre-processing and optimization. Quantitative evaluation. We evaluate our model quantitatively by means of the following predictiontask: given two different answers to a question, predict which one will receive a higher score, i.e. , morenumber of upvotes in the week after posting it. — Experimental setup: We train our model using the first 80% of the answers provided by eachlearner, as well as the learning events which occurred before them. Then, we match pairs of answers tothe same questions from the remaining 20% and predict which one will receive a higher score. Here, weonly consider questions with pairs of answers provided by users from our dataset such that their scoresdiffer by at least one upvote. There are ∼ 32 thousand such pairs in our dataset. Finally, we compare ourmodel against a baseline linear model which only accounts for off-site learning to show the benefits ofincluding knowledge item variables. — Results: Table 1 compares the performance of our model against the baseline model as the differ-ence in score between the answers (and, hence, the users’ expertise) increases. Our model consistently The number of upvotes per answer decreases over time because the number of answers grows at a faster rate than the number of learners. − total knowledge value o f i t e m s (a) Overall knowledge values 0% 50% 100%% of useful votes o f l e a r n e r s (b) Fraction of upvotes leading to learning Figure 6: Estimated knowledge values for knowledge items and useful upvotes for two different kernelparameters, with half-life i.e. , led tolearning) per learner: higher half-life leads to higher sparsity, which leads to fewer fraction of upvotescausing effective learning. − Min1%2% LL0.5790 Figure 7: Negative log-likelihood plotted for different values of kernel parameter (expressed as half-lifein days). The y -axis shows the relative difference with respect to the minimum value. The likelihoodnearly plateaus ( ∼ ) for half-life between . and days. The results we present are robust toparameter changing within the range and we chose days as a representative value.outperforms the baseline for any score difference and the competitive advantage becomes more pro-nounced as the score difference increases, reaching > % accuracy when the score difference is ≥ . Knowledge value and forgetting rate. In this section, we leverage our model to give insights on theknowledge values across items in Stack Overflow for different forgetting rates, i.e. , the kernel decayparameter ω . We express the kernel decay parameter ω in units of half-life in days, i.e. , the time toforget 50% of the knowledge in an item. Figure 6a shows the distribution of estimated knowledgevalue across knowledge items for two kernel parameters with half-life . days and days. We findseveral interesting patterns. Knowledge values in both settings follow a log-normal distribution, in which ∼ % of the items account for ∼ % of the overall knowledge. However, while for a half-life of . days, ∼ % of the knowledge items do not contribute knowledge, this fraction increases to ∼ % for days. A potential explanation for this difference is that, by increasing the half-life, a knowledge itemmust show evidence of effective learning over longer stretches of time to contribute knowledge and thishappens more rarely. As a consequence, a smaller fraction of upvotes lead to effective learning ( i.e. ,being useful ) when the half-life is high, as shown in Figure 6b – when the half-life is . days ( days), % ( %) of upvotes lead to learning.In the remaining sections, for ease of exposition, we set the kernel parameter such that the half-lifeof knowledge is 7 days (refer Figure 7), however, the insights obtained in the following sections arerobust to changes in the kernel parameter. Types of learners. Here, our goal is to better understand the type of learners that use crowd-learningsites as well as their characteristic properties. To this aim, we start by visualizing the estimated learning11 00 900 1000 1100 1200 1300 1400 1500 Time (days) E x p e r t i s e / U p v o t e s Learning EventScoresEstimated Expertise (a) Avg. learner (Avg. knowledge / contribution: 0.005) Time (days) E x p e r t i s e / U p v o t e s Learning EventScoresEstimated Expertise (b) Expert: (Avg. knowledge / contribution: 0.034) Time (days) . . . . . . . E x p e r t i s e / U p v o t e s Learning EventScoresEstimated Expertise (c) On-site learner (on-site learning: 55%) Time (days) E x p e r t i s e / U p v o t e s Learning EventScoresEstimated Expertise (d) Off-site learner (on-site learning: 0.4%) Figure 8: Estimated learning trajectory for four characteristic Stack Overflow users. The (average)learner (a) contributes answers with much less knowledge value than the expert (b), i.e. , . vs . .The on-site learner (c) acquires % of her knowledge by learning from items in Stack Overflow incontrast with the off-site learner (d), who only learns . % of her knowledge by those means. Day inthe plots is the date 2010-01-01.trajectory for four different users — an average learner, an on-site learner, an off-site learner and anexpert — in Figure 8. Each of the users exhibits different characteristic properties. For example, theaverage learner contributes answers with much less knowledge value ( . ) than the expert ( . ),and the on-site learner acquires % of the knowledge by learning from items in Stack Overflow incontrast with the off-site learner, who only learns . % of the knowledge by those means.Next, we investigate the interplay between on-site and off-site learning across all users. Here,given user u , we define on-site learning as the total expertise gathered by reading the knowledgeitems, (cid:80) a ∈A (cid:80) q ∈H lu ( T ) (cid:82) k qa κ ω ( t ) dt , off-site learning as the expertise gathered outside Stack Over-flow, (cid:80) a ∈A (cid:82) µ ua t dt , and overall learning as the sum of both. One can think of these quantities as theaggregate number of upvotes ( i.e. , score) users would have received on the site through either their on oroff-site learning if they were posting answers at the same rate. Note that unlike the reputation on StackOverflow, which is a measure of how much a user has effected others on the site, the on-site and off-sitelearning reflects how much a user has learned. Figure 9a compares users’ on-site and off-site learning bymeans of a box plot. For x ≤ , users achieving higher on-site learning also achieve higher off-sitelearning, but over x > , off-site learning becomes more dominant. Our results seem to indicate thatquick learners rely less on on-site learning, in relative terms.Finally, we investigate the role that a user’s starting expertise plays on her overall learning over time12 , ]( , ]( , ]( , ]( , ]( , ] > Off-site learning O n - s i t e l e a r n i n g (a) On-site and off-site learning [ . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , . ]( . , ] > Starting knowledge, α O v e r a lll e a r n i n g (b) Starting and learned expertise Figure 9: Behavior of learners in for tag c . Panel (a) compares users’ on-site and off-site learning ina box plot. For x ≤ , users achieving higher on-site learning also achieve higher off-site learning,however, over x > , users off-site learning becomes more dominant. Panel (b) shows users’ overalllearning against starting expertise in a box plot. Users with very low or very high initial expertise, i.e. ,newbies and experts, tend to increase their knowledge the least, in contrast, users in the middle of therange tend to increase it the most. In both panels, the limits of the boxes are the 25%–75% percentilesand the red dashed lines shows the median value.by means of a box plot, shown in Figure 9b. Here, the x -axis corresponds to a user’s starting expertise, α u , and the y -axis to her overall learning. Interestingly, we find that users with very low or very highinitial expertise, i.e. , newbies and experts, tend to increase their knowledge the least, in contrast, users inthe middle of the range tend to increase it the most. This is in agreement with previous research, whichindicated that in presence of only positive reinforcement, the gain in expertise has a sigmoidal shape forlearners, i.e. , the newbies and experts increase their expertise at lower rates than learners with mediumlevels of expertise [20]. Learners vs contributors. A crowd-learning site is only useful if it has both learners and contributors.Here, we investigate two natural questions that emerge in such context:I. Are learners and contributors equally common?II. Are more prolific learners better contributors?To answer the first question, we compute the distribution of learned and contributed knowledge peruser. Here, we estimate the knowledge value of each contribution ( i.e. , answer) in a knowledge itemby dividing the total knowledge item value across contributions proportionally to their quality scores(upvotes). Figures 10a and 10b summarize the results. Although, in absolute numbers, there are morelearners than contributors in our dataset, the amount of knowledge fed into the site by the contributorsshows higher variability than the knowledge learned by users – the distribution of contributed knowledgeis fat tailed ( α ≈ . ).Next, we investigate the second question and assess whether more prolific learners are better contrib-utors. To do so, we calculate the average knowledge value per contribution across users that have learnedsimilar amount of knowledge over time, i.e. , sum of the knowledge value of all the knowledge items theuser learned from, (cid:80) a ∈A (cid:80) q ∈H lu ( T ) k qa . Figure 10c shows that the users that learn more knowledge arealso more proficient at producing high knowledge contributions. In other words, our results suggest that“ by learning you will teach; by teaching you will learn .”13 Learned knowledge o f u s e r s (a) Learned knowledge Contributed knowledge o f u s e r s (b) Contributed knowledge [ , ]( , ]( , ]( , ]( , ]( , ]( , ]( , i n f] Learned knowledge . . . A vg . k n o w l e d g e (c) Avg. knowledge per contribution vs. learnedknowledge Figure 10: Learners vs contributors in Stack Overflow. Panels (a) and (b) show the distribution ofoverall learned and contributed knowledge per user. The former follows a log-normal distribution ( µ ≈ . , σ ≈ . ), while the latter follows a power-law ( α ≈ . , x min ≈ . ). This shows that throughthe contributors are fewer in number than learners in absolute terms, they show much higher variability.Panel (c) shows a user’s average knowledge value per contribution against overall learned knowledge ina box plot. The red dotted line shows the median values and the box limits are the 25%–75% percentiles.Interestingly, the users that learn more knowledge are also more proficient at producing high knowledgecontributions. In this section, we take a step back and discuss the limitations of our model. First, we remark that, dueto the large number of parameters in the model, it is necessary to have access to large amount of data forour model estimation method to be accurate. However, this limitation can be overcome, to some extent,by linking expertise of a user across different platforms or sites ( e.g. , MOOCs), i.e. , our model can easilyassimilate traces available for the same user from those sites.In our model, it is also crucial that the score reflects the true assessment of the knowledge contentof the item and not of, say, the popularity of the contributor. In the case of Stack Overflow, which isa strict and self-regulated community, upvotes are seldom granted to answers which do not address thequestion — cases of serial upvoting are caught and remedied quickly which (mostly) prevents users fromvoting as a thank you gesture. As a consequence, on Stack Overflow, upvotes on answers are a goodassessment of the quality of the posts. However, a sensible choice for scores in platforms or sites withmilder self-regulation may be challenging.Finally, the learning events also need to be chosen such that they are not conflated with other objec-tives the user may have on the website. On Stack Overflow, if a user only upvotes a question, it indicatesthat she relates with the problem but none of the answers (if any) provide a solution. However, upvotingan answer is evidence that the Q&A pair taught the user something.These unique features and mechanisms afforded by Stack Overflow allow us to easily identify learn-ing events and assessments. Finding similar features in a different social network would require carefulreasoning and justification. In this paper, we proposed a probabilistic model of crowdlearning, naturally designed to fit fine-grainedlearning and contributing event data. The key innovation of our model is modeling the evolution of users’expertise over time as a latent stochastic process, driven by both off-site and on-site learning. Then, wedeveloped a scalable estimation method to fit the model parameters from millions of recorded learning14vents and contributions. Finally, we applied our model to a large set of learning and contributing eventsfrom Stack Overflow and found several interesting insights. For example, items with high knowledgevalue are rare. Newbies and experts acquire less knowledge than users in the middle range. Prolificlearners tend to be also proficient contributors that share knowledge with high knowledge value.Our work also opens many interesting venues for future work. For example, natural follow-ups topotentially improve the expressiveness of our modeling framework include:1. Consider more complex off-site learning trends, e.g. , isotonic regression [18] or exponential/power-law [16].2. Allow for a knowledge item to have different knowledge values per user by considering a knowl-edge distribution per item, and use Bayesian inference [23] to learn the model parameters.3. Perform a non-parametric estimation of the kernels that model the users’ forgetting process. Weexpect this to allow clustering of knowledge items which provide short-lived and long-lastingknowledge.4. Incorporate incentives mechanisms such as badges, which are often used in crowdlearning sites [2]and MOOCs [3].One of the key modeling ideas behind our framework is realizing that users’ contributions can be viewedas noisy discrete samples of the users’ expertise at points localized (non uniformly) in time. We couldgeneralize this idea to any type of event data and derive sampling theorems and conditions under whichan underlying general continuous signal of interest (be it user’s expertise, opinion, or wealth) can berecovered from event data with provable guarantees. Finally, we experimented with data gathered ex-clusively from Stack Overflow. It would be interesting to apply our model to Stack Exchange at large, toother questions and answers websites ( e.g. , AskReddit), microblogging platforms ( e.g. , Twitter), socialnetworking sites ( e.g. , Pinterest), or even offline crowdlearning networks ( e.g. , citation network). Acknowledgements. The authors would like to thank Sam Brand from Stack Overflow for providingdata to make this work possible. 15 eferences [1] O. Aalen, O. Borgan, and H. Gjessing. Survival and event history analysis: a process point of view . SpringerScience & Business Media, 2008.[2] A. Anderson, D. Huttenlocher, J. Kleinberg, and J. Leskovec. Steering user behavior with badges. In , 2013.[3] A. Anderson, D. Huttenlocher, J. Kleinberg, and J. Leskovec. Engaging with massive online courses. In , 2014.[4] L. Averell and A. Heathcote. The form of the forgetting curve and the fate of memories. Journal of Mathe-matical Psychology , 55(1), 2011.[5] R. S. Baker, Z. A. Pardos, S. M. Gowda, B. B. Nooraei, and N. T. Heffernan. Ensembling predictions ofstudent knowledge within intelligent tutoring systems. In User Modeling, Adaption and Personalization .Springer, 2011.[6] J. E. Beck and J. Mostow. How who should practice: Using learning decomposition to evaluate the efficacyof different types of practice for different types of students. In Intelligent Tutoring Systems , 2008.[7] D. M. Blei, A. Y. Ng, and M. I. Jordan. Latent dirichlet allocation. The Journal of Machine LearningResearch , 2003.[8] A. T. Corbett and J. R. Anderson. Knowledge tracing: Modeling the acquisition of procedural knowledge. User modeling and user-adapted interaction , 1994.[9] C. Eickhoff, J. Teevan, R. White, and S. Dumais. Lessons from the journey: A query log analysis of within-session learning. In Proceedings of the 7th ACM International Conference on Web Search and Data Mining .ACM, 2014.[10] S. E. Embretson and S. P. Reise. Item response theory . Psychology Press, 2013.[11] M. Farajtabar, N. Du, M. Gomez-Rodriguez, I. Valera, L. Song, and H. Zha. Shaping social activity byincentivizing users. In Advances in Neural Information Processing Systems , 2014.[12] M. Feng, N. T. Heffernan, and K. R. Koedinger. Looking for sources of error in predicting student’s knowl-edge. In Educational Data Mining: AAAI Workshop , 2005.[13] S. Ghosh, N. Sharma, F. Benevenuto, N. Ganguly, and K. Gummadi. Cognos: crowdsourcing search fortopic experts in microblogs. In . ACM, 2012.[14] J. Gonzalez-Brenes and J. Mostow. What and when do students learn? fully data-driven joint estimation ofcognitive and student models. In , 2013.[15] B. V. Hanrahan, G. Convertino, and L. Nelson. Modeling problem difficulty and expertise in Stack Overflow.In ACM 2012 Conference on Computer Supported Cooperative Work Companion . ACM, 2012.[16] A. Heathcote, S. Brown, and D. Mewhort. The power law repealed: The case for an exponential law ofpractice. Psychonomic bulletin & review , 7(2), 2000.[17] P. Jurczyk and E. Agichtein. Discovering authorities in question answer communities by using link analysis.In . ACM, 2007.[18] S. M. Kakade, V. Kanade, O. Shamir, and A. Kalai. Efficient learning of generalized linear and single indexmodels with isotonic regression. In Advances in Neural Information Processing Systems , 2011.[19] W.-C. Kao, D.-R. Liu, and S.-W. Wang. Expert finding in question-answering websites. In ACM Symposiumon Applied Computing . ACM, 2010.[20] N. Leibowitz, B. Baum, G. Enden, and A. Karniel. The exponential learning equation as a function ofsuccessful trials results in sigmoid performance. Journal of Mathematical Psychology , 54(3), 2010.[21] G. R. Loftus. Evaluating forgetting curves. Journal of Experimental Psychology: Learning, Memory, andCognition , 11(2), 1985. 22] B. Means, Y. Toyama, R. Murphy, M. Bakia, and K. Jones. Evaluation of evidence-based practices in onlinelearning: A meta-analysis and review of online learning studies. US Department of Education , 2009.[23] K. P. Murphy. Machine learning: a probabilistic perspective . MIT press, 2012.[24] G. R. Norman and H. G. Schmidt. The psychological basis of problem-based learning: A review of theevidence. Academic Medicine , 67(9), 1992.[25] A. Pal and S. Counts. Identifying topical authorities in microblogs. In Fourth ACM International Conferenceon Web Search and Data Mining . ACM, 2011.[26] A. Pal and J. A. Konstan. Expert identification in community question answering: Exploring questionselection bias. In , 2010.[27] A. Paulina and J. Marta. Study of the Temporal-Statistics-Based Reputation Models for Q&A Systems. Computer Science , 16(3), 2015.[28] P. I. Pavlik, H. Cen, and K. R. Koedinger. Performance factors analysis–a new alternative to knowledgetracing. In . IOS Press, 2009.[29] C. Piech, J. Bassen, J. Huang, S. Ganguli, M. Sahami, L. J. Guibas, and J. Sohl-Dickstein. Deep knowledgetracing. In Advances in Neural Information Processing Systems , 2015.[30] D. Posnett, E. Warburg, P. T. Devanbu, and V. Filkov. Mining stack exchange: Expertise is evident frominitial contributions. In International Conference on Social Informatics , 2012.[31] J. Qiu, J. Tang, T. X. Liu, J. Gong, C. Zhang, Q. Zhang, and Y. Xue. Modeling and predicting learningbehavior in moocs. In Fourth ACM International Conference on Web Search and Data Mining . ACM, 2016.[32] C. H. Skinner. Applied comparative effectiveness researchers must measure learning rates: A commentaryon efficiency articles. Psychology in the Schools , 47(2), 2010.[33] I. Valera and M. Gomez-Rodriguez. Modeling adoption and usage of competing products. In , 2015.[34] R. W. White, S. T. Dumais, and J. Teevan. Characterizing the influence of domain expertise on web searchbehavior. In Proceedings of the Second ACM International Conference on Web Search and Data Mining .ACM, 2009.[35] M. V. Yudelson, K. R. Koedinger, and G. J. Gordon. Individualized bayesian knowledge tracing models. In Artificial Intelligence in Education . Springer, 2013.[36] J. Zhang, M. S. Ackerman, and L. Adamic. Expertise networks in online communities: Structure andalgorithms. In , 2007.[37] K. Zhou, H. Zha, and L. Song. Learning triggering kernels for multi-dimensional hawkes processes. In , 2013.[38] C. Zhu, R. H. Byrd, P. Lu, and J. Nocedal. Algorithm 778: L-BFGS-B: Fortran subroutines for large-scalebound-constrained optimization. ACM Transactions on Mathematical Software , 23(4), 1997., 23(4), 1997.