Scoring Popularity in GitHub
SScoring Popularity in GitHub
Abduljaleel Al-Rubaye
Department of Computer ScienceUniversity of Central FloridaOrlando, FL [email protected]
Gita Sukthankar
Department of Computer ScienceUniversity of Central FloridaOrlando, FL [email protected]
Abstract —Popularity and engagement are the currenciesof social media platforms, serving as powerful reinforcementmechanisms to keep users online. Social coding platforms suchas GitHub serve a dual purpose: they are practical toolsthat facilitate asynchronous, distributed collaborations betweensoftware developers while also supporting passive social mediastyle interactions. There are several mechanisms for “liking”content on GitHub: 1) forking repositories to copy their content2) watching repositories to be notified of updates and 3)starring to express approval. This paper presents a study ofpopularity in GitHub and examines the relationship betweenthese three quantitative measures of popularity. We introduce aweight-based popularity score (
W T P S ) that is extracted fromthe history line of other popularity indicators.
Keywords -Full/Regular Research Paper; ISNA (social net-work analysis, media, and mining); social coding platforms
I. I
NTRODUCTION
With more than 31 million users and 2.1 million or-ganizations, GitHub is one of the largest social codingplatforms [1]. Users store their files in repositories which canbe shared with others; GitHub currently hosts more than 96million repositories. User interactions engender events thathelp the repositories perform version control. For instance,the
Create event is used to create a repository, the
Fork event is triggered when a repository is cloned by a user,and the
Push event occurs when a user pushes files toa repository [2]. Beyond simple version control, GitHubalso supports social engagement between users. Users canexpress approval of content by starring it or can opt to benotified of content changes by watching repositories.These measures implicitly serve as a recommendationof the ideas, methods, innovation, and reusability of thecontent. Being an author of a repository that is frequentlyforked is similarly to being highly cited. GitHub has itsown gh-impact score that is comparable to h-index. Threedifferent measurements of popularity are provided: forks , stars , and watchers . The fork count shows the number ofpeople who were sufficiently interested to clone or copy arepository on their local resource, while stars indicate userswho have agreed to receive notifications about all repositoryactivities. Watchers , similarly, represent the number of userswho bookmarked a repository. This paper reports on patternsand trends of popularity and social engagement in GitHub. Popularity measures make it easier for users to rapidly findhighly recommended content. In the quest to achieve higherratings, repository owners may be motivated to improve thequality of the hosted work to attract more people. The darkside of popularity ratings is that they may suppress users’creativity, leading them to remove content that is unpopular.For this reason, Instagram actually obscures the popularity ofcontent to reduce users’ anxiety. The next section discussesrelated work on popularity in social coding platforms.II. R
ELATED W ORK
Most studies conducted by mining data from social codingplatforms have relied on a single popularity metric, suchas stars, to quantify repository popularity. Xavier et al. [7]examined the number of developer followers and noted thatcommit activity is definitely a factor contributing to higherdeveloper popularity. Popularity in other domains, such asauthoring blogs or activity on other social networks, can alsocontribute to GitHub developer popularity.Borges et al. have conducted several studies on GitHubpopularity, analyzing the main factors that can impact star-ring activity [6]. For instance, they looked at the relationshipbetween repository age, number of commits, contributors,forks and starring activity. Moreover they studied patterns ofpopularity growth; based on their work on growth patterns,they proposed a multiple linear regression technique forpredicting the number of future stars a repository willreceive [9]. Al Rubaye and Sukthankar [10] proposed amodel to predict the diffusion of innovation in GitHub.During the model creation process, they used forks as anindicator of repository popularity.In addition to these studies there are several online toolsthat rank repositories by different popularity measurements.For instance, GitHub itself ranks the rock-star repositories ina descending order by their daily stars only [3].
Git Awards [4] is a web-based application that gives a repository’sranking on GitHub by language or by geolocation based onthe number of stars.
Git Most Wanted [5] is another web-based application that ranks the repositories based on starsor fork counts. a r X i v : . [ c s . S I] N ov II. M
ETHOD
A. Data Set
To prepare our data set, we utilized the online archive ofpublic repositories provided by the project GHTorrent [12]which extracts all the data using GitHub’s REST API. Forthis work 36,000 public repositories were randomly capturedon October 30th, 2018.Each record contains a repository object including allthe endpoints of the related events, commits, contributors,etc. The data objects also provide repository features andinformation such as the creation date, the primary program-ming language, size of the repository, fork count, watchercount, star count, and repository owner information. Figure 1presents the distribution of the key properties of our data set.To extract the timeline of occurrences for popularity-relatedevents, we used the property timestamp that was providedby GitHub API to retrieve the exact time points that a forkor star event occurred.
Figure 1. Distribution of repository features: forks, stars, watchers, ageof repositories, size, and number of repository owner followers.
B. Popularity Growth
The concept of popularity cannot be separated from time.Figure 2 depicts the correlation between repository age (indays) and popularity indicators; obviously over time, we seean increase in total number of people who fork, star or watcha repository.Usually repositories experience a spike of popularitygrowth. For instance, the repositories that are created by bigtech companies and well-known organizations may attractdevelopers in a single burst. Figure 3 demonstrates anexample of such a repository, which received more than9% of its total stargazers count on February 2017 and wascreated on October 2016.Repositories can be grouped in three categories based ontheir popularity growth over time. The first group includesthe repositories that have only attracted a small number ofdevelopers. They fail to raise their popularity to a noticeablepoint and remain at the same level for a long time. Con-sequently this type of repository experiences litttle growth. Some repositories gain popularity via various means but thenlose it when some of GitHub users unwatch or unstar them.The last group consists of repositories that keep gainingpopularity, where their overall growth rate is positive overthe majority of their life time. These repositories may beconsidered as resources of innovation for developers onGitHub [6]. C. Calculating Popularity Score
Popularity in GitHub can be viewed as a measurementof how attracted users are to the repository content. Thisaura of attractiveness may not be completely captured bythe number of watchers, stargazers or forkees. We proposethe use of a weighted popularity score to more accuratelyexpress different aspects of popularity.We illustrate the benefits of other scoring techniques withan example. Imagine there is a repository (A) that is forkedby 1000 users and starred by 20 and repository (B) that is notforked at all yet but has been starred 1020 times. Consideringstar count as the popularity indicator, repository (B) is morepopular, but if only fork counts are measured, the oppositeis true—repository (A) is more popular. We believe that itis useful to model repository growth patterns, examining thetimeline of increases in forks and stars. Some may be morestable in gaining popularity and others may fluctuate overtime, gaining thousands of stars/forks in a month and only afew stars/forks on the next month. The benefit of a weightedscore is its ability to model these factors.
Popularity Weight : Popularity quantifies interactions be-tween a community’s components; it is almost a meaninglessconcept when considering an element in isolation. A set ofrepositories can be considered a community, in which thepopularity of a repository should be considered relative toother repositories. In other words, if a repository receivesmore stars than other repositories during a time period, thismay be more significant than achieving a higher absolutecount of stars.The idea of time-based popularity weights emerges fromthe fact that during some time periods we see thousandsof developers forking, starring, or watching different repos-itories whereas during other periods the total number ofpopularity related events is much lower. Hence we treatthese periods as having different popularity weights. Conse-quently, gaining forks, stars or watchers at certain times maybe more valuable than at others. We introduce the
WeightedTotal Popularity Score (
W T P S ) to quantify the popularityof a repository that is part of a community; W T P S weightspopularity with respect to time. Initially the time interval t isdefined to be equal to one month. Therefore, in order to findthe popularity indicators’ weights we take all the forks andstars that were captured on each month, then we calculatethe weights against the total gained forks and stars of therepositories of our data set. Weights of forks ( W F t ) and stars( W S t ) for each time interval t are calculated as follows: igure 2. Correlation of age and popularity measurements. (a) shows the age-star correlation (coefficient=0.3215). (b) shows the age-fork correlation(coefficient=0.2923) and (c) shows age-watcher correlation (coefficient=0.3162). [13]Figure 3. An example of repository star growth patterns W F t = (cid:80) ni =0 F orks [ R i ] t (cid:80) ni =0 F orks [ R i ] (1) W S t = (cid:80) ni =0 Stars [ R i ] t (cid:80) ni =0 Stars [ R i ] (2)such that (cid:80) ni =0 W F t = 1 and (cid:80) ni =0 W S t = 1 , where n isthe total number of the repositories. Note that to computethe popularity score of an individual repository withoutconsidering the fact that it is part of a larger communityof repositories on GitHub is equivalent to setting W F t and W S t to be equal to 1. Table I shows a sample data set ofrepositories and their captured forks and stars at differenttime intervals, along with their calculated W F t and W S t values. By calculating the fork and star weights of eachtime interval, we are able to calculate the weighted totalpopularity score W T P S for each repository R . Weighted Total Popularity Score (
W T P S ) : The equationbelow shows how to calculate the W T P S of a repository R at time t : W T P S [ R i ] t = ( W F t F orks [ R i ] t ) + ( W S t Stars [ R i ] t ) (3)Hence, by summing up all the weighted popularity scoresof a repository for all the time intervals we obtain the overall W T P S as follows:
W T P S [ R i ] overall = m (cid:88) t =0 W T P S [ R i ] t (4)where m is the number of the time intervals. Table II showsthe W T P S of each of the repositories of Table I and ateach time point t , t , ... as well as the W T P S
Overall whereboth forks and stars are utilized to find the overall popularity score. The key intuition is that
W T P S considers popularitygrowth relative to the total community.IV. R
ESULTS
This section presents a comparison of the use of
W T P S as an indicator of repository popularity vs. the standardmeasurements.
A. Ranking Popularity
The previous section described an example set of reposito-ries with different fork/star ratios across several time points.Using our proposed approach of extracting their popularitywe calculated the popularity score,
W T P S (Table I andTable II).
Figure 4. The rank of repositories of Table I based on forks, stars, and
W T P S . By conducting a simple comparison between a reposi-tory’s forks and stars against its
W T P S we see that itspopularity level changes. When the number of accumulatedforks or stars of a repository is used as the only popularityindicator, repositories may be ranked either lower or higherthan when their popularity score is based on other mea-surements. A repository that is popular compared to otherrepositories based on the number of gained forks may havea different rank when star count is used as the popularityindicator.Figure 5 depicts repository rank changes based on thepopularity indicator. For instance, even though R has thehighest ranking among other repositories in fork countsas well as the star counts, it does not have the highest W T P S . Since
W T P S reweights forks and stars based ontheir relative standing at different time periods, this resultsin a lower popularity score for R . Meaning that at somepoints R gained forks and/or stars at a time when thepopularity weights were relatively lower. Below we utilize able IA SAMPLE DATA SET OF FOUR REPOSITORIES AND THEIR COLLECTED FORKS AND STARS AT TIME POINTS [ t ... t ]. W EIGHTS W F AND W S WERECALCULATED USING E QUATIONS AND F t F t F t F t F t F total S t S t S t S t S t S total R
15 20 3 9 7 54 6 12 2 15 10 45 R
10 10 5 3 30 58 6 5 6 5 8 30 R R
14 12 10 5 5 46 12 14 10 13 6 55Weight Figure 5. A sample of 20 repositories’ popularity ranking over using different popularity measures: forks, stars and
W T P S (t=1 month).Table IIC
ALCULATING THE W EIGHTED T OTAL P OPULARITY S CORE ( W T P S ) FOR THE REPOSITORIES DETAILED IN T ABLE I BASED ON E QUATION W T P S R R R R t t t t t total a linear regression approach [13] to analyze the correlationbetween W T P S and other repository properties more indetail. B. W T P S vs. other popularity indicators
Clearly the proposed popularity score,
W T P S , is closelyrelated to forks and stars; however these popularity measure-ments will be weighted differently over time. For instance arepository can gain a large number of stars relative to otherrepositories over a month which will lead to a greater totalpopularity score.To see the overall effect of fork count vs. star count onthe value of
W T P S , we compare their correlation to thetotal
W T P S score (Figure 6). It is clearly noticeable thatthe increase in forks or stars leads to a greater
W T P S valuein general. Even though both indicators are strongly relatedto the value of
W T P S , star counts are more correlatedwith the popularity score values. For our data set the starcount correlation to
W T P S is 0.925 which explains whyrepositories with more stars are likely to have a high
W T P S igure 6. WTPS-Forks and
W T P S -Stars correlation. value. This is aligned with the majority of research that relieson star count alone as an indicator of popularity.
Table IIIC
ORRELATIONS OF
W T P S
AND REPOSITORY PROPERTIES
Repository Metrics
W T P S
Age 0.236089Owner Followers 0.136390Size 0.000755
Table III shows the correlations; it is clear that there is nota strong relationship between age, size and owner’s followerscount and the popularity score
W T P S . However, the agecorrelation to
W T P S is slightly higher than other metricswhich may cause time to have some effect on the popularityscore.In order to find the best time interval length, we con-sidered three more time intervals: 1 week, 2 weeks, and 3weeks. To do this, we computed the new
W T P S correlationto other popularity measurements using different time inter-vals; Table IV presents these results.
W T P S ’s correlationsto forks, stars and watchers are similar; shorter time intervallengths have a more positive slope. However,
W T P S isreasonably robust to interval changes.
C. Popularity in Collaboration Networks
Repositories can be ranked as being more or less popu-lar when different indicators of popularity are considered.Hence, if the GitHub repositories are mapped to a network,the network will exhibit different properties based on howpopularity is scored. To understand the effect of differentpopularity indicators, we consider different versions of thesame network.The model that is defined here has two types of nodes:repositories, and the followers of repository owners whereeach repository is simply linked to its owners’ followers.Figure 7 presents a sample subset of the model; as it isshown R is linked to its owner’s followers { f , f , f } , R Figure 7. Subset of an example network: repository, owners’ followers,and the repository-follower connections. is linked to the followers { f , f , f } and R is connectedto { f , f } . Using this model, we construct a graph ofrepositories and their followers; Figure 8 shows the graphwhere the large-sized nodes are repositories with higher W T P S . Figure 8. A sample graph of 5000 nodes and 5195 edges, where the rednodes represent the repositories and the blue nodes are the followers ofthe repository owners. The size of the repository nodes are proportional to
W T P S popularity.
In this section the aim is to determine whether selectingdifferent popularity indicators affects the graph’s structure.For this purpose we evaluate the impact of deleting nodesaccording to different popularity measures on the total clus-tering coefficient of the graph. Initially the total clusteringcoefficient of the graph is calculated, then on each step weremove a node from the graph that has the highest popularity.At each step we select the node slated for removal based ondifferent popularity measures: highest
W T P S , highest stars, able IVT
HE CALCULATION OF A REGRESSION LINE FOR POPULARITY INDICATORS FORKS , STARS AND WATCHERS AND
W T P S
BASED ON DIFFERENT TIMEINTERVALS .Forks Stars Watchers
W T P S
Slope Coefficient Slope Coefficient Slope Coefficientt = 1 month 7.259 0.726 32.267 0.925 1.728 0.691t = 3 weeks 10.595 0.727 47.311 0.929 2.518 0.688t = 2 weeks 15.722 0.72 70.129 0.928 3.744 0.689t = 1 week 31.142 0.726 138.961 0.928 7.411 0.689 highest forks and highest watchers, and at each time step theclustering coefficient of the graph is recalculated.As shown in Figure 9 the clustering coefficient valuebecomes smaller after every deletion. This behavior is sim-ilar for all four types of popularity-based node deletions.However, the trend is more similar when the node selectionis based on either the highest
W T P S score or star count.Deletion based on the number of watchers leads to a slightlydifferent result. Thus we believe that research studies basedon extracted GitHub networks should either use
W T P S orstar counts when modeling information diffusion.
Figure 9. Repository deletion impact on one of the principal networkcharacteristics: clustering coefficient. C ONCLUSION
This paper surveys many mechanisms for measuring thepopularity of GitHub repositories. Our proposed popularityscore (
W T P S ) weights a repository’s forks and stars, basedon a comparison to the popularity gains made by otherrepositories during the same time period. The key advantageof
W T P S is that it is more robust to changes in social mediaplatforms. Since both the content and user population ofsocial coding platforms increase over time, the best way tocreate a consistent measurement is to normalize increases vs.growth across the community. Hence, our method requiresgathering data from a large sample of GitHub repositories. For this purpose the GitHub API was utilized to collectcollections of repositories.
W T P S was compared against two other popularity indi-cators, forks and stars, in order to examine their correlation.The results showed that
W T P S has a relatively high cor-relation coefficient with both indicators. However
W T P S tends to be more related to the value of the stars than forks.In other words gaining more stars is more likely to increase
W T P S , as compared to gaining forks. Thus the usage of
W T P S is likely to produce results that are consistent withresearch studies that have relied on stars as a shorthand forpopularity.Furthermore we examined the correlation of
W T P S to several other repository measurements in order to findprobable factors that may influence
W T P S . To do so weperformed a correlation analysis of
W T P S with age, totalnumber of followers of a repository’s owner and repository’ssize. The results returned low correlation coefficients. Thisis a useful property because it means that
W T P S can beused to fairly score a diverse group of repositories. Amongthese three factors age had the highest correlation ( ∼ +0.27).Even though the W T P S -Age correlation is not strong, it canmean that the older a repository is, the higher the likelihoodof it having a greater
W T P S .We also defined a graph that represents GitHub reposi-tories and their owners’ followers in which a follower of arepository owner is linked to the repository by an edge. Theaim was to study the impact of removing popular nodes onthe graph’s total clustering coefficient. Our popularity-baseddeletion process was conducted with
W T P S , stars, forks,and watchers, in order to compare the impact on the graphs.All four deletion processes led to close results. However theresults of node deletions for
W T P S and stars were the mostsimilar.To facilitate more research in this area, we have madeour dataset publicly available at https://drive.google.com/file/d/1T lNPrIjmjnZr1ihw1hWan2N-mbNhwiR/view?usp=sharing. V. A
CKNOWLEDGMENTS
This work was partially supported by grant FA8650-18-C-7823 from the Defense Advanced Research Projects Agency(DARPA). The views and opinions contained in this articleare the authors and should not be construed as official or asreflecting the views of UCF, DARPA, or the U.S. DoD.