A Bayesian Hurdle Quantile Regression Model for Citation Analysis with Mass Points at Lower Values
=== D R A F T February 10, 2021 ==
A Bayesian Two-part Hurdle Quantile Regression Model for CitationAnalysis
Marzieh Shahmandi, Paul Wilson, and Mike Thelwall
Statistical Cybermetrics Research Group, School of Mathematics and Computer Science, University of Wolverhampton, Wulfruna Street, Wolverhampton WV1 1LY, UK
Keywords:
Quantile regression, Bayesian method, Hurdle model, Markov chain Monte carlo, Citation Analysis, Excess zeros
Abstract
Quantile regression is a technique to analyse the effects of a set of independent variables on the entiredistribution of a continuous response variable. Quantile regression presents a complete picture of the effectson the location, scale, and shape of the dependent variable at all points, not just at the mean. This researchfocuses on two challenges for the analysis of citation counts by quantile regression: discontinuity andsubstantial mass points at lower counts, such as zero, one, two, and three. A Bayesian two-part hurdlequantile regression model was proposed by King and Song (2019) as a suitable candidate for modelingcount data with a substantial mass point at zero. Their model allows the zeros and non-zeros to be modeledindependently but simultaneously. It uses quantile regression for modeling the nonzero data and logisticregression for modeling the probability of zeros versus nonzeros. Nevertheless, the current paper showsthat substantial mass points also at one, two, and three for citation counts will nearly certainly affect theestimation of parameters in the quantile regression part of the model in a similar manner to the mass pointat zero. We update the King and Song model by shifting the hurdle point from zero to three, past the mainmass points. The new model delivers more accurate quantile regression for moderately to highly citedarticles, and enables estimates of the extent to which factors influence the chances that an article will below cited. To illustrate the advantage and potential of this method, it is applied separately to both simulatedcitation counts and also seven Scopus fields with collaboration, title length, and journal internationality asindependent variables.
Corresponding author: Marzieh Shahmandi, [email protected]
Copyright ©2021 The Author(s). Published by MIT Press. All Rights Reserved. a r X i v : . [ c s . D L ] F e b = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al. INTRODUCTION
Citation analysis can help to estimate the relative importance or impact of articles by counting the numberof times that they have been cited by other works. Non-specialists in governments and funding bodiesor even researchers in different scientific disciplines sometimes use citation counts to help judge theimportance of a piece of scientific research (Meho, 2007). Citation analysis has statistical challengesdue to the characteristics of citation counts (a substantial mass point at zero, high right skewness, andheteroskedasticity). Various statistical models have been proposed for citation counts ( e.g.
Brzezinski, 2015;Eom and Fortunato, 2011; Garanina and Romanovsky, 2016; Low et al., 2016; Redner, 1998; Seglen, 1992;Shahmandi et al., 2020; Thelwall, 2016; Thelwall and Wilson, 2014), but most have sought to model theconditional mean of citation counts from independent variables. In other words, they generate a formulafor the expected value of the mean for given values of research-related parameters, such as article age, topic,and the number of authors.Quantile regression (QR) is a statistical method Koenker and Bassett (1978) to complement classical linearregression analysis ( e.g. , Coad and Rao, 2008; Koenker and Hallock, 2001). Unlike a linear regressionwhere the conditional mean of a dependent variable is modeled, in QR the different conditional quantiles ofthe dependent variable, such as the median, are modeled based on a set of independent variables. In QR theentire distribution of the dependent variable is related to the set of independent variables. In scientometrics,Danell (2011) used QR to investigate whether the future citation rate of an article can be predicted fromthe author’s publication count and previous citation rate. In the study of nanotechnology publications, QRwas used to investigate whether funding acknowledgements influence journal impact factors and citationcounts as two dependent variables in two separate models (Wang and Shapira, 2015). Stegehuis et al. (2015)proposed a QR-based model to estimate a probability distribution for the future number of citations of apublication in relation to variables such as the publishing journal’s impact factor. Anauati et al. (2016)assessed the life cycle of articles across fields of economic research through QR. Ahlgren et al. (2017)used QR to show how some factors, such as the number of cited references, affect the field normalisedcitation rate across all disciplines. In another study Xing (2018) used QR models to explore the relationshipbetween SCI (Science Citation Index) editorial board representation and research output of universities(measured by the number of articles, total number of citations, . . . ) in the field of computer science, M¨antyl¨aand Garousi (2019) applied QR at the . quantile to indicate how factors such as publication venue and–2– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al. author team past citations influence the number of citations of software engineering papers. Galiani andG´alvez (2019) proposed identifying citation ageing by combining QR with a non-parametric specificationto capture citation inflation. Despite this extensive use of QR for citation analysis, the problem of theinfluence of point masses (low citation counts having high frequencies in a set of articles) has not beenfully resolved, undermining the value of the results.The continuity of the dependent variable is important for minimisation of the objective function in QR. Adiscrete dependent variable leads to non-differentiability of the objective function, resulting in problemsderiving the asymptotic distribution of the conditional quantiles. A substantial mass point at zero in thedata results in all conditional quantiles less than the percentage of the zeros being equal to zero. In some ofthe articles cited above, the discontinuity of citation counts was ignored, leading to biased and misspecifiedestimates for parameters in the model. In other articles, citation counts were normalised by differentmethods, or a random positive value was added to each citation count to account for the discontinuity. Ingeneral, when there is a substantial mass point at zero, jittering (Machado and Silva, 2005) is used (randomnoise in the interval (0 , is added to each data point to make the data continuous). In this situation,researchers frequently focus on the interpretation of the upper quantiles of the dependent variable becausethe apparent variation in the lower tail might be a consequence of random noise produced by the jitteringprocess. In practice, some important parts of the analysis can be lost. For instance, in the case of the citationcounts as a dependent variable, we can lose the information about the effects of factors (as independentvariables in the model) on zero or very low cited articles. Therefore, a new methodology related to QRshould be considered to tackle these challenges. The approach proposed in this article is an extension of theBayesian two-part Hurdle QR model of King and Song (2019). Having a two-part structure is a fundamentalaspect of this model. The two-part model of King and Song (2019) allows zero and non-zero citations tobe modeled separately. The QR part of the model is for modeling the non-zeros and logistic regression isused for modeling the probability of zeros versus non-zeros. The Bayesian structure of the model assiststhe estimation of model parameters. In the case of citation count data, there are frequently substantialmass points at one, two, and three, (and possibly also at greater values) which influence the estimates ofparameters in the QR part of the model in a similar manner to the substantial mass point at zero, so a newupdate of the model will be proposed to reduce the effect of the substantial mass points on the estimation ofthe model. This paper, based on simulations of log-normal continuous data with substantial mass pointsat zero, one, two, and three (approximating a common distribution of citation counts), will assess, by–3– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al. considering the mean square error of the estimates of the coefficients corresponding to the independentvariables in the model, whether the QR part of the two-part model with a hurdle at three, results in moreaccurate estimates than are obtained by the other models. We also assess prediction errors and credibleintervals for the estimates. DEFINITIONS AND CONCEPTS
Quantile regression
Gilchrist (2000) describes a quantile as “the value that corresponds to a specified proportion of an (ordered)sample of a population”. The quantiles are the values which divide the distribution such that there is agiven proportion of observations below the quantile. Thus the τ th quantile splits the area under the densitycurve into two parts: one with area τ below the τ th quantile and the other with area − τ above it. Thebest-known quantile is the median, which is the . quantile. The median is a measure of the centraltendency of the distribution: half the data are less than or equal to it and half are greater than or equal toit. In general, for any τ in the interval (0 , and any continuous random variable Y with the probabilitydistribution function F , the τ th quantile of Y can be defined as: F Y ( y τ ) = P ( Y ≤ y τ ) = τ the empirical quantile distribution function can be defined as: y τ = F − Y ( τ ) = inf { y | F Y ( y ) ≥ τ } . The regression model for the conditional quantile level τ of Y is (Koenker and Bassett, 1978) Q Y i ( τ | x i ) = x i T β τ (1)where x i is i th vector of p independent variables, and β τ is estimated by minimization of the sampleobjective function or the weighted absolute sum: min β τ n (cid:88) i =1 ρ τ (cid:18) y i − x i T β τ (cid:19) (2)–4– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al. where n is the number of observation and ρ τ ( r ) = τ max( r,
0) + (1 − τ ) max( − r, is the check lossfunction. QR preserves Q Y ( τ | x ) under transformation. Suppose that η ( . ) is a non-decreasing (monotone)function on R , then Q η ( Y ) ( τ | x ) = η ( Q Y ( τ | x )) This is important because in the following, transformations need to be used for citation analysis.
The Asymmetric Laplace distribution in Bayesian QR
In the following, the (three-parameter) Asymmetric Laplace Distribution (ALD) is defined. Because of thedistribution-free characteristic of QR, the minimisation of Equation 2 can be considered as a non-parametricproblem. This can cause a challenge for defining the Bayesian version of QR because the Bayesianframework needs the likelihood function of the model. Different approaches have been suggested for thisissue but the ALD method proposed by Yu and Moyeed (2001) is the simplest and most understandablemethod.ALD has density probability function: f ( y i | µ, σ, τ ) = τ (1 − τ ) σ exp (cid:110) ρ τ (cid:18) y i − µσ (cid:19)(cid:111) (3)where µ ∈ R , σ > , and τ ∈ [0 , are respectively location, scale, and skewness parameters.For a random variable W where W ∼ ALD ( µ, σ, τ ) , there is a location-scale mixture representationfollowing a normal distribution with specific parameters ( e.g. , Kozumi and Kobayashi, 2011; Lee andNeocleous, 2010). In fact: W i | v i ∼ N ( µ + θv i , ψ σv i ) (4).where θ = 1 − ττ (1 − τ ) , ψ = 2 τ (1 − τ ) –5– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al. where u and v are independent variables, u follows a standard normal distribution, and v is exponentiallydistributed with mean σ . This valuable feature of ALD enables the use of QR in the Bayesian framework.By considering the location parameter in ALD as a linear function of the independent variables, µ i = x i T β τ ,the maximum likelihood estimate of the β in Equation 3 is equivalent to the estimate obtained from theminimisation of Equation 2, for every fixed τ . QR may be regarded as linear regression where the errorterm has been replaced by the ALD distribution. ALD provides a likelihood base for data in the Bayesianframework which holds the data fixed, and treats the parameters as random variables which are explainedprobabilistically by prior knowledge. The combination of the evidence extracted from the data (likelihood)and the prior beliefs is a posterior distribution corresponding to the parameters. A Gibbs sampler of theMarkov chain Monte Carlo (MCMC) method is used for the approximation of the posterior distribution. BAYESIAN TWO-PART HURDLE QR
Santos and Bolfarine (2015) proposed a Bayesian two-part QR methodology for a continuous responsevariable with a substantial mass point at zero or one. King and Song (2019) introduced a Bayesian two-partQR model with a hurdle at zero for the case of count data with a substantial mass point at zero. In thefollowing, a new version of this model for the case of a hurdle at a specific value of c is introduced. To fitthis model, for the first step, the count data should be transformed by y ∗ i = y i ≤ c ln ( y i − c − u i ) y i > c to provide a semi-continuous variable. By this transformation, all substantial mass points less than or equalto c are mapped to zero and the rest of the data are converted to a real number in the domain of the ALDdistribution.The two-part probability function has the form f ( y ∗ i | γ , β τ , σ, v i , τ ) = ( ω i ) . I ( y ∗ i = 0) + (1 − ω i ) . N ( x i T β τ + θv i , ψ σv i ) I ( y ∗ i (cid:54) = 0) (5)where ω i = P ( y ∗ i = 0) and I is the indicator function. A logit link is usually applied to model ω i based ona linear combination of the independent variables –6– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al. logit ( ω i ) = z i T γ (6)where z i is a vector of independent variables. The variables used to model ω i may or may not be the sameas those used to model the non-zero data.The two-part model is a mixture model that is a linear combination of a continuous normal distribution(corresponding to QR for modelling the jittered non-zero citation counts) and a point distribution at zero. ω i and − ω i are respectively the contributions of the point distribution and the continuous distribution inthis mixture. This is a hurdle model because the zeros and non-zeros are modeled separately.From Equation 6, Equation 5 can be rewritten as f ( y ∗ i | . ) = ω I ( y ∗ i =0) i (cid:104) (1 − ω i ) . N ( x i T β τ + θv i , ψ σv i ) (cid:105) I ( y ∗ i (cid:54) =0) = ω I ( y ∗ i =0) i (1 − ω i ) I ( y ∗ i (cid:54) =0) (cid:104) N ( x i T β τ + θv i , ψ σv i ) (cid:105) I ( y ∗ i (cid:54) =0) = (cid:20)
11 + exp ( − z i T γ ) (cid:21) I ( y ∗ i =0) (cid:20)
11 + exp ( z i T γ ) (cid:21) I ( y ∗ i (cid:54) =0) (7) × (cid:104) N ( x i T β τ + θv i , ψ σv i ) (cid:105) I ( y ∗ i (cid:54) =0) By considering non-informative priors: π ( β τ ) ∼ N (˜ b, ˜ B ) π ( v i ) ∼ E ( σ ) π ( σ ) ∼ IG (˜ n, ˜ s ) π ( γ ) ∼ N (˜ g, ˜ G ) where E denotes the exponential distribution with mean σ and IG denotes an inverse gamma distributionwith the hyperparameters ˜ n and ˜ s . The posterior distribution of the model is π ( β τ , γ , σ, v i | y ∗ ) ∝ L ( y ∗ i | β τ , σ, v i , τ ) π ( β τ ) π ( γ ) π ( σ ) π ( v i | σ ) (8)–7– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al. where L ( y ∗ i | . ) is the likelihood function of f ( y ∗ i | . ) . To approximate the posterior distribution, the Gibbssampler of the MCMC method will be used. SIMULATION STUDY
In this section, samples with sizes of 500, 1000 and 3000 are simulated from continuous log-normaldistribution ( LN ) with mean (1 . − . ∗ x + 0 ∗ x + (cid:15) ) and standard deviation . where x ∼ LN (0 , , x ∼ N (0 . , . , and (cid:15) ∼ N (0 , . The log-normal distribution was chosen because it approximates thetypical distribution of citation counts. The floor function was used for the simulated values less than 4to simulate substantial mass points at , , and . The intercept value of . and the coefficient − . of x were chosen so that approximately 40% of the data will be zeros and 80% of the data less than 4,similar to much citation count data. The coefficient of x was chosen as zero to enable comparison of theproposed models when one of the variables is non-significant. Bayesian QR and Bayesian two-part QRmodels with hurdle at 0, and with hurdle at 4 will be used. The objective is to compare Bayesian QR withthe QR parts of the two-part models with hurdles at 0 and 4. For each sample size and for each quantilelevel, the Bayesian QR model is fitted to the whole data then the quantile level of the corresponding quantilevalue is found in the data in which the zeros are excluded and the Bayesian QR model is fitted, (i.e. theQR part of the two-part model with hurdle at 0). Next the quantile value corresponding to the quantilelevel is found in the data in which all substantial mass points (including 0,1,2, and 3) are removed andthe Bayesian QR model is fitted for the corresponding quantile (this model is the QR part of the two-partmodel with hurdle at 4). For more clarification, suppose the specific quantile level is . , and the BayesianQR model is fitted to the whole data at this quantile. Say the value corresponding to this quantile is . .Now the quantile corresponding to . for the data with zeros removed is computed; say it is . . Thenthe Bayesian QR model is fitted to this data ( > ) for the . quantile. This model is the QR part of thetwo-part model with hurdle at 0. Next, the quantile level corresponding to . is found in the data withsubstantial mass points at 0,1,2, and 3 removed. Say this is . . The Bayesian QR model is fitted to thisdata ( ≥ ) at this quantile. The estimates of parameters based on this model are the QR part of the two-partmodel with hurdle at 4. This process is repeated 100 times for each sample size, and for 4 specific quantilesof . , . , . , and . (of the whole data) separately. For each combination, the prediction errorof each model, the mean square error of the parameters’ estimates (intercept excluded), and the width of–8– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al. Figure 1.
Prediction errors based on different models and sample sizes the credible intervals for both independent variables x and x are computed. A sequence of quantilesfrom . to is considered. The function bayesQR from the R -library bayesQR was used for fitting theBayesian QR. 5000 iterations with burn-in 50 was used for the MCMC computation. R code related to thesimulations are available online . Finally, for each model, the boxplots of the prediction errors, the meansquare errors of the parameters’ estimates, and the width of the credible intervals are compared. The resultsare reported in Figure 1, Figure 2, Figure 3, and Figure 4.Figure 1 shows the prediction errors ( y − ˆ y ) where ˆ y is calculated for each fitted model for each quantileand sample size. It shows that the prediction error for the QR part of the two-part model with hurdle at 4 issmallest, followed by the model with hurdle at 0, and then followed by the Bayesian QR.Figure 2 displays the mean square errors of the parameters’ estimates computed based on the the formula: M SE = 1 p p (cid:88) i =1 ( β i − ˆ β i ) https://doi.org/10.6084/m9.figshare.13726198.v1 –9– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al. Figure 2.
Mean square errors for the parameter estimates (excluding intercept) for different models with a sample size of 1000 where p is the number of independent variables, not including the intercept, in the model. All calculationswere based on a sample size of 1000 and quantile levels in { , , , , , , , , } (of the whole data). The first member of the quantile set was chosen as . because its correspondingvalue is greater than four. Smaller values (near zero) of the mean square errors are desirable. The resultsshow that, in general, the QR part of the model with hurdle at 4 has the most precise estimates for most ofthe quantiles, followed by Bayesian QR, followed by the QR part of the two-part model with hurdle at 0.However, by increasing the quantile, the estimates of the mean square errors become larger for all threemodels, giving greater difference between the models. Overall, the model with hurdle at 0 has the poorestestimates.Figure 3 illustrates the width of credible intervals for the estimates of the coefficients of x based on thedifferent models and sample sizes. The credible intervals provided are based on the percentiles of the–10– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al. Figure 3.
Width of credible intervals for the estimates of x posterior probability distribution. Credible intervals are the Bayesian counterparts of confidence intervalsin classical statistics. We see that by increasing the sample size from 500 to 3000, the width of the credibleinterval decreases considerably. Moreover, the QR part of the model with hurdle at 4 has the largest widthfor all the quantiles, followed by the QR part of the model with hurdle at 0, followed with the Baysian QR.Figure 4 shows the widths of the credible intervals for the estimates of the coefficients of x based on thedifferent models and sample sizes. The coefficient of x in the model from which the data was simulatedwas , that is, x is not significant. The figure shows that all three models have approximately the samewidth for each quantiles.The results of the simulation shows that the QR part of the two-part model with hurdle at 4 in generalreturns more accurate estimates based on the mean square errors of the estimates of parameters (excludingthe intercept), and by increasing the quantile level, all three models have larger mean square errors. Inaddition, the QR part of the two-part model with hurdle at 4 results in smaller prediction errors at the–11– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al. Figure 4.
Width of credible intervals for the estimates of x –12– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al. cost of slightly wider credible intervals. This followed with the QR part of the model with hurdle at 0,then followed with the Bayesian QR model. Moreover, a larger sample size also decreases the differencesbetween models for the width of their credible intervals. CITATION COUNT EXAMPLE
The data used in this article consists of citation counts for standard journal articles (excluding reviews)published in the following seven Scopus fields:
Arts and Humanities (all) , Literature and Literary Theory , Religious Studies , Visual Arts and Performing Arts , Media Technology , Architecture , and
EmergencyNursing . The articles were published in the year 2010 and their information was extracted at the endof the year , giving the citation counts time to mature. These seven fields were selected becauseafter discarding the records with missing cells for computation of the dependent and possible independentvariables, they have the highest proportions of zeros and also they have maximum records of 3000. Theeffect of sample size on computation time is a method limitation because MCMC is time-consuming. Thenumber of citations to each article is the dependent variable. The independent variables available were thenumber of keywords, the number of pages, title length, abstract length, collaboration (the number of authorsof an article), international collaboration, abstract readability and journal internationality. Collaboration,length of title, and journal internationality were selected as independent variables because with this selectionfewer records with missing data had to be discarded and the percentage of zeros in the data remained high.The selected variables have a reasonably strong correlation with the corresponding citation counts for mostof the seven fields.
Table 1.
Details of the citation count data for the seven fields from 2010 analysed Field Number of articles Percentage of zeros Percentage of ones Percentage of twos Percentage of threes Percentage of substantial mass points in ≤ y ≤ Total percentage of substantial mass pointsLiterature and Literary Theory 3126 51 18 11 5 34 85Arts and Humanities 1460 41 16 8 7 31 72Visual Arts and Performing Arts 1799 39 16 10 6 32 71Architecture 2215 36 16 10 7 33 69Religious Studies 2176 32 18 11 8 37 69Emergency Nursing 1299 39 10 6 5 21 60Media Technology 1889 27 8 6 6 20 47
The highest fractions of zeros are related to both
Literature and Literary Theory and
Arts and Humanities respectively (Table 1). The portions of ones, twos, and threes in comparison to the portion of zeros are nottiny but they are still noticeable. There are no substantial mass points greater than 3 for the fields. The–13– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al. percentage of substantial mass points greater than zero in the fields varies from approximately 20 % for Media Technology up to 37 % for Religious Studies . The total percentages of substantial mass points for thefields of
Literature and Literary Theory and
Media Technology are the largest (85 % ) and smallest (47 % )respectively. A sequence of quantiles from . to . is considered. Ordinary Bayesian QR, Bayesiantwo-part QR with hurdle at 0, and Bayesian two-part QR with hurdle at 3 were fitted to the datasets. MCMCwas calculated with the J AGS function of the package of R jags in R . 100000 iterations with burn-in50000 and thinning size of 160 was used for the MCMC computation. Collaboration, title length, andjournal internationality were included in the models as independent variables. Collaboration and lengthtitle are discrete variables. The log function of collaboration was used in the models to provide a closer tolinear relationship with the citation counts. Journal internationality is a continuous variable on the interval [0 , . Journal internationality was computed with the Gini coefficient. A value of shows the highest levelof internationality of the journal related to the article, and a value of shows the least internationality. Comparison of the QR part of the two-part models with the Bayesian QR model
In the following, the results related to the QR parts of the Bayesian two-part QR models with hurdle at 0,and 3 are compared to the results of the Bayesian QR.In Figure 5 and Figure 6, the linear effect of collaboration and their credible intervals over all thequantiles of the citation counts in different fields are shown. The credible intervals provided based on thepercentiles of the posterior probability distribution in Bayesian statistics are counterparts of confidenceintervals in classical statistics. The upper and lower boundaries of the credible intervals are representedby dashed lines. A narrower band illustrates a smaller variance for the estimated parameter. When a bandincludes zero, it indicates a non-significant effect related to the variable.The effects of collaboration on citation counts in Bayesian QR are significantly positive at all quantilesfor all fields. In comparison with two-part models with hurdle at 0, and 3, in the Bayesian QR model, theeffects of the collaboration experience roughly more ups and downs and also shows mostly larger effectsin size for most fields. This demonstrates that the substantial mass points in the data can influence thepatterns of the factor in this model over the quantiles. For the fields of
Literature and Literary Theory , Arts and Humanities , and
Visual Arts and Performing Arts the total percentages of the substantial masspoints are the largest and the difference in the pattern of the effect based on the Bayesian QR model incomparison to the two-part models is clearer. It can be deduced that discarding the substantial mass points–14– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al.
Figure 5.
Comparison of the parameter estimates for collaboration ( β ) over the quantiles of the citation count distribution –15– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al. Figure 6.
The parameter estimates for collaboration ( β ) over the quantiles of the citation count distribution separated by types of the models –16– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al. of the citation counts influences the size and shapes of the effect and make it flatter and it also affects thesignificance status of the effect for some fields across all quantiles. For example, for the field of MediaTechnology based on the model with hurdle at zero and for the fields of
Media Technology and
Architecture based on the model with hurdle at 3, collaboration shows the non-significant impact on citation counts.Moreover, when the size of the substantial mass points at , and is small, for example in the fields of Emergency Nursing and
Media Technology , the estimates of the effect based on both two-part models withhurdle at 0 and 3 are close this further indicates that substantial mass points bias quantile regression, andemphasises the importance of placing the hurdle after such substantial mass points. For most of the fieldsand based on the two-part models, the effect stabilised over quantiles, indicating collaboration equallyinfluences the moderately-cited and highly-cited articles, while based on the Bayesian QR for more fields,the trend of the effect gradually increased and mostly influenced the highly-cited articles. In previousstudies, collaboration has sometimes (but not always) been shown to be related to citation counts. Previousresearch has used different data sets and statistical methods to assess the relationship between citationcounts and collaboration with differing results. For example, (Bornmann et al., 2012) used a negativebinomial regression model (with a log link) for approximately 2000 manuscripts that were submitted tothe journal Angewandte Chemie International Edition (AC-IE). The estimated coefficient of collaborationwas . (i.e. for each unit increase in collaboration, the log of the citation count increases by . on average) with a p -value greater than . . However (Borsuk et al., 2009) used ordinary least squares(OLS) regression to analyse six journals: Animal Behaviour, Behavioral Ecology, Behavioral Ecologyand Sociobiology, Biological Conservation, Journal of Biogeography, and Landscape Ecology from 1997to 2004 and estimated that the effect size and p -value were . and . respectively. By applyingthe negative binomial hurdle model, Didegah (2014) also showed that the collaboration has a significantpositive impact on citation counts for all subjects of the Web of Science except Physics that the effect isnon-significant.Figure 7 and Figure 8 display the linear impact of the length of title on the citation count distributionacross the quantiles in the various fields. Based on the Bayesian QR model, in general, the effect is mostlypositive but really small in size and just statistically significant for some quantiles in some fields. Accordingto this model, this effect fluctuated around special points over the quantiles, illustrating that low, moderately,and highly cited articles are equally influenced by this effect. The effect size based on the Bayesian QR is–17– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al. largest for most of the fields and quantile levels in comparison to the two-part QR models with hurdle at 0,and hurdle at 3. This effect is a significant factor for just a few numbers of fields and quantiles based on thetwo-part models with hurdle points of 0 and 3. By skipping the zero mass point and fitting the Bayesiantwo-part QR model with hurdle at 0, the effect shows a smoother pattern in comparison to the Bayesian QR.Based on these two models, it seems that the moderately and highly cited articles benefit equally of a longertitle. The effect based on the two-part model with hurdle at 3 shows a slightly different pattern but stillwith small size for some quantiles in some fields, for example, it shows the smallest effect size for lowerquantiles of the citation counts for the fields of
Visual Arts and Performing Arts and
Religious Studies that have high percentages of mass points in ≤ y ≤ . According to the previous study of Haslam et al.(2008) and using the correlaton test and regression, the longer title length displayed a negative impact oncitation counts in psychology. In addition, by applying negative binomial hurdle model in different subjectsof Web of Science, Didegah (2014) showed that the mean length of title associated negatively to nonzerocitation counts in some fields of Web of Science such as Economics & Business , Computer Science , and
Chemistry , but non-significantly in the fields of
Clinical Medicine , Multidisciplinary and
Physics .Figure 9 and Figure 10 illustrate how journal internationality influences the citation counts at all quantilesin the different fields. As was mentioned, a lesser value for the Gini coefficient corresponds to greaterjournal internationality, indicating that the journals in this field published articles from a broad range ofcountries. Based on the Bayesian QR, the effect of Gini coefficient significantly negatively influences thecitation counts over all the quantiles for all the fields except
Visual Arts and Performing Arts where itsimpact is not significant for the majority of the quantiles. The negativity of the effect reflects the directrelationship between journal internationality and citation counts. Fitting the Bayesian two-part QR modelswith hurdle at 0 and hurdle at 3, results in the trend of the effect becoming smoother and of noticeablysmaller magnitude, especially for the model with hurdle at 3, for the quantiles in all fields except
Visual Artsand Performing Arts and
Arts and Humanities where the estimates based on the various models intersect,perhaps it refers to the existence of the high portions of mass points in these two fields that influenced theestimates of the effect in the Bayesian QR model. In fact, for the case of
Visual Arts and Performing Arts the Bayesian QR model shows that journal internationality is not significant at most quantiles, whereasthe model with hurdle at 3 indicates that it is, the model with hurdle at 0 being somewhere in between.This is a good example that shows the importance of using the appropriate model. Based on the Bayesian–18– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al.
Figure 7.
The parameter estimates for title length ( β ) over the quantiles of the citation count distribution –19– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al. Figure 8.
The parameter estimates for title length ( β ) over the quantiles of the citation count distribution separated by types of the models –20– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al. Figure 9.
The parameter estimates for journal internationality ( β ) over the quantiles of the citation count distribution QR model, the effect follows mostly a decreasing trend by increasing the quantiles for most of the fields,indicating higher impact size on highly cited articles but mostly a stabilised trend for the two-part models,showing the equal importance of the effect on moderately and highly cited articles. Previous literaturehas also found a significant positive association between journal internationality and citation impact withapplication of Structural Equation Modelling and simple correlation coefficient by Yue (2004) and Kim(2010) respectively. Didegah (2014) applied the negative binomial hurdle model to show both negative andpositive relationships between journal internationality and non-zero citation counts in different fields. Forexample, positive association in
Psychiatry/Psychology subject but negative one in
Social Sciences .–21– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al.
Figure 10.
The parameter estimates for journal internationality ( β ) over the quantiles of the citation count distribution separated by types of the models. –22– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al. Analysing the logistic parts of the two-part models with hurdle at zero and three
The estimates in the logistic part, their credible intervals and standard errors are reported in Table 2 andillustrated in Figure 11. It can be seen that by shifting the hurdle from 0 to 3, the significance status andmostly the size of the effects corresponding to the independent variables change for most of the fields.For example, the absolute effect size of the collaboration in three fields
Literature and Literary Theory , Religious Studies , and
Visual Arts and Performing Arts gets larger, for these fields also, title length ceasesto be significant. In general, collaboration shows smaller absolute impact size on zero citation in comparisonto its impact on low citation (e.g., 0-3 citations). For title length, this trend is inverse. The absolute impactsize of the longer title on zero citation is slightly larger than on low citation. In addition, for some of thefields, the influence of journal internationality on zero citation is a bit larger than on low citation, but forother fields it is a bit smaller.collaboration has a significant negative impact on both zero and low cited articles. For example for
Emergency Nursing based on the model with hurdle at 0, with greater collaboration, the odds of zerocitations decreases on average by (more detail: ( exp ( − . − ∗
100 = − ). In the same fieldbut for the model with hurdle at 3, the odds of low citation decreases on average by (more detail: ( exp ( − . − ∗
100 = − ). Collaboration has its absolute largest impact in the field of Arts andHumanities based on the model with hurdle at 0, while it occurs in the field of
Literature and LiteraryTheory for the model with hurdle at 3. Based on the previous studies, for example, Didegah (2014) usedthe negative binomial hurdle model and showed that collaboration has a negative impact on zero citation inmost of Web of Science subjects except for
Microbiology , Multidisciplinary , Pharmacology & Toxicology ,and
Physics that this effect was non-significant.Title length in comparison to collaboration and journal internationality has the smallest absolute impactsize in all the fields. This effect also shows wider credible intervals for all models with hurdle at 0, and 3 incomparison to the effects of collaboration and journal internationality. This effect also has more cases ofnon-significance status for the models with hurdle at 3 in comparison to other factors. For these models, thetitle length has a positive impact on zero citation and also on the low citation for the field of
Architecture while for other fields this effect is negative with the small size of the impact. The negativity means thelonger title decreases the odds of zero citations based on the model with hurdle at 0 and decreases the oddsof low citation in the model with hurdle at 3. The largest and smallest impact sizes for title length are in–23– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al.
Media Technology and
Architecture respectively for the model with hurdle at 0, while for the models withhurdle at 3, they are
Emergency Nursing and
Architecture . Didegah (2014) showed that title length is anon-significant factor for zero citation for most Web of Science subjects, but for
Agricultural Sciences , Geosciences , Materials Science , Mathematics , and
Physics title length has a significant positive impact onthe odds of zero citation.Journal internationality has the largest absolute impact on both zero citation and low citation, and also hasshorter credible intervals in comparison with collaboration and title length for all models with hurdle at0, and 3 for most fields. The impact of a greater Gini coefficient (smaller journal internationality) has asignificant positive effect on the odds of zero citation and low citation for most of the fields, indicating theindirect relationship between journal internationality and zero or low citation. Didegah (2014) showedthat greater journal internationality increases the odds of zero citation for most of the subjects in Web ofScience except in
Space Science the effect has a decreasing pattern.
Table 2.
Estimates of the vector γ and credible intervals and standard deviations from the logistic part of the Bayesian two-part QR model with hurdle at 0,and 3 Field Parameters BTPQR with hurdle at 0 BTPQR with hurdle at 3Lower band Mean Upper band Standard deviation Lower band Mean Upper band Standard deviation
Literature and Literary Theory γ -6.813 -4.112 -1.500 1.366 -3.664 -0.746 2.269 1.497 γ -0.963 -0.711 -0.457 0.127 -1.410 -1.210 -0.984 0.111 γ -0.045 -0.027 -0.004 0.011 -0.028 -0.006 0.015 0.011 γ Arts and Humanities γ -2.574 -0.519 1.703 1.100 -7.635 -5.628 -3.821 0.965 γ -1.412 -1.160 -0.884 0.133 -1.422 -1.187 -0.940 0.123 γ -0.051 -0.027 -0.003 0.012 -0.048 -0.019 0.009 0.015 γ -1.484 0.835 2.931 1.137 6.167 8.030 10.108 1.011 Emergency Nursing γ -16.804 -14.064 -11.437 1.399 -14.206 -12.087 -10.035 1.081 γ -0.533 -0.345 -0.158 0.095 -0.902 -0.674 -0.446 0.115 γ -0.078 -0.048 -0.020 0.014 -0.087 -0.057 -0.026 0.015 γ Visual Arts and Performing Arts γ -0.833 1.421 3.712 1.177 -2.056 0.113 2.207 1.133 γ -0.854 -0.622 -0.399 0.115 -1.287 -1.056 -0.837 0.111 γ -0.050 -0.029 -0.010 0.010 -0.028 -0.005 0.018 0.01 γ -3.900 -1.490 0.881 1.245 -0.655 1.506 3.822 1.198 Architecture γ -14.943 -12.687 -10.485 1.149 -15.209 -13.424 -11.566 0.947 γ -0.774 -0.595 -0.433 0.088 -0.664 -0.472 -0.309 0.092 γ -0.010 0.009 0.027 0.010 -0.012 0.012 0.034 0.011 γ Religious Studies γ -6.813 -4.112 -1.500 1.366 -3.664 -0.746 2.269 1.497 γ -0.963 -0.711 -0.457 0.127 -1.410 -1.210 -0.984 0.111 γ -0.045 -0.027 -0.004 0.011 -0.028 -0.006 0.015 0.011 γ Media Technology γ -23.415 -21.461 -19.359 1.033 -17.779 -16.389 -14.719 0.791 γ -0.867 -0.685 -0.499 0.098 -0.488 -0.297 -0.099 0.096 γ -0.097 -0.062 -0.029 0.017 -0.074 -0.043 -0.010 0.016 γ –24– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al. Figure 11.
Parameter estimates from the logistic part of the two-part QR models –25– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al. DISCUSSION AND CONCLUSION
Quantile regression enables a deep description of the relationship between independent variables and adependent variable. It is a useful technique for analysing the entire citation count distribution correspondingto low, moderately, and highly cited articles. Discontinuity and the presence of substantial mass points atlower counts are characteristics of citation counts that make the application of the “usual” QR inappropriate.In this research, an update of a Bayesian two-part Hurdle QR model was introduced to scientometrics toaddress these problems. The original Bayesian two-part hurdle QR model was introduced for the caseof count data with a substantial mass point at zero. It allows the zeros and nonzeros data to be modeledseparately but simultaneously. For citation count data, as well as a substantial mass point at zero in somefields there can be substantial mass points at lower counts, such as ones, twos, and threes, that influencethe estimation of the model. Therefore, we introduce a method to shift the hurdle forward to discard theeffect of the substantial mass points on the estimation of the model for fields with many low cited articles.Articles without more citations than the hurdle are regarded as “low cited articles”. In this new update,the model enables analyses of the citation counts of low cited articles simultaneously but separately fromthose of the moderately and highly cited articles.It uses jittering for the citation counts greater than thehurdle to render such data continuous. The model benefits from the power of its QR portion for modelingthe different quantiles of the jittered citation counts, and its logistic portion for analysing the influence offactors such as collaboration, title length, and journal internationality on the chances of an article receivingfew citations. The usefulness and applicability of the method were illustrated based on both simulated andreal citation count data. The simulation showed that the QR part of the two-part model with a hurdle pointpast the substantial mass points in the data, gives mostly more accurate estimates based on the indicator ofthe mean square error of the estimates of the coefficients corresponding to the independent variables in themodel. Moreover, the QR part of the two-part QR models provides smaller prediction errors at the cost ofslightly wider credible intervals for the parameter estimates in comparison to the Bayesian QR. Shiftingthe hurdle point forward to after the substantial mass points in the data leads to smaller prediction errorsfor the estimates. Citation data from seven Scopus fields were also considered and three models includingBayesian QR, Bayesian two-part QR with hurdle at 0, and Bayesian two-part QR with hurdle at 3 werefitted to the data. The results of the Bayesian QR model based on the whole data shows a pattern withmore ups and downs for the independent variables over the quantiles. However, the two-part models with–26– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al. hurdle at 0, and 3, generally show a smoother trend of the estimates over the quantiles for most of the fields.Shifting the hurdle from 0 to larger point and passing the substantial mass points in the data influence theimpact size, the significance status, and the width of the credible intervals illustrating the importance ofchoosing the hurdle appropriately.In summary, we have shown that the proposed hurdle-at-three model has many advantages over the hurdle-at-zero model of King and Song (2019) for the modelling of citation count data for fields with largepercentages of articles with few citations.
ACKNOWLEDGMENTS
The authors would like to thank Clay King for his helpful comments.
REFERENCES
Ahlgren, P., Colliander, C., and Sj¨og˚arde, P. (2017). Exploring the relation between referencing practices and citation impact: Alarge-scale study based on web of science data.
Journal of the Association for Information Science and Technology , 69.Anauati, V., Galiani, S., and G´alvez, R. H. (2016). Quantifying the life cycle of scholarly articles across fields of economicresearch.
Economic Inquiry , 54(2):1339–1355.Bornmann, L., Schier, H., Marx, W., and Daniel, H.-D. (2012). What factors determine citation counts of publications inchemistry besides their quality?
J. Informetr. , 6(1):11–18.Borsuk, R., Budden, A., Leimu, R., Aarssen, L., and Lortie, C. (2009). The influence of author gender, national language andnumber of authors on citation rate in ecology.
Open Ecol. J. , 2:25–28.Brzezinski, M. (2015). Power laws in citation distributions: evidence from scopus.
Scientometrics , 103(1):213–228.Coad, A. and Rao, R. (2008). Innovation and firm growth in high-tech sectors: A quantile regression approach.
Research Policy ,37(4):633–648.Danell, R. (2011). Can the quality of scientific work be predicted using information on the author’s track record ? . Journal of theAmerican Society for Information Science and Technology , 62(1):50–60.Didegah, F. (2014). Factors associating with the future citation impact of published articles: A statistical modelling approach.Eom, Y.-H. and Fortunato, S. (2011). Characterizing and modeling citation dynamics.
PLOS ONE , 6(9):1–7.Galiani, S. and G´alvez, R. H. (2019). An empirical approach based on quantile regression for estimating citation ageing.
Journalof Informetrics , 13(2):738–750. –27– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al.
Garanina, O. and Romanovsky, M. (2016). Citation distribution of individual scientist: Approximations of stretch exponentialdistribution with power law tails.
Proceedings of ISSI 2015. Istanbul, Turkey: Bogazic¸i University Printhouse , pages272–277.Gilchrist, W. (2000).
Statistical Modeling with Quantile Functions .Haslam, N., Ban, L., Kaufmann, L., Loughnan, S., Peters, K., Whelan, J., and Wilson, S. (2008). What makes an articleinfluential? predicting impact in social and personality psychology.
Scientometrics , 76:169–185.Kim, M.-J. (2010). Visibility of korean science journals: An analysis between citation measures among international compositionof editorial board and foreign authorship.
Scientometrics , 84:505–522.King, C. and Song, J. J. (2019). A bayesian two-part quantile regression model for count data with excess zeros.
StatisticalModelling , 19(6):653–673.Koenker, R. and Hallock, K. F. (2001). Quantile regression.
J. Econ. Perspect. , 15(4):143–156.Koenker, R. W. and Bassett, G. (1978). Regression quantiles.
Econometrica , 46(1):33–50.Kozumi, H. and Kobayashi, G. (2011). Gibbs sampling methods for bayesian quantile regression.
Journal of StatisticalComputation and Simulation , 81(11):1565–1578.Lee, D. and Neocleous, T. (2010). Bayesian quantile regression for count data with application to environmental epidemiology.
Journal of the Royal Statistical Society: Series C (Applied Statistics) , 59(5):905–920.Low, W. J., Wilson, P., and Thelwall, M. (2016). Stopped sum models and proposed variants for citation data.
Scientometrics ,107(2):369–384.Machado, J. A. F. and Silva, J. M. C. S. (2005). Quantiles for counts.
Journal of the American Statistical Association ,100(472):1226–1237.M¨antyl¨a, M. and Garousi, V. (2019). Citations in software engineering – paper-related, journal-related, and author-related factors.Meho, L. (2007). The rise and rise of citation analysis.
Phys World , 20:32–36.Redner, S. (1998). How popular is your paper? an empirical study of the citation distribution.
Eur. Phys. J. B-Condensed Matterand Complex Systems , pages 131–134.Santos, B. and Bolfarine, H. (2015). Bayesian analysis for zero-or-one inflated proportion data using quantile regression.
Journalof Statistical Computation and Simulation , 85(17):3579–3593.Seglen, P. O. (1992). The skewness of science.
J Am Soc Inf Sci , 43(9):628–638.Shahmandi, M., Wilson, P., and Thelwall, M. (2020). A new algorithm for zero-modified models applied to citation counts.
Scientometrics . –28– = D R A F T February 10, 2021 == / Title: Bayesian Two-part Hurdle Quantile RegressionAuthors: M. Shahmandi, et al. Stegehuis, C., Litvak, N., and Waltman, L. (2015). Predicting the long-term citation impact of recent publications.
Journal ofInformetrics , 9(3):642–657.Thelwall, M. (2016). Are there too many uncited articles? zero inflated variants of the discretised lognormal and hooked powerlaw distributions.
J Informetr , 10(2):622–633.Thelwall, M. and Wilson, P. (2014). Regression for citation data: An evaluation of different methods.
J. Informetr. , 8(4):963–971.Wang, J. and Shapira, P. (2015). Is there a relationship between research sponsorship and publication impact? an analysis offunding acknowledgments in nanotechnology papers.
PLOS ONE , 10(2):1–19.Xing, W. (2018). The relationship between sci editorial board representation and university research output in the field ofcomputer science: A quantile regression approach.
Malaysian Journal of Library & Information Science , 23(1):67–84.Yu, K. and Moyeed, R. A. (2001). Bayesian quantile regression.
Statistics & Probability Letters , 54(4):437 – 447.Yue, W. (2004).
Predicting the citation impact of clinical neurology journals using structural equation modeling with partialleast squares . PhD thesis.. PhD thesis.