Fraud detection with statistics: A comment on "Evidential Value in ANOVA-Regression Results in Scientific Integrity Studies" (Klaassen, 2015)
FFraud detection with statistics: A comment on
Evidential Value in ANOVA-Regression Results inScientific Integrity Studies (Klaassen, 2015).
Hannes Matuschek
Focus Area for Dynamics of Complex Systems and Department of Psychology,University of Potsdam, Karl-Liebknecht-Str. 24, D-14476 Potsdam, Germany,[email protected]
Abstract
Klaassen in [11] proposed a method for the detection of data manipulation given the means and standarddeviations for the cells of a oneway ANOVA design. This comment critically reviews this method. Inaddition, inspired by this analysis, an alternative approach to test sample correlations over severalexperiments is derived. The results are in close agreement with the initial analysis reported by ananonymous whistleblower [1]. Importantly, the statistic requires several similar experiments; a test forcorrelations between 3 sample means based on a single experiment must be considered as unreliable.
An analysis of means and standard deviations [17], culled from a series of scientific publications,led to a request for retraction of a subset of the papers [16]. The analysis was based on a methodreported in Klaassen [11] aimed at detecting a type of data manipulation that causes correlationsbetween condition means of samples that are assumed to be independent. Specifically, given aone-way balanced ANOVA design with 3 conditions, X i , i = 1 , ...,
3, the means obtained byaveraging over the scores of n different subjects in each condition, are samples of a 3-dimensionalnormal distribution X X X ∼ N µ µ µ , n − σ σ σ ρ σ σ ρ σ σ ρ σ σ σ ρ σ σ ρ σ σ ρ σ , (1)where µ i are the unknown true expected values and σ i the unknown sample standard deviationsof the scores under the respective conditions and ρ i their correlations. The ANOVA assumesindependence between the samples of the conditions, such that ρ i = 0. Indeed, given onlysamples of X i and estimates of σ i , the sample correlations ρ i are not directly accessible.An anonymous whistleblower pointed out [1], that the results in the studies under suspicion(i.e [6], compare Figure 1), show a super linear pattern which appears too good to be true .Importantly, the authors of the original publications did not necessarily expect such patterns ofequidistant means; they expected an ordinal, not a linear relation between the three conditionmeans. Nevertheless, the reanalyses were carried out under the assumption of an expected strict1 a r X i v : . [ s t a t . M E ] J u l Introduction x x x l l l x x x l l l x x x l l l x x x l l l x x x l l l x x x l l l x x x l l l x x x l l l x x x l l l x x x l l l x x x l l l x x x l l l JF.D12.SPPS
Fig. 1:
Condition means ( x , x and x ) and standard deviations for the 12 experiments reportedin [6]. The condition means x and x have been connected by a line to visualize thedeviance from a perfect linear behavior of the condition means.linear relation between means. The reason was that this strict assumption is conservative withrespect to an inference of data manipulation .Under the assumption of a strictly linear relationship between the group means, µ i = α + β · i ,the scores can be described as X i = α + β · i + (cid:15) i which implies that 0 = E [ Z ] = E [ X − X + X ] = µ − µ + µ . This linear-combination of sample means X i yields a new random variable Z withthe (univariate) normal distribution Z ∼ N (0 , n − σ Z ( ~σ, ~ρ )). Where σ Z ( ~σ, ~ρ ) = σ + 4 σ + σ − σ σ ρ − σ σ ρ + 2 σ σ ρ . Note that the random variable Z can be seen as the deviancefrom the strictly linear behavior α + β i .Introducing correlations between the samples increases or decreases the variance of Z . Klaassen[11] assumes that a plausible data manipulation (e.g., adjusting the mean of the middle sampletowards the mean of the means of the lower and upper samples to achieve significant differencesbetween the groups) leads to a decrease of the variance of Z , σ Z ( · , · ). Such a variance reductionmay have gone unnoticed as humans tend to underestimate variance in data. As mentionedabove, the results under suspicion show a super linear behavior and hence a small variance in Z which may not be expected given the group variances σ i under the assumption of independence.Consequently, Klaassen [11] used a simple likelihood-ratio test to decide whether there isevidence for data manipulation in terms of a evidential value as V = max ~ρ ∈F f ( z | σ z ( ~σ, ~ρ )) f ( z | σ z ( ~σ,~ , comparing the maximum likelihood of all feasible vectors of correlations ~ρ with the likelihood of z under the assumption of ~ρ = ~
0, where F = (cid:8) ~ρ : ρ i ∈ ( − , , ρ + ρ + ρ − ρ ρ ρ < , σ Z ( ~σ, ~ρ ) ≤ σ Z ( ~σ,~ (cid:9) , is the set of feasible correlation vectors, maintaining that the covariance matrix (in eq. 1)remains positive definite and ensures that σ Z ( ~σ, ~ρ ) ≤ σ Z ( ~σ,~ ∀ ~ρ ∈ F . As the true sample It is also not clear how a suitable test could be constructed for the assumption that the means are expectedonly in a monotonic, not necessarily equidistant order.
An asymptotic test statistic standard deviations ~σ are unknown, they might be replaced by the reported ones ~s , since ~s → ~σ as n → ∞ . Without any knowledge of the test statistics, i.e. the distribution of V under the null hypothesis H (independent group means), it is not possible to interpret the value V and hence to decidewhether a certain value of V does provide evidence for the presence of sample correlations.The estimates of the sample variances ( ~s ) are themselves random variables with some unknowndistribution. It is therefore rather unlikely to obtain a closed form expression for the test statistic,even under restrictive assumptions about the distribution of ~s .Nevertheless, as proposed by Klaassen [11], one may assume that asymptotically ~s → ~σ as n → ∞ . Then one can assume that the sample variances ~σ are fixed and known, allowing for theconstruction of an upper-bound asymptotic test statistic.The likelihood to obtain a specific value z , given the sample variances ~σ and correlations ~ρ is f ( Z = z | σ Z ( ~σ, ~ρ )) = √ n √ πσ Z ( ~σ, ~ρ ) exp (cid:26) − n z σ Z ( ~σ, ~ρ ) (cid:27) and therefore V = max ~ρ ∈F σ Z ( ~σ,~ σ Z ( ~σ, ~ρ ) exp ( − n z σ Z ( ~σ, ~ρ ) + n z σ Z ( ~σ,~ ) . (2)Now, let a = σ Z ( ~σ,~ρ ) σ Z ( ~σ,~ be the relative standard deviation and σ = σ Z ( ~σ,~
0) then V = max a ∈A a − exp (cid:26) − n z a σ + n z σ (cid:27) . The feasible set of all a values A is implicitly defined by the feasible set of correlations as A = (cid:26) σ z ( ~σ, ~ρ ) σ z ( ~σ,~
0) : ~ρ ∈ F (cid:27) . From this it follows immediately that
A ⊆ (0 ,
1] as σ Z ( ~σ, ~ρ ) ≤ σ Z ( ~σ,~ ∀ ~ρ ∈ F .Under a worst-case scenario, one may assume A = (0 , a ∈ (0 , ~ρ ∈ F such that σ Z ( ~σ, ~ρ ) = a σ . Please notethat this is not ensured in general. The worst-case assumption, however, allows one to obtainupper-bounds for the distribution of V under H analytically by relaxing the constraints on a implied by the feasibility constraints on ~ρ .Within this setting one gets V ≤ ˆ V = max a ∈ (0 , a − exp (cid:26) − n z a σ + n z σ (cid:27) . With ˜ z = √ nzσ , the normalized z with respect to the expected standard deviation under H ˆ V = max a ∈ (0 , a − exp (cid:26) − ˜ z a + ˜ z (cid:27) . Straightforward computation reveals0 = ∂ a (cid:18) log (cid:20) a − exp (cid:26) − ˜ z a + ˜ z (cid:27)(cid:21)(cid:19) ⇒ a = ˜ z , Testing multiple experiments and therefore ˆ V = ( | ˜ z | > | ˜ z | − exp n ˜ z − o : else . (3)Under the worst-case scenario, an upper-bound evidential value ˆ V ≥ V can be computeddirectly without maximizing the likelihood-ratio numerically. This result was also found byKlaassen (compare eq. 18 in [11]).Knowing that the maximum ˆ V is achieved at ˜ z = a and therefore nz σ = σ Z ( ~σ,~ρ ) σ , onemay conclude that the likelihood-ratio test compares the expected variance σ under H witha variance estimated from a single sample. Such a variance estimate is known to be unreliableand therefore the evidential value for a single experiment must be unreliable, too. This issue isdiscussed in detail in the next section. Klaassen [11], see also [17] suggested to obtain the evidential value V for an article consisting ofmore than one experiment as the product of the evidential values V j of the single experimentsin the article. The evidential value V of a publication given N experiments is then V = N Y j =1 V j = N Y j =1 max ~ρ ∈F j f ( z j | σ Z ( ~σ j , ~ρ )) f ( z j | σ Z ( ~σ j , ~ρ )) . (4)Given that V j ≥
1, this immediately implies that the product grows exponentially with thenumber of experiments even if H is true. Instead of obtaining the evidential value for everysingle experiment in an article, which (in a worst-case scenario) is based on a variance estimatorfrom a single sample ( σ Z,j = n j z j ), one may try to base that variance estimation on N samplesprovided by the N experiments in an article. I.e. V = max ~ρ ∈F N Y j =1 f ( z j | σ Z ( ~σ j , ~ρ )) f ( z j | σ Z ( ~σ j , ~ρ )) , (5)where the feasible set F = T Nj =1 F j , is just the intersect of all feasible sets F j of every experiment.The idea of this alternative approach is simple: We cannot make a reliable statement aboutthe probability of observing a single suspiciously small ˜ z j , particularly as 0 = E [ Z ] under H .However, observing a suspiciously small ˜ z repeatedly is unlikely and may indicate sample corre-lations between groups.Following the worst-case scenario above, the joint evidential value for N experiments is asymp-totically ˆ V = max a ∈ (0 , a − N exp − N X j =0 n j z j a σ ,j + N X j =0 n j z j σ ,j = max a ∈ (0 , a − N exp − N X j =0 ˜ z j a + N X j =0 ˜ z j , Relation to the ∆ F test where again ˜ z j = √ n j z j σ ,j and σ ,j = σ Z ( σ j ,~ a = 1 N N X j =1 ˜ z . This implies that, in a worst-case scenario, the joint likelihood-ratio compares a varianceestimate based on N samples with the expected one. And finallyˆ V = ≤ N P Nj =1 ˜ z (cid:26) − N + P Nj =0 ˜ z j (cid:27)(cid:0) N P Nj =1 ˜ z j (cid:1) N : else . Note that the joint evidential value for N experiments relies on the fact that ˜ Z j ∼ N (0 , H and therefore P Nj =1 ˜ Z j ∼ χ N . Hence the test statistics for sample correlationsbetween groups can be expressed as a simple chi-squared statistic and one does not need to makethe detour of obtaining an approximate distribution of V under H . ∆ F test The χ -test derived in the last section is closely related to the ∆ F -test suggested by the whistle-blower [1]. This test was also included in the report for the University of Amsterdam [17].Under H and the assumption of a linear trend, the p-values of the ∆ F -test for a singleexperiment within an article are distributed uniformly in [0 , F -test first determines a p-value for every study and testswhether the resulting p-values p j are to good to be true while the chi-square test introduced hereassesses this value directly by inspecting whether the relative deviations form perfect linearity˜ z j are to good to be true . Therefore, unsurprisingly, the two methods yield very similar results(see Table 1). Article χ -test ∆ F -tests ClassificationJF09.JEPG [3] 8.06e-07 2.30e-07 strongJF11.JEPG [5] 8.73e-07 3.53e-07 strongJF.D12.SPPS [6] 7.14e-09 1.82e-08 strongL.JF09.JPSP [13] 6.44e-4 8.46e-5 strongL.JF09.JPSP* 0.03 0.02 –JF.LS09.JEPG [4] 0.25 0.11 strongJF.LK08.JPSP [7] 0.81 0.66 inconclusiveD.JF.L09.JESP [2] 0.93 0.52 inconclusiveReference [8, 9, 10, 12, 15, 18, 20, 22, 21] 0.11 0.14 – Tab. 1:
Comparison of p-values obtained with the direct χ and ∆ F tests for studies classifiedas providing strong or inconclusive statistical evidence for low veracity by Peeters et al.[17]. The first three studies listed in the table were reported by the whistleblower [1].Note the divergence for JF.LS09.JEPG between the present analysis and [17]. Only thosestudies from [17] were considered here which provide at least 8 experiments. Relation to the ∆ F test −4 −2 0 2 4 . . . JF09.JEPG: p=0.00000081 −4 −2 0 2 4 . . . JF11.JEPG: p=0.00000087 −4 −2 0 2 4 . . . JF.D12.SPPS: p=0.00000001 −4 −2 0 2 4 . . . L.JF09.JPSP: p=0.00064358 −4 −2 0 2 4 . . . L.JF09.JPSP*: p=0.02940077 −4 −2 0 2 4 . . . JF.LS09.JEPG: p=0.24869233 −4 −2 0 2 4 . . . JF.LK08.JPSP: p=0.81094855 −4 −2 0 2 4 . . . D.JF.L09.JESP: p=0.89804776 −4 −2 0 2 4 . . . Reference: p=0.11149939
Fig. 2:
The distribution of ˜ z j (short dashes at the bottom of each panel) for each experiment fromthe articles listed in Table 1. The solid line shows the expected distribution of ˜ Z j under H while the dashed line shows the normal distribution with 0-mean and the varianceestimated from the samples ˜ z j .Both methods, the χ and ∆ F tests, are conservative compared to the V-value approachby Klaassen [11]. For example, the article JF.LS09.JEPG in Table 1 was classified with strongstatistical evidence for low veracity [17] (compare also Figure 3). In contrasts, the χ and ∆ F methods, yield p-values of ≈ .
25 and ≈ .
11, respectively, suggesting that there is no evidenceof sample correlations between groups. The three methods agree for the studies JF.LK08.JPSPand D.JF.L09.JESP which were classified with inconclusive statistical evidence for low veracity .The three methods also agree on classifying the three articles reported by the whistleblower [1]with strong statistical evidence for low veracity .Depending on the chosen level of significance, the article L.JF09.JPSP could be classified as strong or inconclusive . This article contains conditions for which the authors did not expected aspecific rank ordering of the condition means. Peeters et al. [17] included these control conditions but reordered them according to increasing group means, yielding a p-value for the χ -test ofabout 0 . µ − µ + µ , contains the assumption of equal group-means, i.e. µ = µ = µ as aspecial case, the actual test-result depends on the ordering of the conditions. Keeping the orderof conditions as reported in [13] yields a p-value of about 0 .
015 and excluding them results in ap-value of about 0 .
03, shown as L.JF09.JPSP* in Table 1.The discrepancy between the χ or ∆ F methods and the V-value method for the JF.LS09.JEPGarticle [4] is due to the tendency of the V-value method to indicate strong evidence if a singleexperiment out of a series of experiments has a very small ˜ z -value. In contrast to the V-valuemethod, the χ and the augmented V-method (see Section 3) take all experiments of an articleinto account by assuming the same correlation structure for all experiments.For the particular article [4], the V-value approach reported strong evidence for low veracity because the last two experiments (compare Figure 3) exhibit the super linear pattern associated Discussion x x x l l l x x x l l l x x x l l l x x x l l l x x x l l l x x x l l l x x x l l l x x x l l l x x x l l l JF.LS09.JEPG
Fig. 3:
Condition means and stdandard deviations for 9 experiments from [4].with sample correlations. The χ and ∆ F method, however, do not indicate significant samplecorrelations as the deviance of remaining experiments fit well into the expected distribution under H , especially the results in panels 5 & 6 in Figure 3.Klaassen [11] intended the V-value to be sensitive for single experiments. The argument isthat bad science cannot be compensated by very good science [11]. Finding a small value for ˜ z j ina series of experiments, however, is quiet probable even under H . Hence one could argue thata single suspiciously small ˜ z j can not be interpreted as strong evidence for sample correlations. There is no doubt that, in principle, statistics can be used to detect sample correlations that aredue to data manipulation. The approach proposed in [11], however, is not without problems.A first problem is the missing test statistics for the evidential value V . Although an upper-bound asymptotic test statistics for the V-value of a single experiment can be obtained (seeSection 2 above and [11]), the reliability of the V value for a small n remains unknown (as wellas how large a large n must be to be considered large ).A second problem is the critical value of V ∗ = 6 chosen by the authors, which implies(asymptotically) p ≈ .
08. Arguably, this is a rather high probability of falsely accusing acolleague of data manipulation.A third problem is the assumption that the product of the evidence provided by every singleexperiment in an article can serve as a metric of evidence for data manipulation in this article. Asmentioned above as well as in the comments to the article at pubpeer.com [19] and in a responseby Denzler and Liberman [14], this assumption implies that the evidence for data manipulationgrows exponentially with the number of experiments even under H . The probability of V ≥ p ≈ .
25. Thus, about every 4th good experiment will doublethe evidence for data manipulation.The fourth problem, finally, is a general concern. The analysis assumes a specific type ofdata manipulation. If this is true, the manipulation will induce correlations between conditionmeans. Moreover, under the second assumption that 0 = X − X + X this correlation can I.e. for 10 experiments ( N = 10) p ≈ . α = 0 .
05 and p ≈ . α = 0 . Discussion be detected. Importantly, however, the reverse is not true: The detection of such correlationsin the data does not necessarily imply that data were manipulated. For that reason, Peeterset al. carefully avoided in [17] to claim that their findings prove that data were manipulated.Instead the results are interpreted as evidence for low data veracity , which is justified. In [11],however, Klaassen claims that its method provides evidence for manipulation. Although theorigin of sample correlations cannot be determined with statistics, their presence certainly vio-lates an ANOVA assumption. This may result in an increased type-I error rate. Therefore, theeffects reported in the articles providing strong or possibly even inconclusive evidence for samplecorrelations (e.g [3, 5, 6, 13]) may be less significant than suggested by their ANOVAs.In this comment, specifically in Section 3, the concept of the single-experiment evidentialvalue was extended to multiple experiments. Moreover, a much simpler chi-squared test wasprovided to test the presence of correlations in the data that is similar to the test proposed in [1]and yielded very similar probabilities for the presence of sample correlations. Thus, the V-valueapproach can serve as a test for sample correlations, if it is applied across several identical or atleast similar experiments. In this case one is also able to decide whether the variability in theresults is suspiciously small or not. However, estimating σ Z on the basis of a single experimentwill certainly not reveal a reliable result. References [1] Anonymous. Suspicion of Scientific misconduct by Dr. Jens Förster, 2012.[2] Markus Denzler, Jens Förster, and Nira Liberman. How goal-fulfillment decreases aggres-sion.
Journal of Experimental Social Psychology , 45(1):90–100, 2009.[3] J. Förster. Relations between perceptual and conceptual scope: How global versus localprocessing fits a focus on similarity versus dissimilarity.
Journal of Experimental Psychology:General , 138:88–111, 2009.[4] J. Förster, N. Liberman, and O Shapira. Preparing for novel versus familiar events: Shiftsin global and local processing.
Journal of Experimental Psychology: General , 138:383–399,2009.[5] Jens Förster. Local and global cross-modal influences between vision and hearing, tasting,smelling, or touching.
Journal of Experimental Psychology: General , 140:364–389, 2011.[6] Jens Förster and Markus Denzler. Retracted: Sense Creative! The Impact of Global andLocal Vision, Hearing, Touching, Tasting and Smelling on Creative and Analytic Thought.
Social Psychological and Personality Science , 3(1):118–118, 2014.[7] Jens Förster, Nira Liberman, and Stefanie Kuschel. The effect of global versus local pro-cessing styles on assimilation versus contrast in social judgment.
Journal of personality andsocial psychology , 94(4):579–599, 2008.[8] H. Hagtvedt and V. M. Patrick. Turning Art Into Mere Illustration: Concretizing ArtRenders Its Influence Context Dependent.
Personality and Social Psychology Bulletin ,37(12):1624–1632, 2011.[9] Catherine Hunt and Marie Carroll. Verbal overshadowing effect: How temporal perspectivemay exacerbate or alleviate the processing shift.
Applied Cognitive Psychology , 22(1):85–93,2008.
Discussion [10] A. B. Kanten. The effect of construal level on predictions of task duration. Journal ofExperimental Social Psychology , 47(6):1037–1047, 2011.[11] Chris A. J. Klaassen. Evidential Value in ANOVA-Regression Results in Scientific IntegrityStudies. pages 1–12, 2015. arXiv:1405.4540 [stat.me].[12] Davy Lerouge. Evaluating the Benefits of Distraction on Product Evaluations: The Mind-Set Effect.
Journal of Consumer Research , 36:367–379, 2009.[13] N. Liberman and J. Förster. Distancing from experienced self: How global-versus-localperception affects estimation of psychological distance.
Journal of Personality and SocialPsychology , 97:203–216, 2009.[14] Nira Liberman and Markus Denzler. Response to a Report Published by the University ofAmsterdam The, 2015.[15] Selin A. Malkoc, Gal Zauberman, and James R. Bettman. Unstuck from the concrete:Carryover effects of abstract mindsets in intertemporal preferences.
Organizational Behaviorand Human Decision Processes , 113(2):112–126, 2010.[16] University of Amsterdam. Articles jens förster investigated, 2015.[17] Carel F. W. Peeters, Chris A. J. Klaassen, and Mark A. van de Wiel. Evaluating theScientific Veracity of Publications by Dr. Jens Förster, 2015.[18] Evan Polman and Kyle J Emich. Decisions for others are more creative than decisions forthe self.
Personality and social psychology bulletin , 37(4):492–501, 2011.[19] Pubpeer.com. Evidential Value in ANOVA-Regression Results in Scientific Integrity Studies,2015.[20] Laurens Rook and Daan van Knippenberg. Creativity and Imitation: Effects of RegulatoryFocus and Creative Exemplar Quality.
Creativity Research Journal , 23:346–356, 2011.[21] Pamela K Smith and Yaacov Trope. You focus on the forest when you’re in charge of thetrees: power priming and abstract information processing.
Journal of personality and socialpsychology , 90(4):578–596, 2006.[22] Pamela K. Smith, D. H J Wigboldus, and Ap Dijksterhuis. Abstract thinking increasesone’s sense of power.