[PDF] Causes of Misleading Statistics and Research Results Irreproducibility: A Concise Review

Abstract

Bad statistics make research papers unreproducible and misleading. For the most part, the reasons for such misusage of numerical data have been found and addressed years ago by experts and proper practical solutions have been presented instead. Yet, we still see numerous instances of statistical fallacies in modern researches which without a doubt play a significant role in the research reproducibility crisis. In this paper, we review different bad practices that impact the research process from its beginning to its very end. Additionally, we briefly propose open science as a universal methodology that can facilitate the entire research life cycle.

Full PDF

aa r X i v : . [ s t a t . O T ] O c t Causes of Misleading Statistics and Research Results Irreproducibility:A Concise Review

Farzan Shenavarmasouleh and Hamid R. Arabnia Department of Computer Science, University of Georgia, Athens, Georgia, United States, 30602 { fs04199, hra } @uga.edu Abstract — Bad statistics make research papers unrepro-ducible and misleading. For the most part, the reasons for suchmisusage of numerical data have been found and addressedyears ago by experts and proper practical solutions have beenpresented instead. Yet, we still see numerous instances ofstatistical fallacies in modern researches which without a doubtplay a signiﬁcant role in the research reproducibility crisis. Inthis paper, we review different bad practices that impact theresearch process from its beginning to its very end. Additionally,we brieﬂy propose open science as a universal methodology thatcan facilitate the entire research life cycle.Keywords: Statistical Fallacy, Misleading Statistics, PaperReproducibility, Replication Crisis, Open Science INTRODUCTION

For the past two centuries, the annual number of articlesand journals have both kept growing at a steady rate of about3% and 3.5% respectively. But, unsurprisingly, the increasein the number of researchers has accelerated this growth overthe last few years. About 2.5 million articles get publishedeach year worldwide [1] and yet, only a tiny proportion oftheir results are reproducible. In fact, more than 70% ofresearchers claimed that they failed at reproducing anotherperson’s work and even more frighteningly, over 50% couldnot reproduce their own work again, altogether leading to areproducibility crisis [2].One may assume that unproducible papers could onlybe found in poor-quality journals, but unfortunately, thatis not the case. Many of these papers are written by verysuccessful researchers and are published in top-gear journalsof their ﬁelds. Given the fact that no good researcher andjournal intend to present false or erroneous information toits audience, the question then arises as to why this incidenthappens. Knowing that the data do not speak for themselvesand they have to be interpreted, it can be safely concludedthat the main cause for this is statistics and it can affect theresearches’ outcomes at multiple levels.Statistics is an essential part of every research, however,not much effort is being put into its proper education.Very few university majors require their students to takestatistics courses and therefore many end up learning itvia some available resources themselves. This could lead toa tremendous amount of problems, such as not using thecorrect statistical methods or graphs for a certain problem or make use of the ones which are not robust to the noises andoutliers existing in the dataset.Sadly, there exists a bigger issue. Even if the paperitself is a hundred percent accurate, the words which getused to describe it can be fallacious. This could simply beunintentional and as before, purely due to bad education,but the numerical data can also be intentionally misused.Truth is not everyone’s ﬁrst priority. Thousands of news getpublished each day, many reporting a concern seeking toget resolved by politicians. Hence, they must compete inorder to get noticed by the public and catch the attention ofpolicymakers and if one succeeds, it consequently gets thebiggest share of their time and budget while the others areleft disregarded. As alarming news can bring more audiencesto the corresponding media, and it means beneﬁts, the peoplein charge do not hesitate to make use of such a thing. Evenif it means that they have to falsely create them and statisticsis the key component in this process. But, every move hasconsequences and propagandas like fake news can alter oreven derail government programs and policies, leading tounforeseen circumstances.This paper aims to address and classify the causes whichmake research papers unreproducible and explain how paperresults could get reported misleadingly. We intend to look atthese challenges as an opportunity to inform our audience.With this goal in mind, we hope that this manuscript wouldoffer the readers, an insight and a better understanding of thecauses of the problems; thus helping them not to become tooreliant on papers that suffer from fallacies. We conclude thepaper by proposing Open Science as an important additionto every research.2. R ESOURCES AND T OOLS

Since statistics is not a mandated part of many majors inuniversities, researchers in those ﬁelds mostly choose to learnit by themselves using some books and after learning thefundamentals, they move on to make use of certain softwaretools to do the complex and time-consuming calculationsfor them or generate eye-catching visuals to aid them inpresenting the results of their work.

Textbooks

Bland et al. [3] explain that most introductory statisticbooks that get published are written by authors who arexperts in their ﬁelds but are rarely qualiﬁed to write aboutanything else. Those books tend to be more attractive to alarge proportion of researchers of that certain ﬁeld than abook authored by an experienced statistician, solely becausethe authors of those books share a common educationalbackground with them and hence their words can be morecomprehensible.From time to time, such books originate incorrect ideas.Bland argues that there were some cases in which the authorshad made an error and then used a false argument to justify it.The horrifying part of this is that the people who read thesebooks are unlikely to have enough knowledge to detect theseﬂaws and consequently these errors keep propagating.

Software Tools

With the extensive use of computers, numerous softwarepackages became accessible for research. Although they aremassively useful and can save much time, using them withoutfull understanding can be dangerous. Many statistics’ opera-tions have multiple versions, probably for different applica-tions. For example, standard deviations could be calculatedwith n instead of n-1 in its denominator or it is incorrectto use unpooled variance instead of pooled variance in t-tests. These little differences could lead to getting signiﬁcantresults while there exists none [3].

Graphs

Graphs are considered great tools for visualizing massivecomplex data and their simplicity helps understand the con-tents better, but when it comes to statistics being misleading,graphs are the root of all evil. There are numerous typesof graphs to choose from and a poor choice can resultin a misleading visual. Much worse, they could easily bemanipulated in a way to give the impression of a much betterresult.Every misleading graph violates at least one of the fol-lowings: • A graph should always be equally spaced on all of itsaxes. • If an axis has to show a quantity, it needs to start fromzero. Otherwise, it fails at picturing the correct relativeincrease or decrease in the value. • If using a pictograph, all symbols should be equallysized and the value which is being represented persymbol should be explicitly written.

Maps

Maps are used to illustrate spatial distributions of datasuch as cancer rates by county. However, if the samplesize differs from area to area, the result can be highlydeceptive especially in poorly-sampled areas. In addition, anyadjustment such as using posterior means can cause furtherproblems [4] and therefore some spurious patterns could befound while in reality, they do not exist. Gelman et al. [5]state that although employing multiple imputed maps couldhelp, they are still not appropriate for presenting the resultas they may confuse the audience even more and hence theycould not be generally used. 3. D ATA C OLLECTION

Broadened Problem Deﬁnition

The deﬁnition of the problem should always be speciﬁcand on point and this inevitably shrinks the size of the do-main and decreases the resultant incident rates. It’s commonsense that bigger numbers are better and since these two arein conﬂict of interest with each other, it is not that rare tosee deﬁnitions intentionally broadened to catch a bigger shareof attention. For instance, in conducting a questionnaire forstranger abduction research, two different strategies could beused. One questionnaire could include short-term missingseven if only for a few hours, attempted offenses which maynot have even ended in actual abduction, to name a few,while the other only includes the children who went missingand found dead later on. The former results in around 15000annual cases in the United States while the latter shows about550 [6].Back in the 90s, an exaggerated image of Australia’stourism was made showing it as a million-job industry, whilein reality, the number of real jobs was around 200,000. Inconsequence, investors, business managers, politicians, andgovernment were misled and by making wrong decisions,national efforts and resources were wasted and the job marketwas over-supplied. Leiper [7] claims that this was partiallybecause of the broad deﬁnition of the word tourist. Itsdeﬁnition from WTTC, WTO and many other institutionsinclude all kinds of visitors who intend to stay somewhere forone or more nights with varying purposes, such as enjoying aholiday, pilgrim, visiting family and friends, doing business,going to school, staying at hospitals, etc. However, if thedeﬁnition was constrained to a more fair one which onlyfactored in the visitors who were there for leisure, then thenumbers would have been closer to reality.

Over-Discretized Sampling

Frequent sampling is a must when it comes to collectinghelpful data. Nonetheless, when the time frames are longerthan a standard threshold, the built dataset will fail to revealvaluable insights. As an example, hospital surge capacity,which is often measured by the number of empty bedsthat can get immediately ready to use in the case of anemergency, is usually reported annually and it fails to takethe daily changes in the number of patients and the within-year changes in bed supply into account. When measureddaily, the result shows way less availability [8].

Biased Sampling

When you want to use a sample, you have to ensure thatit can represent the larger population pretty well. The job ofsample designs is to guarantee such behavior. If a designtends to favor a speciﬁc outcome, then it’s considered abiased bad design. Biased samples can be created whenindividuals themselves choose to be involved in the research.The reason for this would be because the opinion of peoplewho were not interested or simply didn’t care enough hasnot been collected. Also, interviewing people only in speciﬁcplaces, such as streets, bars, libraries, and gyms or even samelaces in different cities can produce completely differentresults. To avoid biases, some good sampling methods havebeen developed, including but not limited to Simple Ran-dom Sampling (SRS), Stratiﬁed Random Sampling, ClusterSampling, Systematic Sampling with Random Start, andMultistage Sampling [9].4. D ATA A NALYSIS AND S TATISTICAL M ETHODS

Gross domestic product (GDP) is one of the most popularindicators used to measure a country’s economic health [10].It depends on factors such as consumer spending, govern-ment spending, businesses’ capital spending, and nation’stotal net exports while each of these, rely on many otherelements related to the goods and services which the nationprovides. Because of this innate complexity, calculation ofGDP is an extremely hard and time-consuming process.In fact, too complex which even though nations’ expertstatisticians have the job of doing the calculations, fromtime to time, the results get revised multiple times afterward[11]. As an example, the U.S. Bureau of Economic Analysisreleased three different GDP rates for the second quarter of2015 [12], [13], [14].The good news is not all statistics problems have that muchcomplexity in terms of the numbers of factors present in themand can be calculated and analyzed quite simply. But, usingthe right methods is a must. Otherwise, the output can leadto misleading deductions.

Common Scales

It’s necessary to have a common scale in order to reportand compare new results with baselines, and since there areusually multiple scales and methods to choose from, weshould ﬁrst fully understand the advantages and disadvan-tages of every single one of them and choose the best onefor our work.Results of clinical trials can be compared using severalapproaches. The most popular ones are Relative Risk (RR),Absolute Risk Reduction (ARR), Odds Ratio (OR), Log ORand Number Needed to Treat (NNT). It is common to useNNT as the default method because it’s easier to comprehendbut [15] refutes that this measurement is biased. NNT lacksprecision as it rounds the result to its nearest integer whichconsequently can blur the differences among trials, it losesthe time dimension and it has some fundamental issues suchas the absence of a value that corresponds to no difference[16], [17], [18], [19]. Hutton [15] further explains that ARRis a lot more reliable and should be used instead.

P-Value and Power

Technically speaking, P-value is the probability of gettinga result, assuming the null hypothesis is true [20]. Put insimple words, the null hypothesis is the theory that wewant to reject, that is our experiment has no effect, andthe alternative hypothesis is the opposite. After calculatingthe P, it is compared with a threshold, usually 0.05, andif it’s above the cut line, the null hypothesis remains true.But, if P-value falls below the cut line, it means that either the null hypothesis is false, or although true, a very rareevent has happened, both indicating that our experimentresult is signiﬁcant. There are some well-known downsidesto using P-value. First, even though we know the smallerthe P-value, the more signiﬁcant the result is, it cannot beused to show the scale of the difference between null andalternative hypothesis. Second, if we fail to reject the nullhypothesis, it does not indicate that our experiment has noeffect whatsoever. It just tells us that there’s not enoughevidence that there is one. Thus, we can never be entirelysure. Third, the probability of the null hypothesis being trueremains unclear as we have already assumed that it’s true.Besides these innate disadvantages, there exists a biggerproblem. P-Hacking [21] is the action of intentionally manip-ulating the P-value or the experiment to reach a signiﬁcantresult. The simplest form of it is to choose a bigger thresholdthan the resultant P and somehow justify it. Next is to dividethe experiment into several smaller ones ( n ) and check allof them individually with the hope of ﬁnding a desirableP-value in at least one of them. This causes family-wiseerror and the problem is as the P threshold ( α ) remainsuntouched and the same as the original experiment in all ofthem, the probability of getting at least one signiﬁcant resultjust by chance would be − (1 − α ) n . Then usually, onlythe signiﬁcant experiment gets published without providingany context of the bigger picture which is clearly misleadingand there’s a huge chance that the signiﬁcance is only theresult of a Type I error and hence, most probably it cannotbe reproduced. Bonferroni correction has to be used in orderto avoid family-wise error and it suggests to use Pn as thenew threshold for each of the smaller experiments [22].In this context, type II error means failing to reject the nullhypothesis when it’s false. While α corresponds to gettinga false positive or being a victim of type I error, β is theprobability of Type II error or getting a false negative. Poweris the probability of detecting an effect when there exists oneand its value is − β . In a certain problem α + β equalsto a ﬁxed number. So, it can be concluded that there’s atrade-off between them and depending on the problem, theresearcher have to decide which error should be valued more.Increasing sample size can do good to both of them as itmakes the distributions more accurate and thinner and as aresult reduces the possibility of both errors, by making thatﬁxed number smaller and resulting in a bigger power and amore precise P-value. Robust Statistics

Classical statistics, also called parametric statistics, requiredata to be normally distributed. Even though there is noguarantee that the data in smalls samples are normallydistributed, more than often it is assumed that they are andthe calculations are done based on that. One alternativeis to use non-parametric statistics that do not have thisrequirement, but they have their own set of rules as well.Using parametric statistics on not normally distributed dataand using non-parametric statistics on normally distributeddata both result in less power [23]. As aforementioned, oneay to increase the power, which happens to be the hardestway as well, is to increase the sample size. One alternativeis to value Type II error more and by raising α make the β smaller and as a result get a higher power. Second alternativeis to use a more powerful statistics, namely robust statistics[24]. It proposes principled ways to overcome parametricand non-parametric issues. For instance, it suggests to useeffect size and conﬁdence intervals in addition to P-valuesince unlike the latter, the result of those stay the same if α changes [25], [20].Outliers can almost be found in every population. If ig-nored, they can violate the assumption of normal distributionand result in less power, and if removed manually they cancause non-Independence in remaining data, and in somecases, they could be hard to be found in the ﬁrst place.One solution that robust statistics proposed is to trim bothends and readjust the normal equation to compromise theeffect of non-independence. It also recommends ways to dot-tests, correlations, 1-way ANOVA, etc. in a robust manner[23]. It’s worth mentioning that robust statistics might resultin a different outcome in terms of signiﬁcance and non-signiﬁcance of the research from the conventional statisticsbut it’s most certainly more accurate.5. F ALLACIOUS D EDUCTIONS

Statistics is a subset of mathematics and it should not comeas a surprise that understanding, mapping and modeling theconcepts may not be a straightforward process. Even expertscan ﬁnd contradictory results with each other and at timessuch as in the case of Sally Clark [26], the deductions thathave made based on those results could be a matter of lifeor death.Misinterpretations can affect future research, resources,and fundings as well. For instance, while analyzing thebullets spread on returned American planes at World WarII, it has been noticed that most bullets were around thefuselage and least around the engines. The initial reasoningwas that in order to increase the survival rate, more armorshould be placed on the fuselage area. Wald [27] refuted thatconclusion and explained that the reason most planes hadmore bullet hits on their fuselage was that the ones whichgot hit on their engines could not eventually make it homeand as a result the distribution was uneven. Thus, to increasethe chance of survival, more armor is needed on the enginesurfaces.

Confusing Correlation With causality

Scatter plots and covariance matrices are used to ﬁndrelationships between pairs of features in the data. If theirrelationship is linear, Pearson’s correlation ( − ≤ r ≤ )[28] can be useful and just by looking at its sign and value,we can ﬁnd out that whether the correlation is positive orinverse and how strong it is. If a strong correlation is found,it could mean two things. First, X causes Y or Y causes X.Second, either there is another reason Z that causes both Xand Y or they are related by complete coincidence. The latteris called spurious correlation [29] and means even though these two features seem to correlate with each other, there’sno way one can cause another and this confusion betweencorrelation and causality can lead to erroneous conclusions.Besides, not all correlations are linear and therefore scatterplots should always be analyzed. Ecological Fallacy

Ecological fallacy happens if a deduction about an indi-viduals’ characteristic is made based on the results found fora group in which that person belongs to [30], [31]. Data arecollected at different levels such as a continent, a country, ora state. As previously discussed, the features can be analyzedin terms of correlation and then lead to discoveries. However,these ﬁndings are only valid in the level in which they arebeing analyzed and if any deduction is made based on themfor lower-level groups or individuals, it cannot be trusted.

Will Rogers Phenomenon

Assume that there are two groups and all the people ingroup A have lower IQ than all the people in group B.If the dumbest person in group B moves to group A, theaverage IQ for both groups rises since B will be got ridof its least intelligent person who was dragging the averagedown and A will proﬁt from gaining its smartest member.This effect is called the Will Rogers phenomenon and hascaused some erroneous conclusions especially in the area ofcancer screening [32], [33], [34]. Feinstein [32] elaboratesthe very ﬁrst instance of this phenomenon by showing thatimprovements reported in the rates of survival of lung cancerpatients are ﬂawed. He explains that patients were originallydivided into two groups. The ones who their cancer waslocalized and the ones with higher stages of cancer andmetastasis. With the advancements in the technology, micro-metastases could be found in people from the localizedgroup moving those patients to the other group. This simplyimproved the survival rates in both groups since with thismigration, the ﬁrst group has lost its patients with lower lifeexpectancy and the second group earned a few patients withbetter overall health than its usual patients.6. I NDECENT R EPORTING

When reporting the result, it’s conventional to use percent-ages, but they can be highly misleading as they do not giveany information about the initial and end value or at leastthe difference that has been made. A bigger picture mustalways be clearly drawn to better understand how signiﬁcantthe new result is. Also, the results should never and underno circumstances, get rounded to their nearest bigger numberjust to attract more attention. Another bad practice is hidingor not providing enough context to the audience, in orderto mislead or fool them into believing that the results weresigniﬁcant. An example of this as aforementioned, couldbe not using Bonferroni correction and not talking aboutthe original experiment at the same time. Sowell [35] alsoargues about a few cases in household reports. Such asnot mentioning the change that has occurred in the numberof people in families while comparing it to old statistics.ssume that in the past each household had 6 people in itand at a later year this number is decreased to 4. Even if theincome per person is increased by 25 percent, the total sumshows an economic decline since it equals the amount that5 people could make in past.7. O PEN S CIENCE AS A POSSIBLE SOLUTION

Open Science is a group of ideas and principles promotingopenness and transparency of the research in every stage ofits life cycle while keeping the integrity of the researcher’swork. It starts with sharing the main idea with the intentionof inducing collaboration with other researchers, funders,publishers, industries and institutions and then quickly ad-vancing the research to its ﬁnal stage. Every note, data, code,software and the end result including the paper should bepublicly available to everyone and ready to be reused. Thelong term goal of open science is to encourage good andhigh-quality research regardless of its outcome by valuing theresearch process more than its result and judge researchersbased on their work and contribution to the communityinstead of completely relying on formal publications. Openscience allows and even inspires researchers to replicate otherresearches and also publish the negative results which itbelieves can be very informative. This acts against the publicand perish culture [36] which motivates researchers to usebiases, P-hacking and all the other bad practices mentionedin this paper to advance their career and instead enableresearchers to do researches based on their joy and curiosity[37], [38], [39].8. C ONCLUSION AND D ISCUSSION

A tremendous amount of data resides in every research. It’salmost impossible to analyze and ﬁnd something meaningfulfrom that data without using statistics. However, if usedincorrectly, either intentionally or unintentionally, it canresult in erroneous and misleading conclusions and thusmake the research unreproducible. It is important to knowthe common causes of this incident to avoid it. In this review,we attempted to summarize the most common ﬂaws that canhappen while using statistics and although it’s unfeasible todescribe each bad practice comprehensively, we believe, allin all, it gives a rough overview of them and how they couldbe avoided. 9. C ONFLICT OF I NTEREST

The authors declare that there is no conﬂict of interestregarding the publication of this article.R

EFERENCES[1] M. Ware and M. Mabe, “The stm report: An overview of scientiﬁcand scholarly journal publishing,” 2015.[2] M. Baker, “1,500 scientists lift the lid on reproducibility,”

NatureNews , vol. 533, no. 7604, p. 452, 2016.[3] J. M. BLAND and D. G. ALTMAN, “Misleading statistics: errors intextbooks, software and manuals,”

International journal of epidemiol-ogy , vol. 17, no. 2, pp. 245–247, 1988.[4] L. H. Moulton, B. Foxman, R. A. Wolfe, and F. K. Port, “Poten-tial pitfalls in interpreting maps of stabilized rates,”

Epidemiology ,pp. 297–301, 1994. [5] A. Gelman and P. N. Price, “All maps of parameter estimates aremisleading,”

Statistics in medicine , vol. 18, no. 23, pp. 3221–3234,1999.[6] J. Best, “Missing children, misleading statistics,”

The Public Interest ,vol. 92, p. 84, 1988.[7] N. Leiper, “A conceptual analysis of tourism-supported employmentwhich reduces the incidence of exaggerated, misleading statistics aboutjobs,”

Tourism Management , vol. 20, no. 5, pp. 605–613, 1999.[8] D. DeLia, “Annual bed statistics give a misleading picture of hospitalsurge capacity,”

Annals of Emergency Medicine , vol. 48, no. 4,pp. 384–388, 2006.[9] W. G. Madow, “Elementary sampling theory,” 1968.[10] “Gross domestic product, second quarter 2019 (third estimate),”

U.S.Bureau of Economic Analysis news release , September 26, 2019.[11] R. Greenaway-McGrevy, B. Grimm, and D. Fixler, “The revisions togdp, gdi, and their major components,” 2014.[12] “Gross domestic product, 2nd quarter 2015 (advance estimate),”

U.S.Bureau of Economic Analysis news release , July 30, 2015.[13] “Gross domestic product, 2nd quarter 2015 (second estimate),”

U.S.Bureau of Economic Analysis news release , August 27, 2015.[14] “Gross domestic product, 2nd quarter 2015 (third estimate),”

U.S.Bureau of Economic Analysis news release , September 25, 2015.[15] J. L. Hutton, “Misleading statistics,”

Pharmaceutical Medicine ,vol. 24, no. 3, pp. 145–149, 2010.[16] J. Hutton, “Number needed to treat: properties and problems,”

Journalof the Royal Statistical Society: Series A (Statistics in Society) ,vol. 163, no. 3, pp. 381–402, 2000.[17] E. Lesaffre and G. Pledger, “A note on the number needed to treat,”

Controlled Clinical Trials , vol. 20, no. 5, pp. 439–447, 1999.[18] P. M. Christensen and I. S. Kristiansen, “Number-needed-to-treat(nnt)–needs treatment with care,”

Basic & clinical pharmacology &toxicology , vol. 99, no. 1, pp. 12–16, 2006.[19] A. Stang, C. Poole, and R. Bender, “Common problems related tothe use of number needed to treat,”

Journal of clinical epidemiology ,vol. 63, no. 8, pp. 820–825, 2010.[20] C. Mellis, “Lies, damned lies and statistics: clinical importance versusstatistical signiﬁcance in research,”

Paediatric respiratory reviews ,vol. 25, pp. 88–93, 2018.[21] M. L. Head, L. Holman, R. Lanfear, A. T. Kahn, and M. D. Jennions,“The extent and consequences of p-hacking in science,”

PLoS biology ,vol. 13, no. 3, p. e1002106, 2015.[22] E. W. Weisstein, “Bonferroni correction,” 2004.[23] J. Larson-Hall, “Our statistical intuitions may be misleading us: Whywe need robust statistics,”

Language Teaching , vol. 45, no. 4, pp. 460–474, 2012.[24] R. R. Wilcox,

Introduction to robust estimation and hypothesis testing .Academic press, 2011.[25] G. M. Sullivan and R. Feinn, “Using effect size—or why the p valueis not enough,”

Journal of graduate medical education , vol. 4, no. 3,pp. 279–282, 2012.[26] R. Nobles and D. Schiff, “Misleading statistics within criminal trials:the sally clark case,”

Signiﬁcance , vol. 2, no. 1, pp. 17–19, 2005.[27] A. Wald, “A reprint of’a method of estimating plane vulnerabil-ity based on damage of survivors.,” tech. rep., CENTER FORNAVAL ANALYSES ALEXANDRIA VA OPERATIONS EVALUA-TION GROUP, 1980.[28] P. Sedgwick, “Pearson’s correlation coefﬁcient,”

Bmj , vol. 345,p. e4483, 2012.[29] H. A. Simon, “Spurious correlation: A causal interpretation,”

Journalof the American statistical Association , vol. 49, no. 267, pp. 467–479,1954.[30] S. Piantadosi, D. P. Byar, and S. B. Green, “The ecological fallacy,”

American journal of epidemiology , vol. 127, no. 5, pp. 893–904, 1988.[31] D. A. Freedman, “Ecological inference and the ecological fallacy,”

International Encyclopedia of the social & Behavioral sciences , vol. 6,no. 4027-4030, pp. 1–7, 1999.[32] A. R. Feinstein, D. M. Sosin, and C. K. Wells, “The will rogersphenomenon: stage migration and new diagnostic techniques as asource of misleading statistics for survival in cancer,”

New EnglandJournal of Medicine , vol. 312, no. 25, pp. 1604–1608, 1985.[33] O. N. Gofrit, K. C. Zorn, G. D. Steinberg, G. P. Zagaja, and A. L.Shalhav, “The will rogers phenomenon in urological oncology,”

TheJournal of urology , vol. 179, no. 1, pp. 28–33, 2008.34] P. C. Albertsen, J. A. Hanley, G. H. Barrows, D. F. Penson, P. D.Kowalczyk, M. M. Sanders, and J. Fine, “Prostate cancer and the willrogers phenomenon,”

Journal of the National Cancer Institute , vol. 97,no. 17, pp. 1248–1253, 2005.[35] T. Sowell,

Economic facts and fallacies . Basic Books, 2011.[36] A. G. Bedeian, S. G. Taylor, and A. N. Miller, “Management scienceon the credibility bubble: Cardinal sins and various misdemeanors,”

Academy of Management Learning & Education , vol. 9, no. 4,pp. 715–725, 2010.[37] J. C. Molloy, “The open knowledge foundation: open data means betterscience,”

PLoS biology , vol. 9, no. 12, p. e1001195, 2011.[38] B. A. Nosek, G. Alter, G. C. Banks, D. Borsboom, S. D. Bowman, S. J.Breckler, S. Buck, C. D. Chambers, G. Chin, G. Christensen, et al. ,“Promoting an open research culture,”

Science , vol. 348, no. 6242,pp. 1422–1425, 2015.[39] M. Woelﬂe, P. Olliaro, and M. H. Todd, “Open science is a researchaccelerator,”