[PDF] Can visualization alleviate dichotomous thinking? Effects of visual representations on the cliff effect

Abstract

Common reporting styles for statistical results in scientific articles, such as p-values and confidence intervals (CI), have been reported to be prone to dichotomous interpretations, especially with respect to the null hypothesis significance testing framework. For example when the p-value is small enough or the CIs of the mean effects of a studied drug and a placebo are not overlapping, scientists tend to claim significant differences while often disregarding the magnitudes and absolute differences in the effect sizes. This type of reasoning has been shown to be potentially harmful to science. Techniques relying on the visual estimation of the strength of evidence have been recommended to reduce such dichotomous interpretations but their effectiveness has also been challenged. We ran two experiments on researchers with expertise in statistical analysis to compare several alternative representations of confidence intervals and used Bayesian multilevel models to estimate the effects of the representation styles on differences in researchers' subjective confidence in the results. We also asked the respondents' opinions and preferences in representation styles. Our results suggest that adding visual information to classic CI representation can decrease the tendency towards dichotomous interpretations - measured as the `cliff effect': the sudden drop in confidence around p-value 0.05 - compared with classic CI visualization and textual representation of the CI with p-values. All data and analyses are publicly available at this https URL.

Full PDF

11 Are You Sure You’re Sure? - Effects of VisualRepresentation on the Cliff Effect in StatisticalInference

Jouni Helske, Satu Helske, Matthew Cooper, Anders Ynnerman,

Member, IEEE , and Lonni Besançon

Abstract —Common reporting styles for statistical results, such as p -values and conﬁdence intervals (CI), have been reported to be prone todichotomous interpretations, especially with respect to null hypothesis testing frameworks. For example, when the p -value is small enough or theCIs of the mean effects of a studied drug and a placebo are not overlapping, scientists tend to claim signiﬁcant differences while oftendisregarding the magnitudes and absolute differences in the effect sizes. Techniques relying on the visual estimation of the strength of evidencehave been recommended to reduce such dichotomous interpretations but their effectiveness has also been challenged. We ran two experimentsto compare several alternative representations of conﬁdence intervals and used Bayesian multilevel models to estimate the effects of therepresentation styles on differences in subjective conﬁdence in the results. We also asked the respondents’ opinions and preferences inrepresentation styles. Our results suggest that adding visual information to classic CI representation can decrease the tendency towardsdichotomous interpretations – measured as the ‘cliff effect’: the sudden drop in conﬁdence around p -value 0.05 – compared with classic CIvisualization and textual representation of the CI with p -values. As a contribution to open science, our data and all analyses are publicly availableat https://github.com/helske/statvis. Index Terms —Statistical inference, visualization; cliff effect; conﬁdence intervals; hypothesis testing; Bayesian inference. (cid:70) ntroduction

One of the most common research questions in many scientiﬁcﬁelds is “Does X have an effect on Y?”, where, for example,X is a new drug and Y is some disease. Often the question isreduced to “Does the average effect of X differ from zero?”, or“Does X signiﬁcantly differ from Z?”. There are various statisticalapproaches available for answering this question, and many waysto report the ﬁndings from such analyses. In many ﬁelds, nullhypothesis signiﬁcance testing (NHST) has long been the de-factostandard approach. NHST is based on the idea of postulating a“no-effect” null hypothesis (H0) which the experimenter aims toreject. An appropriate test statistic, based on assumptions about thedata and model, is then calculated together with the corresponding p -value, the probability of observing a result at least as extremeas the one observed under the assumption that H0 is true. Small p -values indicate incompatibility of the data with the null model,again assuming that the assumptions used in calculating the p -valuehold.The ongoing “replication crisis’ [1], especially in social and lifesciences, has produced many critical comments against arbitrary p -value thresholds and signiﬁcance testing in general (e.g., [2],[3], [4]). As a solution to avoid so-called dichotomous thinking – • This work has been submitted to the IEEE for possible publication.Copyright may be transferred without notice, after which this versionmay no longer be accessible. • J. Helske was with Department of Science and Technology, LinköpingUniversity, Campus Norrköping, SE-602 74 Norrköping, Sweden, and nowwith Department of Mathematics and Statistics, University of Jyväskylä,FI-40014 Jyväskylä, Finland. E-mail: jouni.helske@iki.ﬁ • S. Helske is with Department of Social Research, University of Turku,FI-20014 Turku, Finland. • M. Cooper, A. Ynnerman, and L. Besançon are with Department of Scienceand Technology, Linköping University, Campus Norrköping, SE-60274Norrköping, Sweden. strong tendency to divide results into signiﬁcant or non-signiﬁcant– some are even arguing for a complete ban on NHST and p -values.Such a policy has also been adopted by some journals: e.g., in2015, the Journal of Basic and Applied Social Psychology bannedboth p -values and conﬁdence intervals (CIs) [5], and more recentlythe Journal of Political Analysis banned the use of p -values [6].Despite the critique signiﬁcance testing, in one form or another,is likely to remain a part of a scientist’s toolbox. Because manyof the problems with NHST are due to misunderstandings amongthose who conduct statistical analysis as well as among thosewho interpret results. Work can also been conducted in making iteasier to avoid common pitfalls of NHST either by altering theway analyses are conducted [7], [8], [9] or how the results arepresented [10], [11], [12], [13], [14]. Instead of arguing for bettermethodological solutions, such as Bayesian approaches, here westudy whether different styles of visual representation of commonstatistical problems could help to alleviate this dichotomousthinking which can be approximated by studying the so-called cliff effect . The cliff effect is a term used for the large differencein how the results are interpreted despite only small numericaldifferences in the estimate and p -value [15] (e.g., the estimatedeffect of 0.1 with a corresponding p -valueof 0.055 may be deemednot signiﬁcant while an effect of 0.11 with a p -valueof 0.045 maybe claimed to be signiﬁcant).To study the potential cliff effects of various representationstyles for statistical results, we conducted two experiments onresearchers who are experienced in using and interpreting statisticalanalyses. We showed participants results from artiﬁcial experimentsusing different representation styles and asked the respondents howconﬁdent they were in that the results showed a positive effect(experiment 1) or a difference between two groups (experiment 2).We also asked the respondents to give comments on the differentstyles and to rank them according to their personal preference. a r X i v : . [ s t a t . O T ] O c t We analysed the answers from the experiments using Bayesianmultilevel models. These results are easy to interpret and at thesame time allow us to avoid the problems we aimed at studying (i.e.dichotomous thinking and the cliff effect). Our results suggest thatdespite the increased debate around NHST and related concepts,the problem of dichotomous thinking persists in the scientiﬁccommunity, but that certain visualization styles can help to reducethe cliff effect. Even though this paper is particularly aimed at theﬁeld of human computer interaction and visualization research(HCI/VIS), the results naturally apply to the whole scientiﬁccommunity using statistical reasoning. As a contribution to openscience, our data and all analyses are publicly available and fullyreproducible. ackground and R elated W ork In this paper, our main focus lies in whether and how differentvisualizations can help in reducing the cliff effect when makinginterpretations on inferential statistics. We ﬁrst brieﬂy presentthe basic deﬁnition and interpretation of the conﬁdence interval(CI), which is a common choice for assessing the uncertainty ofa point estimate (such as the sample mean) and has sometimesbeen suggested to reduce dichotomous interpretations. We thendiscuss the problem of dichotomous thinking before moving on topresenting related literature and the visual representations used inour experiments.

Given a sample of values x , . . . , x n from a normal distribution withunknown mean µ and variance σ , the 95% conﬁdence intervalfor the mean is computed using a sample mean ¯ x , sample standarddeviation s , sample size n and t -distribution:¯ x ± t α / ( n − ) s √ n , (1)where t α / ( n − ) is the critical value from t -distribution with n − α (typically 0 . µ . It is important to note that, given a single sample andthe corresponding CI, we cannot infer whether the true populationmean, µ , is contained within the CI or not [16] although it has adirect connection to NHST in that the 95% CI represents the rangeof values of µ for which the difference between µ and ¯ x is notstatistically signiﬁcant at the 5% level. Let us suppose that through an experiment we obtain a p -value of p = . p -value of 0.058many researchers, despite the small difference, would follow therecommendations of colleagues and textbooks, consider this as notenough evidence against H0 [17]. This type of reasoning, oftencalled dichotomous thinking or dichomotous inference has beenshown to be potentially harmful to science [2], [18], [19], [20],[21]. It has been said to be one of the reasons for the replicationcrisis [18], [20], [22] or to lead to “absurd replication failures[with] compatible results” [2]. While dichotomous thinking hasbeen heavily criticized by scholars (e.g., [10], [18], [19], [23], [24], it seems to be persistent in many ﬁelds including HCI [25] andempirical computer science [20].In 2016, the confusion, misuse and critique around p -values ledthe American Statistical Association (ASA) to issue a statementon p -values and statistical signiﬁcance. ASA stated that properinference must be based on full and transparent reporting andcomputing, and that a single number ( p -value) is not equal toscientiﬁc reasoning. Many other authors have criticized the wholeNHST approach due to increased dichotomous thinking basedon arbitrary thresholds [2], [17], [18], [24], [26], [27], commonmisinterpretations of p -values (e.g., the fallacy of accepting H0[28], reading p -values as the probability that H0 is true), as wellas the several questionable research practices that often comewith the use of NHST including p -hacking (testing a number ofhypotheses until a low p -value is found), HARKing (presenting apost-hoc hypothesis as an a priori hypothesis), selective outcomereporting, and the ﬁle-drawer effect (limiting publication to onlystatistically signiﬁcant results) [20], [29], [30], [31], [32], [33],[34]. Additionally, sometimes p -values are reported without effectsizes, although a p -value itself does not help readers determine thepractical importance of the presented results. It should be notedthat it is likely that many of these issues relating to the data-ledanalysis (see the “garden of forking paths” [35] ) are typically notintentional, and can occur in a broader scope than just NHST.Due to all of the issues around p -values, some researchershave recommended either to replace or complement them withCIs [10], [24], [36], [37]. The argument is that CIs could reducedichotomous interpretations as they represent both the effect sizeand the sampling variation around this value. CIs, however, arealso prone to misinterpretation, simply because their interpretationis not very intuitive [38], [39]. CIs have also been reported to leadto dichotomous thinking [25], [40], [41].The cliff effect is a term coined by Rosenthal and Gaito todescribe the sudden drop of conﬁdence that a real effect existsjust above p = .

05 [15]. It can be used as a proxy to measuredichotomous inferences [40]. Rosenthal and Gaito’s initial studydemonstrated the cliff effect on 19 researchers in psychology,and their ﬁndings were later replicated by Nelson et al. [42]on a much larger sample (85 psychologists). However, a morerecent work by Poitevineau and Lecoutre [43] states that theclaims on the cliff effect might be overstated and that only asmall fraction of their participants adopted an all-or-none strategy.Nonetheless, a later study highlighted that even statisticians werenot immune to misinterpretation of p -values and dichotomousthinking [44]. However, due to the previous focus on restrictedpopulations (mainly psychologists) and also because some of thedetails of the experiments have not been fully presented (such asthe exact question asked of the participants), it is difﬁcult to assesswhether these ﬁndings would hold in a more general population ofresearchers.Previous studies on interpretation of p -values and CIs havesuggested that there are two to four conﬁdence interpretationproﬁles [13], [40], [43]. For example, Lai [40] manually categorizedrespondents’ conﬁdence proﬁles into four different categories,although discarding a large proportion respondents whose answersdid not clearly ﬁt into any category. While some individual variationand hybrid interpretation styles are likely to exist, due to historicalreasons it is likely that the main proﬁles are the all-or-none category(related to Neyman-Pearson signiﬁcance testing), and the graduallydecreasing conﬁdence category (related to Fisher’s signiﬁcancetesting approach). See, for example, [45] for descriptions of the original approaches to signiﬁcance testing by Fisher, and Neymanand Pearson as well as their connection to current NHST practice.Bayesian paradigm and replacing CIs with credible intervals have been suggested as a solution to the problems with CIs and p -values [7], [8], [46], [47]. Compared to the CI, the credibleinterval has a more intuitive interpretation: given the model and theprior distribution of the parameter (e.g., mean), the 95% credibleinterval contains the unknown parameter with 95% probability. Orperhaps even better, one can present the whole posterior distributionof the parameter of interest. Despite the beneﬁts of the Bayesianapproach, p -values and CIs are likely to remain in use in manyscientiﬁc ﬁelds, despite their ﬂaws. Hence it is of general interestto study whether the problems relating to dichotomous thinkingcan be alleviated by changing their typical representation styles. Several visualization techniques have been designed to show theuncertainty of the estimation, with several advantages over thecommunication of a sole point estimate [48], [49]. Showing thetheoretical or empirical probability distribution of the variable ofinterest is a commonly used technique. For example, probabilitydensity plots are often used for describing the known distributionssuch as Gaussian distribution or estimated density functions basedon samples of interest (e.g., observed data or samples from posteriordistributions in a Bayesian setting).

Violin plots [50] (also called eyeball plots in [51]) are rotated and mirrored kernel density plots,so that the uncertainty is encoded as the width of the ‘violin’ shape.

Raindrop plots [52] are similar to violin plots but are based on log-density. The gradient plot uses opacity instead of shape to conveythe uncertainty (e.g., [12]), while quantile dot plots [53], [54] arediscrete analogs of the probability density plot. Various alternativerepresentation styles speciﬁcally for CIs are commonly used (see,for example, [55]). In order to remedy the misunderstanding andmisinterpretation of CIs, Kalinowski et al. [13] designed the cat’seye conﬁdence interval which uses normal distributions to depictthe relative likelihood of values within the CI (based on theFisherian interpretation of the CI). A violin plot with additionalcredible interval ranges are also used to depict arbitrary shaped(univariate) posterior distributions based on posterior samples,for example in the tidybayes

R package (coined as the eyeplot ) [56]. Kale et al. [57], [58] studied animated hypotheticaloutcome plots for interactive dissemination of statistical results.Going even further, Dragicevic et al. [14] propose the use ofinteractive explorable statistical analyses in research documents toincrease their transparency. For a systematic review of uncertaintyvisualization practices, see Hullman et al. [59].Some past studies have focused on comparing different visualrepresentations of statistical results. Tak et al. [60] examined sevendifferent visual representations of uncertainty on 140 non-experts.Correll and Gleicher [12] studied four different visualization stylesfor mean and error in several settings. Kalinowski et al. [13]compared students’ intuitions when interpreting classic CI plotsand cat’s eye plots. Finally, the recent study by Hofman et al. [58]focused on the impact of presenting inferential uncertainty incomparison to presenting outcome uncertainty, and investigated theeffect of different visual representations of effect sizes. With theexception of [13], these studies have focused on testing lay-people,a population which can be expected to differ from researcherswho have been trained to use and interpret p -valuesand CIs intheir work. It is also noteworthy that in [12] participants were given information on the uncertainty corresponding to the samplingdistribution of the mean, but were then asked about the likelihoodof a future observation (which, in contrast, relates to the samplingdistribution of the observations). esearch Q uestions Taking inspiration from past research and some of the approacheslisted in Section 2, our work focuses on evaluating the presenceand magnitude of the cliff effect in textual and visual representationstyles among researchers trained in statistical analysis. Our maingoals were to investigate • whether the cliff effect can be reduced by using differentvisual representation styles instead of textual informationand • how researchers’ opinions on, and preferences between,different representation styles differ.More speciﬁcally, we were interested in whether the previouslydocumented cliff effect in scientiﬁc reporting is reduced whenthe textual representation with explicit p -valueis replaced witha traditional visualization of CI, and whether more complexvisualization styles for the CI reduce the cliff effect in comparisonwith classic CI visualization. Regarding the former question, inline with previous research [40], [41], we expected to ﬁnd thatCIs would not reduce the cliff effect, whereas regarding the latterquestion our hypothesis was that more complex visualization stylescould reduce the cliff effect.As our interest was in scientiﬁc reporting, we limited oursample to researchers with an understanding and use of statisticsunlikely to be present with lay-people, and focused on staticvisualizations applicable in traditional scientiﬁc publications.Our study is related to the work by Correll and Gleicher [12]in that we study similar visualization techniques but with afocus on different target populations and research questions. Likeus, Belia et al. [39] focused on the understanding of statisticalresults by researchers (in their case restricted to medicine andpsychology). While we focus on the cliff effect across differentvisualization styles, their focus was on deﬁning the threshold ofstatistical signiﬁcance from classic representations of error bars incomparisons between multiple samples. ne - sample E xperiment In the ﬁrst experiment we are interested in potential differences inthe interpretation of results of an artiﬁcial experiment when partici-pants are presented with textual information of the experiment in aform of a p -value and a CI, a classic CI plot, a gradient CI plot, ora violin CI plot (see Fig. 1 and the descriptions in subsection 4.1).The setting is simple yet common: we have a sample of independentobservations from some underlying population, and we wish toinfer whether the unknown population mean differs from zero. p -value Our ﬁrst representation is text consisting of the exact p -value ofa two-sided t -test, sample mean estimate and lower and upperlimits of the 95% CI (see the leftmost box of Fig. 1 for theparticipant’s view). This style is concise, contains informationabout the effect size and the corresponding variation (width of theCI), while the p -value provides evidence in the hypothesis testing Mean weight increase 0.817kg, 95% CI: [−0.036kg, 1.669kg],p = 0.06 (2−sided t−test)

Fig. 1. Representation styles used in the experiments: Textual version with p -value, classic 95% conﬁdence interval (CI), gradient CI plot, violin CI plot, anddiscrete violin CI plot. style. While this format provides information on the effect size anduncertainty together with the p -valueit can be argued that, due tothe strong tradition in NHST, the inclusion of a p -value can causedichotomous thinking even when accompanying CI information isprovided. While the sample size is not stated in this format, thatinformation was provided separately in our experiment for eachcondition as a part of the explanatory text. Conﬁdence intervals and sample means are commonly visualized asline segments with end points augmented with horizontal lines (seeFig. 1). Compared with textual information, visual representationcould be better at conveying the uncertainty. While the width ofthe horizontal lines of the CI does not have semantic meaning, it issometimes argued (although we have found no studies to suggestthis) that their width emphasises the limits of the CI and increasesdichotomous inference, and intervals without the horizontal linesshould be preferred. We chose the more traditional design (see thesecond box from left in Fig. 1) as it is still commonly used and isalso the default option in many statistical analysis packages suchas SPSS.

In order to reduce the dichotomous nature of the classic CIvisualization, we test the effect of using multiple overlaid con-ﬁdence intervals with varying coverage levels and opacity. Thisis fairly common when presenting prediction intervals for futureobservations [61], but less so in the case of CIs. While using onlya few overlaid CIs (e.g., 80%, 90% and 95%) is a more commonpractice, we decided to replicate the gradient plot format usedin previous approaches [12] which provides more emphasis onthe 95% interval and thus is more comparable with the classicCI approach. Our gradient CI plot contains a colored area of95% CI complemented with gradually colored areas correspondingto 95.1% to 99.9% CIs (with 0.1 percentage point increments),overlaid with a horizontal line corresponding to the sample mean(see the middle box in Fig. 1). The coloring was from hex color to taken from ColorBrewer’s 3-class BuGnpalette [62], corresponding to approximately eucalyptus to lightcyan. This format provides additional information, but gradualcolor changes can be difﬁcult to interpret accurately, and from atechnical point of view this format is also harder to create thanclassic CIs. t -violin Plot (Violin CI Plot) While the gradient CI plot gives information about the uncertaintybeyond the 95% CI, we claim that the use of the rectangularregions with constant widths can be misleading. Therefore, as our fourth representation format (inspired by [12], [13]) we combinethe gradient CI plot and the density of the t -distribution used inconstructing the CIs (see the second box from right in Fig. 1).More speciﬁcally, in the violin CI plot the shape corresponds tothe case of computing a sequence of conﬁdence intervals with veryﬁne increments, with the width of each CI computed using theunderlying t -distribution. The width of the violin at point y is p ( √ n ( y − ¯ x ) s ) √ ns , (2)where p is the probability density function of the t -distributionwith n − x is the sample mean, and s is thestandard deviation.In the second experiment we also consider a more discretizedversion of the violin CI plot with gradually colored areas corre-sponding to the 80%, 85%, 90%, 95% and 99.9% CIs (see therightmost box in Fig. 1).Violin CI plots are more challenging to create, and theprobability density function style can lead to erroneous probabilityinterpretations for which CIs cannot provide answers. On the otherhand, the additional visual clues due to the shape can help overcomethe difﬁculty of interpreting gradient colors. The experiment was run as an online survey. Its preregistrationis available on OSF . As the preregistration states, the number ofparticipants was not decided in advancebut, instead, we aimed forthe maximum number of participants in a given time frame. Theend date of the experiment was ﬁxed to 11 March 2019 so thesurvey was open for 21 days before we started to analyse the data.As stated in section 3, our goal, contrary to most of the previouswork, was to understand how researchers interpret statistical resultsand therefore we aimed at recruiting academics familiar withstatistical analysis. To recruit participants across various scientiﬁcdisciplines, we initially contacted potential participants via email inseveral ﬁelds (namely Human Computer Interaction, Visualization,Statistics, Psychology, and Analytical Sociology, using personalnetworks), and the survey was also shared openly using the authors’academic proﬁles on Twitter and suitable interest groups on Reddit,LinkedIn, and Facebook.The eligibility criteria were 1) You understand English; 2) Youare at least 18 years old; 3) You have at least a basic understandingof hypothesis testing and conﬁdence intervals; 4) You use statisticaltools in your research projects; 5) You are not using a handhelddevice such as tablet or phone to ﬁll out the survey. To evaluatethe validity of our sample, we asked for background information

1. https://osf.io/v75ea/?view_only=e481a9ad345e4e689799d65d988c1c5f ll ll ll ll ll ll ll ll

012 0.001 0.01 0.04 0.05 0.06 0.1 0.5 0.8 p−value W e i gh t i n c r ea s e ( k g ) Sample size 50 200

Fig. 2. Conﬁguration used in the one-sample experiment. See text for details. including participants’ age, scientiﬁc ﬁeld, highest academic degree,length of research experience, and data analysis tools commonlyused. The codes for the experiment are available in supplementarymaterials on Github. There are multiple potential factors which could (although notnecessarily should) have an effect in interpreting results of thissimple experiment: p -value, total length of the conﬁdence interval,effect size, sample size, and representation style. Since our focuswas on the representation styles, and because we wanted to keepthe survey short in order to increase the number of responses, weused a ﬁxed set of p -values (0.001, 0.01, 0.04, 0.05, 0.06, 0.1,0.5, 0.8), and a ﬁxed standard deviation of 3. By deﬁning alsothe sample size, the sample mean was then fully determined bythese values. We used two sets of questions, one with a samplesize of n =

50 and another with n = p -values except in thetextual representation style.During the experiment we displayed each trial to each par-ticipant (one at a time), and asked the following question: “Arandom sample of 200 adults from Sweden were prescribed a newmedication for one week. Based on the information on the screen,how conﬁdent are you that the medication has a positive effecton body weight (increase in body weight)?”. They answered on acontinuous scale (100 points between 0 and 1, the numerical valuewas not shown) using a slider with labelled ends (“Zero conﬁdence”,“Full conﬁdence”), which was explained to the participants as“Leftmost position of the slider corresponds to the case “I have zeroconﬁdence in claiming a positive effect,” whereas the rightmostposition of the slider corresponds to the case “I am fully conﬁdentthat there is a positive effect.” The slider’s ‘thumb’ was hidden atﬁrst, in order to avoid any possible bias due to its initial position.It only became visible when the participant clicked on the slider.Finally, until the slider position was set participants could notproceed to the next question.Our small pilot study suggested that it was hard to understandthe violin CI plot due to its non-standard meaning (participantswere prone to misread the ﬁgure as a typical violin plot of empiricaldensity of the data). Therefore, in order to explain the interpretationof the violin plot in this context, we had to also explain the basicsof CI computations. To keep the complexity of all representationsat the same level, we added explanatory texts to all conditions. Wedetail the impact of this decision in our discussions in section 6

2. Supplementary materials: https://github.com/helske/statvis

In order to balance learning effects, the order of the fourconditions (representation styles) was counterbalanced using Latinsquares, and within each condition the ordering of trials wasrandomly permuted for each participant. At the end of the survey,participants had to give feedback on the representation formatsand rank them from 1 (best) to 4 (worst). We gave participants thepossibility to give equal rankings. They could also leave additionalcomments about the survey in general.We gathered answers from 114 participants, from which oneparticipant was excluded because of nonsensical answers to thebackground questions. One of the background variables wasan open-ended question about ﬁeld of expertise. The answersincluded a range of disciplines that we categorized into four groups:“Statistics and machine learning’ (21 participants), “VIS/HCI” (34),“Social sciences and humanities” (32), and "Physical and lifesciences" (26) (see supplementary material for more information).

All statistical analyses were done in the R environment [63] usingthe brms package [64]. The visualizations of the results werecreated with the ggplot2 package [65]. The collected data, scriptsused for data analysis, additional analysis, and ﬁgures are availablein supplementary material. We also created an accompanying Rpackage ggstudent for drawing modiﬁed violin and gradient CIplots used in the study.To analyse the results we built a Bayesian multilevel model withparticipants’ conﬁdence as the response variable (values rangingfrom 0 to 1), and the underlying p -value and representation styleas the main explanatory variables of interest.While we often perceive the probabilities and strength ofevidence as having a linear relationship after logit-transformationsof both variables [66], in the case of signiﬁcance testing withpotential for dichotomous thinking this relationship is likely nottrue due to the potential cliff effect as well as the excess occurrenceof low and high p -values indicating complete lack of evidence (0)or full conﬁdence (1). Values 0 and 1 (15% of all answers) arealso problematic in the logit-transformation due their mapping to ± ∞ . Therefore, a simple linear model with logit-transformationsof p -values and the conﬁdence scores would not be suitable in thiscase.A typical choice for modelling proportions with disproportion-ately large numbers of zeros and ones is the zero-one-inﬂatedbeta regression. However, as we wanted to incorporate the priorknowledge of the potential linear relationship of conﬁdence andprobabilities in the logit-logit-scale, instead of the zero-one-inﬂatedbeta distribution we created a piecewise logit-normal model withthe probability density function (pdf) deﬁned as p ( x ) =  α ( − γ ) , if x = , αγ , if x = , ( − α ) φ ( logit ( x ) , µ , σ ) , otherwise . (3)Here α = P ( x ∈ { , } ) is the probability of answering one of theextreme values (not at all conﬁdent or fully conﬁdent), whereas γ = P ( x = | x ∈ { , } ) , is the conditional probability of fullconﬁdence given that the answer is one of the extremes . Thus

3. https://cran.r-project.org/package=ggstudent4. The distribution was changed from the preregistration as suggested by areviewer and [66].5. While generating data from this distribution is straightforward, theexpected value of this distribution is analytically intractable. However, this canbe easily computed via Monte Carlo simulation.

TABLE 1The sample mean, standard deviation, standard error of the mean, and the2.5th and 97.5th percentiles of the difference in conﬁdence when p = . and p = . in the ﬁrst experiment. Mean SD SE 2.5% 97.5%Textual 0.19 0.27 0.03 -0.19 0.72Classic CI 0.23 0.25 0.02 -0.05 0.84Gradient CI 0.10 0.24 0.02 -0.37 0.74Violin CI 0.13 0.20 0.02 -0.16 0.62 these two parameters model the extreme probability of answers,and when the answer is between the extremes, we model it with thelogit-normal distribution ( φ ( x ) is the pdf of the normal distributionparameterized with mean µ and standard deviation σ ). Explanatoryvariables can be added to the model to predict α , γ , µ , and σ ,using the log-link for σ , the logit-link for α and γ , and theidentity-link for µ . In comparison to the frequentist approach,such as standard generalized linear (mixed) models or analysingonly simple descriptive statistics, our Bayesian model allows usto take into account the uncertainty of the parameter estimationand more ﬂexible model structures. We can also make varioussimple probabilistic statements based on the posterior distributionsof this model such as the probability that the cliff effect is higherwith p -values than with classic CI. For further information aboutBayesian modelling in general see, for example, [67]. As the ﬁrst step, we checked some descriptive statistics of thepotential cliff effect, deﬁned as δ = E [ conﬁdence ( p = . ) − conﬁdence ( p = . )] , i.e., the average difference in conﬁdence between cases p = . p = .

06. Table 1 shows how gradient and violin CI plotshave a somewhat smaller drop in conﬁdence when moving from p = .

04 to p = .

06 compared to the textual representation andthe classic CI visualization.To analyse the data and the potential cliff effect in more detail,we used the Bayesian multilevel model described in subsection 4.3.Due to the setup of the experiment, participants’ answers wereinﬂuenced by the information on the screen, which in turn dependedon the underlying p -value, visualization style, and sample size.Sample size itself should not have an effect on the answers, whichwas indeed conﬁrmed by preliminary analysis (see supplementarymaterial), so we dropped that variable from further analysis. Dueto the potential cliff effect we wanted to allow different slopes ofthe conﬁdence curve for the cases when p < .

05 and p > . p = .

05 we allowed an extra drop inconﬁdence via an indicator variable I ( p = . ) , as it was not clearwhether this boundary case should be on the “signiﬁcant” or “notsigniﬁcant” side (i.e., whether the cliff effect was due to the dropjust before or after 0.05). Regarding the probability of an extremeanswer, the relationship with respect to the p -value was assumed tobe non-linear so we treated the p -values as a categorical variable.For the conditional probability of full conﬁdence γ we used the p -value as a categorical variable with a monotonic effect (using thesimplex parameterization suggested in [68]), but grouped p > . µ , α and σ . We also allowed the effects of visual-ization and the underlying p -value to vary between participants byincluding corresponding random coefﬁcients in the model. We ranvarious posterior predictive checks [69] to assess that the modelﬁts the data reasonably well (see the supplementary material). Theﬁnal model structure, written using the extended Wilkinson-Rogerssyntax [70], [71] was chosen as follows: µ ∼ viz · I ( p < . ) · logit ( p ) + viz · I ( p = . )+ ( viz + I ( p < . ) · logit ( p ) + I ( p = . ) | id ) , α ∼ p · viz + ( | id ) , γ ∼ mo ( p ) , σ ∼ viz + ( | id ) , (4)where p is a categorical variable deﬁning the true p -value, logit( p )is a continuous variable of the logit-transformed p -value, mo ( p ) denotes a monotonic effect of the p -value, the dot correspondsto interaction (i.e., I ( p = . ) · viz implies both the main andtwo-way interaction terms) and ( z | id ) denotes participant-levelrandom effect for variable z .Given this model, in a presence of a cliff effect we shouldobserve a discontinuity in an otherwise linear relationship betweenthe true p -valueand reported conﬁdence (when examined in thelogit-logit scale).We used the relatively uninformative priors: N ( , ) regressioncoefﬁcients, N ( , ) for the intercept terms, and half-N ( , ) for allstandard deviation parameters, LKJ(1) prior [72] for the correlationmatrices of random effects, and symmetric Dirichlet(1) prior forthe coefﬁcients of the monotonic effect.Consistent with the Bayesian paradigm, we chose this modelover simpler submodels (where some of the interactions orrandom effects are omitted) [73]. This model integrates over theuncertainty regarding the model parameters, with coefﬁcient zerocorresponding to a simpler model where the term is omitted fromthe model. However, as a sensitivity check we also estimatedseveral submodels of this model. These gave very similar results,so in practice the reported results were insensitive to speciﬁc modelchoice.Fig. 3 shows the posterior mean curves of conﬁdence (verticallines corresponding to the 95% credible intervals ) with respect tothe underlying true p -values that were used to generate the data.These are based on the population level effects, i.e., the expectedconﬁdence of an average participant (an individual whose randomeffects are 0).We observe at least some kind of a cliff effect – a suddendrop in conﬁdence – with all representation styles. Within the“statistically signiﬁcant region” (i.e., when p < .

05) the slope ofthe conﬁdence level in relation to the underlying p -value is theleast steep for the classic CI visualization, but there is a large dropin conﬁdence when moving to p > .

05, even larger than with thetextual information. The textual representation with p -value, on theother hand, behaves similarly to the violin CI plot until p = . p -value representation dropsbelow all other techniques. The gradient CI plot and the violin CIplot both have a smaller – although visible – drop in conﬁdence andalso otherwise show a similar pattern, except that the conﬁdencelevel of the gradient CI plot is constantly below that of the violin

6. For readers new to the credible interval, we refer to section 2.2. .

001 0 .

01 0 .

04 0 .

06 0 . . . p−value C on f i den c e Textual Classic CI Gradient CI Violin CI

Fig. 3. Posterior means of conﬁdence and corresponding 95% credibleintervals for different visualization styles in the ﬁrst experiment, on the logit-logit-scale. Here, a discontinuity in an otherwise linear relationship indicatesa cliff effect. The zoom-in plot shows the uncertainty of the estimates when . ≤ p ≤ . . E[confidence(p=0.04) − confidence(p=0.06)] P o s t e r i o r den s i t y Textual Classic CI Gradient CI Violin CI

Fig. 4. Posterior distributions of δ , the drop in conﬁdence around p = . ,for different representation styles in the ﬁrst experiment. Note that thedistributions of the gradient CI and the violin CI on the left-hand side arealmost completely overlapping. CI plot. There were no clear differences in the probabilities of anextreme answer (“zero conﬁdence” or “full conﬁdence”) betweenthe visualization styles (see the supplementary material).Fig. 4 shows the posterior distributions of the drop in conﬁ-dence, δ , for different visualizations. These show that the drop isthe largest with classic CI and the smallest (and nearly identical)with gradient and violin CI visualizations. Textual representationswith p -values position between these (somewhat closer to theclassic CI). While there is some overlap between these distributions,when comparing the pairwise posterior probabilities that the δ ofone visualization style is greater than that of an alternative style foran average participant (Table 2), we see clear differences betweenthe styles: Classic CI leads to larger drop than textual p -values,and both of these lead to larger drops than Gradient CI and ViolinCI (all these comparisons have probabilities close to 1). Note that,unlike the interpretation of p -values, the numbers in Table 2 areactual probabilities that the average drop in conﬁdence around p = .

05 is larger with one style than the other.As a secondary analysis, we also estimated a model withcategorized expertise value as a predictor (with interactions withvisualization and p -value). When averaging (i.e. marginalizing)over the expertise, the results were similar to the main model.The expertise-speciﬁc examinations, however, revealed some dif-ferences between the groups. Most notably we observed the largest TABLE 2NEW TABLE: Posterior probability that δ , the drop in conﬁdence around p = . , is larger for representation style on the row than the representationstyle on the column. Textual Classic CI Gradient CI Violin CITextual - 0.01 1.00 1.00Classic CI 0.99 - 1.00 1.00Gradient CI 0.00 0.00 - 0.49Violin CI 0.00 0.00 0.51 -

Representation R an k i ng p r obab ili t y Fig. 5. Subjective ranking probabilities and the corresponding 95% credibleintervals for visualization styles of the ﬁrst one-sample experiment. A highervalue for rank 1 indicates preference for the method while a higher value forrank 4 indicates distaste. cliff effects in the Stats/ML group (for all representation styles),while in the Phys/Life group there were only small differences inthe conﬁdence proﬁles by representation style. When comparingthe magnitudes of δ , the ordering of the representation styles wasthe same across all expertise groups (as seen in the main results).Due to space restrictions see the supplementary material for moredetailed results. For analysing the subjective rankings of the representation styles,we estimated a Bayesian ordinal regression model where we usedvisualization style to predict the observed rankings (with participant-level random intercept). Fig. 5 shows the results from this modelas a probability that the visualization style obtains a certain rank.From this ﬁgure, we see that p -value typically obtains the worstrank (4), while violin CI and classic CI are the most preferredoptions with approximately equal probabilities for ranks 1 and 2.Gradient CI seems to divide opinions the most, with close to equalprobabilities for each rank. At the end of the experiment, participants were invited to commenton the limitations and beneﬁts of each technique. The fullycategorized and raw data is available in supplementary material,but we summarize the main points here. The following summarieswere created by one of the authors before seeing any of the otherresults. Concerning p -values, participants reported them to be easyto read and accurate ( ×

40 participants). However, participantsalso stated that they could hinder the readability of a paper if manyof them had to be reported ( × × × p -values exclusively ( × × ×

19) thatallows quick analysis with clear ﬁgures ( ×

42) and that scales verywell to multiple comparison ( × ll ll ll ll ll ll ll ll p−value W e i gh t de c r ea s e ( k g ) Group Control Treatment

Fig. 6. Conﬁguration used in the second experiment. reported that this visual representation was missing information —likelihood of the tails for instance— and that it should be augmentedwith more statistical information ( × × × × × × × × × × × × ×

9) which could also help reduce dichotomizedinterpretations. Participants also explained that the gradient couldbe hard to distinguish ( × ×

11) and that the widthwas unnecessary visual information because it does not encodeanything ( × wo - sample E xperiment After conducting the ﬁrst experiment, we deployed a secondsurvey with a similar framing, but this time instead of comparingthe base value of zero, the task was to compare means ofindependent “treatment” and “control” groups, as in [39]. While itis often recommended that instead of comparing intervals of two(potentially dependent) samples it is better to compare intervals ofthe difference [74], nevertheless these types of multiple intervalvisualizations are commonly seen in scientiﬁc publications. Similarto our ﬁrst controlled experiment, this study was also preregistered ,with supplementary material available at Github. Fig. 6 shows theconﬁguration used in this second experiment.

The conditions and the overall design of the study were the same asin the one-sample experiment, except for the fact that we removedthe textual p -value representation and replaced it with a morediscrete version of the violin plot (see the rightmost ﬁgure in Fig. 1).

7. https://osf.io/brjzx/?view_only=e481a9ad345e4e689799d65d988c1c5f8. https://github.com/helske/statvis

TABLE 3The sample mean, standard deviation, standard error of the mean, and the2.5th and 97.5th percentiles of the difference in conﬁdence when p = . and p = . in the second experiment. Mean SD SE 2.5% 97.5%Classic CI 0.07 0.12 0.02 -0.22 0.28Gradient CI 0.01 0.12 0.02 -0.21 0.25Continuous violin CI 0.01 0.09 0.01 -0.15 0.17Discrete violin 0.06 0.18 0.03 -0.17 0.50

The exact question was framed as “A random sample of 50 adultsfrom Sweden were prescribed a new medication for one week.Another random sample of 50 adults from Sweden were assignedto a control group and given a placebo. Based on the informationon the screen how conﬁdent are you that the medication decreasesthe body weight? Note the y-axis, higher values correspond tolarger weight loss.”. The slider endpoints were deﬁned as “I havezero conﬁdence in claiming an effect”, and “I am fully conﬁdentthat there is an effect.”.For this second experiment we used the same channels forsharing the link as in the ﬁrst study and obtained 39 answers,of which two were discarded as they had not answered thebackground questions. Nine participants had expertise in “Statisticsand machine learning”, eight in “VIS/HCI”, 14 in “Social sciencesand humanities” and six in “Physical and life sciences”.

Table 3 shows the differences between subjective conﬁdencewhen the underlying p -value was 0.06 versus 0.04. The dropin conﬁdence is again the largest with the classic CI with discreteviolin CI having a similar drop. The relatively large standard errorin the case of the discrete violin CI is explained by a small numberof respondents that demonstrated a very large drop in conﬁdencewith the discrete violin CI. Overall the cliff effect seems to bemuch smaller than in the one-sample case (where the average dropwas between 0.15–0.30, depending on the technique).For analysing the results, we used the same multilevel model asfor the ﬁrst experiment. Fig. 7 and Fig. 8 show the posterior meancurves of conﬁdence and the posterior distributions of δ (the dropin conﬁdence around 0.05). Compared with the ﬁrst experiment,the overall conﬁdence levels are smaller, for example with p = . p -value is 0.05and 0.06 (although the credible intervals are wide) but, overall, thedifferences between visualization styles are relatively small. Also,in contrast with the one-sample experiment, here we do not seeclear signs of cliff effect or dichotomous thinking as the posteriormean curves are approximately linear (except, perhaps, for theclassic CI where the posterior mean of δ is 0.1). As in the ﬁrstexperiment we saw no clear differences in the probability of anextreme answer between visualization styles. As in the ﬁrst experiment, we analysed the subjective rankingsof the representation styles by Bayesian ordinal regression modelwhere we explained the rank with visualization style and individualvariance. Fig. 9 presents the ranking probabilities which indicatepreferences towards the discrete violin CI plot (estimated to be .

001 0 .

01 0 .

04 0 .

06 0 . . . p−value C on f i den c e Classic CI Gradient CI Cont. violin CI Disc. violin CI

Fig. 7. Posterior means of conﬁdence and corresponding 95% credibleintervals for different visualization styles in the second experiment, onlogit-logit-scale, with a zoom-in plot of the cases with . ≤ p ≤ . . Adiscontinuity in otherwise linear relationship between the true p -value andreported conﬁdence indicates a cliff effect. E[confidence(p=0.04) − confidence(p=0.06)] P o s t e r i o r den s i t y Classic CI Gradient CI Cont. violin CI Disc. violin CI

Fig. 8. Posterior distributions of δ , the drop in conﬁdence around p = . ,for different visualization styles in the second experiment. the most preferred style by 42% of the respondents). No cleardifferences emerge between other styles, and especially the classicCI and the gradient CI yield very similar results. Representation R an k i ng p r obab ili t y Fig. 9. Subjective ranking probabilities and the corresponding 95% credibleintervals for visualization styles of the second experiment. A higher value forrank 1 indicates preference for the method while a higher value for rank 4indicates distaste.

For this second controlled experiment, participants were also askedto comment on the limitations and beneﬁts of each visualization.The fully categorized and raw data is, again, available in thesupplementary material and we present the most frequent commentshere. Classic CIs were reported as easy to read and analyse ( × × × × × × ×

6) and the gradient that could be hard to see ( × × ×

10) and that seeing the discrete steps was veryhelpful—in comparison with the continuous violin plot ( × ×

3) and that these plots could provide too much information ina single ﬁgure ( × ×

8) but participants highlighted that the width wasunnecessary ( × × × iscussion In line with previous ﬁndings [40], [41], our results conﬁrm thatthe classic CI visualization does not ﬁx the cliff effect problemdocumented to be present in numerical and textual information. Infact, it appears that it may even increase the cliff effect. At thesame time, many participants preferred the graphical presentationof CIs over text, stating reasons such as the CI plot being clear andquick to grasp as well as scaling very well to multiple comparisons.We found that more complex visualization styles reduced thecliff effect in the ﬁrst one-sample experiment, and the violin CIplot, in particular, was also well received by the participants. Whilewe expected that these more novel visualization styles (violin andgradient CI plots) would introduce additional problems with theinterpretation due to unfamiliarity, their beneﬁts seem to outweighthese potential negative effects. Some of the problems with theviolin CI plots could be partly explained by the confusion withthe typical uses of a violin plot (as suggested by our qualitativefeedback), namely as a method of visualizing observed data. Thishighlights the importance of properly labeling ﬁgures in researchpapers to avoid such misunderstandings.The results from the second two-sample experiment suggestthat the cliff effect might be a more common problem whencomparing an estimate with a constant versus comparing estimateswith each other. However further studies are needed in order todetermine whether this is a general rule or for example an artefactof our experimental setting or small sample size, especially asthe lack of a clear cliff-effect in the two-sample experiment is incontradiction with the ﬁndings in [39] that showed major problemsin the interpretation of two-sample experiments (in a very differentsetting, however).Even though our convenience samples included researchersacross a wide range of disciplines, it is unlikely to be fully repre-sentative of the general population of researchers using statisticalanalysis. Based on social media behaviour, survey feedback, andpost-experiment discussions with some of the participants, ourconvenience sample likely contains disproportionate numbers ofresearchers with high knowledge and strong opinions on the topicof dichotomous thinking and the replication crisis. In particular,the links to the experiments were shared on the “TransparentStatistics” Slack channel which gathers HCI and VIS researcherswho have argued for non-dichotomous interpretations of statisticalresults in their own work. We thus expect that our results likely downplay the average cliff effect compared with the much broaderand heterogeneous scientiﬁc community.Another factor presumably also affected the answers of ourparticipants and downplayed the observed cliff effect. In ourexperiments we added explanatory texts to all the conditionsto describe how they were computed. This likely affected theanswers of some participants, and it could be argued that the truevariation between the participants’ answers and the size of thecliff effect could have been greater without these explanations. Asa third limitation, we observed a signiﬁcant number of answerswhere the conﬁdence increased by the increase of the underlying p -value. This phenomenon was seen especially in the VIS/HCIgroup, with gradient and violin CI plots, and in general in thesecond experiment where the comparisons were more difﬁcult.While these could explain some the estimated differences betweenrepresentation styles, our sensitivity analyses, with samples wheremost of these counter-intuitive curves were removed, suggestedonly slight increases in the estimates of δ and identical generalconclusions (see the supplementary material).Despite the limitations, we expect that our results provide agood lower estimate of the cliff effect in the broader scientiﬁccommunity and can be generalized into other statistics than just thesample mean.In contrast with most of the earlier studies on the cliff effectwhich have focused on psychologists or lay-people, we aimed tostudy the effect in a general population of researchers familiarwith statistical methods. We used Bayesian modelling to takeinto account the individual-level variability in the answers and theuncertainty due to the parameter estimation leading to more realisticuncertainty assessments of our results than the traditional maximumlikelihood estimation methods. We also provide a reproducibleexperiment with results available online and properly describe thequestions we asked from the participants. onclusions and F uture W ork We provided analysis on the experiments on the cliff effect to studythe effects of visual representation on interpreting statistical results.We found evidence that the problems with dichotomous thinkingand the cliff effect are still common problems among researchersdespite the amount of research and communications on this issue.In addition to educating researchers about this issue, we found thatcarefully chosen visualization styles can play an important role inreducing these phenomena.Our Bayesian multilevel model provides an illustration of howthe data from relatively simple experiments can be analysed ina coherent modelling framework. It can give us more complexinsights than simple descriptive statistics and avoids relying onthe signiﬁcance testing framework. The Bayesian approach alsoprovides results that are easy to interpret, as everything is statedin terms of conditional probabilities which represent the state ofknowledge. We hope this study encourages more model-basedanalysis in the VIS community in the future.All of our representation styles included a clear threshold for p -value 0.05 for comparative purposes. It would be interesting tostudy how visualization styles without this clear threshold wouldperform in similar setting. Also, quantile dot plots [53], [54] (beingdiscretized density plots) are similar to violin plots in terms of theirinformation value but, as they lack some of the potential historicalburden of more common violin plots, it would be interesting to compare the performance of these two visualization styles in thissetting.The consideration of space-efﬁcient visual representationshighlighted by some of our participants provides interesting avenuesfor future research. In line with recent work on interactive analysesand statistical visualization [14], [57], [75], [76], we also anticipatethat novel statistical representations free of the limitations oftraditional printing constraints could have a positive impact bothin general scientiﬁc communication and reducing dichotomousthinking. Indeed, our violin CIs could be made more space-efﬁcientin order to better scale to multiple comparisons, for example byusing interactive scaling. We therefore plan to study such solutionsand their impact on statistical interpretations in future. As suggestedby the discrepancy between the results of the ﬁrst and secondexperiments, another avenue for further research is to study whetherthe cliff effect is stronger or more commonly occurring in settingswhere comparisons are made with respect to a constant referencepoint compared with multiple random variables. A cknowledgments We thank Pierre Dragicevic, Geoff Cumming, and our reviewersfor their helpful comments. J. Helske was supported by Academyof Finland (decision numbers 311877 and 331817). S. Helske wassupported by the Academy of Finland (331816, 320162) and theSwedish Research Council (445-2013-7681, 340-2013-5460). R eferences [1] H. Pashler and E. Wagenmakers, “Editors’ introduction to thespecial section on replicability in psychological science: A crisisof conﬁdence?” Perspectives on Psychological Science , vol. 7,no. 6, pp. 528–530, 2012, pMID: 26168108. [Online]. Available:https://doi.org/10.1177/1745691612465253[2] V. Amrhein, S. Greenland, and B. McShane, “Scientists rise up againststatistical signiﬁcance,”

Nature , vol. 567, no. 7748, pp. 305–307, 2019.[Online]. Available: https://doi.org/10.1038/d41586-019-00857-9[3] R. L. Wasserstein, A. L. Schirm, and N. A. Lazar, “Moving to a worldbeyond "p<0.05",”

The American Statistician , vol. 73, no. sup1, pp.1–19, 2019. [Online]. Available: https://doi.org/10.1080/00031305.2019.1583913[4] D. McCloskey and S. Ziliak,

The Cult of Statistical Signiﬁcance .University of Michigan Press, 2008. [Online]. Available: https://doi.org/10.3998/mpub.186351[5] D. Traﬁmow and M. Marks, “Editorial,”

Basic and Applied SocialPsychology , vol. 37, no. 1, pp. 1–2, 2015. [Online]. Available:https://doi.org/10.1080/01973533.2015.1012991[6] J. Gill, “Comments from the new editor,”

Political Analysis , vol. 26, no. 1,pp. 1–2, 2018.[7] M. Kay, G. L. Nelson, and E. B. Hekler, “Researcher-centered design ofstatistics: Why Bayesian statistics better ﬁt the culture and incentives ofHCI,” in

Proc. CHI . ACM, 2016, pp. 4521–4532.[8] J. K. Kruschke and T. M. Liddell, “The Bayesian new statistics: Hypothesistesting, estimation, meta-analysis, and power analysis from a Bayesianperspective,”

Psychonomic Bulletin & Review , vol. 25, no. 1, pp. 178–206,2018.[9] R. McElreath,

Statistical Rethinking: A Bayesian Course with Examplesin R and Stan . CRC Press, 2016.[10] G. Cumming,

Understanding the new statistics: effect sizes, conﬁdenceintervals and meta-analysis . Routledge Taylor & Francis Group, 2012.[11] R. Calin-Jageman and G. Cumming, “The new statistics for better science:Ask how much, how uncertain, and what else is known,” 2018. [Online].Available: psyarxiv.com/3mztg[12] M. Correll and M. Gleicher, “Error bars considered harmful: Exploring al-ternate encodings for mean and error,”

IEEE Transactions on Visualizationand Computer Graphics , vol. 20, no. 12, pp. 2142–2151, 2014.[13] P. Kalinowski, J. Lai, and G. Cumming, “A cross-sectionalanalysis of students’ intuitions when interpreting CIs,”

Frontiersin Psychology [14] P. Dragicevic, Y. Jansen, A. Sarma, M. Kay, and F. Chevalier, “Increasingthe Transparency of Research Papers with Explorable MultiverseAnalyses,” in Proc. CHI , Glasgow, United Kingdom, 2019. [Online].Available: https://hal.inria.fr/hal-01976951[15] R. Rosenthal and J. Gaito, “The interpretation of levels ofsigniﬁcance by psychological researchers,”

The Journal of Psychology ,vol. 55, no. 1, pp. 33–38, 1963. [Online]. Available: https://doi.org/10.1080/00223980.1963.9916596[16] J. Neyman, “Outline of a Theory of Statistical Estimation Based on theClassical Theory of Probability,”

Philosophical Transactions of the RoyalSociety of London Series A , vol. 236, pp. 333–380, 1937.[17] G. Gigerenzer, “Mindless statistics,”

The Journal of Socio-Economics ,vol. 33, no. 5, pp. 587–606, 2004.[18] V. Amrhein, D. Traﬁmow, and S. Greenland, “Inferential statisticsas descriptive statistics: There is no replication crisis if we don’texpect replication,”

The American Statistician , 2018. [Online]. Available:https://peerj.com/preprints/26857/[19] B. B. McShane and D. Gal, “Statistical signiﬁcance and thedichotomization of evidence,”

Journal of the American StatisticalAssociation , vol. 112, no. 519, pp. 885–895, 2017. [Online]. Available:https://doi.org/10.1080/01621459.2017.1289846[20] A. Cockburn, P. Dragicevic, L. Besançon, and C. Gutwin, “Threatsof a replication crisis in empirical computer science,”

Commun.ACM , vol. 63, no. 8, pp. 70–79, Jul. 2020. [Online]. Available:https://doi.org/10.1145/3360311[21] Z. Raﬁ and S. Greenland, “Semantic and cognitive tools to aid statisticalscience: replace conﬁdence and signiﬁcance by compatibility andsurprise,”

BMC Medical Research Methodology , vol. 20, no. 1, Sep. 2020.[Online]. Available: https://doi.org/10.1186/s12874-020-01105-9[22] W. ´Swi˛atkowski and B. Dompnier, “Replicability crisis in social psychol-ogy: Looking at the past to ﬁnd new pathways for the future,”

InternationalReview of Social Psychology , vol. 30, no. 1, pp. 111–124, 2017.[23] P. Dragicevic, F. Chevalier, and S. Huot, “Running an HCI experimentin multiple parallel universes,” in

Proc. CHI Extended Abstracts . NewYork: ACM, 2014, pp. 607–618.[24] P. Dragicevic, “Fair statistical communication in HCI,” in

ModernStatistical Methods for HCI , J. Robertson and M. Kaptein, Eds. Cham,Switzerland: Springer International Publishing, 2016, ch. 13, pp. 291–330.[Online]. Available: http://software.mauritskaptein.com/StatisticsForHCI/content-overview/[25] L. Besançon and P. Dragicevic, “The continued prevalence of dichotomousinferences at CHI,” in

Proc. CHI Extended Abstracts , Glasgow, UnitedKingdom, 2019. [Online]. Available: https://hal.inria.fr/hal-01980268[26] A. Gelman, “No to inferential thresholds,” Online. Last visited 04 January2019, 2017. [Online]. Available: https://andrewgelman.com/2017/11/19/no-inferential-thresholds/[27] G. Gigerenzer, “Statistical rituals: The replication delusion and how wegot there,”

Advances in Methods and Practices in Psychological Science ,p. 2515245918771329, 2018.[28] D. G. Altman and J. M. Bland, “Statistics notes: Absence of evidence isnot evidence of absence,”

BMJ

Personality and Social Psychology Review , vol. 2, no. 3, pp. 196–217,1998.[30] L. K. John, G. Loewenstein, and D. Prelec, “Measuring theprevalence of questionable research practices with incentivesfor truth telling,”

Psychological Science , vol. 23, no. 5, pp.524–532, 2012, pMID: 22508865. [Online]. Available: https://doi.org/10.1177/0956797611430953[31] J. P. A. Ioannidis, “Why most published research ﬁndings arefalse,”

PLOS Medicine , vol. 2, no. 8, 2005. [Online]. Available:https://doi.org/10.1371/journal.pmed.0020124[32] J. P. Simmons, L. D. Nelson, and U. Simonsohn, “False-positivepsychology: Undisclosed ﬂexibility in data collection and analysis allowspresenting anything as signiﬁcant,”

Psychological Science , vol. 22,no. 11, pp. 1359–1366, 2011, pMID: 22006061. [Online]. Available:https://doi.org/10.1177/0956797611417632[33] R. Ulrich and J. Miller, “Some properties of p-curves, with an applicationto gradual publication bias,”

Psychological Methods , vol. 23, no. 3, p. 546,2018.[34] J. M. Wicherts, C. L. S. Veldkamp, H. E. M. Augusteijn,M. Bakker, R. C. M. van Aert, and M. A. L. M. vanAssen, “Degrees of freedom in planning, running, analyzing, andreporting psychological studies: A checklist to avoid p-hacking,”

Frontiers in Psychology

Proc. IHM . Poitiers, France:AFIHM, 2017, p. 10. [Online]. Available: https://hal.inria.fr/hal-01562281[37] G. Cumming, “The new statistics: Why and how,”

PsychologicalScience , vol. 25, no. 1, pp. 7–29, 2014. [Online]. Available:http://pss.sagepub.com/content/25/1/7.abstract[38] R. Hoekstra, R. D. Morey, J. N. Rouder, and E.-J. Wagenmakers,“Robust misinterpretation of conﬁdence intervals,”

Psychonomic Bulletin& Review , vol. 21, no. 5, pp. 1157–1164, 2014. [Online]. Available:https://doi.org/10.3758/s13423-013-0572-3[39] S. Belia, F. Fidler, J. Williams, and G. Cumming, “Researchers misun-derstand conﬁdence intervals and standard error bars.”

Psychologicalmethods , vol. 10, no. 4, p. 389, 2005.[40] J. Lai, “Dichotomous thinking: A problem beyond NHST,”

Data andcontext in statistics education: Towards an evidence based society ,2010. [Online]. Available: https://iase-web.org/documents/papers/icots8/ICOTS8_C101_LAI.pdf[41] R. Hoekstra, A. Johnson, and H. A. L. Kiers, “Conﬁdence intervalsmake a difference: Effects of showing conﬁdence intervals oninferential reasoning,”

Educational and Psychological Measurement ,vol. 72, no. 6, pp. 1039–1052, 2012. [Online]. Available:https://doi.org/10.1177/0013164412450297[42] N. Nelson, R. Rosenthal, and R. L. Rosnow, “Interpretation of signif-icance levels and effect sizes by psychological researchers.”

AmericanPsychologist , vol. 41, no. 11, p. 1299, 1986.[43] J. Poitevineau and B. Lecoutre, “Interpretation of signiﬁcance levelsby psychological researchers: The .05 cliff effect may be overstated,”

Psychonomic Bulletin & Review , vol. 8, no. 4, pp. 847–850, 2001.[Online]. Available: https://doi.org/10.3758/BF03196227[44] M.-P. Lecoutre, J. Poitevineau, and B. Lecoutre, “Even statisticians arenot immune to misinterpretations of null hypothesis signiﬁcance tests,”

International Journal of Psychology , vol. 38, no. 1, pp. 37–45, 2003.[Online]. Available: https://doi.org/10.1080/00207590244000250[45] J. D. Perezgonzalez, “Fisher, Neyman-Pearson or NHST? a tutorial forteaching data testing,”

Frontiers in Psychology , vol. 6, 2015. [Online].Available: https://doi.org/10.3389/fpsyg.2015.00223[46] B. Scheibehenne, T. Jamil, and E.-J. Wagenmakers, “Bayesianevidence synthesis can reconcile seemingly inconsistent results:The case of hotel towel reuse,”

Psychological Science , vol. 27,no. 7, pp. 1043–1046, 2016, pMID: 27280727. [Online]. Available:https://doi.org/10.1177/0956797616644081[47] E.-J. Wagenmakers, “A practical solution to the pervasive problems of pvalues,”

Psychonomic Bulletin & Review , vol. 14, no. 5, pp. 779–804,2007. [Online]. Available: https://doi.org/10.3758/BF03194105[48] M. F. Jung, D. Sirkin, T. M. Gür, and M. Steinert, “Displayeduncertainty improves driving experience and behavior: The case ofrange anxiety in an electric car,” in

Proc. CHI , ser. CHI ’15. NewYork, NY, USA: ACM, 2015, pp. 2201–2210. [Online]. Available:http://doi.acm.org/10.1145/2702123.2702479[49] M. Wunderlich, K. Ballweg, G. Fuchs, and T. von Landesberger,“Visualization of delay uncertainty and its impact on traintrip planning: A design study,”

Computer Graphics Forum ,vol. 36, no. 3, pp. 317–328, 2017. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1111/cgf.13190[50] J. L. Hintze and R. D. Nelson, “Violin plots: A box plot-density tracesynergism,”

The American Statistician

Journal of the Royal Statistical Society: Series A (Statisticsin Society) , vol. 162, no. 1, pp. 45–58, 1999. [Online]. Available:https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/1467-985X.00120[52] N. J. Barrowman and R. A. Myers, “Raindrop plots,”

The AmericanStatistician , vol. 57, no. 4, pp. 268–274, 2003. [Online]. Available:https://doi.org/10.1198/0003130032369[53] M. Fernandes, L. Walls, S. Munson, J. Hullman, and M. Kay,“Uncertainty displays using quantile dotplots or cdfs improvetransit decision-making,” in

Proc. CHI , ser. CHI ’18. New York,NY, USA: ACM, 2018, pp. 144:1–144:12. [Online]. Available:http://doi.acm.org/10.1145/3173574.3173718[54] M. Kay, T. Kola, J. R. Hullman, and S. A. Munson, “When (ish)is my bus?: User-centered visualizations of uncertainty in everyday, mobile predictive systems,” in Proc. CHI , ser. CHI ’16. NewYork, NY, USA: ACM, 2016, pp. 5092–5103. [Online]. Available:http://doi.acm.org/10.1145/2858036.2858558[55] G. Cumming, “Inference by eye: Pictures of conﬁdence intervalsand thinking about levels of conﬁdence,”

Teaching Statistics ,vol. 29, no. 3, pp. 89–93, 2007. [Online]. Available: https://onlinelibrary.wiley.com/doi/abs/10.1111/j.1467-9639.2007.00267.x[56] M. Kay, tidybayes: Tidy Data and Geoms for Bayesian Models , 2018,R package version 1.0.3. [Online]. Available: http://mjskay.github.io/tidybayes/[57] A. Kale, F. Nguyen, M. Kay, and J. Hullman, “Hypothetical outcomeplots help untrained observers judge trends in ambiguous data,”

IEEETransactions on Visualization and Computer Graphics , vol. 25, no. 1, pp.892–902, 2019.[58] J. M. Hofman, D. G. Goldstein, and J. Hullman, “How visualizinginferential uncertainty can mislead readers about treatment effects inscientiﬁc results,” in

Proc. CHI , ser. CHI ’20. New York, NY, USA:Association for Computing Machinery, 2020, p. 1–12. [Online]. Available:https://doi.org/10.1145/3313831.3376454[59] J. Hullman, X. Qiao, M. Correll, A. Kale, and M. Kay, “In pursuit of error:A survey of uncertainty visualization evaluation,”

IEEE Transactions onVisualization and Computer Graphics , vol. 25, no. 1, pp. 903–913, 2019.[60] S. Tak, A. Toet, and J. van Erp, “The perception of visual uncertain-tyrepresentation by non-experts,”

IEEE Transactions on Visualization andComputer Graphics , vol. 20, no. 6, pp. 935–943, 2014.[61] R. Hyndman and G. Athanasopoulos.,

Forecasting: principlesand practice

R: A Language and Environment for Statistical Computing

Journal of Statistical Software , vol. 80, no. 1, pp. 1–28, 2017.[65] H. Wickham, ggplot2: Elegant Graphics for Data Analysis . Springer-Verlag New York, 2016. [Online]. Available: http://ggplot2.org[66] H. Zhang and L. Maloney, “Ubiquitous log odds: A commonrepresentation of probability and frequency distortion in perception, action,and cognition,”

Frontiers in Neuroscience

Bayesian data analysis . CRC press, 2013.[68] P. Bürkner and E. Charpentier, “Monotonic effects: A principled approachfor including ordinal predictors in regression models,” 2018. [Online].Available: psyarxiv.com/9qkhj[69] J. Gabry, D. Simpson, A. Vehtari, M. Betancourt, and A. Gelman,“Visualization in bayesian workﬂow,”

Journal of the Royal StatisticalSociety: Series A (Statistics in Society) , vol. 182, no. 2, pp. 389–402,2019. [Online]. Available: https://rss.onlinelibrary.wiley.com/doi/abs/10.1111/rssa.12378[70] G. N. Wilkinson and C. E. Rogers, “Symbolic description of factorialmodels for analysis of variance,”

Journal of the Royal Statistical Society.Series C (Applied Statistics) nlme:Linear and Nonlinear Mixed Effects Models , 2018, R package version3.1-137. [Online]. Available: https://CRAN.R-project.org/package=nlme[72] D. Lewandowski, D. Kurowicka, and H. Joe, “Generating randomcorrelation matrices based on vines and extended onion method,”

Journal of Multivariate Analysis

Doing Bayesian data analysis: A tutorial with R, JAGS, andStan . Academic Press, 2014.[74] G. Cumming and S. Finch, “Inference by eye: conﬁdence intervals andhow to read pictures of data.”

American psychologist , vol. 60, no. 2, p.170, 2005.[75] T. L. Pedersen and D. Robinson, gganimate: A Grammar ofAnimated Graphics , 2019, R package version 1.0.2. [Online]. Available:https://CRAN.R-project.org/package=gganimate[76] J. Hullman, P. Resnick, and E. Adar, “Hypothetical outcome plotsoutperform error bars and violin plots for inferences about reliabilityof variable ordering,”

PLOS ONE , vol. 10, no. 11, pp. 1–25, 11 2015.[Online]. Available: https://doi.org/10.1371/journal.pone.0142444

Jouni Helske is a senior researcher at the Univer-sity of Jyväskylä, Finland, from where he receivedthe Ph.D. degree in statistics in 2015. He was pre-viously a postdoctoral researcher at the LinköpingUniversity, Sweden. His research has focused onstate space models, Markov chain Monte Carloand sequential Monte Carlo methods, informationvisualization, and his current research focuses onBayesian causal inference.

Satu Helske is a senior researcher in sociologyat the University of Turku, Finland. She receivedthe Ph.D. degree in statistics from the Universityof Jyväskylä, Finland, after which she worked as apostdoctoral researcher at the University of Oxford,UK, and at Linköping University, Sweden. She worksat the crossroads of sociology and statistics, withher main focus being on longitudinal and life courseanalysis.

Matthew Cooper is senior lecturer in informationvisualization with the University of Linköping. Hewas awarded a PhD in Chemistry by the Universityof Manchester, UK, in 1990 and moved into thearea of visualization in 1996 when he joined theManchester Visualization Centre. He joined the Uni-versity of Linköping in 2001. His current interests liein the areas of visual representations and analyticalmethods for multivariate and temporal data as wellas in the scientiﬁc evaluation of user interaction withvisualization systems and tools.

Anders Ynnerman is a Professor in scientiﬁc vi-sualization at Linköping University and is the di-rector of the Norrköping Visualization Center C.His research has focused on interactive techniquesfor large scientiﬁc data in a range of applicaitonareas. He received the Ph.D. degree in physics fromGothenburg University, Sweden in 1992. Ynnermanis a member of the Swedish Royal Academy of Engi-neering Sciences and the Royal Swedish Academyof Sciences and in 2018 he received the IEEE VGTCtechnical achievement award.