[PDF] Bringing Visual Inference to the Classroom

Abstract

In the classroom, we traditionally visualize inferential concepts using static graphics or interactive apps. For example, there is a long history of using apps to visualize sampling distributions. Recent developments in statistical graphics have created an opportunity to bring additional visualizations into the classroom to hone student understanding. Specifically, the lineup protocol for visual inference provides a framework for students see the difference between signal and noise by embedding a plot of observed data in a field of null (noise) plots. Lineups have proven valuable in visualizing randomization/permutation tests, diagnosing models, and even conducting valid inference when distributional assumptions break down. This paper provides an overview of how the lineup protocol for visual inference can be used to hone understanding of key statistical topics throughout the statistics curricula.

Full PDF

BBringing Visual Inference to the Classroom

Adam LoyDepartment of Mathematics and Statistics, Carleton CollegeJune 23, 2020

Abstract

In the classroom, we traditionally visualize inferential concepts using static graph-ics or interactive apps. For example, there is a long history of using apps to visualizesampling distributions. Recent developments in statistical graphics have created anopportunity to bring additional visualizations into the classroom to hone studentunderstanding. Speciﬁcally, the lineup protocol for visual inference provides a frame-work for students see the diﬀerence between signal and noise by embedding a plotof observed data in a ﬁeld of null (noise) plots. Lineups have proven valuable in vi-sualizing randomization/permutation tests, diagnosing models, and even conductingvalid inference when distributional assumptions break down. This paper provides anoverview of how the lineup protocol for visual inference can be used to hone under-standing of key statistical topics throughout the statistics curricula.

Keywords:

Statistical graphics, Simulation-based inference, Visualizing uncertainty, Lineupprotocol, Introductory statistics, Model diagnostics1 a r X i v : . [ s t a t . O T ] J un Introduction

Recent years have seen a great deal of innovation in how we teach statistics as we striveto overcome what Cobb (2007) termed “the tyranny of the computable.” Most notably,simulation-based pedagogies for the ﬁrst course have been proposed and validated (Cobb2007, Tintle et al. 2011, 2012, Maurer & Lock 2014, Tintle et al. 2014, Hildreth et al. 2018).These simulation-based pedagogies have also been used in mathematical statistics (Chihara& Hesterberg 2011, Cobb 2011) and Tintle, Chance, Cobb, Roy, Swanson & VanderStoep(2015) argue that they should be used throughout the entire curriculum.In addition to changes to how we introduce inference, there have also been changes tothe computational toolkit we use throughout the statistic curricula. At the introductorylevel, numerous toolkits are commonplace depending on the objectives and audience of thecourse. Web apps are commonly used when students may not have access to their owncomputers, or simply to lower the technical barriers to entry. Examples include StatKey(Lock et al. 2017), the

Introduction to Statistical Investigations applets (Tintle, Chance,Cobb, Rossman, Roy, Swanson & VanderStoep 2015), and the shiny apps from Agrestiet al. (2017). These apps allow students to explore course concepts without getting intothe computational weeds. For courses exploring both the concepts and implementation ina realistic data-analytic workﬂow, R (R Core Team 2019) is a common open-source choiceand multiple R packages have been developed to lower the barriers to entry for students,notably the mosaic (Pruim et al. 2017), ggformula (Kaplan & Pruim 2019), and infer (Brayet al. 2019).The above developments enabled the statistics education community to address keyrecommendations made in the GAISE report (GAISE College Report ASA Revision Com-mittee 2016). The simulation-based curriculum has focused attention on teaching statisticalthinking and fostering conceptual understanding before delving into the mathematical de-tails. An improved computational toolkit has enabled students to use technology to exploreconcepts, such as a sampling and permutation distributions, and to analyze data.While the use of the simulation-based curriculum has helped students focus on the un-derlying ideas of statistical inference, little has changed about the way we help students visualize inference. Speciﬁcally, we still have students grapple with null/reference distribu-2ions in hypothesis testing and sampling distributions for estimation while they are trying tohone their intuition. These distributions are very abstract ideas and while the web apps weuse to demonstrate their construction can help make sense of a single “dot” on the distribu-tion, students commonly lose the forest for the trees. Wild et al. (2017) proposed the use ofscaﬀolded animations to help students hone their intuition about sampling/randomizationvariation and to discover the utility of the bootstrap and permutation distributions. Whilethe animations discussed by Wild et al. (2017) to visualize randomization variation appearto be quite useful in communicating this complex idea to students, a “formal” distributionof is not necessary to introduce the core ideas behind hypothesis testing. Instead, we canask students to try to identify the data plot among a small set of decoy plots generatedby permutation resampling and link this simple perceptual task with fundamental ideas ofstatistical inference.In this article, we discuss how to use the lineup protocol from visual inference to helpstudents diﬀerentiate between diﬀerent forms of signal and noise and to better understandthe nuances of statistical signiﬁcance. Section 2 presents an overview of the lineup protocol.Section 3 presents examples of how the lineup protocol can be used in the ﬁrst course, andSection 4 presents additional examples of its use throughout the curriculum. We concludewith a brief summary and discussion in Section 6.

As outlined by Cobb (2007), most introductory statistics books teach that classical hy-pothesis tests consist of (i) formulating null and alternative hypotheses, (ii) calculating atest statistic from the observed data, (iii) comparing the test statistic to a reference (null)distribution, and (iv) deriving a p -value on which a conclusion is based. This is still true forthe ﬁrst course after adapting it based on the new GAISE guidelines, regardless of whethera simulation-based approach is used (cf. Lock et al. 2017, Tintle, Chance, Cobb, Rossman,Roy, Swanson & VanderStoep 2015, De Veaux et al. 2018). In visual inference, the lineupprotocol provides a direct analog for each step of a hypothesis test (Buja et al. 2009).1. Competing claims : Similar to a traditional hypothesis test, a visual test begins by3learly stating the competing claims about the model/population parameters.2.

Test statistic : A plot displaying the raw data or ﬁtted model (call the observedplot ) serves as the test statistic. This plot must be chosen to highlight features ofthe data that are relevant to the hypotheses. For example, a scatterplot is a naturalchoice to examine whether or not two quantitative variables are correlated.3.

Reference (null) distribution : Null plots are generated consistently with the nullhypothesis and the set of all null plots constitutes the reference (or null ) distribution .To facilitate comparison of the observed plot to the null plots, the observed plot israndomly situated in the ﬁeld of null plots, just as a suspect is randomly situatedamong decoys in a police lineup. This arrangement of plots is called a lineup .4. Assessing evidence : If the null hypothesis is true, then we expect the observedplot to be indistinguishable from the null plots. Consequently, if the observer isable to identify the observed plot in a lineup, then this provides evidence againstthe null hypothesis. If one wishes to calculate a visual p-value, then lineups needto be presented to a number of independent observers for evaluation. While this ispossible, it is not a productive discussion in most introductory courses that don’texplore probability theory.

As a ﬁrst example of visual inference via the lineup protocol, consider the creative writingexperiment discussed by Ramsey & Schafer (2013). The experiment was designed to explorewhether motivation type (intrinsic or extrinsic) impacted creativity scores. To evaluatethis, creative writers were randomly assigned to a questionnaire where they ranked reasonsthey write: one questionnaire listed intrinsic motivations and the other listed extrinsicmotivations. After completing the questionnaire, all subjects wrote a Haiku about laughter,which was graded for creativity by a panel of poets. Ramsey & Schafer (2013) discuss howto conduct a permutation test for the diﬀerence in mean creativity scores between the twotreatment groups. Below, we illustrate the steps of a visual test.4

Treatment S c o r e Figure 1: Boxplots of the original creative writing scores by treatment group. The dotrepresents the mean of each group.1. A visual test begins identically to a traditional hypothesis test by clearly stating thecompeting claims about the model/population parameters. In a ﬁrst course, thiscould be written as: H : µ intrinsic − µ extrinsic = 0 vs. H : µ intrinsic − µ extrinsic (cid:54) = 0.2. In a visual test, plots take the role of test statistics (Buja et al. 2009). In thissituation, we must choose a plot (test statistic) that can highlight the diﬀerence inaverage creativity scores between the intrinsic and extrinsic treatment groups. Figure1 displays boxplots of creative writing scores by treatment group where a dot is usedto represent the sample mean for each group, though other graphics could be used.There is an apparent diﬀerence in the distribution of the scores—the average scorefor the intrinsic group appears to be larger–but is it an important (i.e. signiﬁcant)diﬀerence?3. To understand whether the observed (data) plot provides evidence of a signiﬁcantdiﬀerence, we must understand the behavior of our test statistic under the null hy-pothesis. To do this, we generate null plots consistent with the null hypothesis andthe set of all null plots constitutes the reference distribution. To facilitate comparisonof the data plot to the null plots, the data plot is randomly situated in the ﬁeld ofnull plots. This arrangement of plots is called a lineup . Figure 2 shows one possi-ble lineup for the creative writing experiment. The 19 null plots were generated via5 Treatment S c o r e Figure 2: A lineup consisting of 19 null plots generated via permutation resampling andthe original data plot for the creative writing study. The data plot was randomly placedin panel p -value, as the pedagogical value of the lineup protocolis in visualizing signal and noise. 6 Using visual inference in introductory statistics

In this section, we discuss how to use the lineup protocol in the introductory setting tointroduce students to the logic of hypothesis testing and to help students interpret newstatistical graphics. The goal is to provide examples of how this can be done, not toprovide an exhaustive list of possibilities.

The strong parallels between visual inference and classical hypothesis testing make it anatural way to introduce the idea of statistical signiﬁcance without getting bogged downin minutiae/controversy of p-values, or the technical issues of describing a simulation pro-cedure before students understand why that is important. All students understand thequestion “which one of these plots is not like the others,” and this common understandinggenerates fruitful discussion about the underlying inferential thought process without theneed for a slew of deﬁnitions. Below is an outline of a class activity discussing the creativewriting experiment to introduce the inferential thought process.

This activity is designed to be completed in groups of three or four students. We havefound that this group size allows all students to contribute to the discussion, and is alsoconducive to assigned roles (Roseth et al. 2008), if you ﬁnd that helps foster discussion inyour classroom.

Competing claims.

To begin, we have students discuss what competing claims arebeing investigated. We encourage them to write these in words before linking them with themathematical notation they saw in the reading prior to class. The most common answer is:“there is no diﬀerence in the average creative writing scores for the two groups vs. there isa diﬀerence in the average creative writing scores for the two groups.” During the debrief,I make sure to link this to the appropriate notation.

EDA review.

Next, we have students discuss what plot types would be most usefulto investigate this claim. It’s important to ask students why they selected a speciﬁc plot7ype, as this reinforces links to key ideas from exploratory data analysis.

Lineup evaluation.

Most students recognize that side-by-side boxplots, faceted his-tograms, or overlayed density plots are reasonable choices to display the relevant aspectsof the distribution of creative writing scores for each group. We then provide a lineup ofside-by-side boxplots to evaluate (we do place a dot at the sample mean for each group),such as the one shown in Figure 1. At this point, we do not give the details behind thecreation of null plots, we simply tell students that one plot is the observed data while theother nineteen agree with the null hypothesis. We ask students to (i) choose which plot isthe most diﬀerent from the others and (ii) explain why they chose that plot. Once eachstudent has had time to make this assessment (usually about one minute) we ask the groupsto discuss and defend their choices.

Lineup discussion.

Once all of the groups have evaluated their lineups and discussedtheir reasoning, we regroup for a class discussion. During this discussion, we reveal whichpanel contains the observed data (panel

You can also utilize the lineup protocol in the ﬁrst course to introduce new and unfamiliarplot types. For example, we have found many introductory students struggle to interpretresidual plots. In this situation, the lineup protocol helps students tune their understandingof what constitutes an “interesting” pattern (i.e. signal).

Interpreting residual plots is fraught with common errors, and we have found that, regard-less of our valiant attempts to explain what “random noise” or “random deviations from a8

Fitted values R e s i dua l s Figure 3: A residual plot for a simple linear regression model. Is there evidence that themodel is insuﬃcient?model” might look like, there is no substitute for ﬁrst hand experience. In this section, weoutline a class activity/discussion that we use to help train students to interpret residualplots. This activity takes place after a brief introduction to residual plots is given in class(or video if in a ﬂipped classroom). Again, we suggest that students complete such anactivity in small groups.

Model ﬁtting.

To begin, we have students ﬁt a simple linear regression model, writedown what a residual is (in both words and using notation), and then create a ﬁrst residualplot, such as Figure 3.

Interpreting residual plots.

Next, we pose the question: “Does this residual plotprovide evidence of a model deﬁciency?” This provides students time to formalize theirdecision, and link it to speciﬁc features of the residual plot upon which they based theirdecision.

Lineup evaluation.

Once students have carefully interpreted the observed residualplot, we have them generate via a Shiny app (or present them with) a lineup where theirdata plot has been randomly situated in a ﬁeld of null plots, such as the plot shown inFigure 4. Here, the null plots have been generated using the parametric bootstrap, butthe residual or non-parametric bootstraps are other viable choices. We avoid the detailsof how the null plots were generated, but this depends on the goals for your class. Oncethe lineup has been generated, we ask students to (i) identify which panel contains theobserved residual plot, (ii) describe patterns they observed in three null plots, and (iii)9

Fitted values R e s i dua l s Figure 4: A lineup of residual plots. The null plots are generated via a parametricbootstrap from the ﬁtted model. The observed data are shown in panel

Debrief.

Once all of the groups have evaluated their lineups and discussed their rea-soning, it is important to regroup for a class discussion. This allows you to reveal theobserved residual plot and revisit key points about residual plots and their interpretation.

Teaching tips • In Figure 4, the observed residual plot in panel

Depending on your course goals, follow-up discussions about the design of residualplots could be injected to the end of this activity. For example, you could providestudents with a second version of the lineup where LOESS smoothers have beenadded to each panel and ask students what features of the residual plot the smootherhighlights. • An alternative activity ﬁrst has students use the

Rorschach protocol (Buja et al. 2009)to look through a series of null plots, describing what they see, and then looking ata single residual plot.

Similar activities can be designed to introduce other statistical graphics. Speciﬁcally, wehave also found that lineups help students learn to read normal quantile-quantile and mosaicplots (or stacked bar charts).

The utility of visual inference is not limited to introductory courses. Whenever a newmodel is encountered intuition about diagnostic plots must be rebuilt, and the lineupprotocol helps students build this intuition. As an example, consider diagnostics for binarylogistic regression models, a common topic in a second course.

Interpreting residual plots from binary logistic regression is diﬃcult, as plots of the resid-uals against the ﬁtted values or predictors often look similar for adequate and inadequatemodels. The lineup protocol provides a framework for this discussion. For example, youcan simulate data from a model where a quadratic eﬀect is needed, but ﬁt the data to amodel with only a linear eﬀect and extract the Pearson residuals. Then, you can simulatethe null plots from the model with only the linear eﬀect and extract the Pearson residuals.Figure 5 shows a lineup created in this way. Having a discussion surrounding this lineup inclass will help pinpoint the diﬃculty using conventional residual plots for model diagnoses.11

Linear predictor P ea r s on r e s i dua l s Figure 5: A lineup for a deﬁcient logistic regression model. The data plot are simulatedfrom a model with a quadratic eﬀect, while the null plots are simulated from a model withonly a linear eﬀect. Can you identify the deﬁcient plot?12

Predictor A v e r age de v i an c e r e s i dua l Figure 6: A binned residual plot from a simple binary logistic regression model. Theaverage deviance residual is plotted on the y -axis for each of 54 bins on the x -axis.After establishing the pitfalls of “conventional” residual plots for binary logistic regres-sion, you can introduce alternative strategies (i.e., new diagnostic plots) and again use thelineup protocol to calibrate student intuition. Below are two such examples. Gelman & Hill (2007) recommend using binned residual plots to explore possible violationsof linearity for binary logistic regression. A binned residual plot is created by calculatingthe average residual value with bins that partition the x -axis. Figure 6 shows a binnedresidual plot from a simple binary logistic regression model. The average deviance residualis plotted on the y -axis for each of 54 bins on the x -axis. The number of bins is set to (cid:98)√ n (cid:99) , but can be adjusted as with a histogram. Gelman & Hill (2007) claim that these plotsshould behave much like the familiar standardized residual plots from regression. If thisis the case, then Figure 6 is indicative of nonlinearity. However, rather than simply citingGelman & Hill (2007) to students, a lineup empowers them to investigate the behavior ofthis new plot type. A lineup for these residuals is given in Figure 7. As suspected, thedata plot (panel Linear predictor B i nned de v i an c e r e s i dua l s Figure 7: A lineup of binned residual plots from a simple binary logistic regression model.The observed residuals are shown in panel .1.2 Empirical logit plots

A more-common alternative to the binned residual plot is the empirical logit plot (c.f.,Cannon et al. 2018, Ramsey & Schafer 2013). An empirical logit plot can be constructedfor each explanatory variable by calculating the adjusted proportion of “successes” withineach “group” as (cid:98) p adj = number successes + 0 . , and plotting log ( (cid:98) p adj / (1 − (cid:98) p adj )) against the average value of a quantitative explanatoryvariable, or the level of a categorical explanatory variable. For quantitative variables, it iscommon to form groups by forming bins of roughly equal size.While an empirical logit plot is quite straightforward to create, it can be hard to in-terpret for smaller data sets where few groups are formed. For example, Cannon et al.(2018) use empirical logit plots to explore a binary logistic regression model for medicalschool admission decisions based on an applicant’s average grade point average and renderempirical logit plots based on both 5 and 11 bins. Figure 8 shows recreations of theseplots. Experimenting with the bin width reveals the diﬃculty students may encounterdetermining whether linearity is reasonable: the plot can change substantially based onthe binwidth. In our experience, students often see some indication of non-linearity in theplot with 5 bins (Figure 8 (a)), whereas they think the plot with 11 bins (Figure 8 (b)) isreasonably linear.To help students interpret whether observed patterns on empirical logit plots are prob-lematic, we again appeal to the lineup protocol to enforce a comparison between the ob-served plot and what is expected under the model. Figure 9 displays a lineup of theempirical logit plot created using 5 groups. The observed plot (panel In this section, we focused on using the lineup protocol to help diagnose logistic regressionmodels, but the approach is generally applicable. If you have a plot highlighting somefeature(s) of the ﬁtted model, then after simulating data from a “correct” model (i.e. one15

Average GPA E m p i r i c a l l og i t (a) Empirical logit plot using 5 groups −2−1012 3.00 3.25 3.50 3.75 Average GPA E m p i r i c a l l og i t (b) Empirical logit plot using 11 groups Figure 8: Two empirical logit plots rendered for the same data set with n = 55 observa-tions. Panel (a) is rendered using 5 groups while panel (b) is rendered using 11 groups. Theappearance of the plots changes substantially, often leading to confusion in intrepretation.without model deﬁciencies), you can create a lineup to interrogate the model. For example,Loy et al. (2017) discuss how visual inference can be used to diagnose multilevel models. All of the lineups presented in this paper were rendered in R (R Core Team 2019). Atutorial outlining this process using the ggplot2 (Wickham 2016) and nullabor (Wickhamet al. 2014) R packages is provided in the supplementary materials. These tools allow youto customize lineups for class use, but we do not recommend having introductory studentsgrapple with this code. For introductory students, we recommend providing handouts orslides with pre-rendered lineups during class activities. Alternatively, we have created asuite of Shiny apps (Chang et al. 2019) where students can upload data sets and renderlineups. The current suite includes apps to generate lineups to explore associations betweengroups, normal Q-Q plots, and residual plots for simple linear regression models. Links tothe shiny apps can be found at https://aloy.rbind.io/project/classroom-viz-inf/ .While we have found in-class activities where students use the lineup protocol to explorenew plots types and the inferential thought process to be useful, these could also be assigned16

Average GPA E m p i r i c a l l og i t Figure 9: A lineup of empirical logit plots from a simple binary logistic regression model.The observed plot is shown in panel

The lineup protocol provides a framework to help students learn to interpret new statisticalgraphics and hone their intuition about what constitutes an interesting feature/pattern.This is achieved by randomly embedding the observed data plot into a ﬁeld of decoy (null)plots. Lineups provide a natural way to introduce new statistical graphics throughout thestatistics curriculum. At the introductory level, lineups can help students learn to detectassociation in side-by-side boxplots and mosaic plots, and detect problematic patterns inresidual plots for regression models. In more advanced courses, lineups can be used to frameconversations about why conventional residual plots are problematic for certain models andcan improve a student’s diagnostic ability as they investigate new models.The shift to permutation tests in introductory courses has lowered the initial technicalbarriers to hypothesis testing; however, it still requires an explanation of why we need toresample and how we resample. Exploring lineups that you provide and making the analogyto the police lineup (or alternatively the Sesame Street question: “which one of these is notlike the others”) introduces students to the the logic behind testing without the need forthese technical discussions. This allows initial focus to be on the core concepts of hypothesistesting rather than simultaneous focus on the core concepts and the technical details. Wehave found that a wide range of students understand why an inferential process is neededand what the ﬁndings imply at a more intuitive level after grappling with questions such as“which one of these plots is not like the others?”, “how do you know?”, and “what does thismean about your initial claim?” In addition, permutation tests logically follow the lineupprotocol, providing students with the details behind the generation of the null/decoy plotsand ways to formalize the strength of evidence against an initial claim.18inally, the lineup protocol equips students with a rigorous tool for visual investigationthat is applicable outside of the classroom. This not only prepares students to exploreunfamiliar models or graphics in their own statistical analyses, but can facilitate “teaser”conversations about advanced models for majors. For example, if you introduce your stu-dents to the lineup protocol in a modeling course, then you can show a lineup of choroplethmaps and discuss spatial statistics as a potential area of future study.

Supplementary materials • A tutorial on using nullabor and ggplot2 to create lineups for common topics in intro-ductory statistics can be found at https://aloy.github.io/classroom-vizinf/ . • A suite of Shiny apps that creates lineups for common topics in introductory statisticscan be found at https://aloy.rbind.io/project/classroom-viz-inf/

Acknowledgements

The author wishes to thank the editorial board the StatTLC blog for their thoughts on anearly version of this manuscript.

References

Agresti, A., Franklin, C. A. & Klingenberg, B. (2017),

Statistics: The Art and Science ofLearning from Data , 4th edn, Prentice Hall.Bray, A., Ismay, C., Chasnovski, E., Baumer, B. & Cetinkaya-Rundel, M. (2019), infer:Tidy Statistical Inference . R package version 0.5.1.

URL: https://CRAN.R-project.org/package=infer

Buja, A., Cook, D., Hofmann, H., Lawrence, M., Lee, E. K., Swayne, D. F. & Wickham,H. (2009), ‘Statistical inference for exploratory data analysis and model diagnostics’,

Philosophical Transactions of the Royal Society A (1906), 4361–4383.19annon, A., Cobb;, G. W., Hartlaub, B. A., Legler, J. M., Lock, R. H., Moore, T. L.,Rossman, A. J. & Witmer, J. A. (2018),

STAT2: Modeling with Regression and ANOVA ,2nd edn, MacMillan.Chang, W., Cheng, J., Allaire, J., Xie, Y. & McPherson, J. (2019), shiny: Web ApplicationFramework for R . R package version 1.4.0.

URL: https://CRAN.R-project.org/package=shiny

Chihara, L. & Hesterberg, T. (2011),

Mathematical Statistics with Resampling and R ,Wiley.Cobb, G. W. (2007), ‘The introductory statistics course: a ptolemaic curriculum?’,

Tech-nology Innovations in Statistics Education .Cobb, G. W. (2011), ‘Teaching statistics: Some important tensions’, Chil. J. Stat. (1), 31–62.De Veaux, R., Velleman, P. & Bock, D. (2018), Intro Stats , 5 edn, Pearson, Boston, MA.GAISE College Report ASA Revision Committee (2016), ‘Guidelines for assessment andinstruction in statistics education college report 2016’.

URL:

Gelman, A. & Hill, J. (2007),

Data analysis using regression and multilevel/hierarchicalmodels , Cambridge University Press, New York.Hildreth, L. A., Robison-Cox, J. & Schmidt, J. (2018), ‘Comparing student success and un-derstanding in introductory statistics under consensus and Simulation-Based curricula’,

Statistics Education Research Journal (1), 103–120.Kaplan, D. & Pruim, R. (2019), ggformula: Formula Interface to the Grammar of Graphics .R package version 0.9.1. URL: https://CRAN.R-project.org/package=ggformula

Lock, R., Frazer Lock, P., Lock Morgan, K., Lock, E. & Lock, D. (2017),

Statistics: Un-locking the Power of Data , 2nd edn, John Wiley & Sons, Hoboken.20oy, A., Hofmann, H. & Cook, D. (2017), ‘Model choice and diagnostics for linear Mixed-Eﬀects models using statistics on street corners’,

Journal of Computational and GraphicalStatistics (3), 478–492. URL: http://dx.doi.org/10.1080/10618600.2017.1330207

Maurer, K. & Lock, D. (2014), ‘Comparison of learning outcomes for randomization-basedand traditional inference curricula in a designed educational experiment’, pp. 1–18.Pruim, R., Kaplan, D. T. & Horton, N. J. (2017), ‘The mosaic package: Helping studentsto ’think with data’ using R’,

R Journal (1), 77–102. URL: https://journal.r-project.org/archive/2017/RJ-2017-024/RJ-2017-024.pdf

R Core Team (2019),

R: A Language and Environment for Statistical Computing , R Foun-dation for Statistical Computing, Vienna, Austria.

URL:

Ramsey, F. & Schafer, D. (2013),

The Statistical Sleuth: A course in Methods of DataAnalysis , 3 edn, Cengage Learning, Boston, MA.Roseth, C. J., Garﬁeld, J. B. & Ben-Zvi, D. (2008), ‘Collaboration in learning and teachingstatistics’,

Journal of statistics education: an international journal on the teaching andlearning of statistics (1). URL: https://doi.org/10.1080/10691898.2008.11889557

Tintle, N., Chance, B., Cobb, G., Roy, S., Swanson, T. & VanderStoep, J. (2015), ‘Com-bating Anti-Statistical thinking using Simulation-Based methods throughout the under-graduate curriculum’,

The American Statistician (4), 362–370.Tintle, N., Chance, B. L., Cobb, G. W., Rossman, A. J., Roy, S., Swanson, T. & Vander-Stoep, J. (2015), Introduction to statistical investigations , John Wiley & Sons, Danvers,MA.Tintle, N. L., Topliﬀ, K. & VanderStoep, J. (2012), ‘Retention of statistical concepts ina preliminary randomization-based introductory statistics curriculum’,

Statistics Educa-tion Research Journal , 21–40. 21intle, N., Rogers, A., Chance, B., Cobb, G., Rossman, A., Roy, S., Swanson, T. & Van-derStoep, J. (2014), Quantitative evidence for the use of simulation and randomizationin the introductory statistics course, in K. Makar, B. de Sousa & R. Gould, eds, ‘Sus-tainability in statistics education. Proceedings of the Ninth International Conference onTeaching Statistics (ICOTS9, July, 2014), Flagstaﬀ, Arizona, USA.’.Tintle, N., VanderStoep, J. & Holmes, V. L. (2011), ‘Development and assessment of a pre-liminary randomization-based introductory statistics curriculum’,

Journal of StatisticsEducation .Wickham, H. (2016), ggplot2: Elegant Graphics for Data Analysis , Springer-Verlag NewYork. URL: https://ggplot2.tidyverse.org

Wickham, H., Chowdhury, N. R. & Cook, D. (2014), nullabor: Tools for Graphical Infer-ence . R package version 0.3.1.

URL: http://CRAN.R-project.org/package=nullabor

Wild, C. J., Pfannkuch, M., Regan, M. & Parsonage, R. (2017), ‘Accessible conceptionsof statistical inference: Pulling ourselves up by the bootstraps’,

International StatisticalReview (1), 84–107. URL: https://onlinelibrary.wiley.com/doi/full/10.1111/insr.12117https://onlinelibrary.wiley.com/doi/full/10.1111/insr.12117