# A fresh look at introductory data science

AA fresh look at introductory data science

Mine C¸ etinkaya-RundelSchool of Mathematics - University of Edinburgh,Department of Statistical Science - Duke University, and RStudioandVictoria EllisonDepartment of Statistical Science - Duke UniversityAugust 4, 2020

Abstract

The proliferation of vast quantities of available datasets that are large and complexin nature has challenged universities to keep up with the demand for graduates trainedin both the statistical and the computational set of skills required to eﬀectively plan,acquire, manage, analyze, and communicate the ﬁndings of such data. To keep upwith this demand, attracting students early on to data science as well as providingthem a solid foray into the ﬁeld becomes increasingly important. We present a casestudy of an introductory undergraduate course in data science that is designed toaddress these needs. Oﬀered at Duke University, this course has no pre-requisitesand serves a wide audience of aspiring statistics and data science majors as well ashumanities, social sciences, and natural sciences students. We discuss the uniqueset of challenges posed by oﬀering such a course and in light of these challenges, wepresent a detailed discussion into the pedagogical design elements, content, structure,computational infrastructure, and the assessment methodology of the course. We alsooﬀer a repository containing all teaching materials that are open-source, along withsupplemental materials and the R code for reproducing the ﬁgures found in the paper.

Keywords: data science curriculum, exploratory data analysis, data visualization, modeling,reproducibility, R 1 a r X i v : . [ s t a t . O T ] A ug Introduction

How can we eﬀectively and eﬃciently teach data science to students with little to nobackground in computing and statistical thinking? How can we equip them with the skillsand tools for reasoning with various types of data and leave them wanting to learn more?This paper describes an introductory data science course that is our (working) answer tothese questions.At its core, the course focuses on data acquisition and wrangling, exploratory data anal-ysis, data visualization, inference, modeling, and eﬀective communication of results. Timepermitting, the course also provides very brief forays into additional tools and conceptssuch as interactive visualizations, text analysis, and Bayesian inference. A heavy emphasisis placed on a consistent syntax (with tools from the tidyverse), reproducibility (with RMarkdown), and version control and collaboration (with Git and GitHub). The coursedesign builds on the three key recommendations from Nolan & Temple Lang (2010): (1)broaden statistical computing to include emerging areas, (2) deepen computational reason-ing skills, and (3) combine computational topics with data analysis. The goal of the courseis to bring students from zero experience to being able to complete a fully reproducibledata science project on a dataset of their choice and answer questions that they care aboutwithin the span of a semester.In Section 2 of this paper, we start with a review of the most recent curriculum guidelinesfor undergraduate programs in data science, statistics, and computer science. In this sectionwe also present a synopsis of the course content and structure of introductory data sciencecourses at four other institutions with the goal of providing a snapshot of the current stateof aﬀairs in undergraduate introductory data science curricula. In Section 3 we outlinethe overall design goals of the Duke University introductory data science course that is thefocus of this article and discuss how this course addresses current undergraduate curriculumguidelines in statistics and data science. In Section 4 we expand on the course content, ﬂow,and pacing, and present examples of case studies from the course. In Section 5 we detail thepedagogical methods employed by this course, speciﬁcally addressing how these methodscan support a large class with students with a diverse range of previous experiences instatistics and programming. Section 6 presents the computing infrastructure of the course,2ection 7 presents the methods of assessment, and ﬁnally in Section 8 we provide a synthesisof where this course sits in the landscape of introductory data science curriculum guidelines,future design plans for the course, and opportunities and challenges for faculty wanting toadopt this course.

An exact characterization of what the ﬁeld of data science is meant to encompass is stilldebated. However, in this paper we deﬁne data science as the “science of planning for,acquisition, management, analysis of, and inference from data” (NSF 2014). We reviewedfour of the most recent curriculum guidelines for undergraduate programs in data science,statistics, and computer science to assess how the case study course ranks up against them.While the 2013 Computer Science Curricula of the Association for Computing Machin-ery (ACM) (Sahami et al. 2013) do not mention suggestions for integrating data scienceinto a computer science major, the 2019 report by the ACM Task Force on Data ScienceEducation (Danyluk et al. 2019) gives suggestions of core competencies a graduating datascience student should leave with. Each competency corresponds to one of nine data scienceknowledge areas: computing fundamentals; data acquirement and governance; data man-agement, storage, and retrieval; data privacy, security, and integrity; machine learning;big data; analysis and presentation; and professionalism. The report also suggests thata full data science curriculum should integrate courses in “calculus, discrete structures,probability theory, elementary statistics, advanced topics in statistics, and linear algebra.”We note, however, that this document was released as a draft at the time of writing thismanuscript.Their recommendation for the ﬁrst course is to introduce the statistical analysis processstarting with formulating good questions and considering whether available data are appro-priate for addressing the problem, then conducting a reproducible data analysis, assessingthe analytic methods, drawing appropriate conclusions, and communicating results. Theyalso recommend that data science skills, such as managing and wrangling data, algorithmicproblem solving, working with statistical analysis software, as well as high-level comput-ing languages and database management systems, be well integrated into the statistics3urriculum.The 2016 Guidelines for Assessment and Instruction in Statistics Education (GAISE)endorsed by the American Statistical Association also does not make speciﬁc recommen-dations for introductory data science courses, however the guidelines place emphasis onteaching statistics as an “investigative process of problem-solving and decision making” aswell as giving students experience with “multivariable thinking” (Carver et al. 2016). Theguidelines also recommend that students use technology to explore concepts and analyzedata, and suggest examples of doing so using the R statistical programming language (RCore Team 2020).The Curriculum Guidelines for Undergraduate Programs in Data Science suggest thatthe ﬁrst introductory course for students majoring in data science should introduce studentsto a high-level computing language (they recommend R) to “explore, visualize, and posequestions about the data” (De Veaux et al. 2017). Introduction to a high-level comput-ing language, data exploration and wrangling, basic programming and writing functions,introduction to deterministic and stochastic modeling, concepts of projects and code man-agement, databases, and introduction to data collection and statistical inference are amongthe suggested list of topics for the ﬁrst two courses in a data science major. Furthermore,the guidelines propose that the introductory data science courses be taught in a way thatfollows the full iterative data science life cycle, “from initial investigation and data acqui-sition to the communication of ﬁnal results.” Finally, this report recommends ending thecourse with a version-controlled, fully-reproducible, team-based project, complete with awritten and oral presentation. While the Duke University course we describe in Sections3 through 8 was originally designed prior to the publication of De Veaux et al. (2017),the guidelines outlined in this report served as inspiration for much of the updates to thecourse over the ﬁve years that it has been taught.In addition to curriculum guidelines, there exists a body of literature on suggestionsand case studies for integrating data science computational skills into the general statis-tics curriculum. Nolan & Temple Lang (2010) suggest including and discussing in detailfundamentals in scientiﬁc computing with data, information technologies, computationalstatistics (e.g., numerical algorithms) for implementing statistical methods, advanced sta-4istical computing, data visualization, and integrated development environments into theundergraduate statistics curriculum. Hardin et al. (2015) and Baumer (2015) provide casestudies of data science courses that use R as a computing language and have been im-plemented at various levels within a statistics undergraduate major. Dichev & Dicheva(2017) and Brunner & Kim (2016) discuss single Python-based based introductory datascience case studies for courses without prerequisites. Dichev et al. (2016) describe anintroductory data science course that teaches Python and R and that does not have anyprerequisites. Finally, while technically written for data science graduate courses, Hicks& Irizarry (2018) promote teaching data science via utilizing numerous case studies andemulating the process that data scientists would use to answer research questions.In their report titled “Data Science for Undergraduates, Opportunities and Options”,the National Academies of Sciences Engineering and Medicine (NASEM) provide a widersurvey of institutions that have implemented stand-alone introductory data science coursesdesigned to serve as a general education requirement or garner general interest in the ﬁeldof data science (NASEM 2018). Three major challenges identiﬁed in the report that areassociated with teaching an introductory data science course without any prerequisites are(1) increasing student interest that is reﬂected in higher enrollment numbers and the needto reconcile this with instructor availability, (2) speciﬁc curriculum of the course varyingfrom semester to semester based on instructor expertise and interests, and (3) students withdiverse computing backgrounds thriving in a course with a one-size-ﬁts-all curriculum.As part of our eﬀorts to understand the landscape of undergraduate introductory datascience courses, we surveyed four courses that do not require any student background instatistics or programming. These courses are as follows:1. Foundations of Data Science (DATA 8) at University of California Berkeley2. Foundations of Data Science at University of Cambridge3. Introduction to Data Science (SDS 192) at Smith College4. Data Science 101 (STATS 101) at Stanford UniversityThese courses were selected based on the ranking of the programs they are taught in aswell as the type of institution – we wanted to capture courses from a variety of institutionsin terms of public/private, US/non-US, research/liberal arts (U.S. News & World Report5able 1: Summary of programming languages used in each course and the estimated break-down of percent of class time spent on various course components.Duke Berkeley Cambridge Smith StanfordProgramming language R Python Pseudocode R, SQL RData visualization 15% 5% 0% 32% 10%Data wrangling 10% 15% 0% 36% 0%Other EDA 10% 5% 0% 12% 10%Inference 20% 30% 25% 0% 50%Modeling 25% 20% 35% 0% 20%Programming principles 10% 10% 0% 5% 0%Mathematical foundations / theory 5% 5% 35% 0% 0%Communication 5% 5% 0% 10% 10%Ethics 0% 5% 5% 5% 0%2018, QS World University Rankings 2017). These were courses we were somewhat familiarwith prior to data collection and hence knew that they ﬁt our requirements.Table 1 gives a summary of the programming languages used as well as a rough coursecontent breakdown for these four courses as well as the Duke University course that wediscuss in further details in the remainder of this manuscript.For each course, we surveyed the online course syllabus from a recent semester andnoted the lecture topic for each day of the course, the programming language(s) used, andthe assessed components. Then for each course, we classiﬁed each day’s lecture topic intoone of nine content categories given in Table 1. Using these classiﬁcations we calculated anapproximate distribution of the amount of lecture time spent on each of the nine contentcategories. Finally, we contacted the instructors of these four courses and, based on theirfeedback, adjusted our original content distribution estimates.We ﬁrst note that programming is a central role for each of these courses. The coursesat Duke University, Smith College, and Stanford University teach R; and the course at UCBerkeley teaches Python. The course at University of Cambridge is unique as it teaches6nly pseudocode, although students are encouraged to learn Python on their own time. Inline with the greater focus that the Smith College course places on data wrangling, SQL isalso used in this course as well.We allocated content in our rubric for “Communication” if the course has a studentproject in which the students had to present their ﬁndings. We note that the Duke Uni-versity, Smith College, Stanford University, and UC Berkeley courses all have some projectpresentation element. No project component was mentioned for the University of Cam-bridge course.In addition, Duke University, Smith College, UC Berkeley, and University of Cambridgecourses all have some discussion on data ethics built into the class.We next note the diﬀerences in the extent to which each of these courses make useof group assignments and assessments. At Duke University students complete homeworkassignments and take-home exams individually, and lab assignments and projects in groups.At Smith College students work individually on homework assignments as well as on exams,they are strongly encouraged students to work in pairs on the lab assignments, and theywork in groups for the projects. At Stanford University students work individually onexams and homework assignments. At UC Berkeley, the labs, homework assignments, andexams are completed individually by the student, while the students are allowed to workwith one other student during the project. Finally, at University of Cambridge, studentstake one exam that they complete individually.We note the vast diversity of course content within each of these classes compared toone another. For instance, Smith College emphasizes the initial phases of the data sciencelife cycle, such as data visualization and data wrangling, whereas Duke University, UCBerkeley, Stanford University, and University of Cambridge place more attention on themiddle phase of the data science life cycle, such as inference and modeling. The Universityof Cambridge course places a heavier emphasis on the mathematical foundations of datascience than the other four courses. Finally, while the Duke University, UC Berkeley, andUniversity of Cambridge courses place roughly equal focus on inference and modeling, theStanford University course places a much larger emphasis on inference than on modeling.Part of the reason for diﬀerent levels of emphasis placed on diﬀerent phases of the data7cience life cycle that we observe among these classes may be attributed to the diﬀerencesin the primary audience the course is designed for. For instance, Duke University course isdesigned to provide a common (gateway) experience to students interested in the StatisticalScience major and minor or the interdisciplinary major in Data Science. The Smith Collegecourse is a required course for statistics majors while the UC Berkeley course is aimed atentry-level students from all majors and the University of Cambridge course is designed asa prerequisite for more advanced statistical and computer science topics.

In this paper we describe an introductory data science course that is designed to providea common (gateway) experience to students interested in the Statistical Science majorand minor or the interdisciplinary major in Data Science oﬀered at Duke University called

Introduction to Data Science and Statistical Thinking . A version of this course has beenoﬀered as a seminar to ﬁrst year undergraduates each fall semester since the fall of 2014,with an enrollment of 18 students at each oﬀering under the title

Better Living with DataScience . The course, with some modiﬁcations for scale, was opened up to an audience of80 students in the Spring semester of 2018.The main design goals were to create a course that is modern, that places data front andcenter, that is quantitative without mathematical prerequisites, that is diﬀerent than highschool statistics, and that is challenging without being intimidating. The course empha-sizes modern and multivariate exploratory data analysis, and speciﬁcally data visualization;starts at the beginning of the data analysis cycle with data collection and cleaning; en-courages and enforces thinking, coding, writing, and presenting collaboratively; explicitlyteaches best practices and tools for reproducible computing; approaches statistics froma model-based perspective, lessening the emphasis on statistical signiﬁcance testing; andunderscores eﬀective communication of ﬁndings.In addition, use of real data is a pillar of this course. Not only is this strongly rec-ommended in Carver et al. (2016), but it also equips students with the tools to answerquestions of their own choosing as part of their end-of-semester project.Figure 1 summarizes the ﬂow of the three learning units in STA 199: exploring data,8igure 1: Flow of topics in

Introduction to Data Science and Statistical Thinking at DukeUniversity.making rigorous conclusions, and looking forward. The arrows represent a continuous re-view and reuse of previous material as new topics are introduced. The course ultimatelycovers all steps of the full data science cycle presented in Wickham & Grolemund (2016),which includes data import, tidying, exploration (visualise, model, transform), and com-munication. In Section 4 we describe in detail the topics covered in each of these units.

The course is comprised of three learning units. The ﬁrst two are roughly of equal length,and the last one covers two weeks out of a ﬁfteen week semester.

This unit has three main foci: data visualization, data wrangling, and data import.The learning goals of the unit are as follows:1. Introduce the R statistical programming language via building simple data visualisa-tions. 9. Build graphs displaying the relationship between multiple variables using data visu-alisation best practices.3. Perform data wrangling, tidying, and visualisation using packages from the tidyverse.4. Import data from various sources (e.g., CSV, Excel), including by scraping data oﬀthe web.5. Create reproducible reports with R Markdown, version tracked with Git and hostedon GitHub.6. Collaborate on assignments with team mates and resolve any merge conﬂicts thatarise.On the ﬁrst day of the course students log in to a web-based R session and create amultivariate visualisation exploring how countries have voted in the United Nations GeneralAssembly on various issues such as human rights, nuclear weapons, and the Palestinianconﬂict using data from the unvotes package in R (Robinson 2017). This is used asan ice breaker activity to get students talking to each other about what countries theyare interested in exploring. The activity also gets them creating and interpreting a datavisualisation. Getting students to create a data visualisation in R so quickly is madepossible using cloud-based computing infrastructure (which we describe in more detail inSection 6) and a fully functional R Markdown document. We call this the “let them eatcake ﬁrst” approach, where students ﬁrst see an example of a complex data visualisation,which they will be able to build by the end of this unit, and then slowly work their waythrough the building blocks (C¸ etinkaya-Rundel 2018). This approach is also presented inWang et al. (2017), which advocates for “bringing big ideas into intro stats early and often”.There are two main reasons for starting data science instruction with data visualisation.The ﬁrst reason is that most students come in with intuition for being able to interpret datavisualizations without needing much instruction. This means we can focus the majority ofclass time initially on R syntax, and leave it up to the students to do the interpretation.Later in the course, as students are getting more comfortable with R and more advancedstatistical techniques are introduced, this scale tips where we spend more class time onconcepts and model interpretation and less on syntax. Second, it can be easier for studentsto detect if they are making a mistake when building a visualization, compared to data10rangling or statistical modeling.In addition to the process of creating data visualisations, this unit focuses on critiquingand improving data visualisations. After a brief lecture on data visualisation best practices,that was designed in collaboration with data visualisation experts at Duke University, wepresent guidance for implementing these best practices in ggplot2 graphics. Each teamis given a ﬂawed data visualisation as well as the raw data it is based on. First, theycritique the data visualisation and brainstorm ways of improving it. Then, they (attemptto) implement their suggestions for improvements. Finally, they present why and how theyimproved their visualisations to the rest of the class. Since this exercise happens early on inthe semester, some teams fail to implement all of their suggestions, but this ends up beinga motivator for learning. Additionally, multiple teams work on the same visualisation anddata, which makes the presentations valuable opportunities for learning from each other.This exercise is described in further detail, along with speciﬁc data sources and samplevisualisations in C¸ etinkaya-Rundel & Tackett (2020).In the data wrangling and tidying part of Unit 1, we make heavy use of the dplyr and tidyr packages for transforming and summarising data frames, joining data from multipledata frames, and reshaping data from wide to long / long to wide format. One exampleof a data join is an exercise where country level data is joined with a continent lookuptable. This simple exercise presents an opportunity to discuss data science ethics as someof the countries in the original dataset do not appear in the continent lookup table (e.g.,Hong Kong and Myanmar) due to political reasons. The technical solution to this problemis straightforward – we can manually assign these countries to a continent based on theirgeographic location. However we also discuss that country-level datasets are inherentlypolitical as diﬀerent nations have diﬀerent deﬁnitions of what constitutes a country – anexample of how data processing workﬂow might be aﬀected by data issues (NASEM 2018).This data wrangling task is tied to a visualisation exercise as well. By joining shapeﬁledata to the country data we have, we create choropleth maps as well. To simplify theexercise, we use the maps package, along with ggplot2, for built-in shapeﬁles instead ofdownloading these ﬁles from the web (Becker et al. 2018).Finally in Unit 1 we touch on data import. We start by introducing commonly used data11mport options for reading rectangular data into R (e.g., using read csv() or read excel() functions from the readr and readxl packages). We then present web scraping as atechnique for harvesting data oﬀ the web using the rvest package (Wickham 2019). Wescrape data from OpenSecrets (opensecrets.org), a non-proﬁt research group that tracksmoney in politics in the United States. While the speciﬁc dataset we scrape changes fromyear to year, the structure of the web scraping activity stays relatively constant: ﬁrst scrapedata from a single page (containing data on a single voting district, or single election year),convert the code developed for scraping data from this single page into a function that takesa URL and returns a structured data frame, and ﬁnally iterate over many similar web pages(other voting districts, or other election years) using mapping functions from the purrr package (Henry & Wickham 2020). We usually end this exercise with a data visualisationcreated using the scraped data that allows students to gain insights that would have beenimpossible to uncover without getting the data oﬀ the web and into R.In summary, this unit starts oﬀ with data visualisation on a dataset that is alreadyclean and tidy (and usually contained in an R package). Then, we take one step back andlearn about data wrangling and tidying. Finally, we take one more step back and introduceboth statistical and computational aspects of data collection and reading data into R fromvarious sources. In Unit 1 students develop their skills for describing relationships between variables, andthe transition to Unit 2 is done via the desire to quantify these relationships and to makepredictions.This unit is designed to achieve the following learning goals:1. Quantify and interpret relationships between multiple variables.2. Predict numerical outcomes and evaluate model ﬁt using graphical diagnostics.3. Predict binary outcomes, identify decision errors and build basic intuition aroundloss functions.4. Perform model building and feature evaluation, including stepwise model selection.5. Evaluate the performance of models using cross-validation techniques.12. Quantify uncertainty around estimates using bootstrapping techniques.We start oﬀ by introducing simple linear regression, but then quickly move on to mul-tiple linear regression with interaction eﬀects since students are already familiar with theidea that we need to examine relationships between multiple variables at once to get a real-istic depiction of real world processes. We also introduce logistic regression, albeit brieﬂy.Prediction, model selection, and model validation are introduced to pave the pathway formachine learning concepts that students can dive further into in subsequent higher levelclasses.Finally in this unit we introduce the concept of quantifying uncertainty, starting withuncertainty in slope estimates and model predictions. We also touch on slightly moretraditional introductory statistics topics such as statistical inference for comparing meansand proportions. However, unlike many traditional introductory statistics courses, inferencefocuses on conﬁdence intervals, constructed using bootstrapping only.In designing this unit we had three goals in mind: (1) introduce models with multiplepredictors early, (2) touch on elementary machine learning methods, and (3) de-emphasizethe use of p-values for decision making. The ﬁrst goal addresses the 2016 GAISE recom-mendation for giving students experience with multivariable thinking (Carver et al. 2016).Additionally, introducing this topic early helps students frame their project proposals (of-ten due in the middle of this unit) by signalling that this is a technique they might use intheir projects. Teaching logistic regression also proves to be invaluable in a course wherestudents later choose their own datasets and research questions for their ﬁnal projects.Each semester there are a considerable number of teams who, as part of their project, wantto tackle a task involving predicting categorical outcomes, and familiarity with logistic re-gression allows them to do so as long as they can dichotomize their outcome. The secondgoal (touching on machine learning methods) presents two opportunities. First, it enablesa discussion on modeling binary outcomes as both “logistic regression” (where we inter-pret model output to evaluate relationships between variables) and “binary classiﬁcation”(where we care more about prediction than explanation). Second, exposing students tofoundational techniques like classiﬁcation, predictive modeling, cross-validation, etc. en-ables them to start developing basic familiarity with machine learning approaches. The13hird goal (de-emphasize the use of p-values for decision making) is achieved by not coveringnull hypothesis signiﬁcance testing in any meaningful way. Traditional statistical inferencetopics are limited to conﬁdence intervals and decision errors that are presented in the con-text of a logistic regression / classiﬁcation. Students learn how to construct conﬁdenceintervals using bootstrapping, and emphasis is placed on interpreting these intervals in thecontext of the data and the research question and we discuss decision making based onthese intervals. We also present decision making in the context of a classiﬁcation problem(a spam ﬁlter), where we explore the cost of Type 1 and Type 2 errors to start buildingintuition around loss functions.One of the datasets featured in this unit comes from 18th century auctions for paintingsin Paris. In the case study of these paintings, we explore relationships between metadataon paintings that were encoded based on descriptions of paintings from over 3,000 printedauction catalogues. These data include attributes like dimensions, material, orientation,and shape of canvas, number of ﬁgures in the painting, school of the painter, as wellas whether the painting was auctioned as part of a lot or on its own. The goal is tobuild a model predicting price of paintings. However the data requires a fair amountof cleaning before it can be used for building meaningful models. For example, someof the categorical variables (e.g., material and shape of canvas) have levels that are eithermisspelled or occur at low frequency. This oﬀers an opportunity for students to review datawrangling skills from the previous unit while also learning about modeling. Additionally,the response variable, price, is right skewed, which provides a nice opportunity to introducetransformations. Finally, the dataset has over 60 variables, which means considering allinteraction eﬀects is not trivial. Instead we explore interaction eﬀects that the data experts(art historians who created the dataset) have suggested. This provides an opportunity fordiscussion around automated model selection methods vs. model building based on expertopinion.Other datasets include professor evaluations and their “beauty” scores (numerical, con-tinuous outcome: evaluation score) and metadata on emails (categorical, binary outcome:spam/not spam).On the computational side, we use the broom package (Robinson & Hayes 2020) for14idy presentation of model output. Two features of this package are especially well suitedfor the learning goals of this course. First, regression output is returned as a data framethat makes it easier to extract values from the output to include in reproducible reports.This allows students to easily use inline R code chunks to extract statistics like coeﬃcientestimates or R-squared values from model outputs and include in their interpretations, asopposed to manually typing them out, which is recommended for reproducibility of reports.Second, model summaries printed using the tidy() function from the broom package do notcontain the signiﬁcant starts that draw the attention to p-values. Note that it is possibleto turn these oﬀ in base R model summaries as well, but it is preferable to not have themin the ﬁrst place.Like broom, other R packages introduced in this unit are part of the tidymodels suiteof packages, which is “a collection of packages for modeling and machine learning usingtidyverse principles” (Kuhn & Wickham 2020). These include infer for simulation-basedstatistical inference and modelr for quantifying predictive performance.

This unit is designed to shrink or expand as needed depending on time left in the semester.Each module is designed to cover one class period and aims to provide a brief introductionto a topic students might explore in higher level courses. One exception to this is anethics module, which kicks oﬀ the unit and is the only required component. In this modulewe introduce ethical considerations around misrepresentation in data visualizations andreporting of analysis results, p-hacking, privacy, and algorithmic bias.The remaining topics in the unit vary from semester to semester depending on interestsof the students and the instructor. In each class period students are exposed to a few Rpackages that they use to engage with specialised tasks (e.g., ﬂexdashboard for buildingdashboards (Iannone et al. 2020), genius for accessing song lyrics (Parry & Barr 2020), gutenbergr for retrieving text from books (Robinson 2019), shiny for creating web apps(Chang et al. 2020), tidytext for text analysis (Silge & Robinson 2016)). Table 2 liststopics covered in this unit in the past, along with a brief synopsis.15able 2: Topics previously covered in Unit 3 of the course.Topic Synopsis DurationData scienceethics Misrepresentation of results in datavisualisations and reporting, data privacyand data breaches, gender bias in machinetranslated text, algorithmic bias and race insentencing and parole length decisions. 1-2 class periodsInteractivereporting andvisualisation withShiny Introduce the basics of the shiny package forbuilding interactive web applications andbuild a simple application for browsing dataon movies. 1 class periodBuilding staticdashboards Build static dashboards using theﬂexdashboard package. 1 class periodBuildinginteractivedashboards Build interactive dashboards using the shinyand ﬂexdashboard packages. 2 class periodsText mining Perform basic text mining techniques (e.g.,sentiment analysis, term frequencyinversedocument frequency) using the tidytextpackage and data on song lyrics (retrievedwith the genius package) or on books(retrieved with the gutenbergr package). 1 class periodBayesianinference Introduction to Bayesian inference as a wayof decision making using data on sensitivityand speciﬁcity of breast cancer screeningtests. 1 class period16

Pedagogy

In this section we discuss the various pedagogical choices (teamwork, lectures sprinkledwith hands-on exercises, computational labs, etc.) as well as assessment components andfeedback loops in the course. We anticipate that instructors designing a similar coursewould be especially interested in how we evaluate whether students in the course achievethe outlined learning goals as well as a commentary on assessment scalability for largercourses.The pedagogical methods employed are tailored to several speciﬁc aspects of the course.First, the course is relatively large in size with about 80-90 students. Second, while thecourse has no statistical or computing pre-requisites, students come into the course withvery diverse backgrounds – some have no prior exposure to statistics or computing whileothers may have already had a few classes in either of the subjects, or both. As suggested bythe literature (Michaelsen & Sweet 2011), we employ several team-based learning techniquesto address the challenges of keeping a large lecture hall of students with varying degrees ofbackground knowledge both challenged and engaged.Within each lab section we aim to disperse students who have previously learnedsome computing and/or statistics and those without any background in these areas evenlyamongst groups of four. In order to gauge a student’s prior background in statistics wehave each student complete a pretest before the course begins. We use the ComprehensiveAssessment of Outcomes in a First Statistics course (CAOS) test, an online test devel-oped by Assessment Resource Tools for Improving Statistics Thinking (ARTIST) projectapp.gen.umn.edu/artist intended to assess students on the key concepts that any studentcoming out of an introductory statistics course would be expected to know. We use a com-bination of scores from this test as well as information on computing experience to roughlyclassify students into three categories of “has background”, “doesn’t have any background”,and “somewhere in between”. We then assign one student who is identiﬁed as “has back-ground”, one who is identiﬁed as “doesn’t have any background”, and two students from the“somewhere in between” categories to teach team. In choosing which students to pick fromthese categories to place into each team, we take into account self-reported informationcollected via a “Getting to Know You” survey, such as interests, (planned) major, personal17ronouns, etc. We aim to create demographically diverse teams where each student sharessome attributes with at least one other student in the team. The team assignment processis carried out manually, which presents challenges as the class size grows. However sincestudents stay in these teams throughout the entire semester, taking extra care during theteam formation process is a worthwhile investment for reducing team dynamic issues thatmight arise later in the semester.The method of content delivery is mostly lecture, and student feedback on whether theydesire more or less content to be delivered during the actual lecture has been mixed. Futureiterations of this course may seek to decrease the amount of new content delivered to thestudents during the lecture and shift the students ﬁrst exposure to the material to pre-class assignments or videos. This shift is informed by the body of literature which suggestsbetter learning and better student satisfaction in introductory statistics courses taughtusing a ﬂipped classroom approach where students completed relatively simple reading andanswered reading quiz questions prior to class and completed hands-on exercises in class(Wilson 2013, Winquist & Carlson 2014). In place of new content delivered in lecture, futureiterations of the course may incorporate more extensive group application exercises intothe class time, allowing students to get individual feedback on their current understandingfrom their peers, the TAs, and the instructor.

In this section we discuss the computing choices made in the course, including infrastruc-ture, syntax, and tools. In this section we will detail the computing infrastructure used inthe course (access to RStudio in the cloud) and provide pedagogical justiﬁcations for thedecisions made in setting up this infrastructure. Additionally, we will provide a road mapof the computational toolkit, outlining when and why students get introduced to each newpackage or software. 18 .1 Seamless onboarding with RStudio Cloud

This course follows the recommendations outlined in C¸ etinkaya-Rundel & Rundel (2018)for setting up a computational infrastructure to allow for pedagogical innovations whilekeeping student frustration to a minimum.The most common hurdle for getting students started with computation is the veryﬁrst step: installation and conﬁguration. Regardless of how well detailed and documentedinstructions may be, there will always be some diﬃculty at this stage due to diﬀerencesin operating system, software version(s), and conﬁgurations among students’ computers.It is entirely possible that an entire class period can be lost to troubleshooting individualstudent’s laptops. An important goal of this class is to get students to create a datavisualization in R within the ﬁrst ten minutes of the ﬁrst class. Local installation can bediﬃcult to manage, both for the student and the instructor, and can shift the focus awayfrom data science learning at the beginning of the course.Access to R is provided via RStudio, an integrated development environment (IDE)that includes a viewable environment, a ﬁle browser, data viewer, and a plotting pane,which makes it less intimidating than the bare R shell. Additionally, since it is a fullyﬂedged IDE, it also features integrated help, syntax highlighting, and context-aware tabcompletion, which are all powerful tools that help ﬂatten the learning curve.Rather than locally installing R and RStudio, students in this course access RStudioin the cloud via RStudio Cloud (rstudio.cloud), a managed cloud instance of the RStudioIDE. The main reason for this choice is reducing friction at ﬁrst exposure to R that wedescribed above.When you create an account on RStudio Cloud you get a workspace of your own, andthe projects you create here are public to RStudio Cloud members. You can also add a newworkspace and control its permissions, and the projects you create here can be public orprivate. A natural way to set up a course in RStudio Cloud is using a private workspace. Inthis structure, a classroom maps to a workspace. Once a workspace is set up, instructorscan invite students to the workspace via an invite link. Workspaces allow for variouspermission levels which can be assigned to students, teaching assistants, and instructors.Then, each assignment/project in the course maps to an RStudio Cloud project.19nother major advantage of this setup over local installation of R and RStudio is thatworkspaces can be conﬁgured to always use particular versions of R and RStudio as well asa set of packages (and particular versions of those packages). This means the computingenvironment for the students can easily be conﬁgured by the instructor, and always matchesthat of the instructor, further reducing frustration that can be caused by instances of thestudent running the exact same code as the professor but getting errors or diﬀerent results.

Building on literate programming (Knuth 1984), R Markdown provides an easy-to-use au-thoring framework for combining statistical computing and written analysis in one com-putational document that includes the narrative, code, and the output of an analysis (Xieet al. 2018). On the ﬁrst day of the course, upon accessing the computing infrastructure viaRStudio Cloud as described in Section 6.1, students are presented with a fully functional RMarkdown document including a brief but not-so-simple data analysis that they can knitto produce an in-depth data visualization. Then, by updating just one parameter in the RMarkdown document, they can produce a new report with a new data visualization. Thisprocess of an early win is made possible with R Markdown in a way that would be muchharder to accomplish typing code in the console or even with the use of a reproducible Rscript. We are able to introduce students to R Markdown before any formal R instruc-tion thanks to the very lightweight syntax of the markdown language, and by providing afully functional document that is guaranteed to knit and display results for each studentregardless of their personal computing setup.Throughout the course students use a single R Markdown document to write, execute,and save code, as well as to generate data analysis reports that can be shared with theirpeers (for teamwork) or instructors (for assessment). Early on in the course we facilitatethis experience by providing them templates that they can use as starting points for theirassignments. Throughout the semester this scaﬀolding is phased out, and the ﬁnal projectassignment comes with a bare-bones template with just some suggested section headings.The primary beneﬁt of using R Markdown in statistics and data science instructionare outlined in Baumer et al. (2014) as restoring the logical connection between statistical20omputing and statistical analysis that was broken by the copy-and-paste paradigm. Useof this tool keeps code, output, and narrative all in one document, and in fact, makes theminseparable.

The curriculum makes opinionated choices when it comes to speciﬁc programming paradigmsintroduced to students. Students learn R with the tidyverse , an opinionated collection ofR packages designed for data science that share an “underlying design philosophy, gram-mar, and data structures” (Wickham et al. 2019). The most important reason for thischoice is the cohesiveness of the tidyverse packages. The expectation is that learning onepackage makes it easier to use the other due to these shared principles. Tidyverse code isnot necessarily concise, but the course aims to teach students to maximize readability andextensibility of their code instead of minimizing the number of lines to accomplish a task.

One of the learning goals of this course is that how you got to a data analysis result isjust as important as the result itself. Another goal is to give students exposure to andexperience using software tools for modern data science. Use of literate programming withR Markdown gets us part of the way there, but implicit in the idea of reproducibility iscollaboration. The code you produce is documentation of the process and it is criticalto share it (even if only with yourself in the future). This is best accomplished with adistributed version control system like Git (Bryan 2018). In addition, Git is a widely usedtool in industry for code sharing. According to an industry-wide Kaggle survey of datascientists conducted by Kaggle, 58.4% of over 6,000 respondents said Git was the main toolused for sharing code in their workplace (Kaggle 2017).In this class we have adopted a top down approach to teaching Git – students are required to use it for all assignments. Additionally, GitHub is used as the learning man-agement system for distributing and collecting assignments as repositories. Based on bestpractices outlined in C¸ etinkaya-Rundel & Rundel (2018), we structure the class as a GitHuborganization, and a starter private repository is created per student/team per assignment,21nd we use the ghclass package for instructor management of student repositories (Rundelet al. 2019).Students interact with Git via RStudio’s project based Git GUI. We teach a simplecentralized Git workﬂow which only requires the student to know how to perform simpleactions like push , pull , add , rm , commit , status , and clone . Focusing on this corefunctionality helps ﬂatten the learning curve associated with a sophisticated version controltool like Git for students who are new to programming (Fiksel et al. 2019, Beckman et al.2020). Early on in the course, we also engineer situations in which students encounterproblems while they are in the classroom so that the professor and teaching assistants arepresent to troubleshoot and walk them through the process in person.We note that GitHub can also be used as an early diagnostic tool to identify studentsthat may struggle in the course later on. We pulled the data on all commits made bystudents in the Spring 2018 cohort of the course. The usage of these data was given anexemption from IRB review by Duke University Campus Institutional Review Board.Figure 2 displays three plots created with these data. The plot on the left showsthe relationship between number of commits made by each student throughout the entiresemester and their ﬁnal course grade (out of 100 points). The plot in the middle and on theright also display the ﬁnal course grade on the y-axis but the number of commits made byeach student are calculated at earlier time points in the semester (before the ﬁrst midtermfor the plot in the middle, and before the second midterm for the plot in the right). Wecan see a positive relationship in each of the plots, levelling oﬀ at 100 points (since it isnot possible to score higher than 100 points in the course). While number of commits,alone, should not be considered an indication of course performance, these plots suggestthat one can identify students with low numbers of commits as those who will potentiallynot perform well in the course, and reach out to them early on and oﬀer support and help.Incorporation of version control and collaboration with Git and GitHub into the in-troductory data science classroom not only beneﬁts students by teaching them skills de-sired by potential employers, but it also cuts down on the administrative work requiredto distribute, grade, and return assignments, which can now be spent providing in-depthfeedback, working with students, and updating course material.22igure 2: Relationship between number of commits and ﬁnal course grade for each studentat three time points in the semester. This course uses ﬁve methods of assessment, each designed with the incoming student withno background in statistics or computing in mind. First, we have weekly computing labswhich are completed in groups. With these labs, students without any coding backgroundcan beneﬁt from the prior coding experience of other students in the group. However, in aneﬀort to make sure that each student, including those with no computing experience, hasweekly practice in coding we also assign individual homework assignments as well. Finally,because programming plays a central role in the course, we incorporate coding exercisesinto the midterm exams. In order to accommodate ﬁrst-time programmers in which a timedcoding exam may prove to be infeasible, the midterms are set as take-home exams and thestudents are allowed to use books, notes, and the internet to complete them.Participation also factors into the ﬁnal grade of students in the course. In addition,voluntary participation such as answering a question or being called on to answer a questionhas been shown to cause higher anxiety in large introductory courses than working ingroups on in-class exercises (England et al. 2017). Therefore, instead of relying solely on apotentially subjective measure of voluntary participation, participation scores of studentsin this class are made up of a check / no check type grade on their team-based in-class23pplication exercises (they get a check if they were in class for the day) as well as theirengagement on the online course discussion.Many of the assignments and assessments in the course are designed to prepare studentsfor the ﬁnal project, which, in a nutshell, asks students to “Pick a dataset, any dataset,and do something with it.” The actual assignment, of course, goes into a lot more detailsthan this, but ultimately students are asked to work in teams to pick a new (to them)dataset and an accompanying research question and answer the question using methodsand tools they learned in the course. We speciﬁcally ask them to not feel pressured to applyeverything they learned, but to be selective about which method(s) they use. They are alsoencouraged to try methods, models, and approaches that go beyond what they learned inthe course and additional support for implementing these is provided during oﬃce hours.There are three main reasons for assigning this team-based ﬁnal project. First, in aclass where students start oﬀ with no prerequisite knowledge, it is hugely rewarding forthem to see that they can go from zero to full ﬂedged collaborative and reproducible dataanalysis within the span of a semester, and hopefully this leaves them wanting to learnmore. Second, for the most part, teamwork results in a better ﬁnal product than studentswould accomplish individually. And lastly, teams are more adventurous than individualstudents, and are more likely to venture outside of what they learned in the class and learnnew tools and methods to complete their projects.Teams turn in a project proposal roughly one-month before the ﬁnal project is due withtheir data and proposed analysis. These proposals are reviewed carefully and feedback isprovided to the students. Teams can choose to revise their proposals based on the feedback,and thereby increase their score on the proposal stage of the project. The ﬁnal deliverablesof the project are a 10-minute presentation during the scheduled ﬁnal exam time and awrite-up that goes into further depth than the presentation can in the allotted time. Theﬁnal write-up is an R Markdown ﬁle, but unlike the earlier assignments, code chunks areturned oﬀ so that only the prose and the output/plots are visible to the reader. Thisencourages students to pay attention to wording, grammar, and most importantly ﬂowsince their narrative isn’t interrupted with large chunks of code.24

Discussion

The impact of this course at Duke University has been profound. Increasing numbers ofstudents coming out of this course continuing their studies in statistics after this coursehelped provide impetus to update and modernize the computational aspects of the secondstatistics course in regression. For example, the regression course now also uses the tidyversesyntax, students complete assignments using R Markdown, and use version control with Git,and collaborate and submit assignments on GitHub. Additionally, the course has served asa way to start building bridges between the introductory statistical science and computerscience curricula, accelerating the formation of an interdepartmental major in data science,where students are provided an option to build a full undergraduate curriculum in datascience but mixing and matching from a list of prescribed courses from the two departments.In addition to students wanting to pursue a degree in statistics and/or data science, thiscourse also serves a large number of students from the social and natural sciences as well asthe humanities. The course now satisﬁes the introductory statistics requirement of manymajors (e.g., political science, public policy, economics), and hence we expect to see trickledown eﬀects of starting with data science within the statistical and computational learninggoals of these disciplines as well.As Baumer (2015) put it so well, “[i]f data science represents the new reality for dataanalysis, then there is a real risk to the ﬁeld of statistics if we fail to embrace it.” Statisticsdepartments are at a huge advantage for oﬀering courses that can prepare students toembrace and extract meaning from modern data: we have faculty proﬁcient in statisticalinference, modeling, and computing. Traditionally these three pillars of statistics cametogether in higher level courses, but we believe that it’s time to ﬂip things around. Oﬀeringan introductory course like the one described in this article can introduce students to datascience early on, as early as their ﬁrst semester in college due to not having any prerequisitesfor the course. This will not only help drum up interest in the topic (and hence in statistics)but also provide a pathway for students to start interacting meaningfully with data anddeveloping their computational skills while concurrently taking mathematical prerequisitesneeded for a statistics major, such as calculus, linear algebra, etc.It has been ten years since Nolan & Temple Lang (2010) suggested that “[i]t is our25esponsibility, as statistics educators, to ensure our students have the computational un-derstanding, skills, and conﬁdence needed to actively and wholeheartedly participate in thecomputational arena.”

Introduction to Data Science and Statistical Thinking is designedto address this goal early on, and to introduce students to statistical thinking throughcomputing with data. While this course alone is not suﬃcient to equip students with all ofthe computing skills Nolan & Temple Lang (2010) outlines, it serves as a solid foundationto build on.One of the biggest challenges in designing this course has been deciding which topicsto include, especially in the second unit on making rigorous conclusions. Some topics thatare commonly covered in introductory statistics courses are intentionally left out in orderto make room for increased emphasis on computing and computational workﬂows. Forexample, this course places less emphasis on null hypothesis signiﬁcance testing and theCentral Limit Theorem compared to a traditional introductory statistics course. Whilewe touch on p-values as one way of making decisions based on statistical information, wedon’t demonstrate how to calculate them in various settings. Similarly, the Central LimitTheorem is only referenced in relation to some of the common characteristics of bootstrapdistributions. So far, we only have anecdotal evidence that students who take a course onregression after completing the introductory data science course about their experience inthe regression course. The evidence suggests that they have suﬃcient statistical backgroundto succeed in the regression course and do not appear to be less prepared than their peerswho completed a traditional introductory statistics course. Future research could helpinform the downstream eﬀects of introduction to the discipline of statistics via this courseand how student learning outcomes in the statistics major compare to other starting points.In designing the course we had one more ambition: to make all course materials openlylicensed and freely available to the statistics and data science instructor community. Allcourse content (lecture slides, homework assignments, computing labs, application exer-cises, and sample exams) as well as materials on pedagogy and infrastructure setup to helpinstructors who want to teach this curriculum can be found at datasciencebox.org.Beyond the challenges that come with designing any new course, there are a few aspectsof this course that we believe might present challenges for instructors who want to adopt26his course. First, while the foundational skills in data science are well established, thetechnical and implementation details, such as which R package should you use, can be amoving target. Staying current with these active developments is rewarding, but can betime consuming.Second, teaching this curriculum involves engaging with technical logistics that may beoutside of the comfort zone of many instructors. Much of this is addressed by professionallymanaged, web-based services (e.g., RStudio Cloud) as well as tooling developed speciﬁcallyto help manage course logistics (e.g., the ghclass package). A willingness to tackle unex-pected technical diﬃculties (e.g., a student getting stuck on an undecipherable Git error)using a combination of Googling and copying and pasting from Stack Overﬂow will help.One can view this as an opportunity as well – live debugging sessions where an instructormodels how they search for answers on the web can be valuable learning experiences forstudents.Finally, the topics presented in this course are substantially diﬀerent than those in atraditional introductory statistics or introductory probability course. This course providesless exposure to mathematical statistics topics (e.g., the Central Limit Theorem, distribu-tions, probability) in favour of computational data analysis skills. As such, it is importantthat the second course in a program is updated to accommodate students coming in withdiﬀerent backgrounds, which will require buy in from departmental faculty. We stronglybelieve that statistics and data science programs that leverage and reinforce these skillsthroughout the rest of the curriculum will ultimately produce stronger graduates.

Supplemental materials for the article, including details on the data collection process andthe R code for reproducing the ﬁgures found in the paper, can be found on GitHub atgithub.com/mine-cetinkaya-rundel/fresh-ds.27 cknowledgements

We thank the editors, the associate editors, and anonymous reviewers for their helpful com-ments and suggestions. We would also like to thank Ben Baumer from Smith College, JohnChristopher Duchi from Stanford University, David Wagner from University of CaliforniaBerkeley, and Damon Wischik from University of Cambridge for providing information ontheir introductory data science courses.

References

Baumer, B. (2015), ‘A data science course for undergraduates: Thinking with data’,

TheAmerican Statistician (4), 334–342. URL: https://doi.org/10.1080/00031305.2015.1081105

Baumer, B., C¸ etinkaya-Rundel, M., Bray, A., Loi, L. & Horton, N. J. (2014), ‘R Mark-down: Integrating a reproducible analysis tool into introductory statistics’,

TechnologyInnovations in Statistics Education .Becker, R. A., R, A. R. W., Brownrigg, R., Minka, T. P. & Deckmyn, A. (2018), maps:Draw Geographical Maps . R package version 3.3.0. URL: https://CRAN.R-project.org/package=maps

Beckman, M. D., C¸ etinkaya-Rundel, M., Horton, N. J., Rundel, C. W., Sullivan, A. J. &Tackett, M. (2020), ‘Implementing version control with Git as a learning objective instatistics courses’, arXiv preprint arXiv:2001.01988 .Brunner, R. J. & Kim, E. J. (2016), ‘Teaching data science’,

Procedia Computer Science , 1947–1956.Bryan, J. (2018), ‘Excuse me, do you have a moment to talk about version control?’, TheAmerican Statistician (1), 20–27.Carver, R., Everson, M., Gabrosek, J., Horton, N., Lock, R., Mocko, M., Rossman, A.,Roswell, G. H., Velleman, P., Witmer, J. et al. (2016), ‘Guidelines for assessment and28nstruction in statistics education (GAISE) college report 2016’. URL:

C¸ etinkaya-Rundel, M. (2018), ‘Let them eat cake ﬁrst!’.

URL:

C¸ etinkaya-Rundel, M. & Rundel, C. (2018), ‘Infrastructure and tools for teaching comput-ing throughout the statistical curriculum’,

The American Statistician (1), 58–65. URL: https://doi.org/10.1080/00031305.2017.1397549

C¸ etinkaya-Rundel, M. & Tackett, M. (2020), ‘From drab to fab: Teaching visualization viaincremental improvements’,

CHANCE (2), 31–41.Chang, W., Cheng, J., Allaire, J., Xie, Y. & McPherson, J. (2020), shiny: Web ApplicationFramework for R . R package version 1.5.0. URL: https://CRAN.R-project.org/package=shiny

Danyluk, A., Leidig, P., Cassel, L. & Servin, C. (2019), ACM task force on data scienceeducation: Draft report and opportunity for feedback, in ‘Proceedings of the 50th ACMTechnical Symposium on Computer Science Education’. URL: http://dstf.acm.org/DSReportInitialFull.pdf

De Veaux, R. D., Agarwal, M., Averett, M., Baumer, B. S., Bray, A., Bressoud, T. C.,Bryant, L., Cheng, L. Z., Francis, A., Gould, R. et al. (2017), ‘Curriculum guidelines forundergraduate programs in data science’,

Annual Review of Statistics and Its Application , 15–30.Dichev, C. & Dicheva, D. (2017), Towards data science literacy., in ‘ICCS’, pp. 2151–2160.Dichev, C., Dicheva, D., Cassel, L., Goelman, D. & Posner, M. (2016), Preparing allstudents for the data-driven world, in ‘Proceedings of the Symposium on Computing atMinority Institutions, ADMI’, Vol. 346.England, B. J., Brigati, J. R. & Schussler, E. E. (2017), ‘Student anxiety in introductorybiology classrooms: Perceptions about active learning and persistence in the major’, PloSone (8), e0182506. 29iksel, J., Jager, L. R., Hardin, J. S. & Taub, M. A. (2019), ‘Using GitHub classroom toteach statistics’, Journal of Statistics Education (2), 110–119.Hardin, J., Hoerl, R., Horton, N. J., Nolan, D., Baumer, B., Hall-Holt, O., Murrell, P.,Peng, R., Roback, P., Temple Lang, D. & Ward, M. D. (2015), ‘Data science in statis-tics curricula: Preparing students to ”think with data”’, The American Statistician (4), 343–353. URL: https://doi.org/10.1080/00031305.2015.1077729

Henry, L. & Wickham, H. (2020), purrr: Functional Programming Tools . R package version0.3.4.

URL: https://CRAN.R-project.org/package=purrr

Hicks, S. C. & Irizarry, R. A. (2018), ‘A guide to teaching data science’,

The AmericanStatistician (4), 382–391.Iannone, R., Allaire, J. & Borges, B. (2020), ﬂexdashboard: R Markdown Format for Flex-ible Dashboards . R package version 0.5.2. URL: https://CRAN.R-project.org/package=ﬂexdashboard

Kaggle (2017), ‘Kaggle machine learning & data science survey 2017’.

URL:

Knuth, D. E. (1984), ‘Literate programming’,

The Computer Journal (2), 97–111.Kuhn, M. & Wickham, H. (2020), tidymodels: Easily Install and Load the ’Tidymodels’Packages . R package version 0.1.0. URL: https://CRAN.R-project.org/package=tidymodels

Michaelsen, L. K. & Sweet, M. (2011), ‘Team-based learning’,

New directions for teachingand learning (128), 41–51.NASEM (2018),

Data Science for Undergraduates, Opportunities and Options , The Na-tional Academies Press. 30olan, D. & Temple Lang, D. (2010), ‘Computing in the statistics curricula’,

The AmericanStatistician (2), 97–107. URL: https://doi.org/10.1198/tast.2010.09132

NSF (2014), ‘Data science at NSF draft report of StatSNSF committee: Revisions sinceJanuary MPSAC meeting’.

URL:

Parry, J. & Barr, N. (2020), genius: Easily Access Song Lyrics from Genius.com . R packageversion 2.2.2.

URL: https://CRAN.R-project.org/package=genius

QS World University Rankings (2017), ‘QS world university rankings for statistics andoperational research 2017’.

URL:

R Core Team (2020),

R: A Language and Environment for Statistical Computing , R Foun-dation for Statistical Computing, Vienna, Austria.

URL:

Robinson, D. (2017), unvotes: United Nations General Assembly Voting Data . R packageversion 0.2.0.

URL: https://CRAN.R-project.org/package=unvotes

Robinson, D. (2019), gutenbergr: Download and Process Public Domain Works from ProjectGutenberg . R package version 0.1.5.

URL: https://CRAN.R-project.org/package=gutenbergr

Robinson, D. & Hayes, A. (2020), broom: Convert Statistical Analysis Objects into TidyTibbles . R package version 0.5.6.

URL: https://CRAN.R-project.org/package=broom

Rundel, C., C¸ etinkaya-Rundel, M. & Anders, T. (2019), ghclass: Tools for managing classes ith GitHub . R package version 0.1.0. URL: https://rundel.github.io/ghclass

Sahami, M., Danyluk, A., Fincher, S., Fisher, K., Grossman, D., Hawthorne, E., Katz, R.,LeBlanc, R., Reed, D., Roach, S. et al. (2013), ‘Computer science curricula 2013: Cur-riculum guidelines for undergraduate degree programs in computer science’,

Associationfor Computing Machinery (ACM)-IEEE Computer Society .Silge, J. & Robinson, D. (2016), ‘tidytext: Text mining and analysis using tidy data prin-ciples in r’,

JOSS (3). URL: http://dx.doi.org/10.21105/joss.00037

U.S. News & World Report (2018), ‘Best statistics graduate programs - top scienceschools’.

URL:

Wang, X., Rush, C. & Horton, N. J. (2017), ‘Data visualization on day one: Bringing bigideas into intro stats early and often’,

Technology Innovations in Statistics Education (1).Wickham, H. (2019), rvest: Easily Harvest (Scrape) Web Pages . R package version 0.3.5. URL: https://CRAN.R-project.org/package=rvest

Wickham, H., Averick, M., Bryan, J., Chang, W., McGowan, L. D., Fran¸cois, R., Grole-mund, G., Hayes, A., Henry, L., Hester, J. et al. (2019), ‘Welcome to the tidyverse’,

Journal of Open Source Software (43), 1686.Wickham, H. & Grolemund, G. (2016), R for data science: import, tidy, transform, visu-alize, and model data , O’Reilly Media, Inc.

URL: https://r4ds.had.co.nz

Wilson, S. G. (2013), ‘The ﬂipped class: A method to address the challenges of an under-graduate statistics course’,