A Framework for Infusing Authentic Data Experiences Within Statistics Courses
aa r X i v : . [ s t a t . O T ] J u l A Framework for Infusing Authentic DataExperiences Within Statistics Courses
Scott D. Grimshaw ∗ Department of Statistics, Brigham Young UniversityAugust 3, 2015
Abstract
Working with complex data is one of the important updates to the 2014 ASA Cur-riculum Guidelines for Undergraduate Programs in Statistical Science. Infusing ‘au-thentic data experiences’ within courses allow students opportunities to learn andpractice data skills as they prepare a dataset for analysis. While more modest inscope than a senior-level culminating experience, authentic data experiences providean opportunity to demonstrate connections between data skills and statistical skills.The result is more practice of data skills for undergraduate statisticians.
Keywords: assessment, undergraduate statistics education, curriculum guidelines ∗ The author gratefully acknowledges Nicholas J. Horton for his leadership of the working group thatdeveloped the 2014 Guidelines, Natalie J. Blades, Chris Dixon, students in the Fall 2014 BYU Stat 330course, Emily Juchau, and the Editor, Guest Editor, and Referees whose comments have greatly improvedthis paper. Introduction
The 2014 ASA Curriculum Guidelines for Undergraduate Programs in Statistical Science(American Statistical Association (2014) and hereafter referred to as “2014 Guidelines”)define the skills needed for statistics majors. Commonly, statistics programs are designedto include courses in each skill area, followed by a senior-level capstone, internship, or re-search experience. The ‘culminating experience’ is important for students because statisticscontains many connections between statistical application, statistical theory, data manip-ulation, computation, mathematics, and communication.Excellent capstone courses for statistics majors have been shared in the literature (seeLazar et al. (2011) and Zhu et al. (2013)), but the growth in undergraduate statistics pro-grams indicates that capstones need to be scaled up from small courses with one-on-oneinteractions with a professional statistician to courses with enrollments of 50 to 100 stu-dents each year without losing the characteristics that make a capstone experience valuable.Even in programs with strong capstone experiences, the 2014 Guidelines point out that stu-dents should have a scaffolded exposure to topics and connections throughout the academicprogram, rather than relying on a single senior-level course to tie everything together.Connections between methods, theory, data, and mathematics are easily lost since un-dergraduate programs offer multi-course sequences in these topics over as many as fouryears. Integration also suffers when faculty teach only one course in the curriculum, whichis often in the area of their greatest interest. Some students fail to develop connectionsbetween topics. An important aspect of program assessment is evaluating the presence andfrequency of information silos.The integration of some topics in the undergraduate program is common. For example,most programs integrate mathematics foundations with statistical theory and integratereal-world data with statistical methods. Other topics are harder to integrate in the cur-riculum but have been successfully demonstrated in the statistical education literature. Forexample, Horton (2013) tackled integrating statistical theory and computation by pointingout the pedagogical benefits and providing example projects.An authentic data experience provides data sources and data instructions as part ofan analysis. The goal is for students to increase their opportunities to manipulate and2estructure data that are provided in different formats. Creating real data applicationsis a natural part of course preparation, but teachers often perform the data collectionand management tasks and then provide students with a curated data file. For example,one of the strengths of Sheather (2009) as a regression textbook is the many applicationswhich required significant data acquisition and manipulation work by the author, but allthe datasets are presented to students as clean, well-organized files, easily read from thebook website. Using the vocabulary of Wickham (2014), teachers hide the ‘messy data’aspects and provide ‘tidy data’ — even when students possess the data skills required towork with the messy data. It is valuable for students to not only have many authenticdata experiences but also to have the professor model the correct application of statisticsby showing work with messy data in lectures.While authentic data experiences in capstone courses and DataFest (see C¸ etinkaya–Rundel and Stangl(2013) and Gould and C¸ etinkaya–Rundel (2014)) provide opportunities to work with messydata, more can and should be done in other courses to allow students to practice problem-solving using data. Authentic data experience should supplement not avoid creating andoffering courses to satisfy the data manipulation and computation skills in the 2014 Guide-lines and the incorporation of data science courses into the curriculum.What is good for statistics majors can also be applied to introductory courses. Theremay be no data skills on the learning outcomes for introductory courses, but Horton et al.(2015) advocate for complex and interesting data in such courses. Some examples andhomework in an introductory course may be modified and/or updated to use the originalsource data instead of a curated dataset. The data skills required would certainly be modestand need to fit student backgrounds. The objective would be for students to see that dataskills are required in an analysis. Students may rely on code provided to them that resultsin their own copy of the dataset.The paper proceeds with proposed metrics to evaluate authentic data experiences forlectures, homework, and exams in Section 2. Section 3 describes three examples of authenticdata experiences used in teaching a regression course for statistics majors. The paper closeswith an assessment of the infusion of authentic data experiences in a BYU Fall 2014 classfollowed by a summary and discussion. 3
Metrics to Evaluate Authentic Data Experiences
Statistics education is rich with applications that satisfy the definition of ‘real data’ inthe GAISE college report (GAISE College Group (2005)) and are available in textbooks,journals (e.g.
JSE
Data Sets and Stories), and internet repositories (e.g. StatLib, Data andStory Library, R libraries). A common expectation with data sets is the ‘story’ — sharedwith students — that motivates the research questions that led to data collection. Studentlearning is enhanced if the research story also has a data story.Authentic data experiences specifically include details and instructions for any tasksrequiring two skills mentioned in the 2014 Guidelines: (1) the “ability to manage andrestructure data” and (2) “data manipulation using software in a well-documented andreproducible way, data processing in different formats, and methods for addressing missingdata.”In order to distinguish between the breadth and depth of skills required for an authenticdata experience, two dimensions for breadth are proposed: “Data from Different Sourcesand Formats” and “Data Manipulation.” Within these dimensions a ‘good/better/best’classification is suggested as a rubric to distinguish depth of skills, and Table 1 definestasks required on the data step of the analysis. Another way to think of the classificationis the time expected for students to create the dataset.To demonstrate the classification, consider the climate science example of Witt (2013)to teach simple linear regression. One could use the curated data provided as supplementalmaterial at ,and doing so would be ‘good’ for both dimensions since that is a clean, well-organized, eas-ily read file with a header row of interpretable variable names. As an alternative, studentscould be directed to the different webpages containing the original data that are space-delimited files with interpretable variable names and instructed to merge the datasets byyear to form the dataset for analysis. A small change to include data skill tasks enrichesthe teaching experience to ‘best’ on both dimensions because data are read from multiplesources and merged.In order to classify or assess the data skills required for a particular authentic dataexperience, two important topics must be shared: Data Details and Teaching Notes. The4able 1: Breadth and Depth of Data Skills
Data from Different Sources and Formats
Good:
Read a space-, tab-, or comma-delimited file with or without afirst header row of variable names
Better:
Read data from an HTML table or other complex format
Best:
Read data from multiple sources or multiple HTML tables
Data Manipulation
Good:
Conventions for interpretable variable names
Better:
Compute additional variables and/or subset the data
Best:
Merge or combine data from multiple sources to create datasetfor analysis and/or identify missing and unreliable observationsand/or variables 5ata Details section should include the data source, characteristics of the data, and theinstructions to construct the dataset for analysis. The Teaching Notes should addresshow classes with different data skill backgrounds or prerequisites may need instructionsor code elements in order to perform the required tasks. To demonstrate on the climatescience example of Witt (2013) the Data Details would include that the source of the AnnualTemperature Data is http://data.giss.nasa.gov/gistemp/graphs_v3/Fig.A2.txt andpoint out the following features when accessed 31 March 2015: • space-delimited text file • • column header on line 3, data begins on line 5 • the current year is the last row but is coded as missing with * • attempt to read file directly from source in R with read.table results in error: Error in file(file, "rt") : cannot open the connectionIn addition: Warning message:In file(file, "rt") : cannot open: HTTP status was ’403 Forbidden’
The source for the Annual CO Data is ftp://aftp.cmdl.noaa.gov/products/trends/co2/co2_annmean_ and points out the following features when accessed 31 March 2015: • space-delimited text file • can be read directly from source •
50 lines of text describing the data before the data (each line begins with ) • column header line begins with One example of the Teaching Notes is using the climate science example in lecture ofa regression course for statistics majors as an active learning activity. After providingthe Data Story from Witt (2013) and the Data Details section, I ask students to graphand estimate a model for the relationship between CO and temperature. I have them6iscover the data complications instead of pointing them out, but one could point themout to be more time efficient. Students seem to quickly understand what they want to do,but common questions involve syntax for reading from a webpage, creating headers, andskipping lines.Some of the students perform the analysis in SAS; most of these students use datalines ,where they copy and paste from the two source webpages. The proc merge is basic, andstudents who haven’t learned that topic follow the syntax when it is provided. Some ofthe students perform the analysis in R, where the CO data can be read from the sourcewith read.table , read_table from the readr library, or fread from the data.table library, but the same is not true for the temperature data because of the error message.The syntax for creating descriptive variable names should be discussed, and some studentsare not familiar with the syntax for skipping lines in the file. Because the last row ofthe temperature data is missing and identified as * , the R default is to make the columna factor. There are no complications to using the merge function. Some students willcombine and merge the data using Excel. A comparison of student approaches in SAS, R,and Excel provides an opportunity to talk about reproducibility and documenting code.While not part of the metrics to evaluate authentic data experiences, one feature of ar-ticulating the required data skills is that research questions can be addressed using currentlyavailable data. For example, Albert (2010) provides curated seasonal batting data from1871–2009 at .With the passage of time, the examples of Derek Jeter and Alex Rodriguez as current play-ers and batting trends up to 2009 are less timely to students interested in sports, andwhat was once a current research question has become stagnant and ossified. However,Albert (2010) provides sufficient detail about the collection of files on the Lahman Base-ball Database and tasks required to filter and merge the master and batting files that theresearch questions and teaching notes can be updated to reflect current players and trends.The JSE
Data Contributors Guidelines consider non-static datasets when the teachingnotes include annotations and commentary in the form of reproducible code such as Albert(2010) to make the dataset applicable to a large audience.7able 2: Authentic Data Experiences in BYU Stat 330 Fall 2014Different Sources & FormatsGood Better BestManipulation Good 38 1 0Better 2 9 0Best 0 2 3
Fall 2014 was a new preparation for a regression course for statistics majors and providedthe opportunity to incorporate authentic data experiences into lectures, homework, andexams. The course is a traditional undergraduate treatment of linear regression, logisticregression, and time series that also satisfies the Society for Actuaries Applied StatisticsVEE (Validation by Educational Experience). The course prerequisite is a ‘second course instatistics’ based on designed experiments, and the required textbook Fall 2014 was Sheather(2009), but Weisberg (2013) is also at the appropriate level.Table 2 summarizes the 55 different real data applications used in lecture examples,homework assignments, and exams assessed by the metrics proposed in Section 2. Onestand-out table entry shows that most (38/55) of the applications are in files that arecomma- and space-delimited, either with or without a header row. Further investigationreveals that 15 of these 38 applications were in lectures, where using applications from thetextbook reinforced reading assignments, and 13 were in exams or practice exams, wherethe exam questions were on the regression material. Most of the applications correspondingto ‘better’ and ‘best’ assessments are homework assignments, but it is important to teachand demonstrate the expected data acquisition and manipulation skills during lectures.One of the challenges of writing authentic data experiences is identifying applicationswith some, but not significant data acquisition and manipulation tasks and providing clearinstructions for students in the Data Details section regarding what tasks they need tocomplete to prepare the dataset for the analysis questions on the assignment. It is important8o remember that practicing data-related skills is secondary in most courses, and a rule ofthumb is that students spent 15 to 20 minutes preparing the dataset for analysis. While thisrestriction would not be needed if the applications were used in a data course, the TeachingNotes often include code chunks for tasks the students may not know how to perform orwhich would take them longer than 15 to 20 minutes to learn, write and debug. Providingcode chunks is one response to the challenge in Cobb (2015) to flatten the prerequisites. Itmay appear the three applications that are ‘best’ in both dimensions would be ideal, but infact two resulted in students complaining about excessive time to create the dataset: oneexample required downloading 12 files from queries to the US NOAA website with studentsreporting it took 40 minutes to create the dataset, and the other example required deeperknowledge of NBA rosters than most students found valuable to research. Only the climatescience application from Witt (2013) fell within the 15 to 20 minutes to prepare the dataset.Three examples are provided to demonstrate authentic data experiences for homeworkassignments and provide a template for creating future authentic data experiences. TheData Story and Data Details sections and Teaching Notes are provided for each example, aswell as the classification of the breadth and depth of data skills. While all three examplesare from one course for statistics majors, the intention is to demonstrate authentic dataexperiences throughout the statistics curriculum.
Difference Between MLB Leagues and Divisions
Different Sources and Formats: Best (read data from multiple HTML tables, create addi-tional variables from table layout)Manipulation: Better (filtering rows and columns from the dataset)
Data Story:
MLB organizes the 30 teams into two leagues (American League, NationalLeague) with three divisions based on geography (East, Central, West) for each league.Unlike other professional leagues, MLB has different rules for each league that has led tolong and passionate arguments between fans of the two leagues (Google ‘Designated Hitter’for more information). Each year there are also arguments that particular divisions arestronger or weaker in terms of what it takes to win. Consider a model where the responsevariable is the number of wins in a season ( wins ), and the explanatory variables are:9
Factor league with levels AL for American League and NL for National League • Factor division with levels
East , Central , West • run.diff , the run differential for the season (difference between runs scored and runsallowed) Data Details:
The number of wins and the run differential for each team are reportedon MLB Standing pages on many sports websites — usually as HTML tables. The leagueand division for each team is created from the structure of the MLB Standings webpage.What follows is when the data source is
ESPN.com and accessed 31 March 2015: • The webpage http://espn.go.com/mlb/standings provide the current standings.During the season the standings change daily. During spring training the stand-ings are organized by Cactus and Grapefruit League (teams having spring train-ing in Arizona and Florida, respectively) and not the factor levels described above.The web address for previous season’s standings are found by changing the “Sea-son:” variable in the table. For example, the 2014 MLB Regular Season is at http://espn.go.com/mlb/standings/_/season/2014 • The standings are in two HTML tables on the same webpage • The standings have both numeric and character variable types • The standings contain rows and columns that should be ignored • Create League and Division variables from the structure of the standingsSummary of Data Skills: • Reading data from an HTML table • Reading numeric and character variables • Filtering rows • Filtering columns 10
Creating additional variables from table layout
Teaching Notes: Multiple Regression with Qualitative and QuantitativeExplanatory Variables in a Course for Statistics Majors
My objective is for students to write reproducible code in R to create a dataset from twoHTML tables. The example is assigned in homework. The following notes are provided tohelp the students to write the code: • Use the function readHTMLTable from the
XML
R library. • Use the which=1 and which=2 declaration to read the 1st table (American League)and 2nd table (National League) from the webpage. • Use the colClasses argument to specify the ‘type’ of each column as "numeric" fornumerical data columns and "character" for character data columns. • Use rbind to combine into a single dataset. • Use header=FALSE and specify your own column names. • Use skip.rows to only read the 30 teams (ignoring all other rows). • Use the organization of the webpage table to create league and division . Solution (not provided to the students) olClasses=c("character",rep("numeric",11))) The first question in the homework is “write a webscraper.” The full homework assign-ment is at http://grimshawville.byu.edu/hwMLBfromHTML.pdf . The estimated time forstudents to create the dataset is approximately 20 minutes.
Forecasting Movie Box Office Revenue
Different Sources and Formats: Better (read data from an HTML table)Manipulation: Good: (change variable name to something less confusing)
Data Story:
In the early days of the movie industry, customers could only see a movieduring a finite time period in a theater. How that has changed! Today movies and otherentertainment content are still available in theaters but are also available to customers forlonger and more flexible time periods on digital and disk formats. From a business per-spective, each of the different channels (theaters, Netflix, Amazon, YouTube, iTunes, TVbroadcasters, Redbox) seeks to ‘monetize’ the content they own or license. The traditionalentertainment revenue is ‘theater box office,’ defined as the amount paid by customers towatch a movie shown in a theater. While movies have multiple revenue streams (for exam-ple, international box office, DVD sales, digital rights or downloads), the theater box officeusually drives the later revenue streams and for some independent movies (like NapoleonDynamite) the appearance in theaters is a mark of success. Focusing on the economic out-look of theater box office, consider forecasting the next five years. The response variable is12he annual gross box office, defined as the total revenue from all movies seen by customersin theaters in a given calendar year.
Data Details:
Box Office Mojo is a website reporting on many of the business aspectsof the movie industry. The webpage tables thegross box office for all movies in theaters in a calendar year (in $ million). • The tabs on the webpage are actually the first HTML table, so the data of interestis the second HTML table • The table contains 10 columns with different formats: numeric, accounting (leading$, commas at thousands), percentage (trailing %), character • As is common on webpages the most current data is at the top of the page. Timeseries data usually begins t = 1 (oldest data) to t = T (newest data). • The first row is the current year to date gross box office. The other values are for365 calendar days.Summary of Data Skills: • Reading data from an HTML table • Reading variables with multiple formats • Change the order from “most recent data first” to time series • Remove current year since the value is YTD
Teaching Notes: Time Series ARIMA Model and Forecasts in a Course forStatistics Majors
My objective is for students to write reproducible code in R to create a dataset from theHTML table. The example is assigned in homework. The following notes are provided tohelp the students to write the code: • Use the function readHTMLTable from the
XML
R library. • Use the which=2 declaration to read the second HTML table from the webpage. Thefirst HTML table corresponds to the tabs on the webpage.13
Use the colClasses argument to specify the ‘type’ of each column as "numeric" fornumerical data columns, "character" for character data columns, "FormattedNumber" for numerical data columns with ‘,’ separating thousands. There is no R defined for-mat for $ dd,ddd.dd but the following code creates an
AccountingNumber ‘type’ thatcan be run before the readHTMLTable code and then used in the colClasses speci-fication: • Use header=TRUE but then change the variable name for gross box office ($ million)from
TotalGross* to Gross since the ‘ * ’ is confusing. • Remove the row for the current year since it is YTD. • Reorder the data to comply with the time series convention of oldest data (first row)to current data (last row).
Solution (not provided to the students) olClasses=c("numeric","AccountingNumber","Percent","FormattedNumber","Percent","numeric","FormattedNumber","AccountingNumber","AccountingNumber","character")) The first question is “write a webscraper.” The full homework assignment uses the astsa
Rlibrary to estimate and forecast an ARIMA(1,1,1) model and is at http://grimshawville.byu.edu/hwTimeS
The estimated time for students to create the dataset is approximately 10 minutes.
Effect of Age and Race on Having Health Insurance
Different Sources and Formats: Good or Better, depending on students’ background withSAS Transport filesManipulation: Better (filtering rows, creating interpretable variable names), but could beBest if the SAS code to merge the files is not provided
Data Story:
The US CDC performs a large survey of interviews and physical examina-tions that assess the health and nutritional status of adults and children in the US. Themost recently completed, publicly available data is the NHANES 2011-2012 Survey. Youhave been asked to investigate the relationship between age and race on whether or notindividuals have health insurance. Consider a model where the response variable is whetheror not an individual has insurance ( insured ) and the explanatory variables are: • age , the age of an individual (in years)15 Factor race with levels =Mexican American, =Other Hispanic, =Non-HispanicWhite, =Non-Hispanic Black, =Non-Hispanic Asian, =Other Race - IncludingMulti-Racial Data Details:
As is typical for the analysis of NHANES data, merging two SAS datasetsis required. For NHANES 2011-12 the downloaded datasets are SAS Transport files. Therespondent’s age is in the Demographics Data, which must be downloaded locally from , and the insurance cover-age is in one of the Questionnaire Data files, which must be downloaded locally from . • SAS Transport files must be downloaded and saved locally • proc merge should only keep a subset of the variables and only keep observations inthe HIQ_G.XPT dataset • Filter out observations of those who responded
REFUSED or DON’T KNOW for the healthinsurance question • Change the NHANES variable names to something more descriptive • The NHANES Tutorials provide examples for the SAS codeSummary of Data Skills: • Download publicly available SAS Transport files • Merge two SAS Transport files • Create interpretable variable names • Filter rows
Teaching Notes: Logistic Regression with Qualitative and QuantitativeExplanatory Variables in a Course for Statistics Majors
My objective is for students to download data from NHANES and write SAS code to createa dataset. The downloaded files are SAS Transport files and so it is easiest to use SAS to16reate the dataset. The example is assigned in homework. The following notes and codeare provided to help students who haven’t taken a SAS class or don’t know about SASTransport files: • Code to merge two SAS Transport files and subset to only those in NHANES thatparticipated in insurance questionnaire: articipated in insurance questionnaire;run; • Filter out observations of those who responded
REFUSED or DON’T KNOW for the healthinsurance question • Change the NHANES variable names to something more descriptive
Solution appended to the code provided above (not given to the students)data nhanes;set nhanes;* subset to those that responded and define insured=(yes/no)if hiq011 in (1,2);* ’rename’ variables so they are easier to remember;if hiq011=1 then insured="yes";else insured="no";age=ridageyr;race=ridreth3;drop ridageyr ridreth3 hiq011;run;
The first question is “download the appropriate files from NHANES and create the SASdataset nhanes .” The full homework assignment is at http://grimshawville.byu.edu/hwNHANESinSAS.pdf
The estimated time for students to create the dataset is approximately 10 minutes withthe code provided, but it would be more if students were asked to write that SAS code.
One of the advantages of using authentic data experiences in a course is an improvementin the quality of course projects. Students feel empowered to ask a question and then findthe data, instead of identifying a curated dataset from a repository or another book andperforming an analysis. In Fall 2014 the 45 students in Stat 330 were required to completea course project, and most students worked in pairs (two groups had four members). Ofthe 19 projects, 14 used data from multiple sources (three sources was most common) thatrequired merging, and 16 included additional data manipulation tasks of subsetting andcomputing additional variables. 18nother assessment comes from the Fall 2014 student ratings for the course. The meanresponse to ‘Materials and Activities Effective’ was 1.0 higher than the department mean(eight-point scale). Since the authentic data experiences were part of every assignment, thedifference indicates that students found them effective. While not specifically addressingdata skills, the mean response to ‘Intellectual Skills Developed’ was 0.6 higher than thedepartment mean (eight-point scale) and indicates that, at the very least, the integrationof data skills didn’t distract from the course learning outcomes of regression modelingand time series. Four student comments appreciated the ‘real world situations,’ and twoappreciated how they began to see how the different skills in statistics ‘all fit together.’As a follow-up assessment, a survey was sent to all 45 students a semester later and askedabout any change they’d experienced with regard to the six data skills organized underDifferent Sources and Format and Manipulation and defined as good, better, and best.While there is likely a response bias (those who didn’t enjoy the course wouldn’t respond),15 of the 16 responses reported ‘Much Stronger’ skills in at least one of the data skill sets,with 12 reporting ‘Much Stronger’ skills in three or more of the six skills defined.There are a few drawbacks with authentic data experiences. The most important iswhat is omitted from the course when authentic data experiences are added. From theinstructor perspective, new homework assignments had to be written for better and bestapplications since the textbook’s application are all at the good level. Course preparationtime shifted from some regression topics, and assessment at semester’s end reflected lessdepth on about 15% of the course material compared to another instructor teaching thesame course. From the student perspective, using fewer datasets in the weekly homeworkbecause of the time required to assemble the data for analysis resulted in less practice onregression skills. From the Fall 2014 Stat 330 course ratings, the mean response to ‘ValuableTime Out of Class’ was 89%, which is higher than the department mean of 81%. Of the 24student comments, five were about homework load with two reporting ‘the right amount,’three reporting it was ‘too much,’ and two suggesting more time in class with data setssimilar to what will be assigned on the week’s homework.19
Conclusion
Statistics is a field of study requiring an integrated set of skills. The 2014 Guidelines definethe expected skills of a statistics major and encourage programs to provide opportunitiesfor students to connect these skills. ‘Authentic data experiences’ are proposed that extendthe definition of real data to include the application of data skills.Presently, instructors can choose from an exploding set of real data applications in theircourses. Unfortunately, the end result from the student’s perspective is often a clean, well-organized, easily read file. A small paradigm change is to provide source data locations withsufficiently detailed instructions for the students to prepare the dataset for analysis. Anadded benefit of finding and writing authentic data experiences is that it forces instructorsout of the comfort zones described by Horton (2015) and keeps instructors current aspreferred technologies change over time. Applications that are time sensitive benefit fromproviding the instructions to obtain current data. In time series in particular, a studentanalyzing data that was current at book publication may feel the application is artificial. Insome cases obtaining current data for the same problem requires basic data-related skills.Open questions include the number of real data applications that should be includedin a course and program, the number that should be at the ‘better’ and ‘best’ levels, andthe value of including ‘better’ or ‘best’ applications on exams in non-data courses. Furtherextensions are creating an environment to share applications where the teaching notesarticulate how different expectations for the data skills modify the example. For example,the same application could be presented to a statistics major course and an introductorycourse with different student expectations with regard to data skills. Since the Internet isa dynamic environment for data, there needs to be a way for people to share when databecomes subscription-based or a website has changed.The 2014 Guidelines challenge undergraduate statistics programs to emphasize workingwith complex data. Students will take courses in data manipulation and computation, aswell as having a culminating experience through a capstone course, an internship, and/ora mentored research experience. The paper suggests that more can be done to developstudents skills and confidence by providing authentic data experiences in other courses.20
UPPLEMENTARY MATERIAL
The three examples are available as a Wiki at https://grimshaw-wiki.byu.edu . Alsoavailable is the climate science example from Witt (2013) with Teaching Notes for both aregression class and an introductory class. The Wiki welcomes contributions, comments,and updates.
References
Albert, J. (2010). Baseball data at season, play-by-play, andpitch-by-pitch levels.
Journal of Statistics Education 18 (3). .American Statistical Association (2014). Curriculum guide-lines for undergraduate programs in statistical science. .C¸ etinkaya–Rundel, M. and D. Stangl (2013). A celebration of data.
CHANCE 26 (3). http://chance.amstat.org/2013/09/classroom_26-3/ .Cobb, G. W. (2015). Mere renovation is too little too late: We need to rethink ourundergraduate curriculum from the ground up.
The American Statistician (in press).GAISE College Group (2005). Guidelines for assessment and instruction instatistics education.
Technical Report, American Statistical Association . .Gould, R. and M. C¸ etinkaya–Rundel (2014). Teaching statistical thinking in the datadeluge. In T. Wassong, D. Frischemeier, P. R. Fischer, R. Hochmuth, and P. Bender(Eds.), Mit Werkzeugen Mathematik und Stochastik lernen — Using Tools for LearningMathematics and Statistics , pp. 377–391. Springer Fachmedien Wiesbaden.Horton, N. J. (2013). I hear, I forget. I do, I understand: A modified Moore–methodmathematical statistics course.
The American Statistician 67 (4), 219–228.21orton, N. J. (2015). Challenges and opportunities for statistics and statistical education:Looking back, looking forward.
The American Statistician 69 (2), 138–145.Horton, N. J., B. S. Baumer, and H. Wickham (2015). Setting the stage for data science:Integration of data management skills in introductory and second courses in statistics.
CHANCE 28 , 40–50.Lazar, N. A., J. Reeves, and C. Franklin (2011). A capstone course for undergraduatestatistics majors.
The American Statistician 65 (3), 183–189.Sheather, S. J. (2009).
A Modern Approach to Regression with R . Springer.Weisberg, S. (2013).
Applied Linear Regression (4th ed.). Wiley.Wickham, H. (2014). Tidy data.
Journal of Statistical Software 59 (10), 1–23. .Witt, G. (2013). Using data from climate science to teach in-troductory statistics.
Journal of Statistics Education 21 (1). .Zhu, Y., L. M. Hernandez, P. Mueller, Y. Dong, and M. R. Forman (2013). Data acquisitionand preprocessing in studies on humans: What is not taught in statistics classes?