DDynamic Data in the Statistics Classroom
Johanna Hardin
1. Introduction
There has been a recent push to change the way we - as statisticians - engage pedagogicallywith complex real-world data analysis problems. Two parallel forces have directed us towardembracing more complex-data real-world problems in the classroom.The first driver is a call from educators to make statistics more relevant to students’ experiences.In order to address the perspective of students who previously have been reported to “exhibitremarkably little curiosity about the material they are analyzing”, Brown and Kass (2009)suggest the importance of “real-world problem solving” to get students engaged in their analyses.Gould (2010) wants introductory statistics students to leave the course with “a set of ... attitudesabout data that are immediately applicable to their lives.” Gould and ¸Cetinkaya-Rundel (2013)suggest putting “data at the center of the curriculum.” Horton, Baumer, and Wickham (2015)assert that “statistics students need to develop the capacity to make sense of the staggeringamount of information collected in our increasingly data-centered world.” And Zhu, Hernandez,Mueller, Dong, and Forman (2013) remind us that “data pre-processing bridges the gap fromdata acquisition to statistical analysis but has not been championed as a relevant componentin statistics curricula.”The second driver arises from students and other stakeholders and is harder to document withreferences. Certainly my own experience (and that of my colleagues) is that students engage atthe deepest level when they (a) care about the problem, and (b) understand the data collection,study design, or motivation of the problem. Kuiper and Sturdivant (2015) report, “Our studentevaluations of these materials support Gould’s (2010) comments suggesting that previouslycollected and cleaned data were considered abstract to the student ... We have found thatunless students 1) have collected the data themselves or 2) clearly see where and how the datawere collected, they often fail to appreciate it.”Indeed, even if the work is done for them, as Grimshaw (2015) argues, there is much benefit inthe students seeing how the data were procured:Using the vocabulary of Wickham (2014), teachers hide the ‘messy data’ aspects andprovide ‘tidy data’ - even when students possess the data skills required to work withthe messy data. It is valuable for students to not only have many authentic dataexperiences but also to have the professor model the correct application of statisticsby showing work with messy data in lectures...What is good for statistics majors canalso be applied to introductory courses. There may be no data skills on the learningoutcomes for these courses, but some examples and homework in an introductorycourse may be modified and/or updated to use the original source data instead of acurated dataset. The data skills required would certainly be modest and need to fitstudent backgrounds. The objective would be for students to see that data skills are a r X i v : . [ s t a t . O T ] J a n equired in an analysis. Students may rely on code provided to them that results intheir own copy of the dataset.There have been repeated calls to use real data in the classroom (Cobb 1991, 1992, 2007, 2011;Workgroup on Undergraduate Statistics 2000; American Statistical Association UndergraduateGuidelines Workgroup 2014; GAISE College Group 2005; Carver, Everson, Gabrosek, Rowell,Horton, Lock, Mocko, Rossman, Velleman, Witmer, and Wood 2016). There are benefits tousing real data sets (as opposed to made up numbers) in textbooks and in the classroom.Indeed, there are virtually no textbooks of any kind (AP Statistics, Introductory Statistics,second course in statistics) being written today without the vast majority of examples takenfrom actual studies or databases. Additionally, there has been a push to infuse R with datasetsthat are relevant, recent, and sophisticated (e.g., nycflights13 and other mosaic datasets, (Pruim,Kaplan, and Horton 2014)).Unfortunately, by nature, however, all of the data given in a textbook or an R package isstatic. Currently, there do not exist mechanisms to continually update any dataset provided bythe course materials. The most comprehensive baseball dataset compiled and provided to thestudents will be out of date (and less interesting) by next fall. The good news, however, is that,outside of the statistics classroom, data of all kinds are being updated in real time publiclyand accessibly. And even better news is that R developers are continually improving interfacesto the vast amounts of public data. For example, the tidyverse package imports a handfulof packages that make downloading data straightforward (e.g., see Wickham (2014) and morerecently the packages: rvest , readr , readxl , haven , httr , and xml2 ; expect more to come).Grimshaw (2015) delineates data along two axes: the first axis describes the source and formatfor the data; the second axis describes the amount of wrangling required on the data. Bothaxes are scored as good/better/best. Indeed, Grimshaw gives examples, sources, and R codefor several examples. However, he included only one dataset from the “best/best” categorythat worked well in his classroom. Out of the three “best/best” datasets he used, two were notwell received because of excessive time, computing expertise, or domain knowledge needed towrangle the data into usable format and extract meaning from the dataset.In this manuscript, we seek to address the difficulty of using dynamic data in the classroom bycurating additional best/best datasets (including full R Markdown files to facilitate reproducibleanalysis). The available resources aim to open the world of dynamic data to those who havenot previously worked directly with downloaded data. We hope our work can act as a startingpoint for those interested in building their data scraping skills. Additionally, the R Markdownfiles can be used as scaffolding for assignments designed to help students engage with onlineresources.In the next section we work through an entire example including showing the steps and R codefor downloading the data, providing example code for using it in a typical introductory statisticsclassroom, and suggesting ideas for expanded use beyond introductory statistics. Section 3 givessummaries of additional dynamic datasets that are fully curated as R Markdown files in thesupplementary resources on Github ( https://github.com/hardin47/DynamicData ) as well asa list of other places to look for dynamic data, and we provide closing thoughts and ideas forfuture work in the Conclusion. The Appendix describes a few technical details that are helpfulfor making dynamic data accessible to students at all levels. . Complete Example: College Scorecard Data Supplementary materials provide full R Markdown files for complete analysis on five differentdynamic datasets. We start with a complete description of the College Scorecard materials asa way of illustrating the available resources.Data on characteristics of US institutions of higher education were collected in an effort to makemore transparent issues of cost, debt, completion rates, and post-graduation earning potential.An undertaking of the U.S. Department of Education, the College Scorecard data represent acompilation of institutional reporting, federal financial aid reports, and tax information. Theprocess of gathering and compiling the data is well documented on the College Scorecard website https://collegescorecard.ed.gov/data/documentation/ . One caveat is that some of thevariables have only been collected on students receiving federal financial aid. Biases inherent toanalyses done on data collected from a subgroup should be considered.The College Scorecard dataset is incredibly rich. The individual institutions are broken downby total share of enrollment of various races, family income levels, first generation status, age ofstudents, etc. Additionally, the data give matriculation information like SAT scores as well asgraduation information like completion rate and income level. The dataset allows for a studentto investigate political or personal hypotheses about college education and the costs and benefitswithin. The variables are described in a data dictionary given at https://collegescorecard.ed.gov/assets/CollegeScorecardDataDictionary.xlsx . For each of the five fully curated dynamic datasets, there is an R Markdown file (availableat https://github.com/hardin47/DynamicData ) which scrapes the data from an outside websource (presumably kept current and public by some other organization). The downloadingstep is shown here in the manuscript only for the College Scorecard data. Note, as discussedabove, the data are downloaded directly from the website managed by the US Department ofEducation (and not stored locally as a csv file). It is worth pointing out that the code givenhere is more complicated than what is standard in an introductory statistics course (especiallyat the beginning of the semester!). However, both the GapMinder and Wikipedia examples canbe adjusted in very simple ways, allowing each student to work with their own dataset. That is,you can scaffold an assignment by providing the majority of the downloading code and havingthe student fill in the URL (Wikipedia example) or variable names (GapMinder example). TheR code for downloading the data for all of the examples are given in the supplementary RMarkdown files, and each uses slightly different functions and syntax.First, load the data into R. college_url <-"https://s3.amazonaws.com/ed-college-choice-public/Most+Recent+Cohorts+(All+Data+Elements).csv" college_data <- readr::read_csv(college_url)dim(college_data) ext, use data wrangling methods to clean and organize the variables. college_debt = college_data %>%dplyr::select(region, HBCU, DEBT_MDN, md_earn_wne_p10) %>%mutate(DEBT_MDN = readr::parse_number(DEBT_MDN),md_earn_wne_p10 = readr::parse_number(md_earn_wne_p10)) %>%mutate(HBCU = ifelse(HBCU=="NULL", NA, HBCU)) %>%mutate(region2 = ifelse(region=="0", "Military",ifelse(region=="1", "New England",ifelse(region=="2", "Mid East",ifelse(region=="3", "Great Lakes",ifelse(region=="4", "Plains",ifelse(region=="5", "Southeast",ifelse(region=="6", "Southwest",ifelse(region=="7", "Rocky Mnts",ifelse(region=="8", "Far West", "Outlying"))))))))))summary(college_debt) ' s :1163 NA ' s :2168 Using the downloaded data, we start by applying a technique from the introductory curriculumto a research question of interest based on the College Scorecard data. College debt is ofparticular interest to many college students, but debt can be mediated by post-graduationincome. To fully investigate the relationship between the variables, we provide both confidenceand prediction intervals for both variables.After calculating a few individual intervals, we show all intervals represented graphically andbroken down by geographic region. Note that the visual representations do not represent asimple summary plot of the data, and we leave it open to the instructor to have the studentsngage more deeply with the many available variables.Using the two variables measuring amount of debt of a typical (i.e., median) college graduate andmedian earning 10 years after matriculation, we create both confidence intervals and predictionintervals – keeping in mind that the observational unit is an academic institution. Note thatthe calculations below are for both confidence and prediction intervals. The confidence intervalagglomerates institutions over the entire dataset; however, the prediction value is for a single institution (which is the observational unit). The analysis lends itself nicely to a conversationabout confidence vs. prediction intervals as well as observational units as institution vs. asindividual student. It is worth pointing out to the students that the prediction intervals likelyhold more information related to their individual experiences than the confidence intervals.However, the unit of prediction is for an institution , and so the individual student debt andincome is likely even more variable than shown here. Additionally, Figure 1 demonstrates theeffect of samples size: consider the comparison of the Military intervals (one school) to theintervals for all of the US institutions (about 6000 schools). (Note: the intervals given in Figure1 were created using an ANOVA model where the within variance is calculated across all regions,which is how the interval for military schools can be calculated. You may or may not want tobring that up with your students.)The following R code uses the mosaic package to directly calculate both prediction and con-fidence intervals. Note the formula interface given by the tilde is described in detail here: http://rpruim.github.io/eCOTS2014/Workshop/Modeling.html . debt_mod <- lm(DEBT_MDN~1, data = college_debt)debt_fun <- mosaic::makeFun(debt_mod)debt_fun() The intervals are interesting, but they might be even more interesting if broken down by regionand shown visually. Note how much smaller the confidence intervals are from the predictionintervals! The difference indicates lots of variability across institutions and large sample sizes.
Region $deb t ( o r ange ) o r $ i n c o m e ( b l ue ) y ea r s po s t m a t r i c u l a t i on type confpred cost debtearn Figure 1.
The x-axis represents the region of the institution. The y-axis represents either the amount of debt10 years after matriculation (orange) or the amount of income 10 years after matriculation (blue). Confidenceintervals for the average values (within region) are given by the solid lines. Prediction intervals for individualinstitutions are given by the dashed lines. The solid dot represents the center of both types of intervals (brokendown by debt and income).
Additionally, for each of the supplementary dynamic data R Markdown files, we add a section forach analysis which is based on topics that are not traditionally taught in introductory statisticsclasses. The additional analysis is done not only to expand the tool box of the students but alsoto teach the students that they can often think about the problem in sophisticated ways evenif all their tools come only from the introductory course.The College Scorecard dataset is incredibly rich and can be used for many different types ofmodel building: linear, logistic, machine learning. Indeed, thinking about interaction termscould be particularly insightful. Here, we give an example of regressing earnings on debt withthe interaction term as whether or not the institution is one of the Historically Black Collegesand Universities (HBCU). Figure 2 displays the separate regression lines for the two distincttypes of institutions.
Earn= 18869.26 +1.21*Debt, r2= 0.28, p = 0.00, N = 4880Earn= 20501.48 +0.58*Debt, r2= 0.24, p = 0.00, N = 87
Debt M ed i an i n c o m e y ea r s po s t m a t r i c u l a t i on HBCU aaaa Figure 2.
Median income regressed on debt. For the analysis, HBCU is interacted with debt to provide twodistinct (and not parallel) regression lines. HBCU institutions are given in blue, and non-HBCU institutions aregiven in orange.
Many interesting conversations can ensue based on the regression of income on debt. Remindingthe students that each observation is an institution is an important starting point. Additionally,students should be able to volunteer the dangers of using a model like this to suggest causality.Last, there might be room to discuss an inferential analysis of whether HBCUs are statisticallydifferent from non-HBCUs (noting the substantial differences in sample sizes).It is not hard to come up with additional questions to investigate with the College Scorecarddata. Indeed, because the data relate directly to college students, they should be able to findways to engage with the data. We recommend continued conversations about how the data arevaluable to the larger community, but that the information is not always complete (e.g., manyvariables are collected only on students who fill out financial aid forms) and not causative. . Dynamic Data Projects
For each of four additional dynamic datasets, we describe the source of the data and relevantvariables & research questions, some standard and graphical techniques (“using dynamic datawithin a typical introductory statistics classroom”), and a statistical analysis appropriate fora course after introductory statistics (“thinking outside the box”). We also provide sourceinformation for an additional nine dynamic datasets.
Wikipedia stores most of its tabular data in HTML tables. To scrape HTML tables fromany website (or HTML file), use the R function
XML::readHTMLTable . As an initial foray intodownloading data directly from the internet into R, Wikipedia tables provide a nice introduction.In the supplementary R Markdown file associated with the Wikipedia data analysis, we alsowalk through some of the useful aspects of using dplyr to wrangle the data . change in value l og r e t a il v a l ue $ U S m illi on s log retail value vs. change in value over 2013 to 2014 Figure 3.
The log of the retail value of music sales broken down by whether or not the sales have increased ordecreased (presumably over the previous year, although the Wikipedia documentation does not specify the timeperiod over which the change is measured).
Using dynamic data within a typical classroom
The Wikipedia analysis given in the fully curated files explores an HTML table on sales ofmusic (physical and digital) in 2014, https://en.wikipedia.org/wiki/Music_industry . One Original idea for this example provided by Nick Horton, Amherst College. ariable gives an indication of whether the retail value of the music sales has increased ordecreased. Using the country-level music data, we perform a t-test, a Wilcoxon rank sumtest, data transformations, and boxplots to investigate music retail sales (analysis given insupplementary materials, not shown here). By grouping the data into two categories we caninvestigate whether there is any statistical difference between the total average retail sales (inUS$) between those countries for whom retail sales increased versus those that decreased. Thep-value for the initial t-test is reasonably large, but the boxplot shows that the difference invariability across the two groups is also large with a sample that either has large outliers ora long skewed right tail. Because the technical assumptions do not appear to be met, a logtransformation of the data or a non-parametric test might be better assessments of the data(see Figure 3). The analysis leads to conversations about the source of the data and the reasonswhy p-values are non-significant. The example extends easily to each student choosing theirown Wikipedia page and data table, graphical representations, and statistical analyses.
Thinking outside the box
Among the variables in the Wikipedia music dataset are the breakdown (percentages) of howthe retail sales are distributed across physical, digital, performance rights, and synchronization.We might want to see whether there is a dependency of total retail sales on the breakdown oftypes of products. The problem is not well suited to introductory statistics as there is not anobvious statistic we can use within a sampling distribution (to create a p-value, etc.). Becausethere does not seem to be an obvious mechanism for evaluating the breakdown of products (andhow “different” they are), we consider an ad-hoc measure and perform a permutation test toassess significance. The average breakdown of retail sales is given in the R Markdown file inthe supplementary materials. One way to measure a discrepancy between the retail sales andthe consistency of product breakdown is to correlate the retail sales with the sum of squareddistances from the average breakdown of product types. We see that the metric we createdto find a relationship between retail sales and breakdown of types of product does not showsignificance. A student project could be to think about different ways to measure how thebreakdowns can be considered to be different.
The NHANES data come from the National Health and Nutrition Examination Survey, surveysgiven nationwide by the Center for Disease Controls (CDC). The CDC adopted the followingsampling procedure:1. Selection of primary sampling units (PSUs), which are counties or small groups of con-tiguous counties.2. Selection of segments within PSUs that constitute a block or group of blocks containinga cluster of households.3. Selection of specific households within segments.4. Selection of individuals within a householdAbout 12,000 persons per 2-year cycle were asked to participate in NHANES. Response ratesaried by year, but an average of 10,500 persons out of the initial 12,000 agreed to complete ahousehold interview. Of these, about 10,000 then participated in data collection at the mobileexam center. The persons (observational units) are located in counties across the country.About 30 selected counties were visited during a 2-year survey cycle out of approximately 3,000counties in the United States. Each of the four regions of the United States and metropolitanand non-metropolitan areas are represented each year. As such, the data collection is ongoing,and the data are updated on the NHANES website periodically, . This manuscript uses the 2011-2012 NHANES data, but we expect the data to beupdated regularly, and the URL should simply change to 2013-2014 when it becomes available.Note that the NHANES data are available in the mosaic package (Pruim et al. mosaic version of the NHANES data is static (from 2011-2012), and the data has been cleanedwith pre-selected variables. Additionally, the variables can be downloaded directly using the nhanesA package in R. https://cran.r-project.org/web/packages/nhanesA/vignettes/Introducing_nhanesA.html . By accessing the data directly from the CDC’s website, stu-dents become more involved in the data analysis process, understanding what they can andcannot get from the data.The variables in the CDC’s online NHANES dataset are virtually limitless. We use a few differ-ent datasets, merging them based on an individual identifier in the dataset. The variable infor-mation is all given online, but each at a different webpage. For example, the demographic data isat .A further important aspect to the example is that (as described above) the data do not constitutea simple random sample (it is a weighted sample) from a population. The sampling scheme canbe part of the statistical inquiry into the data analysis, or it can be set by the instructor in thetemplate used to download the data. The data which is directly downloaded from the CDCwebsite includes variables on the weighting scheme. In the R Markdown file provided with thismanuscript, we demonstrate how to create a dataset which can act as a simple random samplefrom the population. The figure and analysis below are done with a proxy simple randomsample.
Using dynamic data within a typical classroom
In the supplementary materials (not shown here), we start with a comparison of body massindex (BMI) for those in committed relationships and those not in committed relationships. Thegraphs of the two BMI distributions look quite similar, and the t-test shows a non-significantdifference in means. The results prompt a discussion about averages versus individual results,causation, and sample size.
Thinking outside the box
Because of the large sample size and the ability to determine the functional form of a non-linear relationship, smoothing techniques can be used to model quantitative variables. Adding The weighting analysis was motivated by work done by Shonda Kuiper ( http://web.grinnell.edu/individuals/kuipers/stat2labs/weights.html ) as well as the Project Mosaic Team ( https://cran.r-project.org/web/packages/NHANES/NHANES.pdf ). smooth curve to a standard scatterplot leads (see Figure 4) to discussions about how smoothcurves are estimated, SE of the smooth curve, extra variability and instability due to extremesand fewer data points on the ends. However, extrapolation (note that the two curves havedifferent ranges?) and the slopes of the two curves not seeming to be different (no interaction?)might warrant further study. Height W e i gh t gender femalemale Height vs Weight by Gender with Smooth Regression Fit
Figure 4.
A scatterplot of weight and height broken down by gender. A smooth regression is fit to the points,and the confidence interval for the smooth curve as well as the increase in variability for small and large valuesof height are part of an important discussion in a second level regression course.
The next dynamic dataset comes from GapMinder . GapMinderalso has a plethora of variables from which students can choose (according to their own interests),but here we work with literacy rates measured at the country level. The analysis (and moreimportantly, the data scraping from GapMinder) could easily be extended for students interestedin all sorts of political, social, environmental, or demographic data available. Understandingpolitical and demographic trends across both time and location can provide very interestinginsight into economic or political science questions. Alternatively the GapMinder data can beperused in a descriptive or graphical manner.
Using dynamic data within a typical classroom
The introductory analysis considers gender differences in literacy rates and uses a linear model onthe difference between female and male literacy rates across time (the analysis is available in theupplementary materials and not shown here). We show a graphical representation and discussmodel assumptions including sampling and independence of residuals. The model indicates thatthe difference between male and female literacy rates is shrinking over time. However, we worryabout the effects of other variables and encourage a more complete analysis. Indeed, there maybe large biases in our model if important explanatory variables have been left out.The data provided are ideal for a fantastic classroom conversation about causation, causalmechanisms, and confounding variables.
Thinking outside the box
In the second analysis, we work with the additional variable continent . The trends observedin the first analysis hold up in the second analysis (i.e., the difference declines over time in eachof the continents). However, there are additional considerations to be made, for example, thedifferences between the slopes across the continents (the analysis is available in the supplemen-tary materials and not shown here). We suggest additional explorations into the independenceof the residuals and more advanced spatio-temporal patterns of literacy.
The National Oceanic and Atmospheric Administration (NOAA) is the American federal agencyin charge of collecting information and making decisions related to the oceans and the atmo-sphere. Throughout North America, they supply weather stations which are located both alongthe coast as well as in the middle of the ocean (on buoys). Among other variables, the weatherstations collect information on wind, humidity, temperature, visibility, and atmospheric pres-sure. The data is all publicly available on NOAA’s website, . Using dynamic data within a typical classroom
Although the data do not constitute a random sample, they are very likely to be quite represen-tative with respect to the difference in wind and air temperature at that location over the year.In the supplementary files (not shown here), we used a paired analysis (i.e., subtract the twovariables and treat them as a single variable) to find a confidence interval for the true differencein temperature between wind and air. Also, we find a prediction interval for the difference intemperatures across individual measurements.
Thinking outside the box
Although a full analysis of the data would warrant multiple years of data (so as to under-stand yearly trends), we can estimate the spectral density of the time series using a smoothedperiodogram (the data below represent measurements every hour for all of 2014). In thesmoothed periodogram (see Figure 5) the x-axis is the frequency (one over the period) andy-axis represents the correlation (normalized) between the cosine wave at that frequency andthe time series. We can see that wind speed has strong correlation at period 12 hours andperiod 24 hours. A more sophisticated analysis or longer project could include collectingdata from multiple buoys, extended years, and/or additional information on storms . . . . . Frequency = 1/period s pe c t r u m Wind Speed, Smoothed Periodogram bandwidth = 0.00167
Figure 5.
A smoothed periodogram of Wind Speed for buoy
There are myriad sources of dynamic data which are public and accessible. We list a fewadditional resources here.
Baseball Data
Many students are interested in sports statistics. In particular, statisticians and sabermatricianshave worked over the years to compile datasets and to think carefully about statistical methodsapplied to baseball data. Today’s datasets can answer questions like: What is the probabilityof a particular event happening given certain real time situations?Albert (2010) provides curated seasonal batting data from 1871 to 2009 at . With the passage of time, the exam-ples of Derek Jeter and Alex Rodriguez as current players and batting trends up to 2009 areless relevant to students interested in sports, and what was once a current research ques-tion has become stagnant and ossified. However, Albert (2010) provides sufficient detailabout the collection of files on the Lahman Baseball Database ( with data available as a .zip file which can be read di-rectly into R http://seanlahman.com/files/database/lahman-csv_2015-01-24.zip ) andasks required to filter and merge the master and batting files that the research questions andteaching notes can be updated to reflect current players and trends. Also see Jim Albert’s blog
Exploring Baseball Data with R , https://baseballwithr.wordpress.com/ . Cherry Blossom Ten Mile Run Data
Each year in April, the Cherry Blossom Ten Mile Run happens in Washington, DC ( ). The race results (including finishing time,name, age, hometown, and pace) are available from 1999 to the present (although admittedly,the results after 2012 are more difficult to scrape off the web due to a change in the format ofhow the results are now posted). Kaplan and Nolan (2015) provide step-by-step instructions( http://rdatasciencecases.org/ ) for scraping the data and wrangling it into a format whichcan be used to investigate race results over time, by age, or by gender.
Fatal Accidents and US Census Data
The National Highway Traffic Safety Administration collects data on all fatalities suffered inmotor vehicle traffic crashes. The data is posted yearly and publicly as part of the Fatal AccidentReporting System, . Each year is posted as a separate set of files,available in different formats. For example, download the SAS data (readable into R via the read.sas7bdat function in the sas7bdat package, the sasxport.get function in the
Hmisc package,or read.SAS in the haven package) at ftp://ftp.nhtsa.dot.gov/fars/2014/SAS/ .Another dataset which is easily accessible is the census data ( http://factfinder2.census.gov/faces/nav/jsf/pages/index.xhtml ). By looking at the
Decennial Census data and the
Profile of General Population and Housing Characteristics , students can study the ages andother demographic characteristics of the general population as compared to those individualsinvolved in fatal accidents . Climate Data
Witt (2013) describes using data in class to reveal important insights on climate for statisticsstudents. The data and analyses describe both decline of Arctic sea ice and global temperatureincrease. The (static) data used in the analyses is available through
The Journal of Statistics Ed-ucation , , but the authors alsoprovide links and information about the original data sources which are dynamic.Notably, realclimate.org provides a catalogue of many different types of climate data and relevantsource information ( ). Witt (2013)also reports:NASA’s Global Change Master Directory is available at http://gcmd.gsfc.nasa.gov/ . NOAA’s National Climate Data Center maintains an extensive data directoryavailable at . Yet another good climatedata source is the Data Guide maintained by the National Center for AtmosphericResearch Climate at https://climatedataguide.ucar.edu/ . Original idea for this example from Laura Kapitula, Grand Valley State University. owa Liquor Sales Data
Iowa recently released an 800MB+ dataset containing all the weekly liquor sales from January1, 2014 to the present ( https://data.iowa.gov/Economy/Iowa-Liquor-Sales/m3tr-qhgy ).The data seems to be updated monthly. Dan Nguyen has provided some scripting code andSQL analysis for downloading and wrangling the data ( https://gist.github.com/dannguyen/18ed71d3451d147af414 ). Because the data are available in a csv format, they can be readinto R directly using read.csv from the data.iowa.gov URL. The SQL code from the Nguyen’sgithub site can be translated directly into R using the dplyr package. Note that the rectangularstructure and comma delimited format of the data make it tempting to work with. The challengefor this particular dynamic dataset comes with its size, which can make it unwieldy on manycomputers.
Medicare Inpatient Charges Data
Medicare inpatient charge data is available at which links directly to a zipped csvfile. An analysis of the costs for Medicare Severity Diagnosis Related Group (MS-DRG) (i.e.,the procedure) can be done over multiple different covariates.
World Bank Data
The R package wbstats connects directly to the World Bank API, providing a trove of globaleconomic data. The package documentation and vignette provide an easy start for students todownload basic data of their choosing, https://cran.r-project.org/web/packages/wbstats/vignettes/Using_the_wbstats_package.html . Energy Consumption Data
The US Department of Energy collates energy usage at . For exam-ple, at they provide complete data on “coal production,consumption, exports, imports, stocks, mining, and prices.”
Government Survey Data
Anthony Damico has compiled and documented a repository of data from many dozens ofdifferent government surveys , including NHANES. Additionally,Damico espouses the merits of reproducible analysis done in R with GitHub. As one example,the General Social Survey (GSS) contains information on what Americans think of policies,issues, and priorities in the US ( http://gss.norc.org/ ). We try to communicate to the students that there is information in most data sources. We wanto be wary and attentive to issues of experimental design and systematic biases. However, wedo not want to leave our students feeling stuck every time they encounter a dataset which hasnot been gathered from a large randomized trial. Instead, we try to think of the pieces of theanalysis that can elicit information which is interesting or possibly hypothesis generating.Students at every level are ready to examine complicated relationships using tools which areaccessible before taking multiple advanced statistics courses. “Unfortunately, many instructorsteach the sections on data analysis as descriptive statistics, perhaps because this is what theyexperienced in their first course. They emphasize the process of calculating numerical summariesand making graphical displays, rather than using these as tools to explore what the data aresaying.” (Notz 2015)To keep our students from getting stuck, “we must ask questions such as whether the data allowgeneralization to a larger population, whether their structure can be meaningfully described withthe models we wish to fit, and whether important subgroups or individuals were excluded fromthe data. Exceptions, anomalies, outliers, and subgroups are best recognized and understoodin the context of the question being addressed.” (De Veaux and Velleman 2015)
4. Conclusion
The few examples provided here give a glimpse into how to incorporate real and dynamicdata into introductory statistics and and courses beyond. The Guidelines for Assessment andInstruction in Statistics Education (GAISE) College Report implores instructors to make onesmall change in our courses to keep up with changing technology and data (Carver et al.
5. Acknowledgements
This work was supported by Project MOSAIC (NSF grant 0920350, Phase II: Build-ing a Community around Modeling, Statistics, Computation, and Calculus). Addition-ally, the data sources were found with help from many individuals. For example, seeWes Stevenson’s great blog on importing data into R ( http://statistical-research.com/importing-data-into-r-from-different-sources/ ). An important step forward for access-ing data is Jenny Bryan’s R package googlesheets ( http://blog.revolutionanalytics.com/2015/09/using-the-googlesheets-package-to-work-with-google-sheets.html ). Manyof the data sets were inspired by work by or conversations with Nick Horton, Ben Baumer,Gabe Chandler, Scott Grimshaw, Laura Kapitula, Danny Kaplan, Randy Pruim, Maddi Cowen,Ciaran Evans, Samantha Morrison, and Janie Neal. . Appendix The Appendix contains technical details of interest to students who want to learn more aboutdirectly downloading data. R Studio allows for the teacher to provide template R Markdown fileswhich download data directly from the Internet into the R Studio environment. Additionally,reproducible R Markdown files teach good science and analysis. Horton et al. (2015) discussthe merits of reproducible research as well as the ease of using R Markdown in the classroom.As part of the supplementary materials ( https://github.com/hardin47/DynamicData ), wehave provided R Markdown files giving both R code and related pedagogical commentary asso-ciated with different dynamic data examples for use in a statistics classroom. The data have allbeen collected directly from outside sources - websites that are updated periodically (and withinformation the students can access on their own).
An Application Programming Interface (API) is a set of programming instructions (written incode, giving the appropriate algorithm) for downloading data from a website that contains -typically - vast amounts of data. For example, Twitter has an API ( https://dev.twitter.com/overview/api ) which tells programmers how to access tweets (and related information)directly from the Twitter website. At any level, but particularly useful at an introductory level,it is recommended to use an R (or other software) package or function that allows R to speakto the API in order to download the data. For example, the R package wbstats connectsdirectly from R to the World Bank API and twitteR provides an R interface to the TwitterAPI. Though the examples provided do not generally rely on APIs or a related R interface, itis good to be aware of APIs (and to tell your students!) in order to greatly increase the sourcesof data available to you and your students.
The R Markdown files provided as supplementary materials make it easy for instructors andstudents to perform reproducible analyses on data collection through analysis and synthesis ofthe results. The students get the important practice of tracking each and every step of theirwork. Reproducible analysis has recently gotten a lot of press (Johnson 2014; Stodden, Leisch,and Peng 2014), and R Studio (RStudio Team 2015) has produced a user friendly format forcombining R code with html (or L A TEX) word processing that has been used with introductorystatistics (Baumer, ¸Cetinkaya-Rundel, Bray, Loi, and Horton 2014).R Markdown has a short learning curve and will be straightforward for your students to imple-ment. One of the key considerations is that when running an R Markdown file, the file does notpay attention to anything running locally. R Markdown essentially restarts R, so the currentstate of your R session is totally irrelevant. Recall that the purpose of R Markdown is to createreproducible files, so the Markdown file should run on any computer anywhere (regardless ofthe local environment).
Important note:
Any package used in R Markdown needs to be installed prior to compiling the.Rmd file. That is, for a line of code such as library(dplyr) , before running the Markdownfile, install.packages("dplyr") must be run from within the console one time (ever). .3. Useful R functions
Some important R functions (in italics ) and packages (in typeface ) that will help you andyour students navigate importing and using data in R include: • dplyr package for data wrangling in general; cheat sheet at • The glimpse function in dplyr • tidyr for converting between wide and long formats and for the very useful extract numeric() function (or readr::parse numeric() ) • ggplot2 for faceted graphing (also, ggvis ) (Wickham and Sievert 2016); cheat sheet at • openintro (or packages that come with the textbooks you use) which are great for pullingup any dataset from the text and building on it in class (Diez, Barr, and ¸Cetinkaya-Rundel2012) • mosaic for consistent syntax and helpful functions used in introductory statistics (Pruim et al. https://cran.r-project.org/web/packages/mosaic/vignettes/MinimalR.pdf • googlesheets for loading data directly from Google spreadsheets • lubridate if you ever need to work with any date fields • stringr for text parsing and manipulation • rvest for scraping data off the web; readxl for reading excel data • readr (and fread with data.table ) for loading large datasets with default stringsAsFactors= FALSE • tables for nice looking summary tables. References
Albert J (2010). “Baseball Data at Season, Play-by-Play, and Pitch-by-Pitch Levels.”
Journalof Statistics Education , (3).American Statistical Association Undergraduate Guidelines Workgroup (2014). “2014 curricu-lum guidelines for undergraduate programs in statistical science.” Technical report , Amer-ican Statistical Association, Alexandria, VA. URL .Baumer B, ¸Cetinkaya-Rundel M, Bray A, Loi L, Horton N (2014). “R Markdown: Integrating AReproducible Analysis Tool into Introductory Statistics.”
Technology Innovations in StatisticsEducation , (1). URL http://escholarship.org/uc/item/90b2f5xh .Brown EN, Kass RE (2009). “What is Statistics?” The American Statistician , (2), 105–110.arver R, Everson M, Gabrosek J, Rowell GH, Horton N, Lock R, Mocko M, Rossman A,Velleman P, Witmer J, Wood B (2016). “Guidelines for Assessment and Instruction in Statis-tics Education (GAISE) College Report.” Technical report , American Statistical Association.URL .Cobb GW (1991). “Teaching Statistics: More Data, Less Lecturing.”
UME Trends , pp. 3–7.Cobb GW (1992). “Teaching Statistics.”
In Lynn A. Steen (ed), Heeding the call for change:suggestions for curricular action (MAA Notes No. 22) , pp. 3–43.Cobb GW (2007). “The Introductory Statistics Course: A Ptolemaic Curriculum?”
Technol-ogy Innovations in Statistics Education , (1). URL https://escholarship.org/uc/item/6hb3k0nz .Cobb GW (2011). “Teaching statistics: some important tensions.” Chilean Journal of Statistics , (1), 31–62.De Veaux R, Velleman P (2015). “Teaching Statistics Algorithmically or Stochastically Missesthe Point: Why not Teach Holistically? (Online discussion of “Mere Renovation is Too LittleToo Late: We Need to Rethink Our Undergraduate Curriculum From the Ground Up,” byGeorge W Cobb, The American Statistician ).”
The American Statistician , (4).Diez DM, Barr CD, ¸Cetinkaya-Rundel M (2012). openintro: OpenIntro data sets and sup-plemental functions . R package version 1.4, URL http://CRAN.R-project.org/package=openintro .GAISE College Group (2005). “Guidelines for Assessment and Instruction in Statistics Edu-cation.” Technical report , American Statistical Association. URL .Gould R (2010). “Statistics and the Modern Student.”
International Statistical Review , (2),297–315.Gould R, ¸Cetinkaya-Rundel M (2013). “Teaching Statistical Thinking in the Data Deluge.”In T Wassong, D Frischemeier, PR Fischer, R Hochmuth, P Bender (eds.), Using Tools forLearning Statistics and Mathematics , pp. 377–391. Springer.Grimshaw S (2015). “A Framework for Infusing Authentic Data Experiences Within StatisticsCourses.”
The American Statistician , (4), 307–314. URL http://arxiv.org/abs/1507.08934 .Horton NJ, Baumer B, Wickham H (2015). “Setting the stage for data science: integration ofdata management skills in introductory and second courses in statistics.” CHANCE , (2),40–50. URL http://arxiv.org/abs/1502.00318 .Johnson G (2014). “New Truths That Only One Can See.” URL .Kaplan D, Nolan D (2015). “Modeling Runners’ Times in the Cherry Blossom Race.” In DataScience in R: A Case Studies Approach to Computational Reasoning and Problem Solving ,pp. 45–104. CRC Press.Kuiper S, Sturdivant RX (2015). “Using Online Game-Based Simulations to Strengthen Stu-dents’ Understanding of Practical Statistical Issues in Real-World Data Analysis.”
The Amer-ican Statistician , (4), 354–361.otz W (2015). “Vision or Bad Dream? (Online discussion of “Mere Renovation is Too LittleToo Late: We Need to Rethink Our Undergraduate Curriculum From the Ground Up,” byGeorge W Cobb, The American Statistician ).”
The American Statistician , (4).Pruim R, Kaplan D, Horton N (2014). mosaic: Project MOSAIC (mosaic-web.org) Statistics andMathematics Teaching Utilities . R package version 0.9-1-3, URL http://CRAN.R-project.org/package=mosaic .RStudio Team (2015). RStudio: Integrated Development Environment for R . RStudio, Inc.,Boston, MA. URL .Stodden V, Leisch F, Peng RD (eds.) (2014).
Implementing Reproducible Research . Chapmanand Hall / CRC Press.Wickham H (2014). “Tidy Data.”
Journal of Statistical Software , (10). URL .Wickham H, Sievert C (2016). ggplot2: Elegant Graphics for Data Analysis . Springer NewYork. URL http://had.co.nz/ggplot2/book .Witt G (2013). “Using Data from Climate Science to Teach Introductory Statistics.” Journalof Statistics Education , (1).Workgroup on Undergraduate Statistics (2000). “Guidelines for Undergraduate Statistics Pro-grams, , accessed August18, 2013.” Technical report , American Statistical Association.Zhu Y, Hernandez LM, Mueller P, Dong Y, Forman MR (2013). “Data Acquisition and Pre-processing in Studies on Humans: What is Not Taught in Statistics Classes?”
The AmericanStatistician ,67