R Markdown: Integrating A Reproducible Analysis Tool into Introductory Statistics
Ben Baumer, Mine Cetinkaya-Rundel, Andrew Bray, Linda Loi, Nicholas J. Horton
RR Markdown:Integrating A Reproducible Analysis Toolinto Introductory Statistics
Ben Baumer ∗ , Mine ¸Cetinkaya-Rundel † , Andrew Bray ∗ , Linda Loi ∗ and Nicholas J. Horton ‡ Abstract
Nolan and Temple Lang argue that “the ability to express statistical computations is an es-sential skill.” A key related capacity is the ability to conduct and present data analysis in a waythat another person can understand and replicate. The copy-and-paste workflow that is an arti-fact of antiquated user-interface design makes reproducibility of statistical analysis more difficult,especially as data become increasingly complex and statistical methods become increasingly so-phisticated.
R Markdown is a new technology that makes creating fully-reproducible statisticalanalysis simple and painless. It provides a solution suitable not only for cutting edge research,but also for use in an introductory statistics course. We present evidence that
R Markdown canbe used effectively in introductory statistics courses, and discuss its role in the rapidly-changingworld of statistical computation.
1. Introduction
Statistical analysis of data is both increasingly common and increasingly sophisticated. While theimperative to convey findings with clarity remains, the modern statistical analyst faces a variety ofchallenges that may make analyses more difficult to understand. First, as the field of statistics deep-ens, applications of statistics are increasingly complex. Second, collaboration among researchers isnow the norm, rather than the exception. Third, much of that collaboration is conducted remotely,with written analyses, data files, and computing scripts shared via electronic means. Fourth, theunderlying data being analyzed is larger and more complex, making it impossible to fully describeon paper, and thus necessitating transmission via an electronic file. Each of these complicationsmakes it harder to completely follow someone else’s work. Yet this task—in a word, reproducibil-ity —remains the lifeblood of scientific collaboration.In the past few years, the startling realization that many modern scientific findings cannot bereplicated has been highlighted in the popular press (The Economist Editorial 2013; Johnson 2014) ∗ Department of Mathematics & Statistics, Smith College, Northampton, MA 01063 † Department of Statistical Science, Duke University, Durham, NC 27708 ‡ Department of Mathematics, Amherst College, Amherst, MA 01002 a r X i v : . [ s t a t . O T ] F e b s well as the scientific literature (Ioannidis 2013). Many factors have been identified, includingpublication bias, reporting bias, conflicts of interest, and insufficient statistical power. This lastfactor can be remedied by encouraging the replication of studies and then conducting subsequentmeta-analyses. In order for a scientific study to be replicated, however, the method of statisticalanalysis must be entirely reproducible. Teaching reproducible analysis in an introductory statisticscourse not only makes students aware of these issues, but also paves the way toward making themvaluable contributors to modern data analysis. These future contributions could be made as partof academic research or for a data-centric enterprise that needs to conduct daily analysis on newdata.The journal Nature addressed these issues head-on in an editorial outlining the efforts that thejournal would take to reduce their irreproducibility (Nature Editorial 2013). A key provision is theirpledge to “examine statistics more closely and encourage authors to be transparent, for exampleby including their raw data.” In searching for the source of this irreproducibility, they note that“mentoring of young scientists on matters of rigour and transparency is inconsistent at best” (p.398). A natural environment to provide this mentoring is the first time that most young scientistswill encounter the formal principles of scientific inquiry: in introductory statistics.The introductory statistics course has changed greatly in recent decades, with more focus on ac-tive learning, use of technology for conceptual understanding and analysis of data, along with“naked, realistic and real” data (GAISE College Group 2005). In this paper, we discuss how
RMarkdown (Allaire, Horner, Marti, and Porte 2013), a simple, easy-to-learn, open source markuplanguage, can be integrated into an introductory statistics course in an effort to achieve the GAISEguidelines, specifically, to enable students to develop the basic capacity to undertake modern dataanalysis and communicate their results.Our world is increasingly awash in data. What we plan to do with that information—how weplan to store it, how we will analyze it, and what exactly we hope to extract insight from it—arecentral unanswered questions facing data scientists, network scientists, statisticians, and computerscientists alike, both inside of academia and out. As statistics instructors, we face the difficulttask of preparing our students to make their way in a sea of data that is foreign to many of us.Meanwhile, intrepid graduates hop on massive vessels like Google, Facebook, and Amazon eachyear, perhaps not knowing that the two-sample t -test they spent so much time studying in collegeis unlikely to be sufficient for their future work (Cobb 2007). But our aim here is not to discussthe topics in introductory statistics courses, but rather, the workflow .Hopefully the days in which students perform statistical analyses by hand or with a calculator are,if not over, bounded above by a function tending to zero. Thus, we take it as a given that ourstudents’ future work will be done on a computer. As such, providing students with the tools to“think with data” and “compute with data” is essential to their prosperity as a data analyst. Atthe same time, the ability to communicate one’s findings to other people is imperative. At the endf every data analysis task, there is a person who wants to understand the analyst’s findings. Thatperson may be a boss, a journalist, a doctor, or a policy-maker, who more often than not will havea weaker technical background than the analyst. This is the way it should be, since otherwise, theperson might as well perform the analysis themselves. But the end result is that the analyst’s valueis ultimately tied up in how understandable she can make her work.Because both computation and presentation are essential, a typical workflow is comprised of atleast two major components: a statistical software package for performing the data analysis; anda layout package for presenting the results. For the former, we have had very positive experiencesusing R (R Core Team 2013), even in an introductory course, but other options (e.g. SAS or Stata )may be feasible. For presentation, written reports tend to be composed in a word-processingapplication (e.g. Microsoft Word, LibreOffice Writer, or Google Docs) while oral presentations tendto use slides prepared in a presentation application (e.g. Apple Keynote, Microsoft PowerPoint,L A TEX with beamer or Prezi). A pairing of a statistical package and a layout package constitutea workflow . That is, the analyst’s work will typically begin in their statistical package of choice,wherein the data analysis will be performed. Once completed, translated summaries of work, bethey tables, charts, images, or other output, need to be integrated into the layout application.Once in this environment, additional material can be layered onto the statistical results, so that a(usually less technical) human being can understand the findings.This workflow is ubiquitous, and in most undergraduate courses where students are expected tocompute with data, student homework assignments are produced in this manner. That is, statisticalcomputations are performed in a statistics package, say R , and then a written summary is producedin, say, Word. Tables and plots are simply copied-and-pasted from R to Word.So if this workflow is so common, what is wrong with it? In truth, there are several important un-desirable aspects. First, it is not reproducible. Since the commands used to generate the statisticaloutput are not present in the final presentation, then either: a) the reader must assume that thestudent has calculated exactly what they say they have calculated, since there is no way of verifyingthe computation; or b) the grader must rely on the student to also copy-and-paste the commandsused to generate the analysis. In either case, it will frequently be the case that the grader will beunable to completely follow the student’s work. Moreover, the issue of reproducibility is relevantnot only for a second-party (i.e. a grader), but also for the student. Being able to retrace stepswhile studying for a final, for example, is a desirable outcome. More concretely, the student may bereminded years later of the analysis, and seek to reapply the same methods in a different setting.Having the commands separated from the results inhibits this process. Here we must distinguish between command-driven software packages (each of the aforementioned) and menu-driven software packages (e.g. StatCrunch or Microsoft Excel). It is increasingly the case that the complexity ofdata analysis tasks require the additional functionality and programmability of command-driven applications. Theiteration required of students doing inquiry-based projects often breaks down in a menu-driven workflow. econd, the separation of computation from analysis is not logical. The commands in an R scriptproceed chronologically, such that the analyst will most likely run the entire script all at once.A written report will be read in the same order, and there is no reason why the commands andanalysis should not be interwoven. Rather, the impetus to separate the commands from the analysisis that the statistical package is not good at presentation, and the word-processing application isnot good at computing. But this is not the ideal setup for the data analyst—it is simply an artifactof software design. R Markdown helps to bridge this gap in the data analysis workflow.Third, the separation of computing from presentation is not necessarily honest. At Smith College,a strict honor code—to which all students are bound—discourages cheating. But it is all too easyfor a student copying-and-pasting output from one program to another to fudge a few numbers.Again, the divorce of the computation from the presentation enables the student to edit the contentalong the way. The possibility of getting “lost in translation” is disastrous for the data analyst.More subtly and less perniciously, the copy-and-paste paradigm enables, and in many cases evenencourages, selective reporting. That is, the tabular output from R is admittedly not of presentationquality. Thus the student may be tempted or even encouraged to prettify tabular output beforesubmitting. But while one is fiddling with margins and headers, it is all too tempting to removerows or columns that do not suit the student’s purpose. Since the commands used to generate thetable are not present, the reader is none the wiser.Lastly, the copy-and-paste paradigm is error prone. When jumping between multiple windows ( R and a word processor), students, often working on laptops with small cluttered screens, inadver-tently copy-and-paste partial output or forget to update the output or plots included in the writtenreport as they revise their analysis. This not only complicates grading, but it also results in in-creased frustration levels in students who devote time to improving their analysis but lose pointsfor turning in a report that does not contain the desired results.
2. Related Work
While the notion that scientific results should be reproducible is fundamental, the recent interestin reproducible statistical analysis is a modern outgrowth fueled by developments in computingand networking. In particular, computational statistical methods have become more popular ascomputational power has become cheaper. Similarly, the Internet has eliminated many barriers toinformation dissemination. Thus, it is now possible to transmit the entirety of a statistical researchproject to nearly anyone in the world in almost no time, and at almost no cost. With the inclusionof both data and code the possibility exists that another person can entirely duplicate an analyst’sfindings with little effort.Knuth was an early advocate of literate programming , which emphasized the use of detailed com-ments embedded in code to explain exactly what the code was doing (Knuth 1984). The goal waso tie explanations to instructions so that work could be recreated, better understood, and verified.This idea was a predecessor to the notion of reproducible research . According to Xie (2014), the useof the term reproducible research first appeared in Claerbout (1994). Buckheit and Donoho wereearly disciples of Claerbout’s ideas, incorporating them into their work with Matlab libraries (Buck-heit and Donoho 1995). They proposed that in a scientific publication that relies on computation,the scholarship is not merely the presentation of the figures, etc. that further the author’s case.Rather, “the actual scholarship is the complete software development environment and completeset of instructions which generated the figures” (Buckheit and Donoho 1995). From here, it is clearthat the burden of reproducibility rests on the original author, and that publication of computercode is considered a necessary but not sufficient condition for achieving reproducibility.Particular advocacy of reproducibility has come from the community surrounding R . Sweave (Leisch2002) provided a method for integrating executed R code into L A TEX documents. The knitr packageby Xie (2014) provides equivalent functionality, but also partners with
R Markdown to bring repro-ducibility and dynamic document generation to those who are not familiar with L A TEX (Gandrud2013). In many ways knitr can be seen as the realization of the vision for reproducible statisticalanalysis described by Gentleman and Temple Lang (2004).The emphasis on reproducibility can be seen as a necessary but not sufficient part of ensuringthat students have capacity to “think with data.” Along these lines, recent efforts in statisticseducation have advocated for an increased use of computing in the statistics curriculum, both at theundergraduate and graduate levels (Nolan and Temple Lang 2010). Yet while they argue strongly forthe need for students to learn programming (and presumably, literate programming) they provideno mechanism for allowing students to express their statistical computations.
R Markdown providesexactly such a mechanism, and fits squarely into the statistical computing workflow.
3. R Markdown
R Markdown is an easy-to-use system that enables students to combine statistical computing in anenvironment of their choosing and written analysis in one document . At a high-level, it rendersa well-annotated R script into a self-contained HTML file, replete with graphics, commands, andstylized text.Like L A TEX or HTML,
R Markdown relies on a source file and output file paradigm. Text, with simplerules for creating styles, is typed into an
R Markdown source file, which has the .Rmd extension. R commands are typed directly into this file, set off in “chunks”. The knitr rendering engine then parsesthe .Rmd file. It first executes each of the R commands in the chunks and processes the output fromthose commands. This generates an intermediate Markdown file (with a .md extension) which is ofno immediate interest. Next, it renders this Markdown file into a single HTML file with embeddedgraphics. For those familiar with L A TEX, Sweave, or PHP, it is very similar to the way that each ofhese process one source file into another output file. A comparison of the workflow in renderingapplications is given in Table 1.Source Language Source file format Rendering Engine Intermediatefile format Output fileformatL A TEX .tex pdflatex .log, .aux .pdf, .psSweave .Rnw Sweave, knitr .tex .pdfPHP .php PHP .html
R Markdown .Rmd knitr .md .htmlTable 1: Comparison of similar rendering applicationsThe primary benefit of
R Markdown is that it restores the logical connection between the statisticalcomputing and the statistical analysis that was broken by the copy-and-paste paradigm. Eachchunk of R code is rendered into two parts: first, a box that contains the syntax-highlighted,tidied R code; followed by the output from those commands. In this manner, it is perfectly clearexactly what command has been run, and there is no way to fudge or edit the output from thosecommands. Additional content in the form of text, lists, headers, tables, external images, and weblinks, etc. can surround the R chunks in a standard way.One of the major advantages of R Markdown over existing technologies, such as Sweave (Leisch2002), is that the Markdown syntax is very simple. For example, to make a word show up inboldface, it is surrounded with asterisks. Compare this to HTML, in which you’d have to put“ < b > ” before the word and “ < /b > ” after it. Or consider L A TEX, in which you would have to encasethe word: \textbf{word} . A side-by-side comparison of the alternatives are shown in Table 2.HTML L A TEX
R Markdown word \textbf{word} *word*
Table 2: Comparison of syntax for typesetting “word” in bold face. The syntax employed by
RMarkdown does not require learning a separate set of complex rules, as does HTML or L A TEX.To make a bulleted-list in
R Markdown , a series of lines are prefaced with an asterisk in exactly themanner as in a plain-text email (See Figure 1). Thus, students can learn to use
R Markdown withoutthe burden of learning a wholly new technology, such as L A TEX or HTML. The
R Markdown syntaxis so simple that the majority of the
R Markdown syntax is presented on a short web page (RStudio2013). R commands and output are distinguished from plain text with the use of chunks . Chunks beginwith a series of three backticks, and conclude with three more. Figure 2 illustrates a simple chunk OK, a hacker-student could edit the HTML file manually, but good luck trying to edit the figures, which arerendered as bytecode to allow them to be saved as embedded images! That is, instead of the typical configurationwherein images on a web page are stored in separate files,
R Markdown converts all images to an equivalent HTMLstring of machine-readable bytecode. This allows each rendered
R Markdown document to include images withoutrequiring external files. igure 1: Bulleted list in
R Markdown , input (left) and output (right).of R code and its rendered output. Note that the chunk in Figure 2 is named ( exPlot ), and setstwo options to non-default values ( fig.width, fig.height ).We should note that the knitr rendering engine is not specific to R or RStudio , a popular open sourceintegrated development environment for R . The following R commands are equivalent to clickingon the “Knit HTML” button in RStudio (note the intermediate generation of a Markdown file): library(markdown)knit("filename.Rmd")
RStudio is available as either a client application or a server (cloud-based) version. The lattersetup, implemented at our institutions, allows students to access and run
R Markdown and
RStudio through a browser, and minimizes startup time.Moreover, while in this paper we focus on the use of
R Markdown in the introductory statistics class,we should also note that, just like R , R Markdown also extends beyond the introductory classroom.Students who are introduced to the concept of reproducibility at this level carry the skills theyacquire with them throughout their undergraduate career (and beyond). At the point where thesimple formatting of
R Markdown becomes limiting to producing high quality customizable reports,students who are familiar with L A TEX can easily transition to Sweave/ knitr . In fact, at DukeUniversity, students taking the Statistical Consulting course (STA 470) as one of the last courses inthe major curriculum use Sweave/ knitr to complete their data analysis assignments, as do studentsin Mathematical Statistics at Smith College.igure 2: An example of an
R Markdown chunk (left) and its rendered output (right).
4. Using R Markdown in Introductory Statistics4.1. Duke University
At Duke University in Durham, North Carolina, 272 statistics students have used
R Markdown during the 2012-2013 academic year (221 enrolled in STA 101 during the Fall and Spring semesters,and 51 enrolled in STA 102 during Spring). Both of these are non-calculus based introductorystatistics courses usually taken by first and second year students majoring in either the socialsciences or the life sciences, respectively. Only a very small number of the students each semesterhave any meaningful computational background. Both courses have lecture and lab components,and students used
R Markdown to complete their lab assignments as well as data analysis project(s).n STA 101, students complete a simpler project on statistical inference evaluating univariatedistributions or bivariate relationships (completed individually) and a more advanced project onmultiple regression (completed in teams). In STA 102, the students complete an open-ended dataanalysis project using the appropriate methods covered in the course (competed individually).The STA 101 course employs the flipped classroom model as well as team-based learning. Studentsare assigned to teams by the instructor at the beginning of the semester based on their performanceon the ARTIST CAOS, Comprehensive Assessment of Outcomes in a First Statistics course, (del-Mas, Garfield, Ooms, and Chance 2007), pre-test and their responses to a survey on their statistics,mathematics, and computer science background as well as their interests and reasons for takingthe course. The teams are created to be heterogenous with respect to statistics experience andhomogenous with respect to student interests. Students work in teams in many components of thecourse, including the weekly R labs. The final product of the weekly labs is a team lab report,created using R Markdown . The labs are designed such that students complete the majority of theexercises during the lab sessions led by the teaching assistants. However, finalizing the analysisand the write-up requires spending time outside of class. Reports produced in
R Markdown facil-itate easy and organized sharing of the code and the write-up among the team members. Priorto integrating
R Markdown into the course curriculum students struggled with sharing their workamong team members and version control. Using
R Markdown for the weekly labs allows studentsto work collaboratively on data analysis throughout the semester, and they reap the benefits ofhaving developed a workflow that has reproducibility at its heart when working on their larger scaleindividual and team projects.In addition, reports produced using
R Markdown present the code and the output in one place (asinput and output) making it easier for students to learn R and locate the cause of an error. Likewise,uniformity of the output and the enforced structure of the reports significantly aid the instructorsin debugging issues as they arise as well as simplifying the task of grading (see Appendix C for asample lab assignment and student solution).In previous versions of the course, prior to adopting R Markdown , labs and projects still requiredanalyses performed in R . As the students were learning R concurrently with new statistical concepts,they would often struggle to organize their analyses. They took a trial-and-error approach to coding,and made ad-hoc changes as they went through the analysis. However, despite trying to instill bestpractices, most students never really developed a habit of separately saving their code. This oftenresulted in cluttered workspaces and R consoles, difficult-to-diagnose errors due to overwriting data,and hence student frustration. We believe that the root of the problem was that the desired finalproduct (the lab report, the project write-up, etc.) was just a presentation of results (typed up in aWord processor like Microsoft Word or Google Docs) that did not include the underlying code. Onthe other hand, comments from students enrolled in recent versions of the course, after adopting RMarkdown , suggest that they appreciate the ease of organization of their code: “I think the labs have been great. Using
R Markdown has been so great because we donot spend as much time solving the format/design of the paper and instead focus on actualproblem solving. R is super easy to use and useful. ” • “The labs have been enjoyable, and R Markdown makes the process very easy.” • “Labs can sometimes be troublesome and confusing, however, the TAs are very helpful. The R Markdown used to generate lab reports/proposals are very helpful for organizing our infor-mation.”Students also commented on the usefulness of templates provided with the labs (see Appendix C).Another notable point was a general sense of excitement and interest around the labs. • “The labs are fun. There is something satisfying about hitting ‘knit’ and having the text turninto figures and tables.” • “I like it a lot actually. It has sparked an interest in coding for me.” At Smith College in Northampton, Massachusetts, 145 statistics students used
R Markdown duringthe 2012-2013 academic year. In the fall semester, 42 students completed MTH 245, an advancedfirst course in statistics for students with a calculus background. The course is worth five credits andhas both lecture and lab components. These students completed most of their lab assignments in
R Markdown . Furthermore, after conducting a statistical investigation involving multiple regressionas part of their final project, they submitted a “technical appendix” composed in
R Markdown . Atotal of 33 other students took a second course in statistics, MTH 247, which focused on regressionanalysis. These students completed all of their homework assignments in
R Markdown and for theirfinal project, submitted both a technical appendix written in
R Markdown and a write-up composedin a word-processing application.Anecdotal success with this pilot program at Smith led to the integration of
R Markdown into threesections of the spring semester introductory statistics course. 241 is the four credit equivalent ofMTH 245, which similarly requires calculus but does not have a lab component. As in MTH 245,70 students completed almost all homework assignments in
R Markdown , as well as a technicalappendix for their final project. These students were given surveys at the beginning and end of thesemester in order to gauge their attitudes toward R and R Markdown . (This project was approvedby the Smith College Institutional Review Board.) The results, which we present in detail below,suggest that:. Students grew to appreciate
R Markdown ’s ability to streamline their homework workflow. Inparticular, students did not prefer to copy-and-paste their work from R into Microsoft Word.2. While students experienced frustration with both R and R Markdown , this frustration wanedover the course of the semester.3. There was little to no correlation between a student’s attitude towards
R Markdown and thatstudent’s performance in the course.4. Lack of prior exposure to markup languages similar to
R Markdown was not an impedimentto learning
R Markdown .From the point of view of the instructor, while there is some overhead and growing pain requiredalongside the introduction of
R Markdown , these hurdles will be overcome, and the benefits arewell worth it. Specifically, the lesson of reproducibility is emphasized throughout the semester,homework is easier to grade, and students receive more comprehensive and specific feedback ontheir statistical computing than they would using the typical copy-and-paste paradigm.
Survey data
Of the aforementioned 70 students, 56 completed the Likert-scale survey shown in Appendix A atboth the beginning of the semester (after some initial exposure to R and R Markdown ), and at theend of the semester. A summary of their responses to questions is shown in Table 3 and Figure3 (Bryer and Speerschneider 2013). The responses in Table 3 are scored on a scale from − − R or R Markdown ,and 2 represent strong agreement with that same statement. Note that only about half of thestatements on the survey were worded favorably towards
R Markdown , so for questions 3, 4, 6, 7,10, and 11, − − R Markdown ).Questions 5, 6, 7, 9, and 10 address
R Markdown ’s role in the data analysis workflow. For allfive questions, the students responses were favorable at the end of semester, and grew more favor-able over the course of the semester. Most notably, while students were largely indifferent to
RMarkdown ’s ability to make their homework easier to read and understand at the beginning of thesemester (mean initial response to R5 of 0.35), by the end of the semester the most students realizedthis benefit (mean final response to R5 of 0.84). The improvement of 0.51 was among the largestchanges for any of the eleven questions. Note that this question forces the students to consider theperspective of someone reading their work—it does not solely address a question in the student’simmediate self-interest.efore After ChangeQuestion Idea N Mean (SD) N Mean (SD) N Mean (SD)B1 prior R
56 1.30 (0.60) 56 1.34 (0.58) 56 0.04 (0.50)B2.CSS prior CSS 56 0.14 (0.35) 56 0.12 (0.33) 56 -0.02 (0.23)B2.HTML prior HTML 56 0.46 (0.50) 56 0.48 (0.50) 56 0.02 (0.45)B2.LaTeX prior L A TEX 56 0.07 (0.26) 56 0.07 (0.26) 56 0.00 (0.19)B2.Wiki prior Wiki 56 0.14 (0.40) 56 0.12 (0.33) 56 -0.02 (0.40)B2.XML prior XML 56 0.00 (0.19) 56 0.04 (0.19) 56 0.04 (0.19)R1 simplicity 55 -0.30 (0.93) 56 0.24 (1.05) 55 0.53 (1.12)R2 compilation 55 -0.53 (1.07) 56 -0.04 (1.05) 55 0.48 (1.42)R3 RM frustration 56 -0.50 (1.04) 56 -0.10 (1.07) 56 0.40 (1.19)R4 R frustration 55 -0.68 (0.90) 56 -0.21 (1.17) 55 0.45 (1.26)R5 readability 55 0.35 (0.95) 56 0.84 (0.80) 55 0.51 (0.79)R6 copy-and-paste 51 0.73 (0.94) 55 0.87 (1.06) 50 0.10 (0.99)R7 coercion 53 0.35 (0.99) 55 0.55 (1.02) 52 0.24 (1.03)R8 improvement 56 0.22 (0.83) 56 0.83 (0.75) 56 0.61 (0.94)R9 ease 55 0.08 (0.89) 56 0.33 (0.93) 55 0.25 (1.03)R10 difficulty 55 -0.05 (1.00) 55 0.30 (1.00) 55 0.35 (0.99)R11 training 56 -1.46 (0.79) 56 -1.51 (0.79) 56 -0.04 (0.95)Table 3: Summary of before and after responses to questionnaire. Responses were scored accordingto the scale: no opinion =N/A, strongly disagree = −
2, disagree = −
1, indifferent = 0, agree = 1,strongly agree = 2. The responses to questions 3, 4, 6, 7, 10, and 11 have been flipped. Thus,higher scores are more favorable to R and R Markdown , and lower scores are less favorable. Notethat what is being shown in the third group of columns is the mean change in response, not thechange in mean response.Moreover, while the initial response to questions 9 and 10 were indistinguishable from zero, by theend of the semester there was mild agreement that
R Markdown makes it easier for students tocomplete their homework. Thus, students acknowledged that
R Markdown , in addition to being abenefit to their audience (as demonstrated by question 5), was of a mild benefit to them.Questions 6 and 7 address the possibility of alternative workflows. In question 7, students expresseda mild lack of resentment at being forced to use
R Markdown . However, residual resentment wanedover the course of the semester. More interestingly, students were quite opposed to the typicalworkflow which would require them to copy-and-paste their results from R into Word. Moreover,there was little change in these responses over the course of the semester. Thus, the results suggestthat not only do students prefer R Markdown to Word after having used it all semester long, butthat they never preferred to use Word in the first place. This should help to encourage thoseinstructors who are most comfortable in Word to consider making a change. It does not appearthat these students were beholden to word processing applications.Questions 3, 4, and 8 address the issue of frustration with R and R Markdown . Here, it is expectedthat many students will express frustration with R , which is an admittedly expert-friendly softwarepackage. The data suggests that while initial frustration with both R and R Markdown was rea-
R1: I find the R Markdown syntax to be simple and understandable.R2: When my Markdown document does not compile, I know how to go about fixing it.R3: I am frequently frustrated by R Markdown when doing my homework.R4: I am frequently frustrated by R when doing my homework.R5: R Markdown makes my homework easier to read and understand.R6: I would rather copy and paste my results (plots, tables, and numbers) into a word processing program (e.g. Word).R7: I resent being forced to use R Markdown. It should be my choice how I prepare my homework.R8: I found R Markdown to be frustrating at first, but now I've got the hang of it.R9: R Markdown makes it easier for me to complete my homework.R10: R Markdown makes it more difficult for me to complete my homework.R11: I wish I had received a more thorough introduction to the logic and features of R Markdown.AfterBeforeAfterBeforeAfterBeforeAfterBeforeAfterBeforeAfterBeforeAfterBeforeAfterBeforeAfterBeforeAfterBeforeAfterBefore 100 50 0 50 100
Percentage
Response
Strongly Disagree Disagree Indifferent Agree Strongly Agree
Figure 3: Results from Likert scale
R Markdown survey administered at Smith College, also sum-marized in Table 3. Responses from students who circled more than one answer were rounded tothe more extreme value.sonably high, by the end of the semester it had largely dissipated. In particular, frustration with
R Markdown was negligible by the end of the semester, and frustration with R was considerablydiminished. This notion was addressed more directly by question 8, which offered the largest changeover the course of the semester (0.61). Here, most students agreed that they were frustrated by RMarkdown at first, but had gotten the hang of it by the end of the semester.Questions 1, 2, and 11 address the students’ experience working with
R Markdown . On Question11, students were almost unanimous is their desire to have received a more thorough introductionto the logic and features of
R Markdown . Unlike the previous questions, this initial reaction wasnot moderated over the course of semester. While it is expected that many students will requestadditional help in working with new technologies, future versions of the course will include somekind of “workshop” during the first month that eases the adoption of R and R Markdown . On theother hand, questions 1 and 2 show evidence of student growth. While many students did notfind
R Markdown to be particularly simple and understandable upon initial exposure, by the endof the semester they mildly supported the claim that
R Markdown was simple and understandable.erhaps more importantly, students showed a marked improvement in their ability to debug
RMarkdown . At the beginning of the semester, many students did not feel as though they knew howto fix compilation errors in
R Markdown , but by the end of the semester, they did not disagree (to astatistically significant extent) with the notion that they could debug their own
R Markdown errors.
Anticipated Problems
Two fears that we had did not seem to be supported by the data. First, we feared that since theuse of
R Markdown was so thoroughly integrated into the course, and so vital for completing thehomework (which constituted 20% of the total grade for the course), that students who viewed
RMarkdown more favorably would be advantaged with respect to their overall grade in the course.Second, we feared that students who had stronger prior exposure to technologies similar to
RMarkdown would be have an easier time completing their assignments. More specifically, we fearedthat students who had not been exposed to technologies similar to
R Markdown would suffer sincethey might have to spend more time on their homework. Neither of these fears were borne out inthe data.To test these hypotheses, we examined the relationships between survey responses at the beginningand end of the semester, and two measures of performance in the course: the student’s final coursegrade, and her score on the Comprehensive Assessment of Outcomes in a First Statistics Course(CAOS; delMas et al. (2007)) post-test. None of the correlations between the scores on each of the11 questions and the student’s final course grade were statistically significant at the 5% level . Onlytwo (R5 and R6) were significant at the 10% level, with both indicating weak positive associationwith final course grade (0.23 and 0.26, respectively). Correlations between the responses and CAOSscores revealed a similar lack of association (R6 showed a correlation of 0.27, but a 95% confidenceinterval [ − . , . R Markdown not an important indicator of their performance in thecourse or their absorption of statistical material, but neither was their change in attitude towards
R Markdown over the course of the semester.Similarly, prior exposure to
R Markdown -like technologies did not appear to be associated withstudent performance. First of all, only one quarter (14 of the 56) of the students had ever heardof R prior to taking the course, and only two had used it. Only four students had prior exposureto L A TEX, and only nine reported having edited a Wiki. While eight students had seen CascadingStyle Sheets (CSS), all eight had prior exposure to Hypertext Markup Language (HTML), along We acknowledge that none of the measures of statistical significance reported were corrected for multiple com-parisons. However, since the purpose of this analysis is to show that there is little statistical evidence of correlationbetween attitudes towards R Markdown and performance in the course, and a multiple comparisons correction wouldonly weaken any claims of statistical significance, we do not feel that this omission detracts from our findings. ith 18 students who had used HTML but not CSS. Thus, prior exposure to HTML was theonly prior technology to which students had reasonably varied backgrounds. While there was noassociation between prior exposure to HTML and score on the CAOS exam, there was a borderlinestatistically significant negative correlation between prior exposure to HTML and final course grade( p = 0 . Ancillary Outcomes
Finally, the end-of-semester responses to questions 5 and 6 deserve a moment’s reflection in theirown right. For the most part, students agreed (0.84) that
R Markdown made their homework easierto read and understand. [To what they were comparing it to, perhaps handwritten or pastedinto a Word document, is left open.] Moreover, they would not rather (0.87) copy-and-paste theirhomework into Microsoft Word. We interpret the responses to question 5 as an affirmation of
R Markdown ’s usefulness for students, and note that this perception grew over the course of thesemester. The responses to question 6 confirm that working with
R Markdown for a semester,and the occasional frustration that goes along with it, did not make students yearn for a return toWord. While this attitude did not change much of the course of the semester, it reveals the perhapssurprising discovery that even students who have never heard of R a few weeks before would notrather copy-and-paste their statistical results into Word as part of the homework preparation. Weinterpret these findings as further evidence that open-source tools are perfectly suitable for use ineven introductory statistics courses at the undergraduate level.
5. Discussion
Having presented a motivation for using
R Markdown in introductory statistics, described the tech-nology, and reviewed our experience using it, we close with a discussion of some additional benefits,challenges, and limitations.
One of the benefits of using
R Markdown in both the introductory and intermediate statistics coursesis the development of knowledge within the institution. At Smith, one of five statistics teachingassistants is available for two hours each night from Sunday to Thursday. All of these studentsare now familiar with
R Markdown and capable of helping introductory students with commonproblems. In good faith, we present some of those issues below. • Workspace confusion: Many errors result from a failure to understand that each
R Markdown le, when compiled, runs in a fresh workspace that does not have access to any of the objectsin the existing workspace active in
RStudio . – Failure to load packages: Students will often forget to load additional packages in their
R Markdown scripts (e.g. require(mosaic) ). – Reading external data files: Students often forget to add the read.csv() in their
RMarkdown file after loading it into their workspace from typing it in the console. • Improper use of chunks: Students often forget to put their R code into a valid chunk. A usefulsolution is to tell them to select “Insert Chunk” from the green Chunks menu whenever theywant to enter commands. • Forgetting to close quotes or parentheses or chunks: Syntax highlighting in
RStudio mitigatesthis issue, but it still arises. • Issues specific to R as opposed to R Markdown : Invalid syntax for commands. • Debugging: The compilation errors that occur when
R Markdown is rendered are not alwaysstraightforward to interpret. Thus, students occasionally have a hard time identifying theparticular command that is causing the problem. This can be mitigated by encouraging stu-dents to name their chunks, and to encourage them to pursue common process of eliminationdebugging techniques. • Package versioning: In some cases the packages on a students machine may become out-of-date or out-of-sync. Encouraging them to keep all of their packages up-to-date (especially knitr ) with the update.packages() command usually provides a solution. Alternatively,encouraging students to use a server version of
RStudio (administered by your institution)can be an effective solution. • Formatting: While
R Markdown is capable of implementing basic formatting operations, manymore advanced features are not available. Some of the more useful and accessible options are: – Gratuitous output: Without the message=FALSE option in an
R Markdown chunk, un-wanted messages are rendered in the output. – Plot size: The size of a rendered plot can be changed by using the fig.width and fig.height chunk options. – Chunk naming: Assigning a name to each R chunk is helpful with debugging. • When all else fails: Restarting
RStudio can solve many problems. Any package can be safelyremoved and reinstalled. Occasionally doing this will solve less obvious problems. .2. Limitations
While
R Markdown is suitable for many purposes, it has a few limitations that may prove problem-atic. Specifically: • While objects defined in previous chunks become part of the workspace and are thus availablefor later use, plots defined in previous chunks cannot be modified by later chunks. The mostcommon work-around for this issue is to create a plot in a single chunk or assign the outputof a plot to an object that can be printed in a subsequent chunk. • There is no easy way to count words or pages in the rendered
R Markdown output. This makesit difficult to check to see if a submitted homework assignment meets any such guidelines.Use of the default formatting options in
R Markdown can result in very long documents. If therendered HTML file that a student wishes to submit is very long, then it can quickly becomecumbersome and even expensive to print it out and submit a hard copy. On the other hand, if thedocument is to be submitted and evaluated electronically, the length of the document may be of noreal concern. Thus, while use of even basic non-default formatting options can dramatically reducethe length of rendered
R Markdown documents, there is a sense in which moving to electronicsubmission and grading will mesh well with
R Markdown adoption. Indeed, if the grader knowsHTML, it is even possible to give inline feedback on a student’s submission. (This process has beenimplemented at two of our institutions.)Given the interest in having students collaborate on projects at the introductory level (Halvorsenand Moore 2001), streamlining a collaborative workflow is worthwhile.
R Markdown provides sucha mechanism in part due to its inherent emphasis on reproducibility. Students working together areable to follow, and even extend, each other’s work with minimal effort. Nevertheless, a fool-proofsolution for having multiple students edit the same
R Markdown file simultaneously does not yetexist. The use of an
RStudio server, or a third-party file synchronization solution (e.g. Dropbox) canprovide a functional workaround. Future versions of
RStudio may also include additional featuresdesigned to facilitate real-time collaboration projects.
Another component of reproducibility relates to the version of R and its associated packages, whichare often updated. While somewhat beyond the scope of this manuscript, further efforts to facilitatethe reproduction of analyses that require specific (older) versions of packages will be needed.It is worth noting that knitr provides functionality for condensing an R Markdown file into a conven-tional R script, and vice versa. More generally, those who are comfortable working with R scriptsill find it easy to augment those scripts into R Markdown files, which will retain the ability to sendsuccessive R commands to the current console.It would be interesting to assess the extent to which students absorb the importance of reproducibil-ity in this couse. Adding an assessment that specifically addresses reproducibility and is presentedto students with a set of concrete learning objectives is something that is under consideration anda topic of future work. However it is not trivial to add material to an already busy introduc-tory statistics curriculum, and therefore requires careful consideration of the existing material andassessments.On a cautionary note, we remind the reader that due to the multiple uncorrected tests we ran, theclaims of statistical significance made in Section 4.2.2 should not be overstated.
6. Conclusion
The aforementioned
Nature
Nature Editorial (2013) concludes with a call to action: “We urgeothers to take note . . . and do whatever they can to improve research reproducibility” (p. 398).As statistics educators, we are the members of the scientific community that are most well-suitedto, and responsible for, addressing this challenge.
R Markdown is a new technology that integratesseamlessly into existing computational work done with R within RStudio . With appropriate supportmechanisms, introductory statistics students are receptive to its adoption. In our experience attwo very different institutions with very different student bodies,
R Markdown made a welcomedimprovement to the traditional copy-and-paste workflow. Students left the course equipped withfunctional skills that will help them in any future quantitative endeavor.
7. Acknowledgements
This work was partially supported by Project MOSAIC, US National Science Foundation (DUE-0920350).
References
Allaire, J., Horner, J., Marti, V., and Porte, N. (2013), markdown : Markdown rendering for R , Rpackage version 0.6.3, http://CRAN.R-project.org/package=markdown .Bryer, J. and Speerschneider, K. (2013), likert: Functions to analyze and visualize likert type items ,R package version 1.1, http://CRAN.R-project.org/package=likert .Buckheit, J. B. and Donoho, D. L. (1995), “Wavelab and reproducible research,” Tech. Rep. 474,Stanford University, http://statweb.stanford.edu/~wavelab/Wavelab_850/wavelab.pdf .laerbout, J. (1994), “Hypertext documents about reproducible research,” Tech. rep., StanfordUniversity, .Cobb, G. W. (2007), “The Introductory Statistics Course: A Ptolemaic Curriculum?”
TechnologyInnovations in Statistics Education (TISE) , 1, http://escholarship.org/uc/item/6hb3k0nz .delMas, R., Garfield, J., Ooms, A., and Chance, B. (2007), “Assessing Students’ Conceptual Un-derstanding after a First Course in Statistics,”
Statistics Education Research Journal , 6, 28–58, https://apps3.cehd.umn.edu/artist/caos.html .Fomel, S. and Claerbout, J. F. (2009), “Guest Editors’ Introduction: Reproducible Research,”
Computing in Science & Engineering , 11, 5–7.GAISE College Group (2005), “Guidelines for Assessment and Instruction in Statistics Educa-tion,” Tech. rep., American Statistical Association, ,accessed August 15, 2013.Gandrud, C. (2013),
Reproducible Research With R and RStudio , Chapman & Hall/CRC.Gentleman, R. and Temple Lang, D. (2004), “Statistical analyses and reproducible research,”
Bioconductor Project Working Papers , Working Paper 2, http://biostats.bepress.com/bioconductor/paper2 .Hall, M. R. and Rowell, G. H. (2008), “Introductory statistics education and the National ScienceFoundation,”
Journal of Statistics Education , 16, .Halvorsen, K. T. and Moore, T. L. (2001), “Motivating, monitoring, and evaluating studentprojects,”
MAA Notes , 27–32.Ioannidis, J. P. (2013), “This I believe in genetics: discovery can be a nuisance, replication is science,implementation matters,”
Frontiers in Genetics , 4.Johnson, G. (2014), “New truths that only one can see,”
The New York Times , .Knuth, D. E. (1984), “Literate programming,” The Computer Journal , 27, 97–111.Leisch, F. (2002), “Sweave: Dynamic generation of statistical reports using literate data analysis,”in
Compstat , Springer, pp. 575–580.Nature Editorial (2013), “Announcement: Reducing our irreproducibility,”
Nature , 496, .Nolan, D. and Temple Lang, D. (2010), “Computing in the statistics curricula,”
The AmericanStatistician , 64, 97–107. Core Team (2013),
R: A Language and Environment for Statistical Computing , R Foundationfor Statistical Computing, Vienna, Austria, ISBN 3-900051-07-0, .RStudio (2013),
Using R Markdown with RStudio , .The Economist Editorial (2013), “Trouble at the lab. (Cover story).” .Xie, Y. (2014), Dynamic Documents with R and knitr , Chapman & Hall/CRC. . R Markdown Survey
This survey is part of an ongoing research study to help improve the use of technology in intro-ductory statistics courses. Responses will be merged with other assessment data from the class tocreate a de-identified research dataset accessible only to the instructor, and the original forms willbe destroyed. Only aggregate data will be included in any reports.The decision to participate in this study is entirely up to you. You may refuse to take part inthe study at any time without affecting your relationship with the investigators of this study, yourgrade in the course or Smith College. You have the right not to answer any single question. You areunder no obligation to complete the survey. Your submission of the completed survey constitutesyour consent to use of the data within these constraints.If you have any further questions about the study, at any time feel free to contact Nicholas Hortonat [email protected] or by telephone at 413-585-3688. If you like, a summary of the resultsof the study will be sent to you. If you have any other concerns about your rights as a researchparticipant that have not been answered by the investigators, you may contact Phil Peake, Co-chairof the Smith College Institutional Review Board at (413) 585-3914.
Your Name:Background
1. How often had you used R prior to taking this course (circle one)?had never heard of it never infrequently a few times frequently2. To which of the following markup languages had you been exposed prior to taking this course(circle all that apply)?HTML CSS XML L A TEX Wikipedia (editing)
R Markdown
Please indicate the response that most closely matches your attitude towards each of the followingstatements.. I find the R Markdown syntax to be simple and understandable.no opinion strongly disagree disagree indifferent agree strongly agree2. When my Markdown document does not compile, I know how to go about fixing it.no opinion strongly disagree disagree indifferent agree strongly agree3. I am frequently frustrated by R Markdown when doing my homework.no opinion strongly disagree disagree indifferent agree strongly agree4. I am frequently frustrated by R when doing my homework.no opinion strongly disagree disagree indifferent agree strongly agree5. R Markdown makes my homework easier to read and understand.no opinion strongly disagree disagree indifferent agree strongly agree6. I would rather copy and paste my results (plots, tables, and numbers) into a word processingprogram (e.g. Word).no opinion strongly disagree disagree indifferent agree strongly agree7. I resent being forced to use R Markdown. It should be my choice how I prepare my homework.no opinion strongly disagree disagree indifferent agree strongly agree8. I found R Markdown to be frustrating at first, but now I’ve got the hang of it.no opinion strongly disagree disagree indifferent agree strongly agree9. R Markdown makes it easier for me to complete my homework.no opinion strongly disagree disagree indifferent agree strongly agree10. R Markdown makes it more difficult for me to complete my homework.no opinion strongly disagree disagree indifferent agree strongly agree11. I wish I had received a more thorough introduction to the logic and features of R Markdown.no opinion strongly disagree disagree indifferent agree strongly agree . Introducing R Markdown in class
A Prezi introducing the features of
R Markdown and its use in lab reports can be found at: http://prezi.com/dvmgx17e_was/reproducible/?utm_campaign=share&utm_medium=copy . Figure 4provides two sample slides, diagramming the difference between the two workflows.Figure 4: The traditional workflow, characterized by a separation between the data analysis andthe interpretation that are then fused together by copy-and-paste. By contast, the
R Markdown workflow integrates these two components into a single document.
C. Sample assignment and solution
A sample lab assignment and student solution is included below. ab 1: Introduction to data
Some define Statistics as the field that focuses on turning information into knowledge. The first step inthat process is to summarize and describe the raw information - the data. In this lab, you will gain insightinto public health by generating simple graphical and numerical summaries of a data set collected by theCenters for Disease Control and Prevention (CDC). As this is a large data set, along the way you’ll alsolearn the indispensable skills of data processing and subsetting.
Template for lab report
Before you begin the lab, download the lab report template. This template makes it very simple to in-clude code and output in your write up from within RStudio as well as ensuring reproducibility of yourresults. download.file("http://stat.duke.edu/courses/Summer13/sta104.01-1/labs/lab1.Rmd",destfile = "lab1.Rmd")
Click on the file called lab1.Rmd under the Files tab on the bottom right pane of your RStudio window.Insert your team name, name of the “author” for the week, and the names of the “discussants” (other teammembers present in lab today). Use the allotted spaces to enter your responses. For questions that requireR code or a plot, space has been provided for you to enter the relevant code.
Getting started
The Behavioral Risk Factor Surveillance System (BRFSS) is an annual telephone survey of 350,000 peoplein the United States. As its name implies, the BRFSS is designed to identify risk factors in the adultpopulation and report emerging health trends. For example, respondents are asked about their diet andweekly physical activity, their HIV/AIDS status, possible tobacco use, and even their level of healthcarecoverage. The BRFSS Web site ( ) contains a complete description of the survey,including the research questions that motivate the study and many interesting results derived from thedata.We will focus on a random sample of 20,000 people from the BRFSS survey conducted in 2000. While thereare over 200 variables in this data set, we will work with a small subset.We begin by loading the data set of 20,000 observations into the R workspace. After launching RStudio,enter the following command.
The data set cdc that shows up in your workspace is a data matrix , with each row representing a case andeach column representing a variable . R calls this data format a data frame , which is a term that will be usedthroughout the labs.To view the names of the variables, type the command names(cdc)
This is a product of OpenIntro that is released under a Creative Commons Attribution-ShareAlike 3.0 Unported ( http://creativecommons.org/licenses/by-sa/3.0 ). This lab was adapted for OpenIntro by Andrew Bray and Mine C¸etinkaya-Rundel from alab written by Mark Hansen of UCLA Statistics. genhlth , exerany , hlthplan , smoke100 , height , weight , wtdesire , age , and gender . Each one of these variables corresponds to a question that was asked in the survey. For ex-ample, for genhlth , respondents were asked to evaluate their general health, responding either excellent,very good, good, fair or poor. The exerany variable indicates whether the respondent exercised in thepast month (1) or did not (0). Likewise, hlthplan indicates whether the respondent had some form ofhealth coverage (1) or did not (0). The smoke100 variable indicates whether the respondent had smoked atleast 100 cigarettes in her lifetime. The other variables record the respondent’s height in inches, weight inpounds as well as their desired weight, wtdesire , age in years, and gender . Exercise 1
How many cases are there in this data set? How many variables? For each variable,identify its data type (e.g. categorical, discrete).We can have a look at the first few entries (rows) of our data with the command head(cdc) and similarly we can look at the last few by typing tail(cdc)
You could also look at all of the data frame at once by typing its name into the console, but that mightbe unwise here. We know cdc has 20,000 rows, so viewing the entire data set would mean flooding yourscreen. It’s better to take small peeks at the data with head , tail or the subsetting techniques that you’lllearn in a moment. Summaries and tables
The BRFSS questionnaire is a massive trove of information. A good first step in any analysis is to distill allof that information into a few summary statistics and graphics. As a simple example, the function summary returns a numerical summary: minimum, first quartile, median, mean, second quartile, and maximum.For weight this is summary(cdc$weight)
R also functions like a very fancy calculator. If you wanted to compute the interquartile range for the re-spondents’ weight, you would look at the output from the summary command above and then enter
190 - 140
R also has built-in functions to compute summary statistics one by one. For instance, to calculate the mean,median, and variance of weight , type mean(cdc$weight)var(cdc$weight)median(cdc$weight)
While it makes sense to describe a quantitative variable like weight in terms of these statistics, what aboutcategorical data? We would instead consider the sample frequency or relative frequency distribution. The2unction table does this for you by counting the number of times each kind of response was given. Forexample, to see the number of people who have smoked 100 cigarettes in their lifetime, type table(cdc$smoke100) or instead look at the relative frequency distribution by typing table(cdc$smoke100)/20000
Notice how R automatically divides all entries in the table by 20,000 in the command above. This is similarto something we observed in the last lab; when we multiplied or divided a vector with a number, R appliedthat action across entries in the vectors. As we see above, this also works for tables. Next, we make a barplot of the entries in the table by putting the table inside the barplot command. barplot(table(cdc$smoke100))
Notice what we’ve done here! We’ve computed the table of cdc $ smoke100 and then immediately appliedthe graphical function, barplot . This is an important idea: R commands can be nested. You could alsobreak this into two steps by typing the following: smoke <- table(cdc$smoke100)barplot(smoke) Here, we’ve made a new object, a table, called smoke (the contents of which we can see by typing smoke intothe console) and then used it in as the input for barplot . The special symbol <- performs an assignment ,taking the output of one line of code and saving it into an object in your workspace. This is anotherimportant idea that we’ll return to later. Exercise 2
Create a numerical summary for height and age , and compute the interquartilerange for each.
Exercise 3
Compute the relative frequency distribution for gender and genhlth . How manymales are in the sample? What proportion of the sample reports being in excellent health?The table command can be used to tabulate any number of variables that you provide. For example, toexamine which participants have smoked across each gender, we could use the following. table(cdc$gender, cdc$smoke100)
Here, we see column labels of 0 and 1. Recall that 1 indicates a respondent has smoked at least 100cigarettes. The rows refer to gender. To create a mosaic plot of this table, we would enter the followingcommand. mosaicplot(table(cdc$gender, cdc$smoke100))
We could have accomplished this in two steps by saving the table in one line and applying mosaicplot inthe next (see the table/barplot example above). 3
Exercise 4
What does the mosaic plot reveal about smoking habits and gender?
Interlude: How R thinks about data
We mentioned that R stores data in data frames, which you might think of as a type of spreadsheet. Eachrow is a different observation (a different respondent) and each column is a different variable (the first is genhlth , the second exerany and so on). We can see the size of the data frame next to the object name inthe workspace or we can type dim(cdc) which will return the number of rows and columns. Now, if we want to access a subset of the full dataframe, we can use row-and-column notation. For example, to see the sixth variable of the 567 th respondent,use the format cdc[567, 6] which means we want the element of our data set that is in the 567 th row (meaning the 567 th person orobservation) and the 6 th column (in this case, weight). We know that weight is the 6 th variable because itis the 6 th entry in the list of variable names names(cdc) To see the weights for the first 10 respondents we can type cdc[1:10, 6]
In this expression, we have asked just for rows in the range 1 through 10. R uses the “:” to create a rangeof values, so 1:10 expands to 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. You can see this by entering
Finally, if we want all of the data for the first 10 respondents, type cdc[1:10, ]
By leaving out an index or a range (we didn’t type anything between the comma and the square bracket),we get all the columns. When starting out in R, this is a bit counterintuitive. As a rule, we omit the columnnumber to see all columns in a data frame. Similarly, if we leave out an index or range for the rows, wewould access all the observations, not just the 567 th , or rows 1 through 10. Try the following to see theweights for all 20,000 respondents fly by on your screen cdc[, 6] Recall that column 6 represents respondents’ weight, so the command above reported all of the weightsin the data set. An alternative method to access the weight data is by referring to the name. Previously,4e typed names(cdc) to see all the variables contained in the cdc data set. We can use any of the variablenames to select items in our data set. cdc$weight
The dollar-sign tells R to look in data frame cdc for the column called weight . Since that’s a single vector,we can subset it with just a single index inside square brackets. We see the weight for the 567 th respondentby typing cdc$weight[567] Similarly, for just the first 10 respondents cdc$weight[1:10]
The command above returns the same result as the cdc[1:10,6] command. Both row-and-column no-tation and dollar-sign notation are widely used, which one you choose to use depends on your personalpreference.
A little more on subsetting
It’s often useful to extract all individuals (cases) in a data set that have specific characteristics. We accom-plish this through conditioning commands. First, consider expressions like cdc$gender == "m" or cdc$age > 30 These commands produce a series of
TRUE and
FALSE values. There is one value for each respondent, where
TRUE indicates that the person was male (via the first command) or older than 30 (second command).Suppose we want to extract just the data for the men in the sample, or just for those over 30. We can usethe R function subset to do that for us. For example, the command mdata <- subset(cdc, cdc$gender == "m") will create a new data set called mdata that contains only the men from the cdc data set. In addition tofinding it in your workspace alongside its dimensions, you can take a peek at the first several rows asusual head(mdata)
This new data set contains all the same variables but just under half the rows. It is also possible to tell Rto keep only specific variables, which is a topic we’ll discuss in a future lab. For now, the important thing5 is that we can carve up the data based on values of one or more variables.As an aside, you can use several of these conditions together with & and | . The & is read “and” sothat m_and_over30 <- subset(cdc, cdc$gender == "m" & cdc$age > 30) will give you the data for men over the age of 30. The | character is read “or” so that m_or_over30 <- subset(cdc, cdc$gender == "m" | cdc$age > 30) will take people who are men or over the age of 30 (why that’s an interesting group is hard to say, butright now the mechanics of this are the important thing). In principle, you may use as many “and” and“or” clauses as you like when forming a subset. Exercise 5
Create a new object called under23 and smoke that contains all observations ofrespondents under the age of 23 that have smoked 100 cigarettes in their lifetime. Write thecommand you used to create the new object as the answer to this exercise.
Quantitative data
With our subsetting tools in hand, we’ll now return to the task of the day: making basic summaries of theBRFSS questionnaire. We’ve already looked at categorical data such as smoke and gender so now let’s turnour attention to quantitative data. Two common ways to visualize quantitative data are with box plots andhistograms. We can construct a box plot for a single variable with the following command. boxplot(cdc$height)
You can compare the locations of the components of the box by examining the summary statistics. summary(cdc$height)
Confirm that the median and upper and lower quartiles reported in the numerical summary match thosein the graph. The purpose of a boxplot is to provide a thumbnail sketch of a variable for the purposeof comparing across several categories. So we can, for example, compare the heights of men and womenwith boxplot(cdc$height ~ cdc$gender)
The notation here is new. The ~ character can be read “versus” or “as a function of”. So we’re asking R togive us a box plots of heights where the groups are defined by gender.Next let’s consider a new variable that doesn’t show up directly in this data set: Body Mass Index (BMI).BMI is a weight to height ratio and can be calculated as.
BMI = weight ( lb ) height ( in ) ∗ † †
703 is the approximate conversion factor to change units from metric (meters and kilograms) to imperial (inches and pounds) bmi and then creates box plots of these values,defining groups by the variable cdc $ genhlth . bmi <- (cdc$weight/cdc$height^2) * 703boxplot(bmi ~ cdc$genhlth) Notice that the first line above is just some arithmetic, but it’s applied to all 20,000 numbers in the cdc dataset. That is, for each of the 20,000 participants, we take their weight, divide by their height-squared andthen multiply by 703. The result is 20,000 BMI values, one for each respondent. This is one reason why welike R: it lets us perform computations like this using very simple expressions.
Exercise 6
What does this box plot show? Pick another categorical variable from the data setand see how it relates to BMI. List the variable you chose, why you might think it would havea relationship to BMI, and indicate what the figure seems to suggest.Finally, let’s make some histograms. We can look at the histogram for the age of our respondents with thecommand hist(cdc$age)
Histograms are generally a very good way to see the shape of a single distribution, but that shape canchange depending on how the data is split between the different bins. You can control the number of binsby adding an argument to the command. In the next two lines, we first make a default histogram of bmi and then one with 50 breaks. hist(bmi)hist(bmi, breaks = 50)
Note that you can flip between plots that you’ve created by clicking the forward and backward arrows inthe lower right region of RStudio, just above the plots. How do these two histograms compare?
Exercise 7
In the last lab, when exploring how percentage of boys born varies in time (twonumerical variables) we use a scatterplot. Using the same tools, the plot function, make ascatterplot of weight versus desired weight. Describe the relationship between these variables.At this point, we’ve done a good first pass at analyzing the information in the BRFSS questionnaire.We’ve found an interesting association between smoking and gender, and we can say something about therelationship between people’s assessment of their general health and their own BMI. We’ve also picked upessential computing tools – summary statistics, subsetting, and plots – that will serve us well throughoutthis course.
Class survey
In the rest of this lab you will use the data from the Sta 101 classes to investigate relationships betweencertain types of variables of interest. You can nd a list of the variables and corresponding survey questionshere. download.file("http://stat.duke.edu/courses/Spring13/sta101.001/data/surveyS13.csv", destfile = "survey.csv")survey = read.csv("survey.csv") Exercise 8
Pick a numerical variable, make an appropriate plot to visualize its distribution.Briefly describe the distribution of the variable using appropriate statistics.
Hint: Use R tocalculate summary statistics you might want to mention in your description.
Exercise 9
Pick a categorical variable, make an appropriate plot to visualize its distribution.Briefly describe the distribution of the variable using appropriate statistics.
Exercise 10
Pick one numerical and one categorical variable, make an appropriate plot to vi-sualize the relationship between these variables, and briefly describe the apparent relationship.
Exercise 11
Pick two categorical variables, make an appropriate plot to visualize the relation-ship between these variables, and briefly describe the apparent relationship.
Exercise 12
Pick two numerical variables, make an appropriate plot to visualize the relation-ship between these variables, and briefly describe the apparent relationship.
Exercise 13
What concepts from the textbook are covered in this lab? What concepts, if any, arenot covered in the textbook? Have you seen these concepts elsewhere, e.g. lecture, textbook,previous labs, etc.? Be specific in your answer.
List of R functions
For your convenience, a list of R functions you will commonly use in this class have been posted at onthe course website under the resources tab (also linked here). If you aren’t sure how to do something inR, the first thing to do is to always search the web. But some of the resources you come across might beoverwhelming if they’re designed for more experienced users. Please don’t hesitate to ask your teammates,TAs, and the professor for help. 8 ab 1
Name:
Exercises
Load CDC data:
Exercise 1:
There are 20,000 cases and nine variables. Genhlth: ordinal/categorical, Exerany:categorical, Hlthplan: categorical, Smoke100: categorical, Height: discrete numerical,Weight: discrete numerical, Wtdesire: discrete numerical, Age: discrete numerical, Gender:categorical.
Exercise 2:
IQR of height: 70-64=6 IQR of age: 57-31=26 summary(cdc$height)
Exercise 3: table(cdc$gender)/20000
Exercise 4:
The mosaic plot reveals that slightly more females than males completed the survey–whichwe already knew. More importantly, it reveals that a larger percentage of surveyed maleshave smoked more than 100 cigarettes in their lives than have surveyed females. Thus,males likely have slightly worse smoking habits than females.
Exercise 5:
Using nrow, I find that there are 620 rows in this object, so there are 620 respondents thatmeet this criteria. under23_and_smoke <- subset(cdc, cdc$age < "23" & cdc$smoke100 == "1")
Exercise 6:
With these boxplots, we can see that the worse that participants declared their health to bein, the higher their median BMIs. I would think that the amount that respondants exercisealso relates to the BMI–one would assume that the more someone exercises, the lowertheir BMI. The box plots I created indeed show that those who had not exercised in theprevious month had a slightly higher median BMI than those who had. bmi <- (cdc$weight/cdc$height^2) * 703boxplot(bmi ~ cdc$exerany, main = "BMI vs. recent exercise")
Exercise 7:
It appears as though the majority of respondants' desired weights are slightly below theircurrent weights. As weight increases, desired weight generally increases as well, so therelationship is relatively strong (positive association, linear). There are a couple of outliersthat seem to be inaccurate reportings (ie, desired weight of 600 and 700 lbs.), and it wouldbe wise to disregard these data. plot(cdc$wtdesire ~ cdc$weight, main = "Weight vs. desired weight") oad survey data: download.file("http://stat.duke.edu/courses/Spring13/sta101.001/data/surveyS13.csv", destfile = "survey.csv")survey = read.csv("survey.csv")
Exercise 8:
The histogram of GPAs seems to be left skewed and unimodal. Since the highest a GPA atDuke University can be is 4.0, the outliers are probably reporting mistakes and shouldprobably not be considered with the actual data. According to the summary, the medianGPA is 3.6, and using 1Q and 3Q, we can calculate that the IQR is 3.78-3.36 = .42. hist(survey$gpa, breaks = 20, main = "GPAs of surveyed students")
Exercise 9:
For this categorical variable, the most popular area of residence of surveyed students wasthe Southern United States (72/208 = 34.6%), followed by the Northeast US (28.3%), Western US (20.1%), Midwest US (9.6%), and finally international (7.2%). table(survey$where_from)
Exercise 10:
With these two variables we can see that a student whose first choice was Duke goes out amedian of about 2 nights per week, while the median number of nights out for studentswhose first choice was not Duke is one night less per week than the “yes” group. boxplot(survey$go_out_times ~ survey$duke_first_choice, main = "First choice vs. Nights out per week")
Exercise 11:
In general, surveyed students from regions farther away from home seem to be morehomesick than students who live closer to Duke. This is evident because a largerpercentage of international students, as well as students from the northeastern andwestern US, reported that they were homesick than did students from the midwest andsouthern US. mosaicplot(table(survey$where_from, survey$homesick), main = "Home location vs. Homesickness") xercise 12:
This scatterplot shows that there is a positive, linear association between the number ofdrinks it takes for a student to get drunk and the average number of drinks he/sheconsumes on a given night. Students who need more drinks to get drunk will generallydrink more on average in a given night than students who do not need as much to getdrunk, and the relationship is evident but not particularly strong. There are a few data thathave higher numerical responses than the majority of the other respondants', but theyfollow the same general trend. plot(survey$drink_amount ~ survey$drinks_to_drunk, main = "Average drinks consumed vs. Drinks it takes to get drunk")