Data Science in Statistics Curricula: Preparing Students to "Think with Data"
Johanna Hardin, Roger Hoerl, Nicholas J. Horton, Deborah Nolan
Data Science in Statistics Curricula: Preparing Students to "Think with Data"
J. Hardin, R. Hoerl, N. J. Horton, and D. Nolan with: B. Baumer, O. Hall-Holt, P. Murrell, R. Peng, P. Roback, D. Temple Lang, and M.D. Ward July 24, 2015
ABSTRACT
A growing number of students are completing undergraduate degrees in statistics and entering the workforce as data analysts. In these positions, they are expected to understand how to utilize databases and other data warehouses, scrape data from Internet sources, program solutions to complex problems in multiple languages, and think algorithmically as well as statistically. These data science topics have not traditionally been a major component of undergraduate programs in statistics. Consequently, a curricular shift is needed to address additional learning outcomes. The goal of this paper is to motivate the importance of data science proficiency and to provide examples and resources for instructors to implement data science in their own statistics curricula. We provide case studies from seven institutions. These varied approaches to teaching data science demonstrate curricular innovations to address new needs. Also included here are examples of assignments designed for courses that foster engagement of undergraduates with data and data science. I NTRODUCTION
The number of bachelor's degrees awarded in statistics has more than doubled in the five-year period 2008-2013 (Pierson, 2014) and continues to experience growth. This increase in the number of undergraduates may help address the impending shortage of quantitatively trained workers (National Academy of Sciences, 2010; Manyika, et al., 2011; Zorn, et al., 2014). Statistics graduates at the bachelor’s level often work as analysts, and as a result need training in statistical methods, statistical thinking and statistical practice; a foundation in theoretical statistics; increased skills in computing and data-related technologies; and the ability to communicate (ASA, 2014). Computing skills to enable processing of large data sets are particularly relevant, as noted in the recent London Report on the Future of Statistics: “Undoubtedly the greatest challenge and opportunity that confronts today’s statisticians is the rise of Big Data” (Madigan, 2014). DRAFT 2 To help illustrate possible paths forward, we detail several existing approaches (many of which have been recently developed) to prepare students to work in industry as a statistician or an analyst, to continue their training in statistics graduate programs, or to become a scientist in a field allied to statistics. Our goal is to provide resources and concrete ideas to the statistics community for how to incorporate computational and authentic data experiences into the undergraduate course work. To this end, we invited faculty from seven institutions (Johns Hopkins University, Purdue University, St. Olaf College, Smith College, University of Auckland, and the Universities of California, Berkeley and Davis) to describe their efforts in incorporating data science into the undergraduate curriculum in innovative ways. W HAT IS D ATA S CIENCE ? The term data science was suggested as a discipline by Cleveland (2001), who argued that the statistics profession should change its name to “data science”, as that was, in fact, what statisticians did. Since then, the term data science has become a phrase describing a discipline typically involving some mixture of statistics and large-scale computing (Greenhouse, 2013). Multiple definitions of a data scientist exist, including an inquisitive data explorer who communicates informed conclusions; someone who can use data from multiple sources to spot trends; or a “peculiar blend of developer and statistician that is capable of turning data into awesome” (Wills, 2015). We find particularly relevant the definition proposed by the NSF’s advisory committee, StatSNSF, that data science is the computational aspects of carrying out a complete data analysis, including acquisition, management, and analysis of data (Johnstone & Roberts, 2014). Given the vast increase in the volume and complexity of data and the new technologies that have been developed to process and analyze this information, we argue there is an increased need for statistical thinking in the context of working with data. Key statistical reasoning topics that are critical for data scientists to know at a deep level include: Understanding the randomness, variability, and uncertainty inherent in the problem. Developing clear statements of the problem / scientific research question; understanding the purpose of the answer. Ensuring acquisition of high-quality data and not just a lot of numbers. Understanding the process that produced the data, to provide proper context for analysis. Allowing domain (subject-matter) knowledge of the problem to guide both data collection and analysis. Approaching modeling as a process that requires an overall strategy, not simply a collection of special techniques or methods.
Why does data science belong in the undergraduate curriculum?
Integrating Data Science into the Curriculum: Seven Prototypes
DRAFT 4
In order to illustrate how to develop novel data science curricula, we surveyed a number of faculty about their approaches to integrating data science into the statistics curricula. The descriptions presented here represent a number of innovative approaches. They are similar in that they all share a goal of having students become proficient in data technologies and programming tools for problem solving with data. However, their approaches vary in terms of mode of delivery, topics, and learning outcomes. Our goal in providing these examples is that these “existence proofs” can be useful to those who are working to integrate data science approaches into their own statistics curriculum. The instructors have also shared their syllabi and some course materials in order to provide more concrete guidance about the types of modules, units, and assignments for others to adapt and adopt. The materials include lecture notes, class projects, homework assignments, etc. These resources are noted at the end of each prototype description. The syllabi have all been collected on the website, http://hardin47.github.io/DataSciStatsMaterials/.
The data science exemplars are presented alphabetically by author.
Data Science,
Ben Baumer,
Smith College
The Data Science course at Smith College—first offered in 2013—is an elective in the new program in Statistical & Data Sciences and in the Applied Statistics minor. The course provides a practical foundation for students to compute with data, by participating in the entire data analysis cycle (from forming a statistical question, data acquisition, cleaning, transforming, modeling and interpretation). The course introduces students to tools for data management, storage and manipulation that are common in data science, and students apply those tools to real scenarios. Students undertake practical analyses using real, large, messy datasets using modern computing tools (e.g., R, SQL) and learn to think statistically in approaching all of the aspects of data analysis (see Baumer (2015) for a complete discussion of the course). While some of the topics covered in the course come from existing offerings in applied statistics and computer science, effort is made in Data Science to present a unified blend of this material, such that students recognize that both fields contribute to answering questions from data. The course itself can be thought of as having five components: data visualization (e.g., data graphics, elements of visual perception), data manipulation (e.g., SQL, merging, aggregating and iterating), computational statistics (e.g., confidence intervals via the bootstrap, simulation, regression, variable selection), data mining/machine learning (e.g., classification, cross-validation), and additional topics (e.g., text mining, mapping, regular expressions, network science, MapReduce). In its first offering, the staff of Smith’s GIS (geographical information systems) laboratory regularly attended the class. This facilitated incorporation of lessons on spatial data and mapping techniques into the curriculum. This topic was popular with the students, since the ability to generate data maps was perceived to be useful in terms of visualization and communication. DRAFT 5 A key learning outcome stressed in the course was the ability to think structurally about data and how to manipulate it. Wickham’s “key idioms” (2015) were used to illustrate for students the similarities and differences between merging and aggregating operations in R and SQL. For example, what is the R equivalent of the GROUP BY operation in SQL? The SQL syntax is similar to the English language and so can help to demystify the R code. One key aspect of the course is helping students recognize that while each language may have its own syntax, the underlying operation that is being performed on the data is the same. Among the biggest challenges is keeping students with varied computational abilities and backgrounds on the same page. For example, some students come in with extensive knowledge and practice with R, while others are seeing it for the first time. It is difficult to find assignments that keep both types of students motivated. Another challenge is maintaining a consistent level of difficulty and workload when cobbling together material from a variety of sources. The end of course evaluations indicated that students felt as though they were learning things that are useful. In turn, they have generated more interest among younger students in taking the course. Finally, several students have indicated that skills they learned in the course corresponded directly to questions that they were asked by employers during job interviews.
Data Technologies,
Paul Murrell,
University of Auckland (Murrell, 2009) . Concepts in Computing with Data,
Deborah Nolan,
University of California, Berkeley and Duncan Temple Lang,
University of California, Davis
These courses were co-developed at UC Berkeley and Davis in 2004. It is a required upper division course for the statistics major in both departments and typically taken by sophomores and juniors. However, the majority of students enrolled in the course are not statistics majors. In 2014-15, over 600 students enrolled in four offerings of the course at Berkeley and over 200 students enrolled in one offering at Davis. DRAFT 7 Both courses focus on the computational aspects of the data analysis cycle, from data acquisition and cleaning to data organization and analysis, and reporting. Students are exposed to many different forms of data including structured data such as XML and JSON, free formatted text data, and dates, times, and geo-locations. To handle the data, students learn various tools and technologies including shell commands, regular expressions, structured query language (SQL) for relational databases, JavaScript for developing interactive Web pages, and R. Programming concepts are taught with R, including control flow, recursion, data structures and trees. Although the main focus is on the computational aspects of working with data, the course also covers many statistical topics, including concepts of variability, patterns, comparisons; exploratory and presentation graphics; statistical methods that are often not covered until late in the major program, such as classification trees, multi-dimensional scaling, and nearest neighbor methods; model selection and validation; and simulation tools, e.g., Monte Carlo, bootstrap, cross-validation. One of the biggest challenges has been in developing the resources for projects and assignments. This is more difficult than finding good data sets for teaching statistical methods because the data published for a statistical analysis often come already processed and cleaned, and the courses call for sources that are at least one step earlier in the analysis process. Another common difficulty results from heterogeneity in student background. Many of the students are new to programming, but a substantial fraction of them have taken one or more CS courses. The first group is sometimes intimidated by the course. For this reason, at Berkeley, graphics is taught first since students get very excited about the sophisticated plots that they can make with R. At this point they are more open to learning programming concepts and handling more complex data (e.g., other than CSV formatted). On the other hand, the more computationally advanced students often write code that works but ignores the paradigm of the language (e.g., loops instead of vectorized computations). For these students, their programs are frustratingly slow. The challenge is to have them re-learn how to program with a different computational model. Despite the challenges, Concepts in Computing with Data has been very rewarding to teach. Several faculty have taught the course at Berkeley, and they all report how much they enjoy teaching it. One of the rewards is seeing the enthusiasm that the students have for the material. The confidence the students gain is noticeable as they tackle increasingly challenging computational problems. They report how their project helped them get a job, gave them the confidence to learn new technologies in their new careers, and enabled them to participate in research projects in their major. Many faculty routinely require this course for undergraduate students who wish to join their research team. Additionally, Computing with Data enables teaching traditional statistical topics from a different approach. For example, generating random numbers and carrying out simulation studies help students understand the concept of a random variable and its properties. Also, students use exploratory data analysis (EDA) to debug code; cross-validation and bootstrapping to assess models and variability; and presentation graphics to summarize findings from advanced statistical analyses. A dozen case studies from these and similar courses are now available (Nolan & Temple Lang, 2015a). DRAFT 8
Data Science Specialization,
Roger Peng,
Johns Hopkins Bloomberg School of Public Health / Coursera
A Statistics-infused Introduction to Computer Science,
Paul Roback and Olaf Hall-Holt,
St. Olaf College
An Introduction to Big Data Analysis , Mark Daniel Ward,
Purdue University
Purdue University is one of the largest producers of undergraduate statistics majors. The Introduction to Big Data Analysis course is part of a new initiative in Purdue’s NSF-funded Statistics Living-Learning Community (STAT-LLC), which began in fall 2014. The STAT-LLC is a unique, immersive experience for approximately 20 students per year. It unites many of the elements of the undergraduate experience. The program is aimed at sophomores, with the goal of creating a bridge from the first-year general curriculum into sophomore year Statistics major courses and into a student's first research experience in data analysis, especially with big data. Through this unique experience, the expectation is for students to be: more likely to stay in their chosen major program (improved retention rates), more confident in their coursework and their research, more successful during their sophomore year academic courses, and well-positioned for graduate school and post-graduate experiences and careers. The program includes academic courses, residential life, professional development, and mentored research projects that last a full calendar year (as opposed to a summer research experience). The students take three core courses as a cohort: probability theory, statistical theory, and this new course in big data analysis. They live together in a common residence hall with a dining court. They also enroll DRAFT 12 in a year-long professional development seminar that touches on all aspects of university life, and on their future careers and training. Their research experiences are supported by faculty mentors from statistics and applied disciplines. The Introduction to Big Data Analysis course focuses on computational tools for representing, extracting, manipulating, interpreting, transforming, and visualizing data. It is a statistics elective and has no prerequisites. The first half of the course introduces students to the R platform, and to basic data structures, exploratory data analysis, data visualization, random number generation and simulation, and an introduction to linear models. The second half of the course includes topics such as bash shell and shell scripting; awk, regular expressions, and pattern matching; relational databases using SQL; and XML parsing and Web scraping. In the future, it is likely that the course will be expanded to include parallelism and distributed data with Hadoop and MapReduce. The course is taught in a flipped environment. The course webpage contains videos, computer code, and notes about the topics. The entire course is project-based, using data sets chosen from various areas of application. Students work in teams in a computer laboratory and perform all of their computations remotely, on a server. Through the projects and assignments, the students gain practical experience in effectively communicating insights about data. All course materials are available at http://llc.stat.purdue.edu/2014/29000/index.html. C URRICULAR T OPICS
The prototypes described in the previous section differ in level and audience. They range from introductory courses in data technology, to a core course for the undergraduate major, to a 36-week data science specialization. Yet, many topics arose repeatedly, and for faculty who have just started to consider how to introduce data science in their curricula, Table 1 may provide ideas for key topics. Decisions about what topics to include, how much time to dedicate to them, and in what sequence they should be covered will in part depend on curricular constraints and decisions of the local institution. Despite this, we provide here some ideas to consider when developing and updating curriculum to include data science topics.
Programming.
Programming is an essential skill for data science. As seen in Table 1, we consider programming to include concepts of structured programming and higher order notions of efficiency, and in some cases, high performance computing. It is no longer adequate training for statistics students to be able to analyze data using graphical user interfaces or to write simple scripts that do not use modular approaches (including writing functions and code using control flow) to process data. At some institutions, it may be possible to have programming experience as a prerequisite for a data science course. However, most of the example courses here do not have such a prerequisite. This is predominately due to three constraints: the extra prerequisite limits enrollment because students in some fields, e.g., the social sciences, might not have taken the prerequisite course; requiring such a DRAFT 13 course may be perceived by these same students as a barrier to enrollment; and the data science course offers an important and unique integration of programming concepts with data handling and analysis. This integration yields a course that has a very different focus from typical introductory programming courses and that can ameliorate the difficulties with learning programming concepts because they are couched in a framework that emphasizes learning from data.
Data technologies and formats:
These two areas are essential to any data science course. The division of these topics into the two areas of technologies and format in Table 1 is somewhat arbitrary. For example, a relational database has a specific data format but it is included under technologies because SQL is a language for accessing data in a database. Similarly, text data appears under the format heading, but it typically requires some level of familiarity with regular expressions to extract information for analysis. The University of Auckland is the exception in the courses presented in not addressing text and regular expressions. This is due to the focus of that course on Web technologies and because the material is covered in one of the other two computing electives in their program. The topics of XML and shell commands are not universally included in these courses. There are many arguments for including and excluding these topics in the course. For example, knowledge of the shell is very useful for programmatic handling of files, such as thousands of twitter messages. It also has the advantage of offering an example of trees and hierarchical data structures, an important CS concept that can help students understand how information is organized.
Statistical Topics:
We believe it is crucial to include statistical topics in the course and not simply limit topics to those of data wrangling because an understanding of how we might analyze the data impacts how we process the data. Additionally, as many of the course creators have indicated, they see data science as offering an opportunity to expose students to the entire data analysis process and to teach statistical thinking in alternative, more realistic contexts. Some of the prototypes introduce visualization early in the course. Two reasons are given for this: it provides a platform for introducing computational notions that lead to rewarding outcomes, e.g., a beautiful plot, and because visualization can be reinforced throughout the rest of the course. Including modern methods in the curriculum provides fun ways to apply more complex and advanced methods. In such a course, the focus tends to be more on the statistical ideas and less on formal properties of the methods. Many advanced methods would be appropriate to include, with the selection focusing on those that are computationally intensive and intuitive. Examples include recursive partitioning, support vector machines, nearest-neighbor methods, little bag of bootstraps, and LASSO. The topic of simulation can be approached from a resampling perspective and/or a simulation study/experiment. The collection of these topics seems more an artifact of the presence or absence of other computational courses in the curriculum. For example, at Auckland there are three computing courses and so less pressure to include “everything.” Additionally, since the writing of this article, UC Davis has converted their one-quarter computing course into three quarter courses. The first focuses on programming and includes workflow and simulation topics. The second has a focus on data technologies, and the last covers new topics on advanced computing methods, e.g., high performance computing. Area Topic Smith Auckland UC B/D JHSPH St. Olaf Purdue DRAFT 14
Table 1. Curricular topics offered in each of the example courses. See the definitions section for more detailed explanations of the topics.
Definitions of computation-oriented terms.
Structured Programming (Structured) : a programming paradigm that uses conditional statements, iteration (e.g., for and while loops), block structures, and subroutines.
Efficiency : the speed of runtime execution of code.
High Performance Computing (HPC) : techniques of parallel processing and grid computing that use many computing resources simultaneously.
Relational Databases (RDBMS (SQL)) : information stored in multiple tables (i.e., relations). Each table represents an entity, with rows (records) and columns (variables), and allows linking between tables.
Regular Expressions (RegEx) : a language for describing patterns to search for in text. eXtensible Markup Language (XML ) : a text-based format for exchanging information. XML obeys a set of rules for encoding documents that is human readable and machine readable and generated. Shell commands : A command-line interface to the operating system’s file and process management.
Web scraping:
Automated procedures for retrieving content from the Web.
Ragged arrays : non-rectangular data where records have differing numbers of values.
Reproducibility : the notion that a final product includes the computations required to produce the results, such as code, data, computing environment, etc.
Revision Control : software to manage collaborative development, editing, and sharing of code, documents, web sites, etc. DRAFT 15 It is difficult to quantify the time spent in a course on the topics found in Table 1 because topics are often taught simultaneously and early topics are reinforced when covering later topics. Also, we did not include the basics of learning a computer language, e.g., expressions and data types, under the programming topic in Table 1. However, we attempt to give the reader a better sense of the balance of topics for two courses (St Olaf and Berkeley) by examining the assignments for the courses. At St. Olaf, the student work consists of 29 homework assignments due at each class meeting and 5 larger projects. At Berkeley the student work consists of 11 weekly lab assignments, 8 homework assignments, and 2 projects (plus a midterm and final). For each course, we reviewed the various assignments and attempted to categorize them according to the topics in Table 1. When assignments covered multiple topics we distributed the work evenly across the relevant topics. We hope this crude estimate gives a sense of the distribution of time/effort across the computational topics for these two courses. For each type of assignment, we list the topics in order from greatest to least and provide percentages for those that are more than 10% of the total. We note that the nature and philosophy of the assignments is not captured by this metric. See below for examples At St. Olaf, the daily assignments fell into the following categories: nearly 75% of the work focused on programming concepts in R and Python (including the basics mentioned above that are not listed in the table). The remaining topics were, in decreasing order, regular expressions, Web scraping, visualization, web publishing, and SQL. The balance of topics for the five projects was quite different. Programming made up about 50% of the project focus, followed by modern methods (20%), visualization (15%), simulation, regular expressions and web publishing. At Berkeley about 35% of the lab work focused on programming, and 10% each was on regular expressions, SQL, XML shell, simulation, and visualization. For the homework, again about 35% was on programming. After that, the general focus was on statistical topics with about 25% on visualization, 15% on modern methods, and, in decreasing order, simulation, regular expressions, SQL, and XML. For the projects, the effort on the topics is divided as follows: programming 30%, visualization 30%, simulation 15%, modern methods 15%, web publishing, regular expressions. We also quantified the lecture time as follows: programming 35%, visualization 15%, modern methods 15%, then simulation, regular expressions, Web scraping, SQL, shell, XML, shell, and Web publishing.
Examples of Assignments and Projects S UMMARY D ISCUSSION
As faculty who have worked to integrate aspects of data science into our own curricula, we have a number of reflections on the various implementations presented here. The most striking highlight is that all faculty report how popular their courses are and how rewarding they are to teach. Not surprisingly, a course that incorporates data science can be a way to excite students about further study in statistics. We conjecture that one reason for high enrollments in the data science classes at these institutions is that the students perceive these courses as relevant and exciting. We also contend that incorporating data science into the statistics curricula gives us an opportunity to teach statistical methods, thinking, practice, and computation in a modern venue that is meaningful to students and aligned with the goals of a statistics education. van der Laan (2015) defined statistics as the science of learning from data. We assert that to analyze the data of today and the future these techniques need include to include more data related skills. In the data science courses described above, students have the opportunity to repeatedly participate in the entire data analysis cycle: forming a research question, obtaining data, formatting & cleaning data, analyzing, and communicating results. Whereas the first few aspects of carrying out an analysis have typically been skipped in too many traditional courses and curricula (Zhu, et al., 2013) , using a data centric approach requires students to follow the entire trajectory in order to understand the source and inherent variability of the data. Moreover, with a data science course it is possible to integrate statistical thinking with computing with experience with data -- the goal is to focus on the computational problem solving aspects of carrying out a data analysis. The recommendation for repeated exposure to this cycle flows directly from guidance on the structure of the statistics curriculum (ASA, 2014). Class projects and assignments in data science courses differ from traditional computer science assignments and from traditional data analysis assignments because they typically require both a demonstration of computing aptitude and insight into the data analysis. Grading of the assignments can be cumbersome. Systems such as Murrell’s relieve the burden of grading code. Assignments such as these often require the students to write up their analysis and findings, and Peng’s peer grading system tries to address this aspect of the grading. Both of these grading systems are useful in reducing the burden of grading when there are a large number of students in the course. However, there are additional considerations that need to be addressed. Statistics faculty have typically not been trained to evaluate technical writing or coding of their students, which in turn means the students (i.e., peer graders) do not receive training either. Moreover, assignments that require data analysis are often very open-ended and developing tests for code can be difficult without making the assignment overly prescriptive. For faculty who are interested in developing their pedagogical abilities in data science, we recommend participation in relevant workshops and seminars that focus on such topics. Recent workshops include the 3 rd teaching an entire course and changes it into a student run/lead investigation solving a real data problem. Additionally, students are able to display their data science skills publicly. Indeed, at the 2014 and 2015 Five College DataFest competition, a team of students from Baumer’s class won Best in Show for their analyses. In summary, the main objective of this paper is in making concrete recommendations about ways that new capacities in data science can be implemented in the undergraduate curriculum. Nolan & Temple Lang(2010), Brown & Kass (2009), ASA (2014), Cobb (2015) and others have called for a comprehensive restructuring of how students are prepared to deal with the myriad of data they will see in their careers. This paper takes these recommendations a step further by offering a variety of example implementations, along with syllabi and course materials to compare and contrast, to adopt and adapt, and to assist faculty who want to modernize their statistics programs. A CKNOWLEDGEMENTS R EFERENCES
American Statistical Association. (2014, 7 28).
Statistics Bachelor's Degrees
Report of the ASA Workgroup on Master’s Degrees.
Alexandria, VA: American Statistical Association. From http://magazine.amstat.org/wp-content/uploads/2013an/masterworkgroup.pdf Baumer, B. (2015). A Data Science Course for Undergraduates: Thinking with Data.
The American Statistician , http://arxiv.org/abs/1503.05570. Brown, E., & Kass, R. (2009). What is statistics?
The American Statistician, 63 , 105-110. DRAFT 20 Cleveland, W. (2001). Data Science: An Action Plan for Expanding the Technical Areas of the Field of Statistics.
International Statistics Review, 69 , 21-26. Cobb, G. (2015). Mere Renovation is Too Little Too Late: We Need to Rethink the Undergraduate Curriculum from the Ground Up.
The American Statistician . Cuny, J., Snyder, L., & Wing, J. M. (2010).
Demystifying Computational Thinking for Non-Computer Scientists.
Gould, R. (2010). Statistics and the Modern Student.
International Statistical Review, 78 (2), 297-315. Gould, R., Baumer, B., Mine, C.-R., & Bray, A. (2014). Big Data Goes to College.
AMSTAT News , http://magazine.amstat.org/blog/2014/06/01/datafest/. Greenhouse, J. B. (2013, 7 26). Statistical Thinking: the bedrock of data science.
The Huffington Post
Chance , in press. Johnson, J., Reitzel, J. D., Norwood, B., McCoy, D., Cumming, B., & Tate, R. (2013, March).
Social Network Analysis: A Systematic Approach for Investigating . Retrieved from FBI Law Enforcement Bulletin: https://leb.fbi.gov/2013/march/social-network-analysis-a-systematic-approach-for-investigating Johnstone, I., & Roberts, F. (2014).
Data Science at NSF.
Statistics and Science: A Report of the London Workshop on the Future of the Statistical Sciences.
Big data: The next frontier for innovation, competition, and productivity.
Introduction to Data Technologies.
Rising Above the Gathering Storm, Revisited: Rapidly Approaching Category 5.
Washington, DC: The National Academies Press. Nolan, D., & Temple Lang, D. (2010). Computing in the Statistics Curricula.
The American Statistician, 64 , 97-107. DRAFT 21 Nolan, D., & Temple Lang, D. (2015a).
Data Science in R: A Case Studies Approach to Computational Reasoning and Problem Solving.
CRC Press. Nolan, D., & Temple Lang, D. (2015b).
Explorations in Statistics Research: A model for undergraduate co-curricular exposure to modern research problems.
Pierson, S. (2014, September 1). Bachelor’s Degrees in Statistics Surge Another 20%.
AMSTAT News , http://magazine.amstat.org/blog/2014/09/01/degrees/. van der Laan, M. (2015, February 1).
Statistics as a Science, Not an Art: The Way to Survive in Data Science . Retrieved from AMSTAT News: http://magazine.amstat.org/blog/2015/02/01/statscience_feb2015/ Wickham, H. (2009). ggplot2: elegant graphics for data analysis.
New York: Springer. From http://had.co.nz/ggplot2/book Wickham, H. (2015). Tidy Data.
Journal of Statistical Software , submitted. Wills, J. (2015).
New to Data Science
The American Statistician, 67 , 235-241. Zorn, P., Bailer, J., Braddy, L., Carpenter, J., Jaco, W., & Turner, P. (2014).