AA First Course in Data Science
Donghui Yan and Gary E. DavisDepartment of Mathematics and Program in Data ScienceUniversity of Massachusetts DartmouthMay 9, 2019
Abstract
Data science is a discipline that provides principles, methodology andguidelines for the analysis of data for tools, values, or insights. Drivenby a huge workforce demand, many academic institutions have startedto offer degrees in data science, with many at the graduate, and a few atthe undergraduate level. Curricula may differ at different institutions, be-cause of varying levels of faculty expertise, and different disciplines (suchas Math, computer science, and business etc) in developing the curricu-lum. The University of Massachusetts Dartmouth started offering degreeprograms in data science from Fall 2015, at both the undergraduate andthe graduate level. Quite a few articles have been published that deal withgraduate data science courses, much less so dealing with undergraduateones. Our discussion will focus on undergraduate course structure andfunction, and specifically, a first course in data science. Our design of thiscourse centers around a concept called the data science life cycle . That is,we view tasks or steps in the practice of data science as forming a process,consisting of states that indicate how it comes into life, how different tasksin data science depend on or interact with others until the birth of a dataproduct or the reach of a conclusion. Naturally, different pieces of thedata science life cycle then form individual parts of the course. Details ofeach piece are filled up by concepts, techniques, or skills that are popularin industry. Consequently, the design of our course is both “principled”and practical. A significant feature of our course philosophy is that, in linewith activity theory, the course is based on the use of tools to transformreal data in order to answer strongly motivated questions related to thedata.
We discuss our implementation of a first-year undergraduate course in datascience as part of a 4-year university-level BS in data science, and we alsoelaborate what we see as important principles for any beginning undergraduatecourse in data science. Our principal aim is to stimulate discussion on relevantprinciples and criteria for a productive introduction to data science.1 a r X i v : . [ s t a t . O T ] M a y Background on data science
The term “data science” was coined by Jeff C. Wu in his Carver Professorshiplecture at the University of Michigan in 1997 (Wu 1997). In this and a subse-quent 1998 Mahalanobis Memorial Lecture (Wu 1998), Wu advocated the useof data science as a modern name for statistics. This is the first time the term“data science” was used in the statistical community. Cleveland (Cleveland2001) outlined a plan for a “new” discipline, broader than statistics, that hecalled “data science”, but did not reference Wu’s use of the term. The Inter-national Council for Science: Committee on Data for Science and Technologybegan publication of the
Data Science Journal in 2002, and Columbia Univer-sity began publication of
The Journal of Data Science in 2003.Data science became popular during the last decade with the booming of manymajor Internet corporations, such as
Yahoo, Google, Linkedin, Facebook and
Amazon , and many start-ups built from data, such as
Palantir, Everstring, theClimate Corporation , and
Stitch Fix . Nowadays, “data science”, along with “bigdata”, has become one of the most frequently used phrases in venues such asbusiness, news, media, social networks, and academia, with “data scientist” be-coming one of the most popular job titles (Davenport and Patil 2012, Columbus2018).Despite the fact that data science has become so popular, and we are usingproducts enabled by data science on almost a daily basis, there is currently noconsensus on the definition of data science. While Wu’s proposal of the use ofthe name “data science” adds a modern flavor to traditional statistics, we, alongwith a majority of working data scientists, consider data science as a broaderconcept than statistics. We view data science as the science of learning fromdata: a discipline that provides theory, methodology, principles, and guidelinesfor the analysis of data for tools, values, or insights. Here tools may includethose that can help the user for better analysis, such as tools for visualization,data collection or exploration, and value refers mainly to those with commercialor scientific value.Our view of data science has ingredients from several sources, including tradi-tional statistics—Leo Breiman’s “two cultures” argument of modeling (Breiman2001)—and, in terms of coverage of topics, David Donoho’s “50 years of datascience” lecture at Princeton University, 2015 (Donoho 2015; see also Donoho2017). In particular, our view of data science consists of both the generativeand the predictive “culture”. Effectively, this would include machine learning—mostly with a predictive nature—as part of data science, thus putting these twosubjects of learning from data, namely, statistics and machine learning, under acommon umbrella. This allows a unified treatment of a wide range of problems,including estimation, regression, classification, ranking, as well as unsupervised(or semi-supervised) learning under the broad term “modeling” (or analysis).The benefit is immediate: developments and expertise in these two historicallyseparate subjects could inform each other, and many redundant course offeringsdue to administrative barriers can be removed. Another crucial element in ourview is that, one could start with a large amount of data without any particularquestions in mind, and relevant questions would be figured out while explor-2ng the data. This is what drives the recent surge of interests in data science,given the prevalence of data generating sources such as the Internet, mobile andportable devices, and the increasing feasibility of collecting large amounts ofdata. A third point is that, data science should also include an interface layerthat interacts with domain knowledge or the business aspect, and also algo-rithms or techniques that deals with the implementation, that is, the computerscience aspect. So, in our view, data science is an interdisciplinary subject thatencompasses the traditional regimes of statistics and machine learning, businessor domain sciences, and computer science.
Driven by a huge demand in data science (Manyika et al., 2011; PwC; Columbus,2017), many academic institutes have started offering degrees in data science,with many at the graduate and a few at the undergraduate level (see, for exam-ple, National Academies 2018). The curriculum may differ at every institute,due possibly to the fact that there is still no consensus in the definition of datascience. At the University of Massachusetts, Dartmouth, we started offeringa BS and MS in data science from Fall 2015. Quite a few articles have beenpublished that discuss data science courses, e.g., Tishkovskaya and Lancaster(2012), Baumer (2015), Escobedo-Land and Kim (2015), Hardin et al. (2015),and Horton et al. (2015). Our discussion here will be about undergraduatedata science and, more specifically, a first course in data science (labelled as“DSC101” at the University of Massachusetts Dartmouth). Such a course givesan overview and brief introduction to the concepts and practices of data science,and serves three goals. • It introduces to students the notion that data entails value, thus helpingmotivate students to the study of data science. • It provides students with a big picture and basic concepts of data science,as well as the main ingredients of data science. • Students will learn some practical techniques and tools that they can applylater in more advanced courses or when they start work after their degreeprogram.Our curriculum design centers around the data science life cycle and is not sim-ply a loose collection of various topics in data science. It is based on a processmodel. The idea is that we view individual steps or tasks in data science asforming a process where some may depend on, or interact with others, or mayrepeat as more insights are gained along the way, until the reach of a conclusionor the birth of a data product . A brief introduction to each piece in the processthen forms the individual parts of DSC101, with details to be covered in morespecialized or advanced courses. The design of a data science course could also A data product is any product built from the data. It can be a piece of software (such asa recommendation system in an e-commerce web), a collection of data that some vendors useto make profit (for example, personal data processed from data crawled from many differentsources and arranged in tabular format, such as ), or a softwaretool that one can use to carry out the analysis for a specific application.
3e based on case studies. There are courses in statistics designed with this ap-proach, for example, Nolan and Speed (2000). However, we have not seen manydata science courses designed this way; the exceptions are Hardin et al. (2015)and Nolan and Temple Lang (2015). A case study based approach would re-quire a careful selection of study cases with each emphasizing a different aspectof data science so as to ensure coverage of the course on data science topics,which is far from easy, and requires regular updating. Other alternative coursestructures include the Berkeley Data 8 “Foundations of Data Science” course(see data8.org ).Another feature that distinguishes our DSC101 from similar courses is its prac-tical flavor. Apart from its traditional statistics rigor, DSC101 also has a strongindustry flavor: it has an emphasis on the practical aspects, and many examplesare taken from applications in industry; the idea is to provide students with au-thentic data experiences (Grimshaw 2015). The first author has previous datascience experience in industry, and in designing this course we use examplesfrom data science in industry and carry out some reverse engineering to decidewhat topics, projects, and other components are to be included so that studentscan gain experience with the practical demands of industry. For example, wechoose to use R as the programming language for this course, due to the increas-ing popularity of R in industry. Similarly, given that a data scientist typicallyspends about 60-70% of their daily work in pre-processing the data, includingthe collecting, cleaning and transforming of the data, we have a project thatrequires students to collect and process unstructured auto sales data from theweb, and students are encouraged to use
Python for this purpose.The remainder of this paper is structured as follows. First we present two ex-amples of data science applications to motivate the concept of the data sciencelife cycle in Section 4. This is followed by a discussion of the theoretical basisfor student activity in Section 5. Then we discuss philosophies of the course de-sign in Section 6. This is followed by an introduction in Section 7 of individualpieces in the data science life cycle, namely, the generation of questions, datacollection, various topics in exploratory data analysis, and then linear regressionand hypothesis testing. Finally, we conclude with remarks.
As stated in Section 1, our design of DSC101 centers around the data sciencelife cycle. In this section, we will explain the data science life cycle in detail,through two examples. One is about a large-scale study in untangling the rela-tionship among smoking, low birthweight, and infant mortality. The second ison how an e-commerce web site may use historical transaction records to buildan item recommendation engine. As will become clear shortly, these representtwo different modes of how a data product could be built, and correspondingly,two different paths in the data science life cycle.The first example is from a noted study—the Child Health and DevelopmentStudies, carried out by Yerushalmy (1964, 1971) in the 1960’s on how a mother’ssmoking, low birthweight of infants, and infant mortality, are related. Several4igure 1:
Smoking, low birthweight, and infant mortality. The link betweennodes indicates association instead of causation. prior studies, e.g., Simpson (1957), suggested a much greater proportion oflower birthweights (i.e., less than 2,500 grams for newborns in the US) amongsmoking mothers than nonsmokers. Meanwhile, low birthweight was a strongpredictor of infant mortality. Is smoking related to infant mortality? Datawere collected for all pregnancies (about 10,000 cases before 1964, and laterincreased to about 15,000) between 1960 to 1967 among women in the KaiserFoundation Health Plan in Oakland, California (Nolan and Speed 2000). Thedata includes the baby’s length, weight, and head circumference, the lengthof pregnancy, whether it is first born or not, age, height, weight, education,and smoking status of the mothers, as well as similar information about thefather etc. Yerushalmy’s 1964 study confirmed prior claims on a greater pro-portion of low-weight births but no higher mortality rate for smoking mothers.Yerushalmy collected more data, for about 13,000 pregnancies, and refined hisresearch focus on low birthweight infants. This led to the unexpected findingthat, among the low birthweight infants, those from a smoking mother actuallysurvived considerably better than otherwise. A later study (Wilcox 2001), di-rected by Allen Wilcox on a much larger data set of about 260,000 births in thestate of Missouri (1980-1984), resolved the low birthweight paradox and foundthat infant mortality was primarily caused by other factors, such as pretermbirth. Wilcox writes:“the mortality difference must be due either to a difference in smallpre-term births or to differences in weight-specific mortality that areindependent of birthweight. This demonstrates the central impor-tance of pre-term delivery in infant mortality, and the unimportanceof birthwieght” (Wilcox 2001, p. 1239).The second example is about item recommendation on an e-commerce web. Ane-commerce vendor would typically collect traces of every ‘mouse click’ when auser visits its web, including items a user clicks, views, or purchases. Such datais often called clickstream data, which contains fairly rich information aboutusers’ purchase behavior: for example, the most popular items, items a usertypically buys together (called “co-bought items”), and geographical patternsin users’ purchase behavior. Such user behavior profiles can be used to rec-ommend selected items to the user, or to select appropriate contents to show5he user when he enters a new page. This is called item recommendation orpersonalization. For example, in Figure 2, a user has clicked a
Nikon camera.The co-bought statistics from historical data, taking into account of item prices,can be used to decide which items to display that would lead to the most userclicks or the most profit for the vendor.Figure 2:
Items recommended when a camera is clicked. Courtesy walmart.com.
The first example describes the path taken by traditional statistical analysis.That is, one starts with a question in mind, then collects data, followed by dataanalysis, more data, refined question, and then a conclusion. The second exam-ple describes an alternative path, where large data have been collected (e.g., as aby-product of normal business operations) but it is not clear what to do, so onewill need to come up with a relevant question (such as “what behaviors predictpurchases?”) through some preliminary analysis on the data, and then conductdata analysis until reaching a conclusion or outcome. One thing in common isthat both examples consist of the same set of tasks: data collection (includingdata cleaning and pre-processing), questions, analysis and outcome (a conclu-sion, a model or data products etc). Data analysis can be either exploratorydata analysis (EDA) in which one explores the data and constructs hypotheses,or confirmatory data analysis (CDA) in which one tests prespecified hypothesesvia a model on variables of interest. One reaches a conclusion or outcome eitherby EDA or CDA. Note that some steps may be repeated multiple times. Amongthese tasks, there is a dependency: some tasks only start upon the completionof those proceeding ones. Each data science application has a start , followedby a series of tasks, and finishes with an end . We use a concept called process to describe this, in analogy with the process concept used in computer operat-ing systems. Interdependent tasks are linked by a (directed) arrow—the taskpointed to by the arrow only starts when the task at the source of the arrow6ompletes. Putting these together, we arrive at a directed graph. This is the data science life cycle , similar to the software life cycle (Langer 2012) used insoftware engineering.Figure 3:
The data science life cycle.
Figure 3 is our proposed diagram for the data science life cycle. “Data & Q”indicates a state in the data science life cycle such that, at the current state,one has collected the data and formulated a study question.Similar models or diagrams have been proposed for data science during thelast few years, for example, Schutt and O’Neill’s data science process diagram(O’Neil and Schutt 2013), Phillip Guo’s data science workflow (Guo 2012), thePPDSC cycle (Wild and Pfannkuch 1999), and the Wickham-Grolemund datascience cycle (Wickham and Grolemund 2016). These are illustrated in Figure 4.However, there are major differences from our model. Schutt and O’Neil’s dia-gram focuses on the data and describes intermediate stages in the building of adata product, so it is essentially a data cycle. Guo’s workflow model describesthe dependency of various tasks in a data science project setting; it includesmany details and may not be general enough. The PPDAC Cycle is the closestto ours in the sense that it consists of one possible paths in our diagram. TheWickham-Grolemund data science cycle takes a data-centered approach andform a cycle by including various operations to the data. Our model focuseson the tasks in data science, and allows the interaction between tasks and theirrepetitions, as well as the possibility of having a clearly defined question in mindat the start.
Semester-long undergraduate courses are designed with specific aims and learn-ing outcomes in mind, and college or university administrators require these tobe explicitly articulated. Additionally, instructors bring with them a theoreti-cal stance on how a course will be executed over a semester. This theoreticalaspect to course design and implementation is rarely explicitly articulated (andsometimes not even by the instructor to themselves!) A clearly articulated the-7a)(b)(c)(d)Figure 4:
Alternative diagrams related to the analysis of data. a) The datascience process diagram of Schutt and O’Neil; b) the data science workflow ofGuo; c) the PPDAC cycle; d) the Wickham-Grolemund data science cycle. activity theory (Leontiev 1978; Davydovet al. 1982; Raeithel 1991) provides a coherent and productive foundation, andwe discuss how aspects of activity theory interact with the data science life cycle.As this is only a first course in data science, and usually offered at the beginningof the first year, students typically have not acquired a strong background incalculus or statistics, so we need to begin by working with what intellectual toolsthey do have and then introducing them to new analytical and computationaltools. We focus in the beginning, mainly on exploratory data analysis, andconcepts related to various parts in the data science life cycle. This includesintroduction to concepts or tools such as sampling, descriptive and summarystatistics, data visualization and graphical tools, moving on progressively to theuse of such tools as principal component analysis, clustering, linear regressionand hypothesis testing.A major point is the following: the tools are introduced in order to enablestudents to effectively transform raw data into something more useful. The fo-cus is on the raw data, the motivation to transform them—the objective —andthe tools used to effect those transformations. This is the opposite of a scenarioin which techniques of data analysis are taught with artificially designed andrelatively simple toy data (that is, students practice tool use in the absenceof appropriate or realistic data). Becoming a useful and skillful data scientistrequires addressing the full complexity of data , and finding appropriate tools toeffect insightful transformations on those data. This is the central reason whyactivity theory drives so much of our thinking in course design for data sci-ence: from an activity theory perspective the context of data science for thesebeginning students is the raw data, questions posed about those data, agreedobjectives, and transformation of the data by activity, utilizing analytical tools,in a cyclic process. By this perspective, introductory data science is contextu-alized for the students as a meaningful, empowering process. In many academiccourses students do exercises and practice on toy data sets to complete home-work exercises and study for an examination in order to get a satisfactory grade.The reality of the context makes the DSC101 course quite different from this.Specific curriculum instances of activity theory are often described in termsof an “activity triangle” (see, for example, Engestrom 1991, 1999, 2000 andPrice et al 2010). Typically, these activity triangles have a structure as illus-trated in Figure 5.An activity triangle encapsulates the interrelationship between the main con-stituents of a curriculum activity as conceived by activity theory. As a specificexample consider a traffic data example (see more details in Section 7.1), whichconsists of the starting point and destination of each trip and time stamp ateach road during the trip. The activity starts with raw material which is realtraffic data.The community is the class of students and the instructor, but may also include9igure 5:
A generic activity triangle an audience, other than the instructor, for whom the students are to build adata product such as a predictive model, and write a report. For example, thetraffic data may well have come from someone who wants to know certain thingsabout the data, so in this case students write reports for that person, who isalso part of the communityThe subject or subjects consist of an individual student or small groups of stu-dents working together to produce an outcome, typically a written report of ananalysis or a data product.The tools are usually the software tools, such as the R programming languageand conceptual tools, such as regression or clustering techniques, that studentscan bring to bear on the objective.The object, or objective , is determined through discussion by students, the in-structor and any external client, and in the case of the traffic data example mayinvolve such things as determining traffic bottlenecks at particular times of day.The rules vary from activity to activity, and may include such general things asavoidance of plagiarism, appropriate referencing of sources, cooperation withinand between student teams, sharing of findings, ethical behavior, and responsi-bility for meeting deadlines.
Division of labor can work in several ways including different students within ateam taking charge of different aspects of analysis, or different teams focusingon different aspects of an objective with the aim of pooling findings.The same data set may—and usually does—generate a number of different ac-tivities and objectives as students ask further questions about the data, and setout to examine their determined objectives. When this happens the objectivewill change, the subjects may change in that students may form new groups,spontaneously or at the instructor’s direction, the division of labor may change,and the tools will most likely need to be modified and new tools brought to bearon achieving the objective. 10n activity triangle, as realized in a specific curriculum module, is coordinatedwith the data science life cycle. Although many variations are possible fromactivity to activity, as described above, it is common that certain aspects of anactivity triangle stay fixed throughout a semester: typically the subjects are thestudents; the community is the class of students and the instructor; the rules are articulated at the beginning of semester and stay more or less fixed; andcommonly the division of labor , either within or between groups, stays much thesame. The data science life cycle impacts the activity triangle, and vice versa,from question to objective, analysis to tools, and outcome to conclusion.As students are engaged in a specific project—some of which are detailed below—and cycle through the data science life cycle, a new activity triangle emergesin which new questions inform new objectives, new analyses require new tools,and new outcomes provide new conclusions. Thus one sees a dynamic sequenceof activity triangles as progress on a project involves cycling through the datascience life cycle. The activity triangles inform the data science life cycle in thatthey describe how the various aspects of the data science life cycle are imple-mented through activity.We focus on the practice and craft of data science—part of what it meansto become apprenticed as a beginning data scientist. This does not mean, how-ever, that something akin to Lave’s situated action model (Lave 1988, Laveand Wenger 1991), in which one learns by self-directed, novice participation ina communal activity, provides a better theoretical model for designing a datascience course than does activity theory. The essential feature of activity theorythat is helpful in this regard is that an object comes before an activity based onthat object, and motivates the activity (Nardi 1996). While learning to becomea data scientist through behaving as if one were an apprentice, thrown into anongoing field of activity, can be a positive and highly educative experience—andis the motivation for many student internships—our focus in beginning data sci-ence courses is on activity motivated by student desire to transform data, usingtools they have at hand, or are capable of developing: this constitutes an “ob-ject” (or “objective”) in activity theory. Data is transformed through activitythat relates to an objective usually coming from a naturally arising questionabout the data.
Activity theory helps us focus on two aspects of DSC101 that are important toits success. The first is that the data is real-world data for which a question—sometimes rather vague—is naturally proposed. For example (see also Sec-tion 7.1), given a collection of traffic data for many trips, including startingand end point as well as timestamp at each road during the trip, what ques-tions could students ask that have the potential of becoming a data product?This aspect of the DSC101 course is important in focusing students on an end-product of their studies in data science: a rewarding and satisfying career. Fromthe beginning, students in DSC101 gain a lived experience of what constitutesboth the practical and conceptual aspects of the working life of a data scientist.11he second aspect of DSC101 highlighted by an activity theory perspective is empowerment : the extent to which the activities and tools used in those activi-ties actually empower students to do something satisfying. Students in DSC101should never complain: “When will we ever use this?” The answer is obviousfrom the nature of their activity in attempting to answer questions about real-world data with tools provided to them, or built by them.The design of our DSC101 course centers around the data science life cycleand the activities that involve. The course starts with an introductory lectureof data science with two goals in mind. One is to give students a sense thatdata entails value, another that it is possible to make a difference, to influenceoutcomes, by leveraging values from the data. We introduce numerous inter-esting stories from a variety of fields, ranging from science, finance, metrology,sports, to Internet and e-commerce, on how insights can be obtained from thedata through models and analytical tools. Of course, these stories also conveyan idea to students of what constitutes data science, and how their activity,on raw data, with specific objectives, can transform that raw data to insightfuloutcomes through the use of appropriate tools. Then the data science life cycleis introduced, followed by various parts of the cycle, including asking interestingquestions from data, data collection, exploratory data analysis, modeling, andconfirmatory data analysis.To reflect the practical aspect of this course, also due to its growing popu-larity in the data science community, we dedicate two weeks of lectures for Rprogramming (Verzani 2008), which is the programming language used for in-struction and student projects. There are many alternatives to R, the free andopen source nature of R together with a very large and diverse R user commu-nity make a relatively compelling case for including R as a basic programminglanguage and data analysis tool. Through being inducted into the R ecosystemstudents are exposed to a huge network of open data analytic resources andtools by learning the basics of R programming: it’s not simply a useful andwidely used tool they learn—it’s also a huge and diverse community of poten-tial support. People who use R, write R packages, and provide instruction in,and support for, R come from a widely diverse collection of backgrounds, soexposing beginning data science students to a vision of data science that cutsacross numerous disciplines.
As described earlier, topics covered in our DSC101 are individual parts in ourdata science life cycle. In particular, Section 7.1 corresponds to “question”,Section 7.1 to “data”, Section 7.4, Section 7.5, and Section 7.6 corresponds to“analysis” part of the data science life cycle, respectively. In this section, wewill describe each of the topics in detail.
Asking informed questions, from data or given evidence, is one of the most cru-cial parts of traditional sciences: it forms the start of a scientific investigation.12n the other hand, it is one of the primary driving forces behind the recent ex-plosive growth in data science applications. Imagine that an e-commerce vendorhas collected huge user access data; what new business models can it generate?If a search engine has collected a large collection of searched keywords, howcould such data be utilized? It is possible to use such data to optimize theselection of advertisements and their placement in a page, or even to improvethe design of the search engine.To paraphrase Brown and Keeley (2007, p. 3) in the context of a DSC101course: Questions about data require the person asking the question to act inresponse. By our questions, we are saying: I am curious; I want to know more.The questions exist to inform and provide direction—an objective—for all whohear them. The point of questions is that one needs help and focus in obtaininga deeper understanding and appreciation of what might be in the data. To in-spire students to think and appreciate the value of data, and ask good questions,students are encouraged to ask questions for any data to which they may haveaccess. As an example, in-class groups are formed among students to discusspotentially what one could do with large traffic data.
Suppose one is given traffic data of a city. Data includes about 30million records of vehicles with each consisting of: the starting pointand destination of each trip, and time stamp at each road during thetrip. The same car may have multiple entries in the records. Thereare two cases: knowing or not knowing the auto plate. What canone do with such data?
R is chosen as the programming language for the course, recognizing the growingimportance of R programming in data science as well as its great utility inmodeling (modeling is offered as a senior level undergraduate data science courseat the University of Massachusetts Dartmouth). Topics covered include threeparts. • The first is on programming language features. This includes data struc-tures such as lists, vectors, arrays and matrices, data frames etc; struc-tured programming constructs such as loops, conditional statements andfunctions etc; data and text manipulation (including regular expressions)tools, file I/Os (including excel spread sheets) etc. • The second is on the statistical aspect of R, which covers R functions togenerate data of various distributions, and R functions for statistical testsetc. • The third is on R functions for graphics and visualization. As an ele-mentary course in data science, only R functions or simple graphical toolsrelated to basic plotting functionalities are discussed.To sharpen the programming skills of the students, very simple algorithms re-lated to searching and text manipulation are introduced. Programming exercisesare assigned as labs, and programming questions, such as analyzing the program13utput and implementing a simple function, are included in the exams. Sam-ple R code is provided for most of the examples, so that students can try Rprogramming on their own and gain hands-on experience.
Data collection is an important aspect of data science. In DSC101, the ideaof random sampling and sampling techniques such as simple random samplingand stratified sampling are introduced. To better appreciate the idea of randomsampling, several types of mis-uses of sampling are discussed, including samplingfrom the wrong population, convenience sampling, judgement sampling, datacherry-picking, self-selection, and anecdotal examples . Each of these is discussedwith a story, selected from the news or from the instructor’s experience. Beforea formal analysis of each story, time is allocated for students to think andto form group discussions to see if there is anything potentially wrong in thestory. Students are also encouraged to share their own examples. As a practice,students are assigned a lab to collect auto sales data, including sales prices andthe age of their favorite car model, and judge if their data collection suffersfrom any sampling bias. Such learning by doing practice may improve students’interest in the course.
Exploratory data analysis (EDA) was pioneered by J. W. Tukey in the 1960’s(Tukey 1977). It refers to various things one would try out before a formal andoften complicated data analysis, and is therefore often viewed as a preliminarydata analysis. It is typically applied in situations when one wishes to know moreabout the application domain, and EDA often helps one gain a better sense ofwhat the data looks like, which may be suggestive in the choice of a model ordata transformation. Similarly when one has data but does not have a well-defined question, exploring the data to discover patterns or regularities mayinspire interesting questions. Of course, sometimes EDA may be sufficient if thequestion of interest is rather simple or the underlying pattern is salient enough.Common tasks in EDA include the following: descriptive and summary statis-tics, graphical visualization, data transformations, clustering etc. We discusseach of these in the following.
Descriptive and summary statistics are very helpful in data analysis. From suchstatistics, one can often get a ball-park idea of the data distribution. These arealso useful in presenting data or communicating results to other people, espe-cially when graphical visualization is not possible. Three types of descriptiveor summary statistics are introduced in DSC101. The first is for the measureof location in the distribution, including mean, median, mode , and the moregeneral quantiles and percentiles . The second is for measures of dispersion, in-cluding variance and standard deviation . The third is about the shape of thedata distribution. This includes a measure of asymmetry of the data, skewness ,and a measure of the peakedness of the data, kurtosis .14 .4.2 Graphics and data visualization
Data visualization is an important part of EDA, and also a useful tool for com-municating results. It is being used more and more in the practice of datascience, for example one may see plots or charts in almost every issue of the
New York Times , and the
Guardian newspaper, in its various country and in-ternational editions.This part of the course starts with guidelines, or rules of thumb, for a usefulvisualization. Note that our focus is the visualization of data instead of abstractconcepts (Yan and Davis 2018); here one seeks to understand the data or infor-mation behind by displaying aspects of the data. Then a collection of graphicaltools are introduced, including basic tools such as bar, pie, Pareto charts andtheir stacked or grouped version; statistical graph tools such as histograms, boxplots, stem-and-leaf plots; as well as tools suitable for the visualization of mul-tivariate data. Some interesting data sets are used in introducing the graphicaltools, for example the US crime and arrest data, the US statewide mean Januarytemperature for a given year and the mean during the last century. Studentsuse the tools and example R code to visualize the data, then share what theyobserve from the graphs or other visualizations, and give interpretations. Tobetter appreciate the effect of graphical visualization (Nolan and Perrett 2016),in-class discussions are formed where students are given a data set, such as amultiway contingency table, and then tasked to design their own way of vi-sualization, and designs from different groups are compared. This is a goodopportunity for students to apply what they learn with creativity, and greatlymotivates students’ interests in the course. Indeed quite a few students viewthis as the best part of the course.For the visualization of multivariate data, tools such as bubble plots, Chernofffaces (Chernoff 1973), and radial plots are introduced. In particular, studentsfind Chernoff faces interesting and intuitive, and that helps them to gain in-sights: for example on the US crime or political ideology by states. Principalcomponent analysis is another tool introduced to visualize multivariate dataand for dimension reduction.
Feature engineering refers to the creation of new features from the data, or,combining or transforming existing features into new ones that suitably rep-resent or reveal interesting structures or patterns in the data. It is a task onwhich data scientists typically dedicate major time. It is crucial to the suc-cess of many modeling tasks. Better features often lead to better results, moreflexibility, and better interpretation of the results. While the entire world hasbeen excited about the success by an emerging machine learning paradigm,deep learning (Hinton and Salakhutdinov 2006, LeCun 2015), on the automaticdiscovery of useful features from data, applications beyond image, speech, andnatural language processing still rely heavily on feature engineering. As stu-dents in DSC101 are unlikely to have any prior data science experience, we onlyintroduce the concept of feature engineering and focus on the easiest part—data transformation. Data transformation is needed when different features15ave drastically different numerical scales, or when the underlying pattern orregularity in the data becomes more salient due to data transformation. Topicsdiscussed include Tukey’s idea of “straightening the plot” (an idea that guidesdata transformation from human perception) (Tukey 1977), and the Box-Coxpower transformation (Box and Cox 1964). Several transformations frequentlyused in practice are discussed. This includes logarithmic or square root trans-formation, data standardization to mean 0 and variance 1, linear scaling of thedata to a range [ a, b ], non-linear bucketing of the data (e.g., assign a numericalvalue 1 to income lower than 20,000, and 2 for the range [20,000, 50,000) andso on).
In practice, data are often heterogeneous. This is due possibly to spatial, tem-poral effects, or differences in other characteristics (e.g., male or females oftenhave very different life style or shopping behavior). Heterogeneity is especiallycommon for big data. It is often desirable to divide the data so that data inthe same subgroup is of a similar nature. One way to achieve this is via cluster-ing. Three classical clustering algorithms are introduced, including hierarchical,agglomerative, and K-means clustering (Aggarwal and Reddy 2013). The ideaof the algorithms and important properties are discussed. More advanced andmodern clustering methods such as model-based clustering (Fraley and Raftery2002), spectral clustering (von Luxburg 2007), cluster ensemble (Strehl andGhosh 2003, Yan et al. 2013) etc are not discussed in lecture but may be usedfor course projects for students with adequate preparation in calculus and linearalgebra.
Simple linear regression is introduced both as a continuation of visualization, inthe sense that the regression line is the line that is ‘close’ to most of the datapoints, and also as a way to summarize data with a simple function. This leadsto the concept of modeling. Example models are given that students are likelyto have learned in their high school texts or from other courses. For a betterappreciation of the concept, students are asked to give their own examples ofmodels. Formulation of simple linear regression is introduced as a least squareoptimization problem, as well as the concept of R as an indicator of the amountof variance explained in the model. The term regression was discussed, usingclassical father-son height data. Simple linear regression was naturally extendedto multiple regression, using the auto mileage per gallon (MPG) data fromthe UC Irvine Machine Learning Repository. Before discussing this example,students are asked to make a guess on which factors are important to the gasmileage of a car; after seeing the regression analysis results students would betterappreciate the value of data analysis. Relevant R functions for linear regressionare introduced, along with discussion of how to read the regression output.Depending on the preparation of students, it may be possible to extend thediscussion to multiple linear regression as recommended by the revised GAISECollege Report (2016). 16 .6 Confirmatory data analysis and hypothesis testing In the confirmatory data analysis part of DSC101, the statistical framework ofhypothesis testing is introduced. There have been lots of controversies on theusage of p-values in recent years (see, e.g., Cumming 2013). However, it is stillwidely used in industry. For example, many vendors in industry use A/B test-ing and p-values for the comparison of alternative models or strategies. Theconcept of hypothesis testing is often challenging to students, as it representsa different way of reasoning compared to logic deduction, with which they arelikely more familiar. To help students, two analogies are introduced and an-alyzed, one being the court trial and the other proof by contradiction. Thisgreatly helps students in understanding. An example from industry is used toexplain why hypothesis testing is useful, e.g., A/B test in deciding if a newstrategy or model does better than the existing one via hypothesis testing. Sev-eral students expressed a view that they liked this part of the course as it seemssurprisingly useful for many real world problems. As can be seen, a big part of the course would overlap with a typical statis-tics course at the similar level. We attribute this to the intimate relationshipbetween data science and statistics; we would not expect a data science courseto be very different from a statistics course. That said, compared to relatedstatistics courses at institutes with which the authors are familiar (there is nota similar statistics course at our institute), there are several major differencesapart from topics apparently missing in these statistics courses (i.e., topics onbiases in sampling, feature engineering, visualization of multivariate data, PCA,clustering). Similar statistics courses would not be structured by the (data sci-ence) life cycle, and the main theme of the courses here is on leveraging data forinsights, conclusions, models, or data products. In a similar statistics course,there would not be any motivating lectures on leveraging value from the data,nor is there any discussion of data science life cycle in the form of carefully cho-sen examples or in-class discussions. There would not be so much discussion onvisualization in a typical statistics course. Also likely the data for projects aregiven instead of asking students to find or scrape data by themselves. Poten-tially, there may also be differences in the execution even if the schedules mightlook similar. For example, we use many examples from the industry (includingsome from the author’s past work), which may not be the case for a typicalstatistics course.
Section 7 discusses topics for lectures, yet there are other components of thecourse not touched, namely, labs or course projects, and presentations. We willbriefly discuss these here; for more details, we refer the reader to the samplesyllabus in the appendix. A/B test is the application of hypothesis testing to compare the effectiveness of twoalternatives (one termed as “A” and the other “B”). It is used widely in industry to comparealternative models or strategies. .1 Labs and course projects An important part of a data science course is projects. As DSC101 is offeredmostly to first-year students, and students typically do not have prior exposureto any programming language, the course project is in the form of several smalllabs. Each lab touches a major topic in the course, and students are typicallygiven two weeks time to work on each project. Students will write a lab reportdescribing the project, where and how the data are collected, a description ofthe data analysis procedure, and conclusion, if any. R code is required to submitwith the lab report. This is a critically important part of DSC101 because itintroduces students to an essential characteristic of a data science professional:the ability to clearly communicate the results of data analysis (see, e.g., Sisto2009, O’Neil and Schutt 2013).The first project is mainly on data collection. Students are required to finddata online or from other sources, and then conduct some simple exploratoryanalysis. One is to download and extract auto sales price for a particular carmodel from a popular auto sales web, cars.com , for cars of different years. Theaverage prices are calculated for cars of the same years, and then a price-yearplot is produced. The second example is from kaggle.com , which consists of his-torical records of airplane crashes since 1908. Students download and processthe data, then visualize airplane crashes by year, airlines, and aircraft models.In terms of empowerment, some students became very excited about the notionof data analysis for insights, and started analyzing data related to their owninterests. For example, one student chose to analyze data on basketball games,and observed the rising of 3-point shots in recent years; he also made interestingpredictions on the strategy of future basketball games.The second project is to read an article of data analysis. One example is aboutanalysis on the swimming competitions in the
Rio Olympics . Two interest-ing phenomena were observed, namely, the noted difference in time betweenback and forth laps, and the observed disadvantage towards athletes assignedto lower-numbered lanes. Students are required to write a report on how theauthor uses the data and carries out his analysis to reach his conclusions. Stu-dents were asked if there are any biases in the way the author was designingthe study. The second part of the project is to have students find two examplesof misuse of sampling techniques in collecting data, from recent news or articles.The third project is about descriptive statistics and sampling techniques. Sev-eral data sets are given and students are asked to compute the skewness andkurtosis. The second part is about sampling techniques, to compare simple ran-dom sampling (SRS) and stratified sampling. Students find or generate theirown data set that is ‘heterogeneous’, and then compare SRS and stratified sam-pling on the variation in the sample means if they are to repeat the sampling100 times.The fourth project is the visualization of US population by states, for Cen-sus 2000 and 2010, respectively. In particular, students are required to producean appropriate heatmap on the US map, and then plot a bubble plot on therate of change in population on the map.18he last project is about the application of different clustering methods, includ-ing K-means, agglomerative, and divisive clustering. Students produce dendro-grams and compare the results. For this project, students are required to do ashort presentation for the project of a 10 minutes duration, including questionsand answers. As stated above, an important part of a data scientist’s job is tocommunicate a problem of interest, or to present analyses, to other people. Wemake presentation of projects, and in-class discussion, in addition to writtenreports an important part of the course.
The students’ performance in the course is assessed in all course components,including quizzes, labs, in-class discussion, a midterm, a final exam. Also thereare two in-class practice sessions. The idea is to ensure students go throughthe relevant course materials and apply these to problem solving. The instruc-tor can observe students performance and provide help on any potential issuesstudents may have. This is allocated to two key topics of the course, R pro-gramming and hypothesis testing. The grade breakdown in a typical semester isas follows: quizzes–10%; in class discussion, practice or presentation–20%; labs–20%; midterm–20%; final–30%. Team-based learning is incorporated in in-classdiscussion or presentation, or labs (students can choose to do it individually, oras a team).
We have been teaching this course since Fall 2015 (this course is offered everyFall). Typically about 40% of the students are data science majors, with othersfrom a very diverse list of majors, such as mathematics, computer science, bi-ology, electrical engineering, mechanical engineering, accounting, managementinformation systems (MIS) etc. This is not a service course.We do not offer a similar introductory statistics course at University of MassDartmouth. At one other institute, one author taught a similar course,
Ele-mentary Statistics . In DSC101, the students are more engaged. We attributethat to the following based on our observations and feedbacks from students.This course is better motivated with many realistic applications. The courserequires more hands-on from students, for example, students need to try outsimple examples using R programming during class. The in-class discussionsuse topics students are familiar with and that they could apply their creativity.Finally, students have more freedom in choosing their projects using real data.Feedback from students suggests that they generally like the in-class discussion,the hands-on exercise on examples discussed in class, the exam problem on datavisualization, and also the freedom in choosing problems for their projects.19
Conclusion
We have briefly introduced a first course in data science offered at the Universityof Massachusetts Dartmouth since Fall 2015. To facilitate our discussion, weclarified our viewpoints on what data science is, and introduced the notion thatdata entails value yet to be explored. Our design of the course is both principledand practical. The design centers around the data science life cycle—topicscovered in the course correspond roughly to individual pieces in the life cycle.That is, data collection, the generation of a study question, data analysis, howto draw conclusions, and how to communicate results. As a first course in datascience, our focus is on the motivation and concepts, and the formal analysis partis limited to exploratory data analysis, linear regression and hypothesis testing.The practical aspect of the course is reflected in several ways. Our design ofthe course has incorporated many elements from current data science practice.We use the popular R programming language for instruction, students hands-onexercises, and projects (but we also encourage the use of Python for projects).Our examples and the data used for course projects are mostly from real worldapplications. In terms of empowerment, the course has been fairly successfulin that at the conclusion of this course, students can comfortably carry outelementary data analysis using R and tools introduced in the class, on variedrealistic, and real, data sets. Some students even started analyzing datasetsrelated to their own interests, for example, the basketball/baseball games data,the
Zillow.com real estate data, etc. One thing worth noting is that this coursehas managed to attract several students from other majors to our data scienceprogram. We hope that our DSC101 course can benefit educators who are newin the field, or students who are interested in data science.
10 Appendix
A sample weekly schedule of DSC101 can be seen in Table 1. The class meetstwice a week for a 75 minute session.This weekly schedule was designed by statistics faculty. If a computer sciencefaculty were to teach such a course, they could still use the data science life cy-cle to structure the course. They could replace several parts of the course (e.g.,those with a statistical flavor) with a computer science flavor, and possibly focusmore on the implementation aspects of data science. For example, they couldteach Python instead of R, given the fact that Python is used more for taskssuch as the processing of texts and unstructured data (both R and Python arepopular programming languages in data science practice). They could structurethe data visualization part with the implementation of visualization and visualanalytics from a human-computer interaction (HCI) perspective. They couldreplace topics such as PCA with data mining topics such as association anal-ysis, or frequent itemsets mining. For data collection, they might focus moreon practical sampling algorithm (possibly in a big data setting), or tools fromPython for data scraping, for example.20eek Topics1
Introduction to data scienceThe data science life cycle
R programming Concept of sampling and potential bias Simple random and stratified samplingDescriptive and summary statistics Data visualization (principle and basics) Data visualization (statistics)Data visualization (bubbles, maps etc) Data transformations and feature engineeringMidterm Visualization of multivariate dataPrinciple component analysis Agglomerative and divisive clusteringK-Means clustering Concept of modelingSimple linear regression Multiple regressionIntroduction to hypothesis testing t-testTwo-sample and A/B test In-class practice of hypothesis testing problemsProject presentation Final exam
Table 1: A sample weekly schedule of topics covered in DSC101.
Acknowledgements
We thank the editors, the associate editors, and anonymous reviewers for theirhelpful comments and suggestions.
References [1] Aggarwal, C. C., and Reddy, C. K. (2013),
Data Clustering: Algorithmsand Applications , Chapman and Hall.[2] American Statistical Association (2016), “Guidelines for Assess-ment and Instruction in Statistics Education (GAISE) College Re-port”, available at .[3] Andersen, M. R., Simonsen, U., Uldbjerg, Aalkjaer, N. C., and Stender,S. (2009). “Smoking cessation early in pregnancy and birth weight, length,head circumference, and endothelial nitric oxide synthase activity in um-bilical and chorionic vessels: an observational study of healthy singletonpregnancies”,
Circulation , 119, 857-864.214] Baumer, B. (2015), “A data science course for undergraduates: Thinkingwith data”,
American Statistician , 69, 334-342.[5] Box, G., and Cox, D. R. (1964), “An analysis of transformations”,
Journalof the Royal Statistical Society, Series B , 26, 211-252.[6] Breiman, L. (2001), “Statistical modeling: The two cultures”,
StatisticalScience , 16, 199-231.[7] Browne, M. N., and Keeley, S. M. (2007),
Asking the Right Questions (11thEdition) , Pearson/Prentice Hall.[8] Chernoff, H. (1973), “The use of faces to represent points in k-dimensionalspace graphically”,
Journal of the American Statistical Association , 68,361-368.[9] Cleveland, W. S. (2001). “Data science: an action plan for expanding thetechnical areas of the field of statistics”.
International statistical review ,69(1), 21-26.[10] Columbus, L. (2017). “IBM predicts demand for data scientists will soar28% by 2020”. , May 13, 2017.[11] Columbus, L. (2018). “Data scientist is the best job In America accordingto Glassdoor’s 2018 rankings”. , Jan 29, 2018.[12] Cumming, G. (2013), “Understanding the new statistics: Effect sizes, con-fidence intervals, and meta-analysis”,
Routledge .[13] Davenport, T. H., and Patil, D. J. (2012), “Data scientist: The sexiest jobof the 21st century”,
Harvard Business Review , October issue, 2012.[14] Davydov, V., Zinchenko, V., and Talyzina, N. (1982), “The problem ofactivity in the works of A. N. Leontiev”,
Soviet Psychology , 21, 31-42.[15] Donoho, D. (2015), “50 years of data science”,
Tukey Centennial Work-shop , Princeton, NJ. Available at http://courses.csail.mit.edu/18.337/2015/docs/50YearsDataScience.pdf .[16] Donoho, D (2017), “50 years of data science”,
Journal of Computationaland Graphical Statistics , 26(4), 745-766.[17] Engestrom, Y. (1991). Activity theory and individual and social transfor-mation.
Multidisciplinary Newsletter for Activity Theory , 7/8, 6-17.[18] Engestrom, Y. (1999). Activity theory and individual and social transfor-mation. In Y. Engestrom, R. Miettinen, & R.-L. Punamaki (Eds.), Per-spectives on activity theory (pp. 19-38). Cambridge: Cambridge UniversityPress.[19] Engestrom, Y. (2000). “Activity theory as a framework for analyzing andredesigning work”.
Ergonomics , 43(7), 960-974.[20] Escobedo-Land, A., and Kim, A. Y. (2015), “OkCupid data for introduc-tory statistics and data science courses”,
Journal of Statistics Education ,23, 1-25. 2221] Fraley, C., and Raftery, A. (2002), “Model-based clustering, discriminantanalysis, and density estimation”,
Journal of the American Statistical As-sociation , 97, 611-631.[22] Grimshaw, S. (2015), “A framework for infusing authentic data experienceswithin statistics courses”,
The American Statistician , 69(4), 307-314.[23] Guo, P. J. (2012), “Software Tools to Facilitate Research Programming”,Ph.D. Dissertation, Stanford University.[24] Hardin, J., Hoerl, R., Horton, N . J., Nolan, D., Baumer, B., Hall-Holt,O., Murrell, P., Peng, R., Roback, P., Temple Lang, D., and Ward, M. D.(2015), “Data science in statistics curricula: Preparing students to “thinkwith data””,
The American Statistician , 69, 343-353.[25] Hinton, G., and Salakhutdinov, R. (2006), “Reducing the dimensionalityof data with neural networks”,
Science , 313, 504-507.[26] Horton, N. J., Baumer, B., and Wickham, H. (2015), “Setting the stage fordata science: Integration of data management skills in introductory andsecond courses in statistics”,
CHANCE , 28, 40-50.[27] Langer, A. M. (2012),
Guide to Software Development: Designing and Man-aging the Life Cycle , Springer.[28] Lave, J. (1988),
Cognition in practice: Mind, mathematics, and culture ineveryday life , Cambridge University Press.[29] Lave, J., and Wenger, E. (1991),
Situated learning: Legitimate peripheralparticipation , Cambridge University Press.[30] LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning.
Nature , 521,436-444.[31] Leontiev, A. N. (1978),
Activity, consciousness, and personality (originallypublished in Russian in 1975), Prentice-Hall.[32] Manyika, J., Chui, M., Brown, B., Bughin, J., Dobbs, R., Roxburgh, C.,and Byers, A. H. (2011).
Big data: The next frontier for innovation, com-petition, and productivity . McKinsey Global Institute.[33] Mathews, T. J., and MacDrman, M. (2013), “Infant mortality statisticsfrom the 2010 period linked birth/infant death data set”.
National VitalStatistics Reports , 62, 1-26.[34] Nardi, B. (1996),
Context and Consciousness: Activity Theory and Human-Computer Interaction , Cambridge, MA: MIT Press.[35] Nolan, D., and Perrett, J. (2016), “Teaching and Learning Data Visualiza-tion: Ideas and Assignments”,
American Statistician , 70(3), 260-269.[36] Nolan, D., and Speed, T. (2000),
Stat Labs: Mathematical StatisticsThrough Applications , New York: Springer-Verlag.2337] Nolan, D., and Temple Lang, D. (2015),
Data Science Case Studies in R : ACase Studies Approach to Computational Reasoning and Problem Solving ,Chapman and Hall/CRC.[38] O’Neil, C., and Schutt. R. (2013), “Doing data science: Straight talk fromthe frontline”, O’Reilly Media.[39] Price, E., De Leone, C., and Lasry, N. (2010), “Comparing educationaltools using activity theory: Clickers and flashcards”. In
AIP ConferenceProceedings (Vol. 1289, No. 1, pp. 265-268). AIP.[40] PwC. (2015), “What’s next for the data science and analytics job market?”,available at https://pwc.to/2FL8GEG .[41] Raeithel, A. (1991), “Semiotic self-regularization and work: An activitytheoretical foundation of design”, In Floyd, R. et al.
Software Developmentand Reality Construction , Springer Verlag.[42] Simpson, W. J. (1957), “A preliminary report on cigarette smoking and theincidence of prematurity”,
American Journal of Obstetrics and Gynecology ,73, 808-815.[43] Sisto, M. (2009), “Can you explain that in plain English? Making statis-tics group projects work in a multicultural setting”,
Journal of StatisticsEducation , 17, 1-11.[44] Strehl, A., and Ghosh, J. (2003), “Cluster ensembles—a knowledge reuseframework for combining multiple partitions”,
The Journal of MachineLearning Research , 3, 583-617.[45] The National Academies of Sciences, Engineering and Medicine Consen-sus Report (2018), “Data Science for Undergraduates: Opportunities andOptions”, available at https://nas.edu/envisioningds .[46] Tishkovskaya, S., and Lancaster, G. A. (2012), “Statistical education in the21st century: A review of challenges, teaching innovations and strategiesfor reform”,
Journal of Statistics Education , 23, 1-56.[47] Tukey, J. W. (1977),
Exploratory Data Analysis , Addison-Wesley.[48] Verzani, J. (2008), “Using R in introductory statistics courses with the pmggraphical user interface”,
Journal of Statistics Education , 16, 1-17.[49] von Luxburg, U. (2007), “A tutorial on spectral clustering”,
Statistics andComputing , 17, 395-416.[50] Wickham, W., and Grolemund, G. (2016),
R for Data Science: Import,Tidy, Transform, Visualize, and Model Data , O’Reilly Media.[51] Wilcox, A. (2001), “On the importance—and the unimportance—of birth-weight”,
International Journal of Epidemiology , 30, 1233-1241.[52] Wild, C. J., and Pfannkuch, M. (1999), “Statistical thinking in empiricalenquiry”,
International Statistical Review , 67(3), 223-265.2453] Wu, C.-F. J. (1997), “Statistics = Data Science?”,
H. C. Carver Professor-ship Lecture , The University of Michigan, Ann Arbor. Available at .[54] Wu, C.-F. J. (1998). “Statistics = Data Science?”,
P. C. MahalanobisMemorial Lecture , The Indian Statistical Institute.[55] Yan, D., Chen, A. and Jordan, M. I. (2013), “Cluster Forests”,
Computa-tional Statistics and Data Analysis , 66, 178-192.[56] Yan, D., and Davis, G. E. (2018), “The turtleback diagram for conditionalprobability”,
The Open Journal of Statistics , 8(4), 684-705.[57] Yerushalmy, J. (1964), “Mother’s cigarette smoking and survival of infant”,
American Journal of Obstetrics and Gynecology , 88, 505-518.[58] Yerushalmy. J. (1971), “The relationship of parents? Cigarette smoking tooutcome of pregnancy—implications as to the problem of inferring causa-tion from observed associations”,