Interleaving Computational and Inferential Thinking: Data Science for Undergraduates at Berkeley
aa r X i v : . [ c s . C Y ] F e b Interleaving Computational and Inferential Thinking:Data Science for Undergraduates at Berkeley
Ani Adhikari † John DeNero ∗ Michael I. Jordan ∗†∗
Department of Electrical Engineering and Computer Sciences † Department of StatisticsUniversity of California, Berkeley
Abstract
The undergraduate data science curriculum at the University of California, Berkeley is an-chored in five new courses that emphasize computational thinking, inferential thinking, andworking on real-world problems. We believe that interleaving these elements within our corecourses is essential to preparing students to engage in data-driven inquiry at the scale thatcontemporary scientific and industrial applications demand. This new curriculum is alreadyreshaping the undergraduate experience at Berkeley, where these courses have become some ofthe most popular on campus and have led to a surging interest in a new undergraduate majorand minor program in data science.
One of the most challenging—but ultimately most rewarding—ways to develop an undergraduatecurriculum is to aim for the grand conceptual achievements of a field, stripping away the inessen-tials and conveying the core ideas in a way that reveals their beauty, their universality, and theircontemporary relevance. This has been done for many fields in the sciences and humanities, andthe introductory courses in these fields have stabilized and stood the test of time. Are we in an erain which this can be done for data science?We have approached this question with cautious optimism at Berkeley. The caution stems froma belief that the appropriate scope for data science education is a very broad one. Indeed, weview data science as a form of liberal arts for the 21st century—a mingling of the computationaland inferential disciplines that flowered during the 20th century and an approach to science andtechnology that permits empirical investigations at unprecedented scale and scope. Given this vastremit, the challenge of perceiving and distilling a curricular core seems daunting at best.And yet we are optimistic. We view data science as a phenomenon that took shape slowly andsteadily over the past century, with the flowering of its various parts arising from a common root.Indeed, during the gestation of the computational and inferential disciplines at the beginning ofthe 20th century, individuals such as John von Neumann, Andrey Kolmogorov, Alan Turing, JerzyNeymann, Norbert Wiener, and Abraham Wald blended the deductive and inductive traditionsof mathematics to devise rigorous formulations of concepts such as “algorithm,” “probability,”“inference,” “feedback,” “uncertainty,” “model,” and “risk.” These formulations provided new waysto think about data and its role in scientific inquiry. The scope of these developments was further1roadened in the hands of individuals such as David Blackwell, Claude Shannon, and HerbertRobbins, whose work led to new perspectives on economics, communications, and psychology.As the 20th century progressed the unity began to be less apparent. From the original foun-dations, major new academic disciplines emerged—computer science, control theory, informationtheory, signal processing, and mathematical statistics—each focused on particular aspects of theoverall set of challenges associated with information, inference, and decisions. Whereas von Neu-mann, Kolmogorov, et al. would likely have resisted being labeled with a single one of these disci-plinary labels, subsequent researchers have generally pursued their careers entirely within a singlediscipline.Data science has brought the original threads back together. Data science focuses on real-worldproblems involving data and decisions. That these problems are not generally the province of asingle discipline is something that students are prepared to accept without much debate. Theycan appreciate the need to go beyond the mere processing of data and call forth the broader set ofideas: the specification of inferential goals, the development of models that aim to capture the wayin which data may have arisen, the crafting of algorithms that are responsive to the models andgoals, an understanding of the feedback mechanisms that affect the data and the interpretationof the results, concern about uncertainty and risk, and concern about the human implications ofautomated data analysis and decision-making.At Berkeley we have coined a phrase—“computational thinking and inferential thinking”—tocapture one important aspect of our vision for a data science curriculum. “Computational thinking”is the goal of a modern introductory class in computer science, where the focus is on notions ofabstraction, modularity, and efficiency. “Inferential thinking” is the goal of a modern introductoryclass in statistics, with a focus on ideas such as populations, sampling, Bayes theorem, causality, androbustness. Placing the two in juxtaposition brings together many of the conceptual achievementsof the past century referred to earlier.Moreover, such a juxtaposition profits from the complementary problem-solving styles asso-ciated with computer science and statistics. Computer science has a “builder” spirit associatedwith it. Students who write computer programs feel empowered. They are not merely learninga formalism, but they are creating working artifacts. Statistics, on the other hand, embodies a“collaborator” spirit. Statisticians learn to embed themselves in teams along with domain expertsand contribute to the conceptual flow of a project. Putting these two styles together is a natural,desirable feature of a data science curriculum that aims to blend computation and inference.In the design of our curriculum, we have also included “real-world implications” as a thirdfoundational leg, to highlight the fact that while data science may have its roots in mathematicalformalism, it is crucially a real-world enterprise. Here too there are important historical precursorsthat have inspired us. Individuals such as John Tukey and Leo Breiman, trained as mathemati-cians, came to emphasize the open-ended, exploratory nature of data analysis, and the necessityof trying things out on real-world data. This perspective complemented the formal mathemati-cal perspective that dominated academic research communities, bringing difficult-to-formalize, butessential, notions such as visualization, interpretability, and criticism to the fore. Another key his-torical reference is the database researcher Jim Gray, whose point of departure was the deductiveunderpinnings of computer science, but whose eventual contributions included systems for storing,indexing, and querying massive collections of data, making it possible to base scientific and tech-nological inquiry directly on data. Finally, another powerful historical thread that has influencedus comes from social science, where the contextual nature of data and empirical investigation is a2ajor theme. A particular inspiration has been the writing of Ursula Franklin, whose wide-rangingcommentary on technology as a complex system is exemplified by the following quote: “Technolo-gies are developed and used within a particular social, economic, and political context. They ariseout of a social structure, they are grafted on to it, and they may reinforce it or destroy it, oftenin ways that are neither foreseen nor foreseeable” (Franklin, 1999). We have accordingly aimed tostrike a tone of respect for consequences, foreseen and unforeseen, in our teaching of data science.There was indeed an interesting unforeseen consequence that arose when we first began to teachthe new courses. We found that whatever the notion of “real world” might mean for us, studentshad their own ideas. Students come to our curriculum with their own questions and passions,providing a vivid example of the notion of “context.” We came to realize that “real world” is bestviewed as a student-centric concept—it can be whatever a given student wishes for it to be. Adata science education can aim at empowering students to not merely solve others’ problems, butto solve their own problems. A data science curriculum can empower them to find data that arerelevant to their questions and to provide convincing analyses of those data. The word “convincing”is important here—a good data scientist should be able to convince not only himself or herself ofan analysis, but should be able to convince others as well.In short, our learning goals for the new curriculum were multi-fold, and avidly cross-disciplinary:we wanted students to understand how to formulate meaningful inferential problems, collect datarelevant to those problems, build data analysis pipelines that allow problems to be solved at a rangeof scales, carry out analyses that are convincing, and make assertions or policy recommendationsthat carry weight. Moreover, throughout this process we wanted students to be aware of the social,cultural, and ethical context of the problems that they were formulating and aiming to solve, and wewanted to empower students to pursue their own unique perspective as data scientists. These goalsecho many of those proposed in recent efforts to reform undergraduate statistics curricula (Cobb,2015). They also have some elements in common with proposals that build new data science curric-ula by drawing carefully from existing curricula in computer science, statistics, and mathematics,while providing experience with real-world data (De Veaux et al., 2017), but where we particularlyalign with the latter authors is when they state, “We believe that many of the courses traditionallyfound in computer science, statistics, and mathematics offerings should be redesigned for the datascience major in the interests of efficiency and the potential synergy that integrated courses wouldoffer.” A thoroughgoing redesign is precisely what we have engaged in at Berkeley.To turn high-level desiderata into an actual class sequence, we found it helpful to start fromscratch. We first focused on first-year students, and we avoided making any strong assumptionsabout mathematical background or programming skills. Our thought was that if “computationalthinking and inferential thinking” is a powerful new force on the academic landscape, then its powershould be evident without a great deal of background or formalism. Later courses then reinforceand expand their understanding by revisiting many of the same problems to which the studentswere exposed in their first course using more developed mathematical and computational tools.
Vignettes
Let us give a concrete example of what it means to teach “computational thinking and inferentialthinking” in an interlaced fashion. This particular example is drawn from our freshman-level course(Data 8, which we discuss in more detail below).In general, concrete examples helped to shape our design of data science syllabii. Rather thansimply trying to assemble a syllabus as a mash-up of computer science topics and statistics topics,3e found it more helpful to focus on specific problems, and bring computing and statistics to bearjointly in the solution of the problem.Consider the problem of deciding whether the juries in a given city are representative of thepopulation of the city. In particular, suppose that we have two columns of numbers, one listingthe values of a certain demographic measurement for a set of jurors and the other one listing thevalues of the same measurement for a census of the population. For concreteness, suppose that wehave two columns of numbers with 100 elements in each column. These columns will surely differin various ways, but are these differences systematic, and reflective of some form of bias, or wouldthe difference be expected by chance alone?Rather than heading immediately down the traditional path of means, standard deviations, andt-tests, we treat this problem as an opportunity to focus on “randomness.” Recalling that we are atthe freshman level and that we are not assuming knowledge of things like probability distributionsor independence, we introduce the histogram as a stand-in for the notion of a distribution. Thisrequires an assumption, one that we make explicit: We assume that the data in each columnare exchangeable —meaning that the ordering of the points in the column is arbitrary. From thisassumption it is reasonable to reduce a column of numbers to a histogram, a visualizable datastructure that records the proportion of the overall data set that falls in each bin. Using the samebins for the two columns, we next introduce the idea of comparing two histograms. To do this,we simply take the difference between the two values in each bin, compute the absolute value, andsum up these absolute values across the bins. A graduate-level researcher will recognize this sumas the variation distance , and may be surprised that we are introducing such an advanced conceptin a freshman course, in lieu of means and standard deviations. But our students don’t seem tofind this concept to be difficult or unnatural.For a particular set of data, suppose that the variation distance is 0.7. Could this number havearisen by chance alone?We’re at a key juncture in inferential thinking. The language of “could have arisen” suggeststhat we’re asking a question that is not to be found by mere inspection of the data. Some thinkingis needed. We introduce the idea of a hypothetical —a “null hypothesis” that there is actually nodifference between the jurors and the population. Entertaining this null hypothesis as a thoughtexperiment, we can simply lump all of the data together into a single column of 200 numbers.Recalling our exchangeability assumption, we are also free to permute this column—the orderdoesn’t matter. For each permutation, we then split the column into two new columns of 100numbers each—for example, we take the first 100 numbers and the second 100 numbers. We thenform two histograms from these columns and compute the variation distance. What we are doinghere is simulating the randomness that we would expect if juries are selected in a way that accordswith our thought experiment. We’ve operationalized the notion of “by chance alone.”This process yields a collection of numbers, from which we form a new histogram. We givethis histogram a name—it is the sampling distribution . It captures differences between histogramsdue to chance alone. We now reconsider that value of 0.7. If it is in the center of the samplingdistribution, then it would have been expected under the null hypothesis. We have no reasonto reject the null hypothesis. If, on the other hand, it is in the tail of the histogram, then it is“surprising.” Although we could decide that something surprising has happened, it is perhaps moreplausible to consider that we were in fact wrong to tentatively assume that there was no differencebetween the juries and the population. We reject the null hypothesis.This line of argument has introduced the idea of a permutation test , an idea that goes back4everal decades (Fisher, 1935; Pitman, 1937). The test embodies core inferential notions of ex-changeability, a counterfactual null hypothesis, random sampling, and the logic of “surprise.” Theconcreteness has the virtue that these core notions are laid bare, not cluttered by ideas such asstandard deviations, Gaussian distributions, or (horrors) asymptotics.The concreteness also has the virtue that an inquiring freshman mind can immediately seevarious weaknesses and hidden assumptions in our setup. Caveats are needed in interpreting theresults of such a procedure. Thus, with little prompting, students are able to take a criticalperspective on the ideas that they have learned, particularly when those ideas are confronted withthe messiness of the real world. Moreover, they can take their new machinery out into the realworld—working with data sets and scientific questions that they find interesting—and to considerthose caveats in a setting where there are consequences.Where is the computational thinking in this? One possible answer is that it is present through-out, in that we ask students to work with real computer code (Python in our case) as they formhistograms, compute variation distances, and form sampling distributions. Understanding the me-chanics of the program in which the inferential process is expressed provides a solid foundation forconceptual understanding.There is also an opportunity to relay a deeper lesson about abstraction. We ask studentswhether a similar analysis could be carried out if the census were not available, and instead theyhad only the relative frequencies in the city population for various ranges of demographic values.Instead of comparing two columns of numbers describing individuals, they now must compare acolumn describing individual jurors with a table describing population frequencies. Students canaffirm for themselves that the same reasoning applies by identifying that histograms are beingcompared in their hypothesis test, rather than columns of numbers, and that a histogram can beproduced from either a column of numbers or a table of bin frequencies. A small change to thesimulation, motivated by the idea that the meaning of a histogram is independent of how it wascomputed, allows this inferential technique to apply to a new data condition. Similar reasoninggeneralizes the technique to categorical variables, where bar graphs take the place of histogramsand sound judgment can be made about whether a difference exists between two groups withoutever mentioning chi-squared distributions.There is also an opportunity to teach deeper computational concepts here. How, from a com-putational point of view does one (randomly) permute a list of numbers? One possibility is torandomly swap each pair of numbers. But we note a key problem—that the computational com-plexity of this algorithm is quadratic in the length of the list. Such an algorithm will not performwell on large-scale problems. Is there a better alternative? The answer is “yes,” and it is given by
Knuth’s algorithm —walk through the list in order and randomly swap the k th element with oneof preceding k − that form the backbone of the Berkeley datascience curriculum. They are: (1) Data 8, a flagship freshman-level course that introduces Pythonprogramming, data manipulation, and basic statistical inference; (2) Data 100, a sophomore/junior-level course that introduces data-analysis pipelines and gradient-based algorithms for fitting sta-tistical models; (3) Data 140, a sophomore/junior-level course that begins to assemble the formaltools of probability theory, doing so from a computational perspective; (4) Data 102, a senior-levelcourse that focuses on the decision-making side of inference, including false-discovery rate control,bandit algorithms, optimal control theory, differential privacy, and matching markets; and (5) Data104, a course focused on the human context and ethics of data collection, data analysis, and algo-rithmic decisions. Throughout, an integrated set of labs, homework assignments, and programmingprojects help to keep the focus on real-world problems. Starting from a blank slate, or more aptly a blank screen, we endeavored to build a course abouthow data-driven reasoning is properly carried out now that programmable computers are the centraltool of statistical inquiry. We sought to design a course that would integrate all that had beenpreviously discovered about how best to introduce students to both statistics and computer science,but also distill the essential lessons learned by the many fruitful cross-disciplinary collaborationsbetween researchers in statistics, computer science, and neighboring disciplines that gave rise to theemerging discipline of data science. Indeed, our experience in graduate-level research and teachingwas an essential inspiration for the design of Data 8, our freshman-year course in data science.The focus of the course is on reasoning, visualization, and interpretation, rather than calcula-tions or the use of software packages. This approach is inspired by the boldly innovative
Statistics by Freedman et al. (1978), a textbook that transformed the way the field of statistics was intro-duced to undergraduates at Berkeley and around the world. We emphasize the importance ofconsistency between the assumptions in a statistical model and the way in which the data aregenerated; we discuss what can and cannot be justifiably concluded based on our calculations; andwe try to express mathematical statements as precisely as possible in plain English.
However, the presentation of these classical lessons changed sharply when we adopted computationas a central tool. Unlike existing introductory statistics courses, examples in Data 8 do not beginwith data summaries in the form of one or two graphs or a few numerical values. Rather, we givestudents the entire data set and teach them to generate the graphs or summary statistics that willbecome the centerpieces of their statistical arguments. Students are better equipped to interpret a Course materials are publicly available at data8.org, ds100.org, prob140.org, data102.org, anddata.berkeley.edu/data-c104-fall-2020-syllabus.
Structure andInterpretation of Computer Programs , by Abelson and Sussman (1985), another transformativeclassical textbook. We highlight how a small set of primitive operations can be composed to carryout a wide range of data manipulation and visualization tasks; how simple built-in data structurescan be used to represent a wide range of data scenarios; and how names and functions can be usedto create compact and extensible programs that people can follow and understand.Data 8 provides a great deal of practice using an essential but small and highly accessible subsetof the Python programming language, with the goal of giving students sufficient experience withbasic programming topics to achieve mastery. Programming can be engaging even when limitedto its most basic components—variables and procedure calls—when used to understand real-worlddata sets. For example, just four weeks into the course, Data 8 students carry out an extendedanalysis of world population growth over time, along with trends in fertility and child mortalityrates, to form an argument based on data that the Earth’s population will eventually stabilizerather than grow without bound. This project does not require any iteration or control; justassigning variables, defining functions, and invoking methods.To achieve this simplicity, Data 8 uses a Python module built for the course called datascience designed to ensure that students can carry out a wide range of table manipulation and data visual-ization operations using only a core set of the most fundamental programming language concepts.Later parts of the course introduce control statements, but programming topics such as definingclasses that are standard in introductory computer science courses are never covered in Data 8. The data and many of the visualizations for this assignment are from GapMinder. See data8.org/datascience for documentation.
The arc of Data 8 motivates each programming topic with a new inferential goal. In the first sectionof the course, students learn to program while learning about data visualization and summarization.Many questions can be answered with a carefully crafted picture. However, when two quantitiesdiffer and random sampling is involved, students naturally ask whether that difference is a realproperty of the world or just a quirk of the data. Making this decision motivates the second partof the course.To make such a decision, it is necessary to understand sampling variability. In traditionalclasses students have to imagine how the sample might have come out differently, and then eitherdo some mathematics to quantify sampling variability or (as is common in introductory classes)accept and memorize some variance formulas. While demonstration by simulation is used in manyclasses, the students’ work most commonly involves the use of standard deviations that they don’tunderstand well. In Data 8, the students’ primary tool for inference is to draw new samples and seethe difference. Sampling variability is therefore expected and visible instead of being mysterious orobscure.Inference in Data 8 centers on simulation and the fundamental observation that the empiricaldistribution of a large random sample typically resembles the distribution of the population fromwhich the sample is drawn. Simulation takes three main forms: • Simulating a multinomial random variable when the probabilities of all the classes are com-pletely specified; • Random permutations of a pooled sample, for inference about the underlying distributionsof the individual samples; • The bootstrap, to create new samples from a single large sample.The third and final part of the course on prediction guides students to understand how machinelearning systems are trained, applied, and evaluated without the conceptual overhead of optimiza-tion or calculus. Using simple linear regression and k -nearest-neighbor classification, students makepredictions using real-world data sets. Although the prediction techniques are simple, the students’statements about the problem they are solving can be quite sophisticated at this stage, for exampleincluding confidence intervals around the estimated accuracy of their classifiers.Using real-world data sets throughout naturally promotes discussion of the context in whichdata were collected and the social implications of collecting and analyzing data. Students explorethe privacy implications of license plate scanners, the science behind estimating the age of theuniverse, and the complicated evidence linking dietary cholesterol to coronary heart disease. Though Data 8 is largely the same now as in its pilot offering in Fall 2015, there have been somechanges based on faculty and student response. Students in the pilot were exceptionally confidentand curious, willing to take a chance on a new course based only on an interesting course description.8ith them, we were able to go further than we now can with a much more diverse group. Forexample, multiple regression has been removed from the syllabus.The treatment of probability theory too has been cut down, focusing on shapes of distributionsand uniform random sampling from finite populations. With a minimal use of notation, the coursecovers partitioning and addition, intersections and conditioning, and Bayes’ theorem includinga discussion about base rates. Probability calculations based on finite outcome spaces provideinteresting settings for using array operations and iteration, but in the interests of time there arefewer of these now than there used to be. There is also less time spent on symmetry in randompermutations. There is no formal treatment of random variables and distribution families otherthan the normal. However, Data 8 gives students an understanding of standard deviation andits relation to the normal curve, the variability of a random sample mean, and the Central LimitTheorem. This approach closely follows that of Freedman et al. (1978).Students appreciate Data 8. Each semester’s class consists of students of all years and dozensof different majors, the vast majority finding the course well worthwhile. “This was easily the mostinfluential and enjoyable class I’ve taken at Cal,” said one. “This was an extremely applicablecourse,” said another, ”Because it opened up students’ eyes on the widespread use of data scienceand how we can see it and use it in everyday life.” Others made connections to their future work,as in, “Companies look for people who are technologically aware and capable, therefore I believeit is an extremely beneficial class,” or “I feel I learned a lot and gained many skills I will take intomy future work and academic career.” See also Uptake and Engagement below.
Accompanying Data 8 is a suite of “connector” courses designed to give students a more immersiveexperience in the data science of a particular domain. These courses address the diverse interestsand goals of the students in Data 8, provide students with engaging introductions to many differentfields, and offer students a collaborative research-style environment in which to apply what theyhave learned in Data 8.Connector courses require anywhere between 50% to 100% of the hours spent by students ina typical course, and the class sizes are typically small by Berkeley standards. Some connectorcourses have additional prerequisites. Data 8 is a prerequisite or corequisite for all connectorcourses. Students may take multiple connector courses, while they are taking Data 8 or in anyfuture semester.Two of the connector courses are intended for students who want to go deeper into the compu-tational or statistical underpinnings of Data 8. The Computer Science connector,
ComputationalStructures in Data Science , develops fundamental concepts of computation in the context of dataanalytics and, along with the programming in Data 8, provides students with the equivalent ofa one-semester introductory CS course. The Statistics connector,
Probability and MathematicalStatistics in Data Science , provides students with a theoretical foundation that complements Data8 and prepares students for higher-level classes in Statistics, Economics, and other fields. Studentsare eager for this preparation; each of these connectors is taken by hundreds of students everysemester.But the typical connector course is a more intimate project-based introduction to data science ina domain of application. Class structures vary across connectors and fields. One common structureconsists of a weekly two-hour session that is a combination of lecture and in-class lab time. Studentsare required to complete a weekly lab, along with a more long-term project or assignment.9epartments all across campus have become enthusiastic partners in the connector project.Examples of connector courses and their host departments include, among many others: • Data Science and the Mind [Cognitive Science] • How Does History Count? Exploring Japanese-American Internment through Digital Sources[History] • Data and Ethics [Information] • Immunotherapy of Cancer: Success and Failure [Molecular and Cell Biology] • Reading and Writing in the Digital Age [English] • Children in the Developing World [Public Health] • Data Science Applications in Physics [Physics] • Data Science for Smart Cities [Civil Engineering] • Data Science and Immigration [Demography] • Exploring Geospatial Data [Environmental Science, Policy, and Management] • Crime and Punishment: Taking the Measure of the US Justice System [Legal Studies] • Data Science for Social Impact [Sociology]Through the connector courses, which have been offered ever since Data 8 was introduced inFall 2015, data science education has become a campus-wide effort and is not restricted to a smallnumber of STEM departments. Connector courses embody Berkeley’s view of data science as away of thinking about the world, and are a formative element in our students’ perception of datascience as accessible and interesting to all.For faculty, connector courses offer many benefits. They are an unusual and effective way ofattracting students to a field. They provide a low-stress opportunity for faculty to expand the roleof data science in their teaching and research, and sometimes also in their departments’ degreeprograms. They create a community of faculty who are engaged in incorporating data science intotheir fields, in varying degrees and in different ways to be sure, but all working towards a betterunderstanding of the value of data science in their domain.Data Science Undergraduate Studies (DSUS) at Berkeley provides considerable support forfaculty in connector courses and has produced a publicly available Guide for Instructors thataids in the development of such courses. DSUS staff help faculty with computational resourcesand infrastructure, identifying qualified student assistants, and other logistical and developmentalaspects. There are two annual summer faculty workshops, one for Berkeley faculty and one forfaculty from other schools, on developing courses that draw upon the content and pedagogy ofData 8. There is also a day-long workshop for graduate students on data science pedagogy. Duringthe semester, weekly meetings of connector instructors and support staff provide an opportunityfor connector faculty to share their experiences and best practices as well as get the support theyneed for the smooth day-to-day running of their courses.10
Data 100: Principles and Techniques of Data Science
While Data 8 students master the manipulation of small, tidy data tables that describe uniformrandom samples, Data 100 directly confronts the reality of modern data science: unstructured text,high dimensional observations, stratified samples, data sets that exceed the size of a program’sworking memory, and other twists that challenge students to generalize what they have learnedto new settings. But at this stage they are far more capable due to the mathematics, statistics,and computer science courses they have taken since they started in Data 8. Data 100 drawsconnections across programming, linear algebra, calculus, and probability in order to expand theset of visualization, inference, and prediction problems that students can address. One studentsummarized the course as, “The class provides real-world data analysis skills that are highly prizedby industry and prepares students to handle data at large scale.”Like Data 8, Data 100 is organized around vignettes that expose the need for coordinatingcomputational and inferential ideas in the context of real-world problems. In one project, thestudents develop a system to predict the duration of a taxi ride in Manhattan based on historicalrecords. Faced with a problem that combines geospatial and temporal information, the statisticaltechniques used for prediction are important, but not the primary focus. Instead, students focustheir effort primarily on data representation and interpretation. Geographic coordinates describewhere each taxi ride originated, but students must write a program to determine if each location ison Manhattan given a polygon that outlines the island, and their implementation must be efficientenough to process the full data set. Traffic delays are both stochastic and nonlinear functions ofthe time of day. Why were there so few taxi rides on January 23, 2016? A record 27 inches of snowblanketed the city. Nothing about such a problem is a straightforward application of mathematics,computer science, or statistics. As with many real-world problems, success requires a combinationof applying general theory about making predictions with particular attention to data-manipulationdetails.Data 100 exposes students to a variety of new computational topics. The course transitionsstudents from the datascience module to pandas so that they can take advantage of Python’svast data science ecosystem. The programming concepts necessary to master pandas —slicing,class hierarchies, property methods, Boolean arrays, multiple views of an object, mutation, andtype dispatching—are all covered in prerequisite programming courses (although not in Data 8).Therefore, introducing pandas is quick and efficient in Data 100, and students can appreciatethe library’s design and its interaction with Python’s visualization and prediction libraries. ButData 100 also strives to make clear that Python is neither a necessary nor sufficient platform fordata science. Interacting with a database management system is a substantial topic in the course,extending the basic SQL coverage in prerequisite courses by addressing import and export, datacleaning, types of joins, and sampling. The course projects demonstrate that leveraging the fullcapabilities of a database can greatly increase the feasible scale, efficiency, repeatability, and clarityof a data workflow. The intent of this approach is to ensure that students are prepared to applywhat they have learned in external settings. One wrote, “[This course] is useful for my research,and I could see myself using most of the topics in real life.” Another wrote, “I learned a lot abouthow data science might work in industry and going beyond Data 8 level of inference/intuition foranalyzing data sets.”A unifying theme of the course is a data science problem-solving life cycle that builds under-standing about a dataset in conjunction with goal-oriented inference. To this end, students gainproficiency in exploratory data analysis, linear and logistic regression, and tuning predictive mod-11ls. Most importantly, they see that effective use of any machine learning or inference techniquerequires not only understanding the technical details of the method, but also understanding thedata to which it is applied, the representation of those data, and the context in which the datawere collected. It is in Data 100 that some students begin to appreciate the unique character ofdata science as a discipline. One wrote, “I absolutely loved how this course pushed what I knewabout data science further. I am a Computer Science major and am now planning to do a DataScience major thanks to this course!”
Probability theory is the essential mathematical framework for formalizing the concepts introducedin Data 8 and its connector courses, and providing a bridge to more advanced courses. Data 140 isan introductory probability class for students who have taken Data 8, calculus, and linear algebra.In most statistics departments, the traditional journey towards inference and data analysisbegins with a calculus-based probability class, typically with no computational component; thena mathematical statistics class that typically includes some data analysis; and then classes inlinear models, time series, and so on. Data 140 is novel in that it uses inference as motivation forstudying probability, and computing as a primary tool along with math. The course preserves themathematical rigor of classical probability courses and then goes further with the theory, using themath and computation to enhance and inform each other just as they do in research.Interleaving mathematics and computing is a natural progression for students who have takenData 8 and want to go deeper into inference. Indeed, for them it can be frustrating to take atraditional all-calculus probability class where the computation, if any, is done on hand calculators.Some students find themselves at sea in abstract mathematics and need computation as an anchor.No matter that the skills needed in mathematics and computation are largely the same—logic,abstraction, modularity, precision, and creativity. To these students, one of the worlds feels naturaland the other alien. But they still want a way to go further into the probability that they have seendaily in Data 8: sampling variability, random permutations, the bootstrap, and always distributions,distributions, distributions.Data 140 offers them the way. The course is able to move fast: many of the concepts havealready been motivated and discovered empirically in Data 8. A side effect of the pace is thatstudents with strong math preparation are also opting for Data 140 over traditional probabilitycourses, because they know it will take them further.The course content includes the material covered in standard undergraduate probability classesand also some inference, both frequentist and Bayesian, so that students can move on to a class instatistical learning and decisions without having to take a semester of mathematical statistics inbetween. This is ambitious, and is achieved by careful coordination of all the material each week:lectures, practice, homework, and lab.In some weeks we build directly on the ideas introduced in Data 8. For example, total variationdistance is introduced in Data 8 as a reasonable and straightforward way of quantifying the differ-ence between two distributions on the same finite set of categories. In Data 140 homework, studentsderive the interpretation of total variation as the biggest difference in probabilities assigned by thetwo distributions. In lab in the same week, they compute the distance between binomial distri-butions and their Poisson approximations, and thus have a sense of how good the approximationsare. As an extension of the method, Data 140 students have examined fixed points and consecutive12airs of random permutations, and top-to-random shuffles of a deck.Shuffles reinforce the point that calculations can quickly get too large even for computers tohandle. This point is made on the very first lecture, in the context of the Birthday Attack, so thatstudents start off with respect for the mathematics and pleasure in it. By the end of the term, whenthe lab has them construct bivariate normal random variables from independent standard normalcomponents, the lecture derives the multiple regression estimate as a projection, and the homeworkhas them come up with the multivariate normal distribution of the estimated coefficients, studentsare at ease in the world of math as well as computing.Labs take the students beyond what is typically covered in a first course in probability. Markovchains and the Metropolis algorithm are covered in a fairly standard way, and then students use thealgorithm to break a substitution code. The beta and binomial families are explored in the contextof Bayesian estimation, and then students construct a Chinese Restaurant clustering process anduse beta and binomial facts to study its properties. But mainly, the labs help develop and reinforcethe math. In the words of a student, “[L]abs were shockingly helpful in developing certain theorywithout me even noticing.”Like all our data science classes, Data 140 attempts to develop a way of thinking, and indeedthe word “think” appears repeatedly in student comments. “Great mix of theory and developinga probabilistic way of thinking,” said a student. “This course can change the way you learn andthink,” said another. “The pace is demanding and pushes you to improve faster than you thinkis possible, but in the middle of the semester you will change. Things will start to click and it isindescribable how satisfying that is.” And another said simply, “It taught me how to think in amanner that enables me to succeed.”Data 140 has not caused a drop in enrollments in the probability courses that the campus haslong offered at this level in the Statistics department and elsewhere. Instead it engages about650 more students annually in probability theory. Data 8 and its connectors have motivated adiverse population to study probability: top Data 140 students in recent semesters have come fromSociology and Nutrition Science as well as from Applied Math, CS, Data Science, and Economics.The interest is also spurred by our students’ growing realization of the importance of probabilitytheory and the need to understand it well. “I realized I couldn’t keep faking my way throughprobability,” said a student, explaining why they took this class after doing just fine in the CSdepartment’s demanding machine learning class. “Definitely a must-take class for anyone interestedin Computer Science or Data Science,” said another, while their classmate cast a wider net: “[T]hisis the one class I would recommend everyone taking before they graduate Berkeley.”
Data 102 is a senior-level course where the focus is decision-making. While decision-making ispresent throughout our course sequence, it steps up to center stage in Data 102. We think of ourData 102 students as likely future thought leaders in industry, science, academia, and government,and we want them to understand that data science is not merely about processing data, or solelyabout understand phenomena through the lens of data analysis, but often it is about making real-world decisions. Decisions can have consequences. Accordingly, in Data 102 we emphasize examplesfrom domains such as public policy, medicine, and commerce, where algorithms are increasinglybeing used to make decisions, and where human happiness can hang in the balance. We emphasizethe social, economic, and ethical aspects of decision-making.13e begin the course in a very traditional fashion, teaching the rudiments of statistical decisiontheory (in a style that would be familiar to Wald, von Neumann, and Blackwell). We start withbasic two-alternative hypothesis testing, and take the opportunity to distinguish between Bayesianand frequentist versions of hypothesis testing, relying on Data 140 to give students the mathematicalconcepts so that these distinctions can be made crisp. But, as is our wont, the focus is not themathematics, but rather the thinking styles associated with the two perspectives. Bayesian andfrequentist perspectives are rarely brought together in undergraduate statistics curricula (not tomention graduate curricula), and when they are it is generally in the context of estimation, whereBayesian and frequentist approaches are more similar than different, and where the goal is toemphasize the similarities. A sharper understanding is obtained by considering the contrast inthinking styles in the context of hypothesis testing.We see the situation as being akin to that of quantum mechanics, where waves and particlesprovide related—but different—perspectives on physical phenomena. A physics curriculum thatsimply picks one or the other, or merges the two in some vague way, would not be viewed as satis-factory. Similarly, a data science curriculum needs to give some indication that the underpinningsof inference and decision-making are conceptually nontrivial, and are supported by two underlying,complementary perspectives.To have this philosophical discussion yield something concrete, we turn the class in a rather non-traditional direction. We explain that classical hypothesis testing focuses on a single decision, andwe argue (via examples) that real-world decisions are rarely made in isolation. Instead, decisions aremade in the context of other decisions and in the context of other decision-makers. We accordinglyturn to multiple hypothesis testing , and we introduce the error-rate criterion of false-discoveryproportion (FDP) as a way to measure performance when there are many hypotheses, distinguishingit from the zoo of classical criteria such as Type I and Type II errors, specificity, sensitivity, etc.—all of which are fine for single hypothesis tests, but are less useful for multiple hypothesis tests.The FDP is a very natural quantity—it is simply the number of false discoveries divided by thetotal number of discoveries over a set of hypothesis tests. It is natural enough that it’s not hardto convince students of the practical value of such a criterion—they can well imagine a future bossasking them how many discoveries they’ve made today, and what fraction of those discoveries areworth investing in. We also emphasize that the FDP is fundamentally a Bayesian quantity—itinvolves conditioning on the decisions, which depend on the data, instead of conditioning on theunknown truth. We then ask a frequentist question—can we develop algorithms that keep theaverage FDP—the false discovery rate (FDR) —below some desired value?The cognoscenti will know that there exists such an algorithm, and that the algorithm has anappealing computational side to it, involving sorting and comparisons. But we have a differentdidactic goal in mind at this point—the modern problem of online FDR control . The onlineFDR problem involves developing algorithms that maintain FDR control not only at the end ofa batch of tests, but at any time along the way. Wearing our computational hats, we find thatthis perspective connects better to real software deployments of multiple testing, particularly inindustry. And, importantly, a beautiful fact arises: there exist algorithms that are not only simple,but which also have an elementary proof of online FDR control. Indeed, it is a one-slide proof,perfectly suited to a senior-level undergraduate course. We wish to emphasize two points about this discussion. First, in developing our data sciencecurriculum, we have often found that topics that are at the research frontier are often easier For a recent treatment of online FDR algorithms and their theory, see (Ramdas et al., 2018).
14o motivate and explain than classical concepts, precisely because they have been developed inresponse to contemporary real-world problems, and because they make use of modern tools, mostnotably the computer. Second, although we have de-emphasized mathematics in this article, wewant to make clear that we are not against mathematics. Quite to the contrary! But we want themathematics to support the concepts, rather than to be the concepts. And we want the proofsthat support the class to be, in the words of Paul Erd¨os, “Proofs from the Book.” Students shouldmarvel at how the mathematics expresses powerful ideas simply and sharpens the mind.In the design of Data 102, we continued to rely on “vignettes” to drive our thinking. Inparticular, in the current version of the course, the COVID-19 pandemic is omnipresent in the livesof both students and teachers, providing a daily reminder of what consequential, on-line, societal-scale decision-making with uncertain data can look like. Accordingly, we take real-life examplesfrom problems that we find in the newspaper, such as ascertaining the prevalence of infection,estimation of the case-fatality rate, and the design of clinical trials for vaccines. Concepts such as thesensitivity and specificity of a diagnostic test take on real meaning for students in this context, andwe are able to emphasize that these classical diagnostics—which are independent of the prevalenceof the disease—yield decision rules that are very different from the FDR-based decision rules, whichtake prevalence into account. Students can see that this really matters when prevalence is low, asit generally is (hopefully!) in an epidemiological context. We can therefore contrast individualdecision-making with group decision-making, and discuss the practical and ethical implications ofsuch a contrast for public policy. We can also discuss the use of Markov-chain simulations of viralspread and how to make inferences based on such computations. Overall, the COVID-19 examplesallow us to drive home the point that one important mission for data science is to design and deploysocietal-scale systems that help humans cope with emerging challenges.Another vignette that helped to drive the course design involved combining an economic per-spective on decision-making with a statistical perspective. In particular, we bring together theeconomic concept of a matching market with the statistical concept of a multi-armed bandit . Theteaching of matching markets allows us to introduce the Gale-Shapley algorithm, and to reasonabout its properties, as one would do in a classical computer science class. To combine such algo-rithmic thinking with inferential thinking, we note that in real life, agents don’t necessarily knowtheir preferences a priori, but need to accumulate experience with the outcomes available in themarket in order to learn their preferences. Multi-armed bandits provide an excellent example ofexactly such a learning problem. They are based on the use of confidence intervals (already taughtin Data 140) to guide choices of which arm to pull in each of a sequence of trials. Putting these ideastogether yields a concept that has only recently appeared in the research literature—a matchingmarket in which each agent forms confidence intervals that are used by the Gale-Shapley algorithmto help a group of agents make decisions (Liu et al., 2020). Students find this example not onlymathematically interesting but also compelling in terms of its natural real-world applications.The concept of private data analysis provides fodder for additional vignettes. Briefly, researchin the theoretical computer science and database communities has given rise to the concept of differential privacy , which provides algorithmic methods for adding noise to a database, yielding a“privatized database” that guarantees the privacy of individuals who supply data to the databasewhile still allowing queries to be answered (nearly) correctly. It is natural to supplement thiscomputational story with an inferential story—how might we add noise to the database to not onlyinsure privacy but also to ensure inferential accuracy? That is, will a privatized database makegood predictions for individuals who come from the same population as those in the original data?15inally, we also cover causal inference in Data 102. In doing so, we close a loop from thesenior experience back to the freshman experience—one of the first lectures in Data 8 is on causalinference, where we discuss John Snow’s discovery of the cause of a cholera outbreak in London. InData 102, we discuss how to design experiments so that causal hypotheses can be evaluated, andwe discuss inferring cause from observational data. The latter is precisely what John Snow did in1854, where, in so doing, he showed vividly how data analysis can be used to make consequentialdecisions and change the world for the better.
Recall that one of the major learning goals for the curriculum is that students should be “aware ofthe social, cultural, and ethical context of the problems that they were formulating and aiming tosolve.” Accordingly, in all of our courses, we aim to tie our framing and our examples to plausible,real-world contexts, and to provide explicit consideration of the implications for society and forindividuals of the increasingly wide deployment of data science methodology. Such contextualizationis aided by the connector model. But the lecture format of our technical courses is not adequatefor achieving such a learning goal, and we felt it essential that the curriculum include an entirecourse, organized around readings and discussion, that serves to provide students with an in-depthexperience in thinking about the implications of data collection, data analysis, and algorithmicdecision-making for society and for individuals. Data 104, “Human Contexts and Ethics of Data,”provides such an experience. It is taken by students in their junior or senior year.Just as it is essential to provide mathematical meaning to concepts such as “inference,” “feed-back,” “uncertainty,” “model,” and “risk,” it is also essential to remember that these conceptsgenerally arise in particular societal and cultural contexts, and that there are often stakeholdersassociated with perspectives on those concepts. Even basic notions such as “data” and “measure-ment” need to be discussed relative to context. Consider, for example, a data-science graduatearriving at a company and being asked to use a large internal dataset to predict the productivityof employees. If asked to measure “productivity” in a naive way—e.g., the raw number of widgetsassembled over a five-year period—such a graduate should be prepared to raise concerns; noting forexample, that such a measure could disadvantage individuals who have taken family leave, but areas productive as others given a more reasonable way to measure “productivity.” Our data-sciencegraduate should not shy away from any ensuing dialog or debate, and in general should be preparedto engage in discussion of concepts that have strategic, legal, and ethical aspects to them. Theyshould also be able to help ensure that a final design of a data-analysis system in fact respectsagreed-upon desiderata; that it doesn’t, for example, use a surrogate measurement implicitly inplace of a proscribed measurement, or that it doesn’t treat individuals, or groups, unfairly. Theyshould help to ensure transparency, honesty, and accountability.Broadly, we want our courses to yield data scientists who can engage in dialog with a rangeof stakeholders in real-world problems, including scientists, management, labor advocates, legalexperts, medical workers, economists, social workers, and engineers, and to be able to work withsuch individuals over extended periods of time to help understand and shape the role that datacollection and data analysis play in our society.Data 104 contributes to such learning goals by asking a set of questions: • How does data science transform how people live and how societies function?16
How do cultures and values inform how data-driven tools are developed and deployed? • What assumptions do data-enabled algorithms and tools carry with them? • What projections does artificial intelligence make on the future? • How can we shape the outcomes we want to see?It uses readings culled from a diverse set of venues that explore these questions. In the most cur-rent offering of the course, the readings are organized into themes: (1) Our Datafied World; (2)Responsible Data; (3) When Data are Personal; (4) Collective Life; (5) Data and Democracy; (6)Scientific Research; (7) Machines and Industry; and (8) The Ethos of Making. Based on thesereadings, and in-class discussions, students are exposed to voices from fields such as sociology,public policy, urban studies, political science, literature, and philosophy. History plays a particu-larly salient role. Technological change has swept through society before, and it will do so again.Historical analysis helps us to perceive general features of such changes, reminds us to focus onlong-term issues, and provides us with tools for understanding causes and effects. It also allowsus to consider counterfactuals—utopias and dystopias—that need to be made explicit as humanscontinue to ponder what data science—and technology more broadly—can and should be.Data 104 provides a way to amplify, and give a fuller treatment of, some of the issues raised inother classes, including notions of privacy and fairness discussed in Data 102, notions of provenancediscussed in Data 100, aspects of causal inference discussed in Data 8 and Data 102, and the multi-faceted notions of sampling, representativeness, bias, subpopulations, and uncertainty that arediscussed in all of our courses. Moreover, there are numerous substantive topics which are onlytreated in Data 104; in particular, Data 104 treats questions surrounding the ownership of data,individuals’ rights regarding data and decisions based on data, and the relationships between datascience, journalism, and democracy.Finally, Data 104 is an ethics course, helping students to align values with scientific and tech-nological concepts. While classical ethics courses are often disconnected from the technical fieldsthat they sit alongside, in the case of data science, which brings people and technology togetherin a tight and sometimes awkward embrace, ethics is central and vital. Given the novelty of thechallenges raised by data science, ethics emerges as a vibrant form of inquiry.
Almost 3,000 UC Berkeley students took Data 8 in the 2019-2020 academic year. We expect thatnearly 50% of the most recent class of UC Berkeley undergraduates will enroll in Data 8 beforethey graduate. Its popularity among students prompted departments across campus to considerData 8 as a potential part of their degree programs, and it now fulfills the statistics requirementfor 25 different major programs, from Civil Engineering to Legal Studies to Public Health. Ina time when demand for computing and statistics education is surging, Data 8 plays a criticalrole in providing broadly accessible and relevant instruction to the UC Berkeley undergraduatepopulation. Enrollment in undergraduate computer science lecture courses at UC Berkeley hasincreased by 453% in the last ten years. Meeting this new demand requires more than justscaling up existing courses; we believe that Berkeley and other institutions must create educational From 3,633 enrollments in the 2008-2009 academic year to 20,079 enrollments in the 2018-2019 academic year.
A new model of large-scale course delivery was required to offer Data 8 to a large fraction ofBerkeley’s undergraduate population. Not only is the absolute number of students enrolled inthe course very large, but there is tremendous variety in the academic interests and intendedmajors of these students. When subjects reach this level of popularity, they often branch intopopulation-specific variants. For example, Berkeley has three different calculus sequences: one forbiology, one for physical sciences and engineering, and a third for everyone else. The variety ofintroductory statistics courses is greater still, including discipline-specific courses in public health,linguistics, and more. By contrast, Data 8 was designed to provide a single common core that isboth accessible and challenging to a broad cross-section of undergraduates on campus. Studentslearn about connections to their specific domains of interest through the combination of connectorcourses and a team of teaching assistants from departments across campus. The structure ofthe teaching staff and the design of assignments are both intended to allow students to conductambitious data analysis projects within their first semester, even those without prior experiencein statistics or programming. The Data 8 model is meant to support students who will go on tospecialize in data science as well as those who will choose to focus on a data-rich domain withinthe natural sciences, social sciences, engineering, or data-focused humanities.Many of the connector courses offered today were developed and offered during the same yearthat Data 8 was originally piloted. Indeed, Data 8 and its connector courses have been viewedfrom the outset as components in a unified learning experience for students that is both essentialin its broad applicability and specific in its alignment with the particular academic interests ofstudents. Developing this diverse suite of courses in a way that maintained their coherence requireda substantial internal recruiting effort to identify faculty who would develop connector courses, aswell as an unusually high degree of coordination among course instructors to ensure that theconnector courses did in fact connect with the material in Data 8. At Berkeley, this coordination19as enabled in large part because the intention to develop Data 8 and its connector courses wasnot isolated to a particular department or small group of faculty, but instead was cultivated by adiverse faculty committee convened by the chancellor to create a campus-wide unified approach todata science education. Before any particular course syllabus was ever created, stakeholders frommultiple departments, divisions, and schools were invested in creating this new course experience.The team of instructors who created both Data 8 and its connector courses came together underthe guidance of this committee who first identified that a common intellectual core existed amongso many different disciplines. Individual courses can be created by individual faculty members, butcreating a coherent education program requires the commitment and energy of many.One organizational component that proved critical was that someone who was not responsiblefor the day-to-day development and delivery of Data 8 took responsibility for ensuring that the Data8 and connector instructors met regularly, shared their plans, and kept each other apprised of whatstudents were learning in each of the courses. This coordination effort led to the development of asummer seminar for faculty that was so popular among Berkeley faculty that it has now expandedto become the National Workshop on Data Science Education that convenes faculty from aroundthe world to discuss course content and data science pedagogy. Another fruitful coordination tactic has been to involve undergraduates who have taken Data 8as tutors and teaching assistants for connector courses. Because they have recently taken Data 8,they often know the material and course cadence well enough to help connector course instructorsidentify connection points between their course and Data 8.Undergraduates are also heavily involved in delivering the weekly lab sections of Data 8 at Berke-ley, as well as helping students complete assignments through various forms of tutoring. Large-scaledelivery of an introductory course, especially a new course such as Data 8, can benefit greatly froma robust undergraduate teaching program in which students first gain experience and confidence insupport roles, then take on broader, more autonomous, and more essential roles as they mature.Teaching mentorship, either within a course staff or through a pedagogy course, can help withrecruiting and developing more capable undergraduate tutors and teaching assistants. Althoughmore logistically complicated, there are appealing advantages to building a course staff with a largenumber of undergraduates each working for a small number of hours each week. Undergraduateteaching roles with a lower time commitment are less likely to interfere with academic work, and ahigher ratio of teachers to students can promote more individualized instruction. While the needsof each course are different, undergraduate tutors and teaching assistants are also heavily involvedin teaching Data 100 and Data 140 alongside graduate students (but Data 102 and 104 are staffedprimarily by graduate students).For all of these courses, the student experience outside of lecture is a central concern. Acommon feature across Data 8, Data 100, and Data 140 is a weekly section in which students workthrough example problems using the same computing environment in which they will completeassignments. While these lab sections typically do not introduce new concepts or techniques, theydo introduce new applications, new ways of combining ideas from the course, and new problem-solving strategies. This design is meant to ensure that students spend their time productivelythroughout the course: lab time offers review and reinforcement but minimizes redundancy withlecture, and the experience of solving problems in lab prepares students to solve similar problemson assignments and exams. Many students still need assistance in completing assignments, and sovarious forms of free one-on-one and small-group tutoring are available to students in all of these https://data.berkeley.edu/education/data-science-education-opportunities The Berkeley Data Science curriculum arose as a pedagogical leap of faith—a belief that computa-tion and inference are natural allies and they should be taught together, as a blend, in a modernundergraduate curriculum. Having watched developments at the research frontier in both computerscience and statistics over many years, we had been struck by the fact that the former has becomeincreasingly probabilistic and inferential, and the latter has become increasingly computational,and that these developments have led, perhaps surprisingly, to ideas that have often been simplerto understand and to deploy than their classical counterparts. But there really was no surprise—theclassical insistence in computer science on determinism and the classical insistence in statistics onanalytical results imposed severe limits on both fields, constraining the scope of their applicationsand requiring cleverness on the part of practitioners to surmount those limitations in real-worldapplications.Pedagogy in both fields often aimed at bringing young minds to heel within the classical frame-works, while only occasionally admitting the limitations of those frameworks. Powerful ideas thatblend probability, inference, and computation were reserved for PhD-level seminars and researchconferences—undergraduates were kept in the dark.While the emergence of data science on the academic landscape at many universities has led todiscussions of curricular change, these discussions have often focused on the mere juxtaposition ofexisting curricula in computer science and statistics. Such a juxtaposition comprises a great dealof material, and the ensuing discussion has often focused on what to omit. A problem with sucha framing, in our view, is that the classical curricula were overly disjoint and the opportunities forblending were difficult to perceive.We believe that data science provides a historical opportunity to concoct a new blend, transport-ing powerful ideas from the research frontier into the undergraduate curriculum, and empoweringstudents to pose and solve a broad range of emerging problems. We believe that we can and shouldexpose a new generation of students to intellectual foundations that are simultaneously computa-tional and inferential. We believe that students will perceive the simplicity and generality of thiscombination, and will view it as a foundation on which to build their own thinking and solve theirown problems.Taking our enthusiasm to the limit, we would like to end with the following provocative question:Should the new data science curriculum entirely displace classical curricula in computer science andstatistics, or should it simply live side by side with those curricula?Of course the new curriculum comprises many classes in addition to those that we have dis-cussed here, and those classes are drawn in part from the classical curricula, so there is necessarilyan overlap. The question that we are asking is whether the main architectural elements of thecurriculum—the stage-setting introductory class and the main backbone of the curriculum—arethose that we have discussed here or their counterparts in a classical computer science or statistics21urriculum. We offer a few opinions on this question, with the goal of opening a debate, not closingit. Certainly there are elegant, deep, and powerful ideas in classical curricula that should continueto be nurtured. We have in mind notions of abstraction in computer science, sampling in statistics,and different flavors of modeling in both disciplines. The new curriculum certainly teaches thesenotions, but the range of problem domains in which they are developed is arguably narrowed.For example, a data science student will likely have less exposure to logic-based abstractions incomputer science that underpin the fields of verification and cryptography. A data-science studentmay have less exposure than a classical statistics student to the range of applications of analysis ofvariance.But overall we would have confidence in the ability of one of our data-science students tocompete with classically-trained peers in either industry or academia. Our data-science studentswill be able to write code, but they will do so with an enhanced perception of the real-worldconsequences of that code; in particular they will be aware of the fact that code often operates in astochastic environment, and conclusions are always tentative. Moreover, their education will havein it the seeds of a broad range of mathematical ideas that can be further developed in a PhD.One useful perspective on this debate comes from considering the idea of a “fifth year.” Edu-cation is not merely about exposure to a certain set of ideas; it is about the intellectual maturitythat comes about from the lived experience of working with those ideas, in the context of a rangeof problems, and in the context of one’s own life experience. Mature understanding of deep con-cepts such as as abstraction, sampling, and modeling takes time. One can imagine a fifth-yearprogram that is project-based, providing time for maturity to emerge, introducing a wider rangeof concepts from classical curricula, and continuing to emphasize engagement with fields beyondcomputer science and statistics.
Acknowledgments
We would like to thank the many colleagues at Berkeley who have been our partners over the pastfew years as the Berkeley Data Science curriculum has been rolled out, from faculty who helped todesign and teach the classes, to graduate students who have played key roles as teaching assistantsand mentors, to staff who have helped to build the infrastructure behind the curriculum, to thedepartments who came together as stakeholders in a campus-wide initiative, but most of all to thethousands of undergraduate students whose energy, passion, open minds, creativity, and couragehas inspired us and challenged us. Their excitement at doing something new and impactful, andtheir willingness to help each other and to help us, has confirmed our abiding faith in liberaleducation.
References
Abelson, H. and Sussman, G. J. (1985).
Structure and Interpretation of Computer Programs . TheMIT Press, Cambridge, MA.ACLU of Northern California (2010).
Racial and Ethnic Disparities in Alameda County Jury Pools .Cobb, G. (2015). Mere renovation is too little too late: We need to rethink our undergraduatecurriculum from the ground up.
The American Statistician , 69(4).22e Veaux, R. D., Agarwal, M., Averett, M., Baumer, B., Bray, A.; Bressoud, T. C., Bryant, L.,Cheng, L., Francis, A., Gould, R., Kim, A., Kretchmar, M., Lu, Q., Moskol, A., Nolan, D.,Pelayo, R., Raleigh, S., Sethi, R. J., Sondjaja, M., Tiruviluamala, N., Uhlig, P., Washington, T.,Wesley, C., White, D., and Ye, P. (2017). Curriculum guidelines for undergraduate programs indata science.
Annual Review of Statistics and Its Application , 4:15–30.Fisher, R. A. (1935).
The Design of Experiments . Hafner Publishing Co., New York.Franklin, U. (1999).
The Real World of Technology . House of Anansi Press, Toronto, Canada.Freedman, D., Pisani, R., and Purves, R. (1978).
Statistics . W. W. Norton, New York.Liu, L., Mania, H., and Jordan, M. I. (2020). Competing bandits in matching markets. In
Pro-ceedings of the Twenty-Third Conference on Artificial Intelligence and Statistics (AISTATS) ,Palermo, Italy.Pitman, E. J. G. (1937). Significance tests which may be applied to samples from any populations.
Royal Statistical Society Supplement , 4:119–130.Ramdas, A., Zrnic, T., Wainwright, W., and Jordan, M. I. (2018). SAFFRON: an adaptive al-gorithm for online control of the false discovery rate. In