"Playing the whole game": A data collection and analysis exercise with Google Calendar
““Playing the whole game”: A data collectionand analysis exercise with Google Calendar
Albert Y. Kim ∗ Program in Statistical & Data Sciences, Smith CollegeandJohanna HardinDepartment of Mathematics, Pomona CollegeJune 22, 2020
Abstract
We provide a computational exercise suitable for early introduction in an under-graduate statistics or data science course that allows students to ‘play the wholegame’ of data science: performing both data collection and data analysis. Whilemany teaching resources exist for data analysis, such resources are not as abundantfor data collection given the inherent difficulty of the task. Our proposed exercisecenters around student use of Google Calendar to collect data with the goal of answer-ing the question ‘How do I spend my time?’ On the one hand, the exercise involvesanswering a question with near universal appeal, but on the other hand, the datacollection mechanism is not beyond the reach of a typical undergraduate student. Afurther benefit of the exercise is that it provides an opportunity for discussions onethical questions and considerations that data providers and data analysts face intoday’s age of large-scale internet-based data collection.
Keywords: data science, statistics, education, data collection, Google Calendar, ethics. ∗ Albert Y. Kim is Assistant Professor, Statistical & Data Sciences, Smith College, Northampton, MA01063 (e-mail: [email protected] ). This work was not supported by any grant. The authors thanknumerous colleagues and students for their support. a r X i v : . [ s t a t . O T ] J un Introduction
The title of our paper refers to the reality that in order to master a subject, one needsto do more than practice the individual, elemental, and necessary parts. For example, nomatter how good you are at running, dribbling the ball, or shooting the ball into the uppercorner of the net, you cannot excel at soccer unless you practice the whole game (Perkins2010). While Wickham & Bryan (2019) use the phrase to describe creating an entire RPackage from beginning all the way to distribution on GitHub, we use the term to describethe entire process by which data science is performed.Many statistics and data science educators are likely quite familiar with Wickham &Grolemund (2017)’s model of the tools needed in a typical data science project seen in Fig-ure 1. While one should not interpret any simplifying diagram in a overly literal fashion,we appreciate two aspects of this diagram. First, the cyclical nature of the “Understand”portion emphasizes that in many substantive data analysis projects, original models andvisualizations need updating, necessitating many iterations through this cycle until a de-sired outcome can be communicated. Second, it encourages a holistic view of the elementsof a typical data science project.Figure 1: Data science workflow diagram (Wickham & Grolemund 2017)Previously many undergraduate statistics courses have not taught such a holistic ap-proach, instead focusing only on individual components at the expense of others, therebyleaving gaps in the process. In this paper, we view these gaps in terms of what
Not SoStandard Deviations podcast hosts Roger Peng and Hilary Parker call differences be-2ween “what data analysis is” and “what data analysts do” (Peng & Parker 2018 b ). Inother words, there are many differences between the idealized view of the data analysisprocess and what is actually done in practice.We argue that is important to adjust curricula to “bridge the gaps” between “what dataanalysis is” and “what data analysts do” (McNamara 2015). For example, a substantialbridge of the gap has come through great pedagogical strides to expose students in statisticsand data science courses to the entirety of the process in Figure 1 (Baumer 2015, Hardinet al. (2015), Loy et al. (2019), Yan & Davis (2019)).Another example of an existing gap in some statistics and data science courses is sparsetreatment of data wrangling. A familiar refrain from working data scientists is that 80%of their time is spent wrangling data leaving only 20% for actual analysis (see Figure 2).Given the prospect of the 80-20 rule, Kim et al. (2018) argue that to completely shieldstudents in statistics courses from performing meaningful data wrangling is to do thema disservice. Horton et al. (2015) propose five key data wrangling elements that deservegreater emphasis in the undergraduate curriculum: creative & constructive thinking, facilitywith different types of data, statistical computing skills, experience wrangling, and an ethosof responsibility. Two of the most seminal contributions to modernizing the statistics anddata science curriculum, National Academies of Sciences Engineering and Medicine (2018)and Nolan & Lang (2010), both emphasize the importance of teaching and practicing datawrangling. (a) Image credit: (Ruiz 2017) (b) Image credit: (Robinson 2018) Figure 2: The 80-20 rule of data wranglingIn this manuscript, we argue that another such gap in curricula is inadequate treatmentof the data collection component of the cycle. Many working data scientists do not spend3 large proportion of their time working on one static data set as suggested by the single“Import” step in Figure 1 (Ruiz 2017, Lohr (Aug 17, 2014)). Instead, the data used toaddress questions are often dynamic; the data change based on differences in time, samplingstrategies, questions asked, variables collected, etc. Furthermore, only after a first iterationof data has been collected and analyzed can research questions be updated and fine tuned,often necessitating another round (or more) of data collection.While we are not the first to argue that data collection and acquisition is an importanttopic for the statistics classroom (Zhu et al. 2013) or the computer science classroom(Blitzstein 2013, Protopapas et al. (2020)), we point out that Wickham & Grolemund(2017)’s model of data analysis from Figure 1 is not a complete representation of a typical“data science project” as it neglects a critical phase: (repeated) data collection. We presentwhat we term “playing the whole game” in Figure 3, which augments the earlier “dataanalysis” diagram in Figure 1 with an additional “data collection” block. The new blockcomprises of key elements to consider when collecting data, including ethical considerations,the experimental design (if any), the sampling methodology, questionnaire design in thecase of surveys, the data input and logging methods used, and most critically, identifyingthe research question. While the new block is by no means exhaustive and thus shouldnot be interpreted in an overly literal fashion either, we again emphasize the cyclical anditerative nature of data collection and data analysis.4igure 3: “Playing the whole game”: Data collection as well as data analysis in datascience.Note that some of the tasks in the “Data Collection” box may not be seen immediatelyas collecting data. However, we believe that they are important considerations at the datacollection step. For example as it relates to “Ethical Considerations”, questions like “Shouldyou collect someone’s location data?” or “Should you require explicit permission to collecta particular variable?” would be considered. As it relates to “Question Prompts”, studieson the psychological effect of “priming” have shown that you can steer survey responses toa particular direction depending on how and when questions are asked (Hjortskov 2017).In the context of undergraduate statistics and data science classrooms, a large amountof data analysis conducted by students uses data that they themselves or the class as awhole had no part in collecting. For example, many students seek data for assignments andprojects from repositories like Kaggle.com or data.gov. Thus in many cases, students getno exposure to the “data collection” component of Figure 3 and take the data for granted.Lack of exposure to data collection has potential pitfalls including potentially erroneousconclusions based on mistaken assumptions of the data collection method, a lack of a senseof ownership of the work, and most importantly, no experience going through the iterativeprocess of “playing the whole game.” 5revious literature has focused on slightly different ways of incorporating data col-lection into the classroom: research projects (e.g., Halvorsen & Moore (2000), Halvorsen(2010), Sole & Weinberg (2017)); simulating realistic data through games (e.g., Kuiper &Sturdivant (2015)); and collecting classroom activity data (e.g., Scheaffer et al. (2004)).Building toward the goal of having students experience the entire data analysis process,we propose a data collection activity that allows students to mimic the real-life data col-lection conducted by numerous internet-age organizations in industry, media, government,and academia. At the same time the technological background and sophistication necessaryfor the activity is kept at a level suitable for undergraduate classroom settings. The class-room activity touches many components of playing the whole game, demonstrates to thestudents the difficulty of data science, and provides an interesting research question withwhich the students can engage. A visual summary of the activity is presented in Figure 4.Figure 4: Graphical representation of playing the “Playing the Whole Game” with GoogleCalendar.
Muse
The idea for our “data collection” exercise for students came out of an episode of the earliermentioned
Not So Standard Deviations podcast titled “Compromised Shoe Situation”(Peng & Parker 2018 a ) (as well as described in a corresponding blog post by Roger Pengtitled “How Data Scientists Think - A Mini Case Study” (Peng 2019)).6osts Hilary Parker and Roger Peng gave each other a data science challenge wherebythey had to solve a problem using data science. They contrasted it to other common datascience challenges (such as the American Statistical Association’s DataFest (Bialik 2014)or prediction and classification competitions available from Kaggle.com) where the datahave already been collected and cleaned.The challenge that Parker and Peng proposed centers on identifying what factors influ-ence the time it takes each of them to get to work. In one of our favorite discussions of theirpodcast, they break down the iterative process of gathering data, learning what informationit provides, and gathering more data. They also discuss the difficulty of gathering preciseinformation and the balance of “low-touch” and “high-touch” data collection (collectionmethods that require few active user actions versus many , respectively). In Parker’s datacollection, she implemented a low-touch system of recording when her commute started byautomatically recording when her phone disconnects from her home WiFi; a higher-touchpart of her analysis was when she manually logged the route she took to work.In the rest of the manuscript we describe how we have taken the ideas from Parkerand Peng’s podcast and infused them into a class assignment previewed in Figure 4 whichallows students to practice “playing the whole game.” In Section 2 we provide the details ofthe assignment and the considerations that went into why we made some of the choices wedid. Section 3 provides a reflection on what we learned and how the assignment succeededin accomplishing the goals for “playing the whole game.” In Subsection 3.3, we includequotes from reflection pieces written by students summarizing their experiences. As a vitalaspect of our work, in Section 4 we discuss the ethical considerations of the assignmentand how they can be generalized to the larger scale data collection done by internet-age organizations in industry, media, government, and academia. As part of our ethicsdiscussion, we emphasize the importance of bringing up data ethics all along the semesterwith every data analysis, and not just in a data ethics course.7 Materials & Methods
In Fall 2019 both authors were each teaching classes in a context where the whole gamewould be important to the learning outcomes of the course. Albert’s class was “Intro-duction to Data Science” at Smith College, a no-prerequisite course designed to appeal toa broad audience and act as an entry-point to the Statistical and Data Sciences major.He gave the assignment in the fifth week of the semester as the first “mini-project.” Jo’sclass was “Computational Statistics” at Pomona College with primarily junior and seniormath/statistics majors; she used the assignment as the first homework of the semester. In what follows, we provide the details of the assignment. To engage the studentsmaximally, we tried to find a question with universal appeal and landed on “How do Ispend my time?” In the assignment, the student is required to collect, wrangle, visualize,and analyze data based on entries they make in their own electronic calendar/plannerapplication, such as Google Calendar, macOS, or Microsoft Outlook.To assess prior student use of a calendar/planner, Albert polled his class at SmithCollege on their preferred method of keeping track of their schedule (See Figure 5). InAlbert’s case, most students were already keeping an electronic calendar (44 of 74), whilesome were keeping a hand-written calendar (21 of 74). The work has been verified by the Smith College Institutional Review Board as ”Exempt” accordingto 45CFR46.101(b)(1): (1) Educational Practices. Both authors were surprised to see that any college student would be able to make it through thesemester without any type of calendar at all (3 of 74)!
The learning outcomes for the exercise can be broken down into three categories. Thefirst category is “Data collection” which focuses on the technical aspects of the activity.The second category is “Data ethics” which focuses on the larger context and conclusionsresulting from the exercise. The last category encompasses the learning goals related to“Playing the whole game.” • Data collection1. Experience creating measurable data observations (e.g., how is “one day” mea-sured, or what defines “studying”).2. Address data collection constraints due to limits in technological capacity andhuman behavior. • Data ethics3. Practice the ethical and legal responsibilities of those collecting, storing, andanalyzing data. 9. Decide limits for personal privacy.5. Deliberate on the trade-offs between research results and privacy. • “Playing the whole game”6. Tie together data collection, analysis, ethics, and communication components.7. Iterate between and within the components of the “whole game.”Throughout the manuscript, we touch on all of the learning outcomes. Section 3.1describes many of the benefits and learning outcomes related to “Data collection”; Section3.2 describes many of the benefits and learning outcomes related to “Playing the wholegame”.As a specific example of learning outcome 5, consider ethics training for animal studieswhere there is typically a discussion on the minimum number of samples needed for “ethicalresearch.” The idea is that an under-powered study would indicate that the animals hadbeen needlessly sacrificed with no possibility of moving scientific research forward. As withminimum number of samples, we ask the students what types of research connect with whichtypes of privacy violations, and where is the correct balance for pushing forward knowledge.A more complete examination of data ethics in the classroom is given in Section 4. We have provided the complete assignments used by both authors at the following website . While the twoversions of the assignment vary slightly in format and reflection, both still require thestudents to go through the entire process of collecting, wrangling, visualizing, analyzing,and communicating about the data, with an important additional step of reflecting onthe information gathered. The assignment was scaffolded so that a student with minimalcomputational and coding experience could still directly work with calendar data.Before we go into the details of working with calendar data, we first point out somedifferences in approach between the two assignments. One unique aspect of Albert’s classis that he had the students work in pairs. A particular student would make their calendarentries and then export the data. However, instead of analyzing their own calendar, they10ould send their data to their partner who then wrangled, visualized, and analyzed thedata (as well as vice versa). The motivation was to encourage students to think about whatdata they could or should share with their partner, what data they couldn’t nor shouldn’t,and any particular responsibilities that the individual analyzing the data had.In Jo’s class, she spent half of one class period discussing the podcast in great detail.Because it was early on in the semester, there were lots of unknowns about what is datascience and what are the possible ways one can use data to make decisions. Additionally,the assignment and podcast continued to come up in class sessions throughout the semesteras a way to ground conversations with respect to data collection & analysis: what we cando, what we can’t do, what we should do, and what we shouldn’t do.
Students were instructed to track how they spent their time on their calendar applicationfor approximately 10-14 days. More specifically, students were instructed to fill in blocksof time and mark the entry with the activity they were performing: sleeping, studying,eating, exercising, socializing, etc. The students chose their own categories to fit their ownschedules. Students were able to have overlapping blocks of activities in situations wheretwo activities were done simultaneously. Students were also informed that they should feelcomfortable leaving out any details they did not want to share; indeed, if they wanted to,they were free to make up all the information in the calendar.Note that our example centers around the use of Google Calendar. To ensure as muchconsistency as possible we encouraged students who did not have an electronic calendarto use Google Calendar to record their activities. However, the assignment can equally bedone using macOS calendar or Microsoft Outlook.As an example, we filled a sample Google Calendar with entries between September2nd and 7th, 2019, which you can view using the Google Calendar interface at http://bit.ly/dummy_calendar . We suggest that after scrolling to the week of September 1st,2019, you click the “Week” tab on the top right for a week-based overview of the calendarentries. 11igure 6: Sample Google Calendar presented to students.After looking at the sample calendar, students exported their own calendar data to .ics file format, a universal calendar format used by several email and calendar programs,including Google Calendar, macOS Calendar, and Microsoft Outlook. Their file was thenimported into R as a data frame using the ical parse df() function from the ical package(Meissner 2019).In order to help the students focus on the larger data science paradigm, the instructorsscaffolded the assignment in template R Markdown files (Allaire et al. 2019) which servedas the foundation of the students’ submissions; we provide these scaffolded assignmentsat . Continuingour earlier example involving a sample Google Calendar, we exported its contents to a file. Subsequently, we imported the .ics file into R as a tibble data frame(Mller & Wickham 2019), and then performed some data wrangling using dplyr (Wickham,Franois, Henry & Mller 2019) and the lubridate package for parsing and wrangling datesand times (Grolemund & Wickham 2011).Below, we give an extract of the code representing the crux of the process previouslydescribed. (Note that to change the "America/New York" timezone setting above, run the
OlsonNames() function in R to output a list of all time zones.) library(ggplot2)library(dplyr)library(lubridate) ibrary(ical)calendar_data <- "192.ics" %>% ical_parse_df() %>% as_tibble() %>% mutate(start_datetime = with_tz(start, tzone = "America/New_York"),end_datetime = with_tz(end, tzone = "America/New_York"),length = end_datetime - start_datetime,date = floor_date(start_datetime, unit = "day")) %>% mutate(summary = tolower(summary)) %>% group_by(date, summary) %>%summarize(length = sum(length),length = as.numeric(length)) %>% mutate(hours = length/60) %>% filter(date > "2019-09-01") The resulting calendar data data frame output is presented in Table 1. The calendar data data frame can then be used to make plots like the time-series linegraph in Figure 7 andthe side-by-side boxplot in Figure 8. 13able 1: Example calendar data frame.date summary length hours2019-09-02 sleep 480 8.02019-09-02 study 60 1.02019-09-03 exercise 60 1.02019-09-04 sleep 960 16.02019-09-04 study 180 3.02019-09-05 sleep 540 9.02019-09-06 exercise 30 0.52019-09-06 study 90 1.52019-09-07 exercise 30 0.52019-09-07 sleep 540 9.0
Date N u m be r o f hou r s Activity exercisesleepstudy
Figure 7: Time series plot of calendar data.14
Activity N u m be r o f hou r s summary exercisesleepstudy Figure 8: Distribution of number of hours spent split by activity type.
The assignments we used in our classes did not take a substantial amount of time tocreate, execute, or discuss. Albert engaged with the assignment for about half a classperiod at three different times: (1) to motivate and introduce the assignment includingan end-to-end demonstration; (2) a check-in to make sure everyone was on track, and(3) an in-class discussion after the assignment had been submitted. Jo spent only a fewminutes discussing the technical aspects of the assignment and 30 min during one classperiod discussing Peng & Parker (2018 a ). We believe that including the “play the wholegame” activity in the course allowed for a much deeper understanding and reflection onthe entire data science pipeline, without needing to substitute any of our previous courselearning goals. That is to say, the assignment we have shared is in keeping with thecurrent structure of many statistics and data science courses. In fact, while our manuscriptwas under review, Prof. Katharine F. Correia at Amherst College extended our originalideas and modified them so that her students could create “data diaries” during the socialisolation period of Spring 2020. Some examples of her students’ remarkable work can befound here: https://stat231-01-s20.github.io/data-science-diaries/ .We now discuss what we view as the successes and lessons of the activity (for student15erspectives on these successes and lessons see Section 3.3). First, the focus on “self” and an engaging question (“How do I spend my time?”) easilycaptured our students’ interest. Universally, they found the assignment compelling, withone student remarking informally afterwards that they planned on pursuing further anal-yses on their own private Google Calendar. Another student came up with a clever ideaof documenting their intended time studying versus their actual time studying. Some stu-dents seemed concerned that their question on how they used their time wasn’t interestingenough, but we emphasized that the project should be more about process than outcome.Additionally, we prompted students with follow-up queries such as: “Do the amounts oftime spent on certain activities have an inverse relationship with each other?” or “Do youspend as much time on a particular activity as you expect to?” (where the latter questionwould necessitate recording both expected and actual time for each event).Second, having the data collection involve time intervals puts a nice cap/standardizationon the scope on the type of data collected. As can be seen in Table 1 of the example calendar data data frame, all resulting data frames consist of three variables: date (nu-merical), length of time spent (numerical), and activity type (categorical). Having sucha combination of variables is an excellent starting point for data science courses. Fur-thermore, many students encountered very interesting computational and data wranglingquestions quite organically, some along the lines of: • How do I combine two types of activities in R? • How do I only select events from days past 2019-09-02? • My ‘hours spent’ variable is actually in minutes rather than hours. How do I convertit to hours? (We noticed operating system level variation in the default units of therecorded length variable in the calendar data dataframe.) • How would I make a scatterplot of daily time spent studying versus time spent sleep-ing?The latter question is particular interesting as it necessitates converting data framesfrom tall/long format (like that of the calendar data example in 1) to wide format using16he pivot wider() function from the tidyr package (Wickham & Henry 2019). In ouropinions, this question serves as an ideal method to teach the subtle concept of “tidy” data(Wickham 2014). Rather than starting with the somewhat technical definition of “tidy”data, there is a benefit to having students gently “hit the wall” whereby they cannot createa certain visualization or wrangle data in a certain way unless the data is in the correctformat.Lastly, both instructors found that the exercise served as an excellent opportunity toteach students about file paths and including the dataset (in this case the .ics file) in anappropriate directory. Students were required to compile .Rmd files which connected to aunique .ics file, and a second individual (either a partner or a student grader) also neededto be able to compile the .Rmd file. Due to the typical structure of a homework assignment(e.g., professor providing a link to a dataset), our experience is that working with file pathsis both difficult for novice data science students as well as difficult to provide examples forpracticing.
Importantly, there were benefits engendered by having students “play the whole game,”both (1) of a technical, coding, and computational nature and (2) of a scientific and researchnature. Many lessons were only learned by students after having gone through an entireiteration of “data collection” and “data analysis” cycle as seen in Figure 3.For example, one student made calendar entries recording the start times of events,whereas the assignment assumed time intervals were being recorded. For example, “I wentto sleep at time X.” versus “I slept between times X and Y, and thus slept for Z hours.”The student thus needed to revise their data collection method.Other students found out that unless their calendar entry titles matched exactly, theyare recorded as two different activity types. Consider the following extremely nuanceddifference due to trailing spaces: "Sleep" versus "Sleep " , such data collection methodswere also revised in order to ensure entries were standardized. These students thus learnedabout the differences in how humans and computers process text data.Another issue that surprised the instructors was that calendar entries based on a “re-17eat” schedule only appeared once in their resulting data in R: the entry for the first day.The instructors thus had to consider whether to update the assignment instructions forstudents in future classes, or to let them encounter this issue on their own.On top of more technical, coding, and computational issues, sometimes students realizedmid-way that their research question itself needed revising and thus had to retroactivelyfill in their calendars using their best guesses. These students thus realized the importanceof iterating through the entire process sooner rather than later in order to prevent errorsfrom accumulating.An unforeseen issue that combined both technical and scientific queries was how toallocate sleep hours. For example, if a student went to sleep at 10pm and woke up at 6am,then 2 hours of sleep would be logged for one day and 6 hours for the next, whereas mostpeople would call it a single night of sleep.
On top of the earlier mentioned student “data diaries” from Prof. Katharine F. Correia’scourse available at https://stat231-01-s20.github.io/data-science-diaries/ , wenow provide excerpts from the reflection pieces students wrote in Albert’s course at SmithCollege. Students were asked to write a joint reflection piece on their experiences, keepingthe podcast in mind (Peng & Parker 2018 a ). In particular: • As someone who provides data: What expectations do you have when you give yourdata? • As someone who analyzes other’s data: What legal and ethical responsibilities do youhave?The student quotes have been grouped into categories which relate back (roughly) tothe previously described learning outcomes (LO) in Subsection 2.2.
Relating to iterating between “data collection” and “data analysis” (LO1) : Per Smith College Institutional Review Board guidelines, explicit consent was obtained directly fromall students who are confidentially quoted in the manuscript. “I fell ill and was unable to log in more data points for her to visualize. Student Xwas still able to visualize my activity, incorporating my sick days as another variable” • “One of the technical issues we ran into, and probably the defining experience of thisproject, was the difficulty in creating consistent error free information to export toour partner. It’s kind of ridiculous how small manual entry errors, like whether ornot I used spaces (in the calendar entries), made such a difference at the end of theline when mystery variables started showing up” • “But that was not the biggest issue we encountered, our issue was the fact that weinputted repeated events on our calendar which affected how many observations cameup onto the data set (since only the first of the multiple events would show up)” • “I, for example, learned that data is fickle and needs consideration of how variablesinteract with each other within a data visualization before unnecessary data collec-tion. Her original question was inadequate as one of her variables wasn’t a functionof time; thus, she had to retroactively enter the data for her new question using herbest estimate, meaning her data should be recognized as imprecise and that it couldaffect the outcome of any analysis.” Relating to “low” and “high touch” data collection methods (LO2) : • “Specifically for Student X’s data, it was interesting to work with real-time datarecorded by her phone. Because it was automatically collected by her mobile device,there was less room for human errors when calculating the time spent on her phonescreen.” • “The time is hard to pin down because sometimes there will be interruptions orpotentially interleaving studying to do other things. We think that the study timecould be off between an hour and a half an hour for regular weekdays and maybetwo during weekends because sometimes it is harder to take down the exact start andstop time on the spot. Comparatively, recording sleeping time is much easier becausewe could always check the time when the alarm goes off and take a screenshot whensetting the alarm for the other day.” Relating to analyst ethical responsibilities (LO3) :19 “For example, in our group, one member accidentally shared their entire calendardata with the other member when only a small portion of this data was needed foranalysis. The other member realized the first member’s mistake, deleted the data,and showed the first member how to separate the events she wanted to be analyzedfrom the rest of her calendar.”
Relating to prior analyst biases (LO3) : • “It might be worthwhile as the analyzer to be completely transparent and addressany previous biases that may affect the reliability of a given data analysis.” • “Keep to the original data, don’t change the data to get a favorable result” Relating to data provider expectations (LO4) : • “Additionally, the individual should have the right to ask about the research projectthat the data is going to and how their data may be used.” Relating to analyst empathy towards data providers (LO5) : • “The question was straightforward but it also made us uneasy as we had to considerboth what we were comfortable sharing and how to handle our partner’s data in anethical manner” • “On the other hand, because of my discomfort, I was also able to handle Student X’sdata with the same caution that I would expect her to do with my dataset” • “This in turn motivated me to handle her data with care and sensitivity. For thisreason, I appreciated this project as it forced us to simultaneously consider our rolesboth as people who share our data in numerous scenarios throughout our daily livesand as data science students with a new responsibility to handle another person’sdata ethically.” • “If we oppose Facebook tracking our personal information and profiting off it withoutour consent, we owe it to our customers/providers of data to not do the same.” • “Personally I did question how comfortable I was sharing information about XXX,but decided to push my boundaries, because though it made me nervous to share20nformation about XXX I was also very curious about how it affected YYY, and theactual risk of sharing this information was little to none. I wanted to gain someinsight from this project that would help me lead a healthier, more stable life.” Relating to the value of data (LO5) : • “There would be no incentives for us to share our data if we are not getting anythingback from it.” • “A grey area of data analysis would be targeted ads. On the one hand, one couldargue that getting targeted ads bring convenience to only getting advertised productswe would enjoy. On the other hand, one could also argue it pushes capitalism on usand thus only disadvantages the public by heavily influencing us to buy products wemight not need.” Relating to the balancing act many analysts face (LO5) : • “Information that puts our security or identity at risk should be left out. However, iftoo little information is shared, it may affect the ability to accurately represent anyphenomena occurring within the data.” • “In the podcast we listened to, Hilary Parker specifically mentioned how in collect-ing employees’ commute time, she wanted to protect privacy by withholding theexact minute they left work. An alternative was releasing the general length of timeemployees spend commuting to work would, but this would remove the impact of con-founding variables and externatlities. The exact times an employee commuted wereimportant because part of the data collection included factoring in traffic patterns.Because of this consideration, Parker decided it would be more effective to round thetimes an employee arrived of left work to the nearest hour to maximize privacy andaccuracy of the data.” Relating to actionable insight (LO6) : • “Since then I have been more cautious about granting apps access to my information.” • “Visualization 1 made me realize that I should also spend more time during theweekdays sleeping and hanging out with friends to take care of myself.”21 elating to revised research questions (LO7) : • “For example, while analyzing Student X’s data, we had to decide whether or notFriday would be considered a weekday or a weekend.” • “As for improvements in our next projects, for consistency and accuracy, it might beimportant to keep the data collection dates fixed so that there are the same numberof the week days considered in the analysis.” As noted by Baumer et al. (2020), many major professional societies, including the Amer-ican Statistical Association (ASA) (Committee on Professional Ethics 2018 b ), the Asso-ciation for Computing Machinery (ACM) (Committee on Professional Ethics 2018 a ), andthe National Academy of Sciences (NAS) (Committee on Science, Engineering, and Pub-lic Policy 2009), have long published guidelines for conducting ethical research. However,only more recently has there been growing literature on the weaving of data ethics topicsearly and often all throughout undergraduate curricula in data science, statistics, and com-puter science (Baumer et al. 2020, Ott (2019), Burton et al. (2018), Elliott et al. (2018),Heggeseth (2019)).The class activity we present provides a myriad of ways to start a classroom conversationabout data ethics. Albert motivated the idea of the project by discussing the pros and consof how he logs his daily calorie consumption using the MyFitnessPal app. On the one hand,the students could appreciate the convenience of logging calories using the barcode scanner,on the other hand, they could also understand that Albert is essentially telling the UnderArmour corporation everything he eats and when. In the age of Fitbit, 23andMe, andinnumerable different health and fitness apps, the example presents the students with anopportunity to think about what information they are sharing with large corporations.Instructors may choose a different type of data collection, or let each student decide ontheir own collection method. Some possible extensions include the above mentioned fitnessapps (outputting all entries in a .csv file); every interaction on Instagram (outputting a.json file); Google trends or Google search terms (output as a function of time); or personal22acebook information (as .html or .json). Each data type will present a unique setting inwhich to discuss privacy and data ethics considerations.In Jo’s class conversation about the podcast, one student appreciated that Hilary andRoger discussed whether they should do the data collection. The student expressed thatmaybe Hilary should just leave earlier for work so as to not make anyone wait. That wouldbe the more kind thing to do. Indeed, as Hilary herself says “. . . it’s probably better tojust give yourself a cushion so you don’t have to worry.” The student comment led to alarger conversation about what types of questions we should answer with data science andwhat types of questions should not be data driven.In the last few minutes of the podcast, Parker says (Peng & Parker 2018 a ):. . . setting up, thinking through what could you collect, what makes sense tocollect? What are the assumptions in the data that you’re collecting? Likegoing through that, I would encourage people to go through that process, evenif you end up like not doing it long-term or something, but it’s still, I thinkthat’s the skill set that’s important to develop if you want to work in datascience.Why does Hilary Parker implore people to “play the whole game?” We believe thatthere is a disconnect in the narrative surrounding “what is data analysis” when insteadit should be more closely connected to “what do data analysts do.” The only way toinculcate lessons pertaining to “what data analysts do” is by having students “play thewhole game.” Furthermore, just as in real-life, it is critical that students go through thecycle several times. In other words, “iterate early and iterate often.” Most importantly, we’d like to acknowledge the Smith College and Pomona College studentswith whom we work. It is from their ideas and energy that we discovered the fun and beautyof working in statistics and data science. Hilary Parker & Roger Peng provided not onlythe inspiration for the activity but also helpful feedback on the pre-print. We appreciatethe thoughtful suggestions of three anonymous referees and the associate editor. On top23f the packages cited in the manuscript, the authors also used the ggplot2 (Wickham,Chang, Henry, Pedersen, Takahashi, Wilke, Woo & Yutani 2019), kableExtra (Zhu 2019), knitr (Xie 2019), and viridis packages (Garnier 2018). We also thank Hadley Wickhamfor providing the direct insipiration for the “playing the whole game” wording in our paper.
Nothing to declare.
References
Allaire, J., Xie, Y., McPherson, J., Luraschi, J., Ushey, K., Atkins, A., Wickham, H.,Cheng, J., Chang, W. & Iannone, R. (2019), rmarkdown: Dynamic Documents for R . Rpackage version 1.15.
URL: https://CRAN.R-project.org/package=rmarkdown
Baumer, B., Garcia, R., Kim, A., Kinnaird, K. & Ott, M. (2020), ‘Integrating data scienceethics into an undergraduate major’, submitted . URL: https://arxiv.org/abs/2001.07649
Baumer, B. S. (2015), ‘A data science course for undergraduates: Thinking with data’,
TheAmerican Statistician (4), 334–342. URL: http://dx.doi.org/10.1080/00031305.2015.1081105
Bialik, C. (2014), ‘The students most likely to take our jobs’,
FiveThirtyEight.com . URL: https://fivethirtyeight.com/features/the-students-most-likely-to-take-our-jobs/
Blitzstein, J. (2013), ‘What is it like to design a data science class? In particular, what wasit like to design Harvard’s new data science class, taught by professors Joe Blitzsteinand Hanspeter Pfister?’.
URL:
Communications of the ACM (8), 54–64. URL: https://dl.acm.org/citation.cfm?id=3154485
Committee on Professional Ethics (2018 a ), ACM Code of Ethics and Professional Conduct ,Association for Computing Machinery, Inc.
URL:
Committee on Professional Ethics (2018 b ), Ethical guidelines for statistical practice, Tech-nical report, American Statistical Association. URL:
Committee on Science, Engineering, and Public Policy (2009),
On being a scientist: a guideto responsible conduct in research , 3 edn, Washington, DC: National Academies Press.
URL:
Elliott, A. C., Stokes, S. L. & Cao, J. (2018), ‘Teaching ethics in a statistics curriculumwith a cross-cultural emphasis’,
The American Statistician (4), 359–367. URL:
Garnier, S. (2018), viridis: Default Color Maps from ‘matplotlib’ . R package version 0.5.1.
URL: https://CRAN.R-project.org/package=viridis
Grolemund, G. & Wickham, H. (2011), ‘Dates and times made easy with lubridate’,
Journalof Statistical Software (3), 1–25. URL:
Halvorsen, K. (2010), ‘Formulating statistical questions and implementig statistics projectsin an introductory applied statistics course’,
Proceedings of the International Conferenceon Teaching Statistics 8 . URL: https://iase-web.org/documents/papers/icots8/ICOTS8 4G3 HALVORSEN.pdf
Halvorsen, K. & Moore, T. (2000), Section 2: Motivating, monitoring, and evaluatingstudent projects, in T. Moore, ed., ‘Teaching Statistics: Resources for Undergradu-25te Instructors’, Vol. 52, The Mathematical Association of America, Washington, DC,pp. 27–32.Hardin, J., Hoerl, R., Horton, N. J., Nolan, D., Baumer, B. S., Hall-Holt, O., Murrell,P., Peng, R., Roback, P., Temple Lang, D. & Ward, M. D. (2015), ‘Data science instatistics curricula: Preparing students to ‘Think with data”,
The American Statistician (4), 343–353. URL: https://doi.org/10.1080/00031305.2015.1077729
Heggeseth, B. (2019), ‘Intertwining data ethics in intro stats’, Symposium on Data Scienceand Statistics.
URL: https://drive.google.com/file/d/1GXzVMpb6GVNfWPS6bd9jggtqq1C77Wsc/view
Hjortskov, M. (2017), ‘Priming and context effects in citizen satisfaction surveys’,
PublicAdministration (4), 912–926. URL: https://onlinelibrary.wiley.com/doi/abs/10.1111/padm.12346
Horton, N. J., Baumer, B. S. & Wickham, H. (2015), ‘Taking a chance in the classroom:Setting the stage for data science: Integration of data management skills in introductoryand second courses in statistics’,
CHANCE (2), 40–50. URL: https://doi.org/10.1080/09332480.2015.1042739
Kim, A. Y., Ismay, C. & Chunn, J. (2018), ‘The fivethirtyeight R package: ‘Tame Data’principles for introductory statistics and data science courses’,
Technology Innovationsin Statistics Education . URL: https://escholarship.org/uc/item/0rx1231m
Kuiper, S. & Sturdivant, R. X. (2015), ‘Using online game-based simulations to strengthenstudents’ understanding of practical statistical issues in real-world data analysis’,
TheAmerican Statistician (4), 354–361. URL: https://doi.org/10.1080/00031305.2015.1075421
Lohr, S. (Aug 17, 2014), ‘For big-data scientists, ‘janitor work’ is key hurdle to insights’,
New York Times . 26
RL:
Loy, A., Kuiper, S. & Chihara, L. (2019), ‘Supporting data science in the statistics cur-riculum’,
Journal of Statistics Education (1), 2–11.McNamara, A. A. (2015), Bridging the Gap Between Tools for Learning and for DoingStatistics, PhD thesis, UCLA. URL: https://escholarship.org/uc/item/1mm9303x
Meissner, P. (2019), ical: ‘iCalendar’ Parsing . R package version 0.1.6.
URL: https://CRAN.R-project.org/package=ical
Mller, K. & Wickham, H. (2019), tibble: Simple Data Frames . R package version 2.1.3.
URL: https://CRAN.R-project.org/package=tibble
National Academies of Sciences Engineering and Medicine (2018),
Data Science for Un-dergraduates: Opportunities and Options , The National Academies Press, Washington,DC.
URL:
Nolan, D. & Lang, D. T. (2010), ‘Computing in the statistics curricula’,
The AmericanStatistician (2), 97–107. URL: https://doi.org/10.1198/tast.2010.09132
Ott, M. (2019), Symposium on Data Science & Statistics, Seattle, WA.
URL:
Peng, R. (2019), ‘How data scientists think - a mini case study’, Simply Stats.
URL: https://simplystatistics.org/2019/01/09/how-data-scientists-think-a-mini-case-study/
Peng, R. & Parker, H. (2018 a ), ‘Compromised shoe situation’, Not So Standard Deviations. URL: http://nssdeviations.com/71-compromised-shoe-situation b ), ‘The smoothie happens everyday’, Not So Standard Devi-ations. URL: http://nssdeviations.com/70-the-smoothie-happens-everyday
Perkins, D. (2010),
Making Learning Whole: How Seven Principles of Teaching Can Trans-form Education , Wiley.
URL: https://books.google.com/books?id=0DF9WxgGgNsC
Protopapas, P., Rader, K., Glickman, M., Tanner, C., Blitzstein, J., Pfister, H. & Kaynig-Fittkau, V. (2020), ‘CS109 Data Science’.
URL: http://cs109.github.io/2015/
Robinson, D. (2018), Twitter.
URL: https://twitter.com/drob/status/987436677026254848
Ruiz, A. (2017), ‘The 80/20 data science dilemma’, InfoWorld.
URL:
Scheaffer, R., Erickson, T., Watkins, A., Witmer, J. & Gnanadesikan, M. (2004),
Activity-Based Statistics , Key College.Sole, M. A. & Weinberg, S. L. (2017), ‘What’s brewing? A statistics education discoveryproject’,
Journal of Statistics Education (3), 137–144. URL: https://doi.org/10.1080/10691898.2017.1395302
Wickham, H. (2014), ‘Tidy data’,
Journal of Statistical Software, Articles (10), 1–23. URL:
Wickham, H. & Bryan, J. (2019),
R packages , O’Reilly Media.
URL: https://r-pkgs.org/
Wickham, H., Chang, W., Henry, L., Pedersen, T. L., Takahashi, K., Wilke, C., Woo, K.& Yutani, H. (2019), ggplot2: Create Elegant Data Visualisations Using the Grammar ofGraphics . R package version 3.2.1.
URL: https://CRAN.R-project.org/package=ggplot2 dplyr: A Grammar of DataManipulation . R package version 0.8.3.
URL: https://CRAN.R-project.org/package=dplyr
Wickham, H. & Grolemund, G. (2017),
R for Data Science , O’Reilly Media.
URL: https://r4ds.had.co.nz/
Wickham, H. & Henry, L. (2019), tidyr: Tidy Messy Data . R package version 1.0.0.
URL: https://CRAN.R-project.org/package=tidyr
Xie, Y. (2019), knitr: A General-Purpose Package for Dynamic Report Generation in R .R package version 1.26.
URL: https://CRAN.R-project.org/package=knitr
Yan, D. & Davis, G. (2019), ‘A first course in data science’,
Journal of Statistics Education (2), 99–109.Zhu, H. (2019), kableExtra: Construct Complex Table with ‘kable’ and Pipe Syntax . Rpackage version 1.1.0. URL: https://CRAN.R-project.org/package=kableExtra
Zhu, Y., Hernandez, L. M., Mueller, P., Dong, Y. & Forman, M. R. (2013), ‘Data acquisitionand preprocessing in studies on humans: What is not taught in statistics classes?’,
TheAmerican Statistician (4), 235–241. PMID: 24511148. URL: https://doi.org/10.1080/00031305.2013.842498https://doi.org/10.1080/00031305.2013.842498