Tools for analyzing R code the tidy way
TTools for analyzing R code the tidy way
Lucy D’Agostino McGowanSean KrossJeffrey Leek
Abstract
With the current emphasis on reproducibility and replicability, there is an increasing need to examinehow data analyses are conducted. In order to analyze the between researcher variability in data analysischoices as well as the aspects within the data analysis pipeline that contribute to the variability in results,we have created two R packages: matahari and tidycode . These packages build on methods createdfor natural language processing; rather than allowing for the processing of natural language, we focus onR code as the substrate of interest. The matahari package facilitates the logging of everything that istyped in the R console or in an R script in a tidy data frame. The tidycode package contains tools toallow for analyzing R calls in a tidy manner. We demonstrate the utility of these packages as well as walkthrough two examples.
Introduction
With the current emphasis on reproducibility and replicability, there is an increasing need to examine howdata analyses are conducted (Goecks et al. 2010; Peng 2011; McNutt 2014; Miguel et al. 2014; Ioannidis etal. 2014; Richard 2014; Leek and Peng 2015; Nosek et al. 2015; Sidi and Harel 2018). In order to accuratelyreplicate a result, the exact methods used for data analysis need to be recorded, including the specific analyticsteps taken as well as the software utilized (Waltemath and Wolkenhauer 2016). Studies across multipledisciplines have examined the global set of possible data analyses that can conducted on a specific dataset (Silberzhan et al. 2018). While we are able to define this global set, very little is known about theactual variation that exists between researchers. For example, it is possible that the true range of dataanalysis choices is realistically a much more narrow set than the global sets that are presented. There is abreadth of excellent research and experiments examining how people read visual information (Majumder,Hofmann, and Cook 2013; Loy, Hofmann, and Cook 2017; Wickham, Cook, and Hofmann 2015; Buja etal. 2009; Loy, Follett, and Hofmann 2016), for example the Experiments on Visual Inference detailed here:(http://mamajumder.github.io/html/experiments.html), but not how they actually make analysis choices,specifically analysis coding choices. In addition to not knowing about the “data analysis choice” variabilitybetween researchers, we also don’t know which portions of the data analysis pipeline result in the mostvariability in the ultimate research result. We seek to build tools to analyze these two aspects of data analysis:1. The between researcher variability in data analysis choices2. The aspects within the data analysis pipeline that contribute to the variability in resultsSpecifically, we have designed a framework to conduct such analyses and created two R packages that allowfor the study of data analysis code conducted in R. In addition to answering these crucial questions for broadresearch fields, we see these tools having additional concrete use cases. These tools will facilitate data scienceand statistics pedagogy, allowing researchers and instructors to investigate how students are conducting dataanalyses in the classroom. Alternatively, a researcher could use these tools to examine how collaboratorshave conducted a data analysis. Finally, these tools could be used in a meta-manner to explore how currentsoftware and tools in R are being utilized.
Tidy principles
We specifically employ tidy principles in our proposed packages.
Tidy refers to an implementation strategypropagated by Hadley Wickham and implemented by the Tidyverse team at RStudio (Wickham and Grolemund2016). Here, by tidy we mean our packages adhere to the following principles:1 a r X i v : . [ s t a t . C O ] M a y igure 1: A flowchart of a typical analysis that uses matahari and tidycode to analyze and classify R code.1. Our functions follow the principles outlined in R packages (Wickham 2015) as well as the tidyversestyle guide (Wickham 2019).2. Our output data sets are tidy, as in:• Each variable has its own column.• Each observation has its own row.• Each value has its own cell.By implementing these tidy principles, and thus outputting tidy data frames, we allow for data manipulationand analysis to be conducted using a specific set of tools, such as those included in the tidyverse metapackage (Wickham 2017).Ultimately, we create a mechanism to utilize methods created for natural language processing; here thesubstrate is code rather than natural language. We model our tools to emulate the tidytext package (Silgeand Robinson 2016, 2017); instead of analyzing tokens of text, we are analyzing tokens of code.We present two packages, matahari , a package for logging everything that is typed in the R console or inan R script, and tidycode , a package with tools to allow for analyzing R calls in a tidy manner. In thispaper, we first explain how these packages work. We then demonstrate two examples, one that analyzes datacollected from an online experiment, and one that analyzes “old” data via previously created R scripts.
Methods
We have created two R packages, matahari and tidycode . The former is a way to log R code, the latterallows the user to analyze R calls on the function-level in a tidy manner. Figure 1 is a flowchart of the processdescribed in more detail below. This flowchart is adapted from Figure 2.1 in
Text Mining with R: A TidyApproach (Silge and Robinson 2017).We demonstrate how to create these tidy data frames of R code and then emulate the data analysis workflowsimilar to that put forth in the tidy text literature.
Terminology
In this paper, we refer to R “expressions” or “calls” as well as R “functions” and “arguments”. An R call is acombination of an R function with arguments. For example, the following is an R call (Example 1). library (tidycode) xample 1. R call, libraryAnother example of an R call is the following piped chain of functions from the dplyr package (Example 2). starwars %>%select (height, mass)
Example 2.
Piped R callSpecifically, we know something is a call in R if is.call() is TRUE . quote (starwars %>%select (height, mass)) %>%is.call () Calls in R are made up of a function or name of a function, and arguments. For example, the call library(tidycode) from Example 1 is comprised of the function library() and the argument tidycode .Example 2 is a bit more complicated. The piped code can be rewritten, as seen in Example 3. `%>%`(starwars, select (height, mass))
Example 3.
Rewritten piped R callFrom this example, it is easier to see that the function for this R call is %>% with two arguments, starwars and select(height, mass) . Notice that one of these arguments is an R call itself, select(height, mass) . mataharimatahari is a simple package for logging R code in a tidy manner. It can be installed from CRAN using thefollowing code. install.packages ("matahari") There are three ways to use the matahari package:1. Record R code as it is typed and output a tidy data frame of the contents2. Input a character string of R code and output a tidy data frame of the contents3. Input an R file containing R code and output a tidy data frame of the contentsIn the following sections, we will split these into two categories, tidy logging from the R console (1) and tidylogging from an R script (2 and 3).
Tidy logging from the R console
In order to begin logging from the R console, the dance_start() function is used. Logging is paused using dance_stop() and the log can be viewed using dance_tbl() . For example, the following code will result inthe the subsequent tidy data frame. library (matahari) dance_start ()1 + sum (1 : dance_stop () dance_tbl () >
Logging R code from the R console using matahari
The resulting tidy data frame consists of 6 columns: expr , the R call that was run, value , the value thatwas output, path , if the code was run within RStudio, this will be the path to the file in focus, contents ,the file contents of the RStudio editor tab in focus, selection , the text that is highlighted in the RStudioeditor tab in focus, and dt , the date and time the expression was run. By default, value , path , contents and selection will not be logged unless the argument is set to TRUE in the dance_start() function. Forexample, if the analyst wanted the output data frame to include the values computed, they would input dance_start(value = TRUE) .In this particular data frame, there are 6 rows. The first and final rows report the R session information atthe time when dance_start() was initiated (row 1) and when dance_stop() was run (row 6). The secondrow holds the R call dance_start() , the first command run in the R console, was run; the third row holds
1+ 2 , the fourth holds here is some text , and the fifth holds sum(1:10) . dance_tbl ()[["expr"]] These functions work by saving an invisible data frame called .dance that is referenced by dance_tbl() .Each time dance_start() is subsequently run after dance_stop() , new rows of data are added to this dataframe. This invisible data frame exists in a new environment created by the matahari package. We canremove this data frame by running dance_remove() .This data frame can be manipulated using common R techniques. Below, we rerun the same code as above,this time saving the values that are computed in the R console by using the value = TRUE parameter. dance_start (value = TRUE)1 + sum (1 : dance_stop ()tbl <- dance_tbl ()
4s an example of the type of data wrangling that this tidy format allows for, using dplyr and purrr , we canmanipulate this to only examine expressions that result in numeric values. library (dplyr) library (purrr)t_numeric <- tbl %>%mutate (numeric_output = map_lgl (value, is.numeric)) %>%filter (numeric_output)t_numeric
Here, three rows are output, since we have filtered to only calls with numeric output:1. The dance_start() call (this defaults to have a numeric value of 1)2. The call, resulting in a value of
3. The sum(1:10) , resulting in a value of Tidy logging from an R script
In addition to allowing for the logging of everything typed in the R console, the matahari package alsoallows for the logging of pre-created R scripts. This can be done using the dance_recital() function, whichallows for either a .R file or a character string of R calls as the input. For example, if we have a code file called sample_code.R , we can run dance_recital("sample_code.R") to create a tidy data frame. Alternatively,we can enter code directly as a string of text, such as dance_recital("1 + 2") to create the tidy dataframe. Below illustrates this functionality. code_file <- system.file ("test", "sample_code.R", package = "matahari") dance_recital (code_file)
Example 5.
R call, Logging code from a .R file using matahari code_string <- '4 + 4"wow!"mean(1:10)stop("Error!") arning("Warning!")message("Hello?")cat("Welcome!")' dance_recital (code_string) Example 6.
Logging code from a character string using matahari
The resulting tidy data frame from dance_recital() , as seen in Examples 5 and 6, is different from that of dance_tbl() . This data frame has 6 columns. The first is the same as the dance_tbl() , expr , the R callsin the .R script or string of code. The subsequent columns are, value , the computed result of the R call, error , which contains the resulting error object from a poorly formed call, output , the printed output froman call, warnings , the contents of any warnings that would be displayed in the console, and messages , thecontents of any generated diagnostic messages. Now that we have a tidy data frame with R calls obtainedeither from the R console or from a .R script, we can analyze them using the tidycode package. tidycode The goal of tidycode is to allow users to analyze R scripts, calls, and functions in a tidy way. There are twomain tasks that can be achieved with this package:1. We can “tokenize” R calls2. We can classify the functions run into one of nine potential data analysis categories: “Setup”, “Ex-ploratory”, “Data Cleaning”, “Modeling”, “Evaluation”,“Visualization”, “Communication”, “Import”,or “Export”.The tidycode package can be installed from CRAN in the following manner. install.packages ("tidycode") library (tidycode)
We can first create a tidy data frame using the matahari package. Alternatively, we can use a functionin the tidycode package that wraps the dance_recital() function called read_rfiles() . This functionallows you to read in multiple .R files or links to .R files. There are a few example files included in the tidycode package. The paths to these files can be accessed via the tidycode_example() function. Forexample, running the following code will give the file path for the example_analysis.R file. tidycode_example ("example_analysis.R")
Running the function without any file specified will supply a vector of all available file names. tidycode_example ()
6e can use these example files in the read_rfiles() function. df <- read_rfiles ( tidycode_example ( c ("example_analysis.R", "example_plot.R")))df This will give a tidy data frame with three columns: file , the path to the file, expr the R call, and line theline the call was made in the original .R file.We can then use the unnest_calls() function to create a data frame of the calls, splitting each into theindividual functions and arguments. We liken this to the tidytext unnest_tokens() function. This functionhas two parameters, .data , the data frame that contains the R calls, and input the name of the column thatcontains the R calls. In this case, the data frame is m and the input column is expr . u <- unnest_calls (df, expr)u This results is a tidy data frame with two additional columns: func the name of the function called and args the arguments of the function called. Because this function takes a data frame as the first argument, it worksnicely with the tidyverse data manipulation packages. For example, we could get the same data frame asabove by using the following code. df %>%unnest_calls (expr) We can further manipulate this, for example we could select just the func and args columns using dplyr ’s select() function. df %>%unnest_calls (expr) %>%select (func, args) The get_classifications() function calls a classification data frame that we curated that classifies theindividual functions into one of nine categories: setup, exploratory, data cleaning, modeling, evaluation,visualization, communication, import, or export. This can also be merged into the data frame. u %>%inner_join ( get_classifications ()) %>%select (func, classification, lexicon, score) There are two lexicons for classification, crowdsource and leeklab . The former was created by volunteerswho classified R code using the classify shiny application. The latter was curated by Jeff Leek’s Lab. Toselect a particular lexicon, you can specify the lexicon parameter. For example, the following code will8erge in the crowdsource lexicon only. u %>%inner_join ( get_classifications ("crowdsource")) %>%select (func, classification, score) It is possible for a function to belong to multiple classes. This will result in multiple lines (and multipleclassifications) for a given function. By default, these multiple classifications are included along with theprevalence of each, indicated by the score column. To merge in only the most prevalent classification, setthe include_duplicates option to
FALSE . u %>%inner_join ( get_classifications ("crowdsource", include_duplicates = FALSE)) %>%select (func, classification) In text analysis, there is the concept of “stopwords”. These are often small common filler words you wantto remove before completing an analysis, such as “a” or “the”. In a tidy code analysis, we can use a similarconcept to remove some functions. For example we may want to remove the assignment operator, <- , beforecompleting an analysis. We have compiled a list of common stop functions in the get_stopfuncs() functionto anti join from the data frame. u %>%inner_join ( get_classifications ("crowdsource", include_duplicates = FALSE)) %>%anti_join ( get_stopfuncs ()) %>%select (func, classification) Examples
Online experiment: P-hack-athon
This first example demonstrates how to use the matahari and tidycode packages to analyze data froma prospective study, using the “recording” capabilities of the matahari package to capture the code asparticipants run it. Recently, we launched a “p-hack-athon” where we encouraged users to analyze a datasetwith the goal of producing the smallest p-value (IRB dance_start() and dance_stop() functions from the matahari package. This resulted in atidy data frame of R calls for each participant. We use the tidycode package to analyze these mataharidata frames.
Setup library (tidyverse) library (tidycode) load ("data/df_phackathon.Rda")
The data from the “p-hack-a-thon” is saved as a data frame called df . We have bound the expr column fromthe matahari data frame for each participant. Using the unnest_calls() function, we unnest each of theseR calls into a function and it’s arguments. tbl <- df %>%unnest_calls (expr) We can then remove the “stop functions” by doing an anti join with the get_stopfuncs() function andmerge in the crowd-sourced classifications with the get_classifications() function. tbl <- tbl %>%anti_join ( get_stopfuncs ()) %>%inner_join ( get_classifications ("crowdsource", include_duplicates = FALSE)) Classifications
We can use common data manipulation functions from dplyr . For example, on average, “data cleaning”functions made up 39.6% of the functions run by participants (Table 1). tbl %>%group_by (id, classification) %>%summarise (n = n ()) %>%mutate (pct = n / sum (n)) %>%group_by (classification) %>%summarise (`Average percent` = mean (pct) * %>%arrange ( - `Average percent`) We can also examine most common functions in each classification. func_counts <- tbl %>%count (func, classification, sort = TRUE) %>%ungroup ()func_counts %>%filter (classification %in% c ("data cleaning", "exploratory", "modeling", "visualization")) %>%group_by (classification) %>%top_n (5) %>%ungroup () %>%mutate (func = reorder (func, n)) %>% odeling visualizationdata cleaning exploratory0 100 200 300 0 100 200 300dimtablesumheadsummarygeom_smoothloggeom_pointggplotaesis.nafilteras.factorselectfactorformulaglmcor.testlmerlm Number of function calls in each classification
Figure 2: Functions that contribute to data cleaning, exploratory analysis, modeling and visualizationclassifications in p-hack-athon trial ggplot ( aes (func, n, fill = classification)) +theme_bw () +geom_col (show.legend = FALSE) +facet_wrap ( ~ classification, scales = "free_y") +scale_x_discrete ( element_blank ()) +scale_y_continuous ("Number of function calls in each classification") +coord_flip () We could also examine a word cloud of the functions used, colored by the classification. We can do this usingthe wordcloud library. library (wordcloud)tbl %>%count (func, classification) %>%with ( wordcloud (func, n,colors = brewer.pal (9, "Set1")[ factor (. $ classification)],random.order = FALSE,ordered.colors = TRUE)) ummar y lm factorselect library as.factor filter aes ggplot lmer geom_point is.na forround cor.testmutate g l m expformulafunctionpaste print as.numeric namessessionInfolog nrowgeom_smoothheadhist Isum%<>% plotna.omit setdiffcor xlimcolnamesgroup_byaov table geom_lineuniquedimsapply %in%as.data.frame glimpselength F il t e r qplot seqtidy coefdata.table glm.nbifquantile strViewduplicated minsource starts_with str_extractvarswhich ifelse i n s t a ll . pa ck age s scalesqrt cut meanmonth select_ifwdayapply as.Date cbind classncolnest predict r o w na m e s summarise ungroup append df mapsamplesort t he m e_ m i n i m a l which.min arrangeas.matrixdistinct fp grepl li s t log10replace scale_y_log10set.seed subsubset tibbleyear Figure 3: Word cloud of functions used in the p-hack-athon trial, colored by classification13 tatic Analysis
This second example demonstrates how to use the matahari and tidycode packages to analyze data froma retrospective study, or static R scripts. Here, we use the read_rfiles() function from the tidycode package. This wraps the dance_recital() matahari function and allows for multiple file paths or urls tobe read, resulting in a tidy data frame. As an example, we are going to scrape all of the .R files from two ofthe most widely used data manipulation packages, the data.table package (Dowle and Srinivasan 2019) andthe dplyr package. We are going to use the gh package (Bryan and Wickham 2017) to scrape these filesfrom GitHub. Setup
We access the files via GitHub using the gh() function from the gh package. This gives a list of downloadurls that can be passed to the read_rfiles() function from the tidycode package. library (tidyverse) library (gh) library (tidycode)dplyr_code <- gh ("/repos/tidyverse/dplyr/contents/R") %>% purrr ::map ("download_url") %>%read_rfiles ()datatable_code <- gh ("/repos/Rdatatable/data.table/contents/R") %>% purrr ::map ("download_url") %>%read_rfiles () Data Cleaning
We can combine these two tidy data frames. We will do some small data manipulation, removing R calls thatwere either
NULL or character . For example, in the dplyr package some .R files just reference data framesas a character string. pkg_data <- bind_rows ( list (dplyr = dplyr_code,datatable = datatable_code),.id = "pkg") %>%filter ( !map_lgl (expr, is.null), !map_lgl (expr, is.character)) Analyze R functions
Now we can use the tidycode unnest_calls() function to create a tidy data frame of the individualfunctions along with the arguments used to create both packages. Notice here we are not performing ananti join on “stop functions”. For this analysis, we are interested in examining some key differences in thecommonly used functions contained the two packages. Common operators may actually be of interest, so wedo not want to drop them from the data frame. We can count the functions by package.14 unc_counts <- pkg_data %>%unnest_calls (expr) %>%count (pkg, func, sort = TRUE)func_counts
Using this data frame, we can visualize which functions are most commonly called in each package. top_funcs <- func_counts %>%group_by (pkg) %>%top_n (10) %>%ungroup () %>%arrange (pkg, n) %>%mutate (i = row_number ()) ggplot (top_funcs, aes (i, n, fill = pkg)) +theme_bw () +geom_col (show.legend = FALSE) +facet_wrap ( ~ pkg, scales = "free") +scale_x_continuous ( element_blank (),breaks = top_funcs $ i,labels = top_funcs $ func,expand = c (0, 0)) +coord_flip () We can glean a few interesting details from Figure 4. First, the data.table authors sometimes use the = asan assignment operator, resulting in this being the most frequent function used. The dplyr authors alwaysuse <- for assignment, therefore this is the most frequent function seen in this package (Wickham 2019).Additionally, the dplyr authors often create modular code as a combination of small functions to completespecific tasks. This may explain why function is the third most frequent R call in this package, and lessprevalent in the data.table package. This just serves as a glimpse of what can be accomplished with thesetools. Discussion
We have designed a framework to analyze the data analysis pipeline and created two R packages that allowfor the study of data analysis code conducted in R. We present two packages, matahari , a package for15 atatable dplyr0 500 1000 1500 0 500 1000 1500::nameslength[$if!function{<−stop&&[[length[<−!{if= n Figure 4: Most frequent functions used in data.table and dplyr package development.16ogging everything that is typed in the R console or in an R script, and tidycode , a package with tools toallow for analyzing R calls in a tidy manner. These tools can be applied both to prospective studies, where aresearcher can intentionally record code typed by participants, and retrospectively, where the researcher canretrospectively analyze code. We believe that these tools will help shape the next phase of reproducibilityand replicability, allowing the analysis of code to inform data science pedagogy, examine how collaboratesconduct data analyses, and explore how current software tools are being utilized.
Acknowledgements
We would like to extend a special thank you to the members of the Leek Lab at Johns Hopkins BloombergSchool of Public Health as well as volunteers who used the “classify” shiny application for helping classify Rfunctions.
References
Bryan, Jennifer, and Hadley Wickham. 2017.
Gh: ’GitHub’ ’Api’ . https://CRAN.R-project.org/package=gh.Buja, Andreas, Dianne Cook, Heike Hofmann, Michael Lawrence, Eun-Kyung Lee, Deborah F Swayne,and Hadley Wickham. 2009. “Statistical Inference for Exploratory Data Analysis and Model Diagnostics.”
Philosophical Transactions of the Royal Society A: Mathematical, Physical and Engineering Sciences
Data.table: Extension of ‘Data.frame‘ . https://CRAN.R-project.org/package=data.table.Goecks, Jeremy, Anton Nekrutenko, James Taylor, and Galaxy Team. 2010. “Galaxy: a comprehensiveapproach for supporting accessible, reproducible, and transparent computational research in the life sciences.”
Genome Biology
11 (8).Ioannidis, John P A, Marcus R Munafo, Paolo Fusar-Poli, Brian A Nosek, and Sean P David. 2014.“Publication and other reporting biases in cognitive sciences: detection, prevalence, and prevention.”
Trendsin Cognitive Sciences
18 (5): 235–41.Leek, Jeffrey T, and Roger D Peng. 2015. “Opinion: Reproducible research can still be wrong: Adopting aprevention approach.”
Proceedings of the National Academy of Sciences
112 (6): 1645–6.Loy, Adam, Lendie Follett, and Heike Hofmann. 2016. “Variations of Q–Q Plots: The Power of Our Eyes!”
The American Statistician
70 (2). Taylor & Francis: 202–14.Loy, Adam, Heike Hofmann, and Dianne Cook. 2017. “Model Choice and Diagnostics for Linear Mixed-EffectsModels Using Statistics on Street Corners.”
Journal of Computational and Graphical Statistics
26 (3). Taylor& Francis: 478–92.Majumder, Mahbubul, Heike Hofmann, and Dianne Cook. 2013. “Validation of Visual Statistical Inference,Applied to Linear Models.”
Journal of the American Statistical Association
108 (503). Taylor & FrancisGroup: 942–56.McNutt, M. 2014. “Reproducibility.”
Science
343 (6168): 229–29.Miguel, E, C Camerer, K Casey, J Cohen, K M Esterling, A Gerber, R Glennerster, et al. 2014. “PromotingTransparency in Social Science Research.”
Science
343 (6166): 30–31.Nosek, B A, G Alter, G C Banks, D Borsboom, S D Bowman, S J Breckler, S Buck, et al. 2015. “Promotingan open research culture.”
Science
348 (6242): 1422–5.Peng, Roger D. 2011. “Reproducible Research in Computational Science.”
Science
334 (6060): 1226–7.Richard, Blaustein. 2014. “Reproducibility Undergoes Scrutiny.”
BioScience
64 (4): 368–68.17idi, Yulia, and Ofer Harel. 2018. “The treatment of incomplete data: Reporting, analysis, reproducibility,and replicability.”
Social Science & Medicine
209 (July): 169–73.Silberzhan, Raphael, Eric L Uhlmann, Daniel P Martin, Pasquale Anselmi, Frederick Aust, Eli Awtrey,Štěpán Bahník, et al. 2018. “Many Analysts, One Data Set: Making Transparent How Variations in AnalyticChoices Affect Results.”
Advances in Methods and Practices in Psychological Science
The Journal of Open Source Software
Text Mining with R: A Tidy Approach . " O’Reilly Media, Inc.".Waltemath, Dagmar, and Olaf Wolkenhauer. 2016. “How Modeling Standards, Software, and InitiativesSupport Reproducibility in Systems Biology and Systems Medicine.”
Ieee Transactions on BiomedicalEngineering
63 (10): 1999–2006.Wickham, Hadley. 2015.
R Packages: Organize, Test, Document, and Share Your Code . " O’Reilly Media,Inc.".———. 2017.
Tidyverse: Easily Install and Load the ’Tidyverse’ . https://CRAN.R-project.org/package=tidyverse.———. 2019.
The Tidyverse Style Guide . https://style.tidyverse.org.Wickham, Hadley, Dianne Cook, and Heike Hofmann. 2015. “Visualizing Statistical Models: Removingthe Blindfold.”
Statistical Analysis and Data Mining: The ASA Data Science Journal