CC ONTRIBUTED RESEARCH ARTICLE A method for deriving information fromrunning R code by Mark P.J. van der Loo
Abstract
It is often useful to tap information from a running R script. Obvious use cases includemonitoring the consumption of resources (time, memory) and logging. Perhaps less obvious casesinclude tracking changes in R objects or collecting output of unit tests. In this paper we demonstratean approach that abstracts collection and processing of such secondary information from the runningR script. Our approach is based on a combination of three elements. The first element is to build acustomized way to evaluate code. The second is labeled local masking and it involves temporarilymasking a user-facing function so an alternative version of it is called. The third element we label local side effect . This refers to the fact that the masking function exports information to the secondaryinformation flow without altering a global state. The result is a method for building systems in pureR that lets users create and control secondary flows of information with minimal impact on theirworkflow, and no global side effects.
Introduction
The R language provides a convenient language to read, manipulate, and write data in the formof scripts. As with any other scripted language, an R script gives description of data manipulationactivities, one after the other, when read from top to bottom. Alternatively we can think of an Rscript as a one-dimensional visualisation of data flowing from one processing step to the next, whereintermediate variables or pipe operators carry data from one treatment to the next.We run into limitations of this one-dimensional view when we want to produce data flows thatare somehow ‘orthogonal’ to the flow of the data being treated. For example, we may wish to followthe state of a variable while a script is being executed, report on progress (logging), or keep trackof resource consumption. Indeed, the sequential (one-dimensional) nature of a script forces one tointroduce extra expressions between the data processing code.As an example, consider a code fragment where the variable x is manipulated. x[x > threshold] <- thresholdx[is.na(x)] <- median(x, na.rm=TRUE) In the first statement every value above a certain threshold is replaced with a fixed value, and next,missing values are replaced with the median of the completed cases. It is interesting to know how anaggregate of interest, say the mean of x , evolves as it gets processed. The instinctive way to do this isto edit the code by adding statements to the script that collect the desired information. meanx <- mean(x, na.rm=TRUE)x[x > threshold] <- thresholdmeanx <- c(meanx, mean(x, na.rm=TRUE))x[is.na(x)] <- median(x, na.rm=TRUE)meanx <- c(meanx, mean(x, na.rm=TRUE)) This solution clutters the script by insterting expresssions that are not necessary for its main purpose.Moreover, the tracking statements are repetitive, which validates some form of abstraction.A more general picture of what we would like to achieve is given in Figure 1. The ‘primary dataflow’ is developed by a user as a script. In the previous example this concerns processing x . When thescript runs, some kind of logging information, which we label the ‘secondary data flow’ is derivedimplicitly by an abstraction layer.Creating an abstraction layer means that concerns between primary and secondary data flows areseparated as much as possible. In particular, we want to prevent the abstraction layer from inspectingor altering the user code that describes the primary data flow. Furthermore, we would like the userto have some control over the secondary flow from within the script, for example to start, stop orparameterize the secondary flow. This should be done with minumum editing of the original usercode and it should not rely on global side effects. This means that neither the user, nor the abstractionlayer for the secondary data flow should have to manipulate or read global variables, options, or otherenvironmental settings to convey information from one flow to the other. Finally, we want to treatthe availability of a secondary data flow as a normal situation. This means we wish to avoid usingsignaling conditions (e.g. warnings or errors) to convey information between the flows, unless there isan actual exceptional condition such as an error. The R Journal Vol. XX/YY, AAAA 20ZZ ISSN 2073-4859 a r X i v : . [ s t a t . C O ] F e b ONTRIBUTED RESEARCH ARTICLE data process data’ process data”data data Primary data flowSecondarydata flow
Figure 1:
Primary and secondary data flows in an R script. The primary flow follows the execution ofan R script, while in the background a secondary data flow (e.g. logging information) is created.
Prior art
There are several packages that generate a secondary data flow from a running script. One straightfor-ward application concerns logging messages that report on the status of a running script. To createa logging message, users edit their code by inserting logging expressions where desired. Loggingexpressions are functions calls that help building expressions, for example by automatically addinga time stamp. Configuration options usually include a measure of logging verbosity, and settingan output channel that controls where logging data will be sent. Changing these settings relies oncommunication from the main script to the functionality that controlls the flow of logging data. In logger (Daróczi, 2019) this is done by manipulating a variable stored in the package namespace usingspecial helper functions. The logging package (Frasca, 2019) also uses an environment within thenamespace of the package to manage option settings, while futile.logger (Rowe, 2016) implements acustom global option settings manager that is somewhat comparable to R’s own options() function.Packages bench (Hester, 2019) and microbenchmark (Mersmann, 2018) provide time profilingof single R expressions. The bench package also includes memory profiling. Their purpose is notto derive a secondary data flow from a running production script as in Figure 1 but to compareperformance of R expressions. Both packages export a function that accepts a sequence of expressionsto profile. These functions take control of expression execution and insert time and/or memorymeasurements where necessary. Options, such as the number of times each expression is executed, arepassed directly to the respective function.Unit testing frameworks provide another source of secondary data flows. Here, an R script is usedto prepare, setup, and compare test data, while the results of comparisons are tapped and reported.Testing frameworks are provided by testthat (Wickham, 2011),
RUnit , (Burger et al., 2018), testit
Xie(2018), unitizer (Gaslam, 2019), and tinytest (van der Loo, 2019). The first three packages ( testthat , RUnit and testit ) all export assertion functions that generate condition signals to convey informationabout test results. Packages
RUnit and testit use sys.source() to run a file containing unit testassertions and exit on first error while testthat uses eval() to run expressions, capture conditionsand test results and reports afterwards. The unitizer framework is different because it implements aninteractive prompt to run tests and explore their results. Rather than providing explicit assertions, unitizer stores results of all expressions that return a visible result and compares their output atsubsequent runs. Interestingly, unitizer allows for optional monitoring of the testing environment.This includes environment variables, options, and more. This is done by manipulating code of (base)R functions that manage these settings and masking the original functions temporarily. These maskingfunctions then provide parts of the secondary data flow (changes in the environment). Finally, tinytest is based on the approach that is the topic of this paper and it will be discused as an application below.Finally we note the covr package of Hester (2018). This package is used to keep track of whichexpressions of an R package are run (covered) by package tests or examples. In this case the primarydata flow is a test script executing code (functions, methods) stored in another script, usually in thecontext of a package. The secondary flow consists of counts of how often each expression in the sourcefiles are executed. The package works by parsing and altering the code in the source file, insertingexpressions that increase appropriate counters. These counters are stored in a variable that is part ofthe package’s namespace.Summarizing, we find that in logging packages the secondary data flow is invoked explicitlyby users while configuration settings are communicated by manipulating a global state that may ormay not be directly accessible by the user. For benchmarking packages, the expressions are passedexplicitly to an ‘expression runner’ that monitors effect on memory and passage of time. In most testpackages the secondary flow is invoked explicitly using special assertions that throw condition signals.Test files are run using functionality that captures and administrates signals where necessary. Two ofthe discussed packages explicitly manipulate existing code before running it to create a secondary dataflow. The covr package does this to update expression counters and the unitizer package to monitorchanges in the global state.
The R Journal Vol. XX/YY, AAAA 20ZZ ISSN 2073-4859
ONTRIBUTED RESEARCH ARTICLE Contribution of this paper
The purpose of this paper is to first provide some insight into the problem of managing multipledata flows, independent of specific applications. In the following section we discuss managing asecondary data stream from the point of view of changing the way in which expressions are combinedand executed by R.Next, we highlight two programming patterns that allow one to derive a secondary data stream,both in non-interactive (while executing a file) and in interactive circumstances. The methods discussedhere do not require explicit inspection or modification of the code that describes the primary dataflow. It is also not necessary to invoke signalling conditions to transport information from or to thesecondary data stream.We also demonstrate a combination of techniques that allow users to parameterize the secondaryflow, without resorting to global variables, global options, or variables within a package’s namespace.We call this technique ‘local masking’ with ‘local side effects’. It is based on temporarily and locallymasking a user-facing function with a function that does exactly the same except for a side effect thatpasses information to the secondary data flow.As examples we discuss two applications where these techniques have been implemented. Thefirst is the lumberjack package (van der Loo, 2018), which allows for tracking changes in R objects asthey are manipulated expression by expression. The second is tinytest (van der Loo, 2019), a compactand extensible unit testing framework.Finally, we discuss some advantages and limitations to the techniques proposed.
Concepts
In this section we give a high-level overview of the problem of adding a second data flow to an existingone, as well as a general way to think about a solution. The general approach was inspired by adiscussion of Milewski (2018) and is related to what is sometimes called a bind operator in functionalprogramming.Consider as an example the following two expressions, labeled e and e . e1: x <- 10e2: y <- 2*x We would like to implement some kind of monitoring as these expressions are evaluated. For thispurpose it is useful to think of think of e and e as functions that accept a set of key-value pairs,possibly alter the set’s contents, and return it. In R this set of key-value pairs is an environment , andusually it is the global environment (the user’s workspace). Starting with an empty environment {} we get: e ( {} ) = { ( "x" , ) } e ( e ( {} )) = { ( "x" , ) , ( "y" , ) } In this representation we can write the result of executing the above script in terms of the functioncomposition operator ◦ : e ( e ( {} )) = ( e ◦ e )( {} ) .And in general we can express the final state U of any environment after executing a sequence ofexpressions e , e , · · · , e k as U = ( e k ◦ e k − ◦ · · · ◦ e )( {} ) , (1)where we assumed without loss of generality that we start with an empty environment. We will referto the sequence e . . . e k as the ‘primary expressions’, since they define a user’s main data flow.We now wish to introduce some kind of logging. For example, we want to count the number ofevaluated expressions, not counting the expressions that will perform the count. The naive way to dothis is to introduce a new expression, say n : n: if (!exists("N")) N <- 1 else N <- N + 1 And we insert this into the original sequence of expressions. This amounts to the cumbersome solution
U ∪ { ( "N" , k ) } = ( n ◦ e k ◦ n ◦ e k − ◦ n ◦ · · · n ◦ e )( {} ) , (2) The R Journal Vol. XX/YY, AAAA 20ZZ ISSN 2073-4859
ONTRIBUTED RESEARCH ARTICLE where the number of executed expressions is stored in N . We shall refer to n as a ‘secondary expression’as it does not contribute to the user’s primary data flow.The above procedure can be simplified if we define a new function composition operator ◦ n asfollows. a ◦ n b = a ◦ n ◦ b .One may verify the associativity property a ◦ n ( b ◦ n c ) = ( a ◦ n b ) ◦ n c for expressions a , b and c , so ◦ n can indeed be interpreted as a new function composition operator. Using this operator we get U ∪ { ( "N" , k − ) } = ( e k ◦ n e k − ◦ n · · · ◦ n e )( {} ) , (3)which gives the same result as Equation 2 up to a constant.If we are able to alter function composition, then this mechanism can be used to track all sorts ofuseful information during the execution of e , . . . , e k . For example, a simple profiler is set up by timingthe expressions and adding the following expression to the function composition operator. s: if (!exists("S")) S <- Sys.time() else S <- c(S, Sys.time()) After running e k ◦ s · · · ◦ s e , diff(S) gives the timings of individual statements. A simple memoryprofiler is defined as follows. m: if (!exists("M")) M <- sum(memory.profile()) else M <- c(M, sum(memory.profile())) After running e k ◦ m · · · ◦ m e , M gives the amount of memory used by R after each expression.We can also track changes in data, but it requires that the composition operator knows the name ofthe R object that is being tracked. As an example, consider the following primary expressions. e1: x <- rnorm(10)e2: x[x<0] <- 0e3: print(x) We can define the following expression for our modified function composition operator. v: { if (!exists("V")){V <- logical(0)x0 <- x}if (identical(x0,x)) V <- c(V, FALSE)else V <- c(V, TRUE)x0 <- x}
After running e ◦ v e ◦ v e the variable V equals c(TRUE,FALSE) , indicating that e changed x and e did not.These examples demonstrate that redefining function composition yields a powerful method forextracting logging information with (almost) no intrusion on the user’s common work flow. Thesimple model shown here does have some obvious setbacks: first, the expressions inserted by thecomposition operator manipulate the same environment as the user expressions. The user- andsecondary expressions can therefore interfere with each other’s results. Second, there is no directcontrol from the primary sequence over the secondary sequence: the user has no explicit control overstarting, stopping, or parametrizing the secondary data stream. We demonstrate in the next sectionhow these setbacks can be avoided by evaluating secondary expressions in a separate environment,and by using a techniques we call ‘local masking’ and ‘local side-effects’. Creating a secondary data flow with R
R executes expressions one by one in a read-evaluate-print loop (REPL). In order to tap informationfrom this running loop it is necessary to catch the user’s expressions and interweave them with ourown expressions. One way to do this is to develop an alternative to R’s native source() function. Recallthat source() reads an R script and executes all expressions in the global environment. Applicationsinclude non-interactive sessions or interactive sessions with repetitive tasks such as running testscripts while developing functions. A second way to intervene with a user’s code is to develop aspecial ‘forward pipe’ operator, akin to for example the magrittr pipe of Bache and Wickham (2014) orthe ‘dot-pipe’ of Mount and Zumel (2018). Since a user inserts a pipe between expressions, it is anobvious place to insert code that generates a secondary data flow.
The R Journal Vol. XX/YY, AAAA 20ZZ ISSN 2073-4859
ONTRIBUTED RESEARCH ARTICLE In the following two subsections we will develop both approaches. As a running example we willimplement a secondary data stream that counts expressions.
Build your own source()
The source() function reads an R script and executes all expressions in the global environment. Asimple variant of source() that counts expressions as they get evaluated can be built using parse() and eval() . run <- function(file){expressions <- parse(file)runtime <- new.env(parent=.GlobalEnv)n <- 0for (e in expressions){eval(e, envir=runtime)n <- n + 1}message(sprintf("Counted %d expressions",n))runtime} Here parse() reads the R file and returns a list of expressions (technically, an object of class ‘ expression ’).The eval() function executes the expression while all variables created by, or needed for execution aresought in a newly created environment called runtime . We make sure that variables and functions inthe global environment are found by setting the parent of runtime equal to .GlobalEnv . Now, given afile "script.R" . An interactive session would look like this. > e <- run("script.R")Counted 2 expressions> e$x[1] 10
So contrary to default behavior of source() , variables are assigned in a new environment. Thisdifference in behavior can be avoided by evaluating expressions in .GlobalEnv , but for the next step itis important to have a separate runtime environment.We now wish to give the user some control over the secondary data stream. In particular, we wantthe user to be able to choose when run() starts counting expressions. Recall that we demand thatthis is done by direct communication to run() . This means that side-effects such as setting a specialvariable in the global environment or a a global option is out of the question. Furthermore, we want toavoid code inspection: the run() function should be unaware of what expressions it is running exactly.We start by writing a function for the user that returns
TRUE . start_counting <- function() TRUE Our task is to capture this output from run() when start_counting() is called. We do this by maskingthis function with another function that does exactly the same, except that it also copies the outputvalue to a place where run() can find it. To achieve this, we use the following helper function. capture <- function(fun, envir){function(...){out <- fun(...)envir$counting <- outout}}
This function accepts a function ( fun ) and an environment ( envir ). It returns a function that firstexecutes fun(...) , copies its output value to envir and then returns the output to the user. In aninteractive session, we would see the following.
The R Journal Vol. XX/YY, AAAA 20ZZ ISSN 2073-4859
ONTRIBUTED RESEARCH ARTICLE > store <- new.env()> f <- capture(start_counting, store)> f()[1] TRUE> store$counting[1] TRUE Observe that our call to f() returns
TRUE as expected, but also exported a copy of
TRUE into store .The reason this works is that an R function ‘remembers’ where it is created. The function f() wascreated inside capture() and the variable envir is present there. We say that this ‘capturing’ versionof start_counting has a local side-effect : it writes outside of its own scope but the place where it writesis controlled.We now need to make sure that run() executes the captured version of start_counting() . This isdone by locally masking the user-facing version of start_counting() . That is, we make sure that thecaptured version is found by eval() and not the original version. A new version of run() now looksas follows. run <- function(file){expressions <- parse(file)store <- new.env()runtime <- new.env(parent=.GlobalEnv)runtime$start_counting <- capture(start_counting, store)n <- 0for (e in expressions){eval(e, envir=runtime)if ( isTRUE(store$counting) ) n <- n + 1}message(sprintf("Counted %d expressions",n))runtime}
Now, consider the following code, stored in script1.R . In an interactive session we would see this. > e <- run("script1.R")Counted 1 expressions> e$x[1] 10> e$y[1] 20
Let us go through the most important parts of the new run() function. After parsing the R file anew environment is created that will store the output of calls to start_counting() . store <- new.env() The runtime environment is created as before, but now we add the capturing version of start_counting() . runtime <- new.env(parent=.GlobalEnv)runtime$start_counting <- capture(start_counting, store) This ensures that when the user calls start_counting() , the capturing version is executed. We callthis technique local masking since the start_counting() function is only masked during the executionof run() . The captured version of start_counting() as a side effect stores its output in store . Wecall this a ‘local side-effect’ because store is never seen by the user: it is created inside run() anddestroyed when run() is finished.Finally, all expressions are executed in the runtime environment and counted conditional on thevalue of store$counting . for (e in expressions){eval(e, envir=runtime)if ( isTRUE(store$counting) ) n <- n + 1} The R Journal Vol. XX/YY, AAAA 20ZZ ISSN 2073-4859
ONTRIBUTED RESEARCH ARTICLE Summarizing, with this construction we are able to create a file runner akin to source() that cangather and communicate useful process metadata while executing a script. Moreover, the user ofthe script can convey information directly to the file runner, while it runs, without relying on globalside-effects. This is achieved by first creating a user-facing function that returns the information to besend to the file runner. The file runner locally masks the user-facing version with a version that copiesthe output to an environment local to the file runner before returning the output to the user.The approach just described can be generalized to more realistic use cases. All examples mentionedin the ‘Context’ section —time or memory profiling, or logging changes in data, merely need someextra administration. Furthermore, the current example emits the secondary data flow as a ‘ message ’.In practical use cases it may make more sense to write the output to a file connection or database,or the make the secondary data stream output of the file runner. In the Applications section bothapplications are discussed.
Build your own pipe operator
The magrittr forward ‘pipe’ operator of Bache and Wickham (2014) has become a popular tool for Rusers over the last years. This pipe operator is intended as a form of ‘syntactic sugar’ that in some casesmakes code a little easier to write. A pipe operator behaves somewhat like a left-to-right ‘expressioncomposition operator’. This, in the sense that a sequence of expressions that are joined by a pipeoperator are interpreted by R’s parser as a single expression. Pipe operators also offer an opportunityto derive information from a running sequence of expressions.The magrittr pipe operator has quite complex semantics, but it is possible to implement a basicpipe operator as follows. `%p>%` <- function(lhs, rhs) rhs(lhs)
Here, the rhs (right hand side) argument must be a single-argument function, which is applied to lhs .In an interactive session we could see this. > 3 %p>% sin %p>% cos[1] 0.9900591
To build our expression counter, we need to have a place to store the counter value, hidden fromthe user. In contrast to the implementation of the file runner in the previous section, each use of %p>% is disconnected from the other, and there seems to be no shared space to increase the counter at eachcall. The solution is to let the secondary data flow travel with the primary flow, by adding an attributeto the data. We create two user-facing functions that start or stop logging, as follows. start_counting <- function(data){attr(data, "n") <- 0data}end_counting <- function(data){message(sprintf("Counted %d expressions", attr(data,"n")-1))attr(data, "n") <- NULLdata}
Here the first function attaches a counter to the data and initializes it to zero. The second functionreports its value, decreased by one so the stop function itself is not included in the count. We also alterthe pipe operator to increase the counter, if it exists. `%p>%` <- function(lhs, rhs){if ( !is.null(attr(lhs,"n")) ){attr(lhs,"n") <- attr(lhs,"n") + 1}rhs(lhs)}
In an interactive session, we could now see the following. > out <- 3 %p>%+ start_counting %p>%+ sin %p>%+ cos %p>%+ end_countingCounted 2 expressions
The R Journal Vol. XX/YY, AAAA 20ZZ ISSN 2073-4859
ONTRIBUTED RESEARCH ARTICLE > out[1] 0.9900591 Summarizing, for small interactive tasks a secondary data flow can be added to the primary oneby using a special kind of pipe operator. Communication between the user and the secondary dataflow is implemented by adding or altering attributes attached to the R object.Generalizations of this technique come with a few caveats. First, the current pipe operator onlyallows right-hand side expressions that accept a single argument. Extension to a more general caseinvolves inspection and manipulation of the right-hand side’s abstract syntax tree and is out of scopefor the current work. Second, the current implementation relies on the right-hand side expressions topreserve attributes. A general implementation will have to test that the output of rhs(lhs) still has thelogging attribute attached (if there was any) and re-attach it if necessary.
Application 1: tracking changes in data
The lumberjack package (van der Loo, 2018) implements a logging framework to track changes inR objects as they get processed. The package implements both a pipe operator, denoted %L>% and afile runner called run_file() . The main communication devices for the user are two functions called start_log() and dump_log() .We will first demonstrate working with the lumberjack pipe operator. The function start_log() accepts an R object and a logger object. It attaches the logger to the R object and returns the augmentedR object. A logger is a reference object that exposes at least an $add() method and a $dump() method.If a logger is present, the pipe operator stores a copy of the left hand side. Next, it executes theexpression on the right-hand side with the left-hand side as an argument and stores the output. Itthen calls the add() method of the logger with the input and output, so that the logger can computeand store the difference. The dump_log() function accepts an R object, calls the $dump() method onthe attached logger (if there is any), removes the logger from the object and returns the object. Aninteractive session could look as follows. > library(lumberjack)> out <- women %L>%> start_log(simple$new()) %L>%> transform(height = height * 2.54) %L>%> identity() %L>%> dump_log()Dumped a log at /home/mark/simple.csv> read.csv("simple.csv")step time expression changed1 1 2019-08-09 11:29:06 transform(height = height * 2.54) TRUE2 2 2019-08-09 11:29:06 identity() FALSE Here, simple$new() creates a logger object that registers whether an R object has changed or not.There are other loggers that compute more involved differences between in- and output. The $dump() method of the logger writes the logging output to a csv file.For larger scripts, a file runner called run_file() is available in lumberjack . As an exampleconsider the following script. It converts columns of the built-in women data set to SI units (meters andkilogram) and then computes the body-mass index of each case.
In an interactive session we can run the script and access both the logging information and retrievethe output of the script. > e <- run_file("script2.R")Dumped a log at /home/mark/women_simple.csv> read.csv("women_simple.csv")step time expression changed1 1 2019-08-09 13:11:25 start_log(women, simple$new()) FALSE A native R Reference Class, an ‘ R6 ’ object (Chang, 2019) or any other reference type object implementing theproper API. The R Journal Vol. XX/YY, AAAA 20ZZ ISSN 2073-4859
ONTRIBUTED RESEARCH ARTICLE The lumberjack file runner locally masks start_log() with a function that stores the logger andthe name of the tracked R object in a local environment. A copy of the tracked object is stored locallyas well. Expressions in the script are executed one by one. After each expression, the object in theruntime environment is compared with the stored object. If it has changed, the $add() method of thelogger is called and a copy of the changed object is stored. After all expressions have been executed,the $dump() method is called so the user does not have to do this explicitly.A user can add multiple loggers for each R object and track multiple objects. It is also possibleto dump specific logs for specific objects during the script. All communication necessary for theseoperations runs via the mechanism explained in the ‘build your own source() ’ section.
Application 2: unit testing
The tinytest package (van der Loo, 2019) implements a unit testing framework. Its core function is afile runner that uses local masking and local side effects to capture the output of assertions that areinserted explictly by the user. As an example, we create tests for the following function.
A simple tinytest test file could look like this.
The first four lines prepare some data, while the last two lines check whether the prepared data meetsour expectations. In an interactive session we can run the test file, after loading the bmi() function. > source("bmi.R")> library(tinytest)> out <- run_test_file('test_script.R')Running test_script.R................. 2 tests OK> print(out, passes=TRUE)----- PASSED : test_script.R<7--7>call| expect_true(all(BMI >= 10))----- PASSED : test_script.R<8--8>call| expect_true(all(BMI <= 30))
In this application, the file runner locally masks the expect_*() functions and captures theirresult through a local side effect. As we are only interested in the test results, the output of all otherexpressions is discarded.Compared to the basic version described in the ‘build your own source() ’ section, this file runnerkeeps some extra administration such as the line numbers of each masked expression. These can beextracted from the output of parse() . The package comes with a number of assertions in the form of expect_*() functions. It is possible to extend tinytest by registering new assertions. These are thenautomatically masked by the file runner. The only requirement on the new assertions is that theyreturn an object of the same type as the built-in assertions (an object of class ‘ tinytest ’).
The R Journal Vol. XX/YY, AAAA 20ZZ ISSN 2073-4859
ONTRIBUTED RESEARCH ARTICLE Discussion
The techniques demonstrated here have two major advantages. First, it allows for a clean and side-effect free separation between the primary and secondary data flows. As a result, the secondary dataflow is composes with the primary data flow. In other words: a user that wants to add a secondary dataflow to an existing script does not have to edit any existing code. Instead it is only necessary to add abit of code to specify and initialize the secondary stream, which is a big advantage for maintainability.Second, the current mechanisms avoid the use of condition signals. This also leads to code that iseasier to understand and navigate because all code associated with the secondary flow can be limitedto the scope of a single function (here: either a file runner or a pipe operator). Since the secondarydata flow is not treated as an unusual condition (exception) the exception signaling channel is free fortransmitting truly unusual conditions such as errors and warnings.There are also some limitations inherent to these techniques. Although the code for the secondarydata flow is easy to compose with code for the primary data flow, it is not as easy to compose differentsecondary data flows. For example: one can use only one file runner to run an R script, and only asingle pipe operator to combine two expressions.A second limitation is that this approach does not recurse into the primary expressions. Forexample, the expression counters we developed only count user-defined expressions: they can notcount expressions that are called by functions called by the user. This means that something like acode coverage tool such as covr is out of scope.A third and related limitation is that the resolution of expressions may be too low for certainapplications. For example in R, ‘ if ’ is an expression (it returns a value when evaluated) rather then astatement (like for ). This means that parse() interprets a block such as if ( x > 0 ){x <- 10y <- 2*x} as a single expression. If higher resolution is needed, this requires explicit manipulation of the usercode.Finally, the local masking mechanism excudes the use of the namespace resolution operator. Forexample, in lumberjack it is not possible to use lumberjack::start_log() since in that case the user-facing function from the package is executed and not the masked function with the desired localside-effect. Conclusion
In this paper we demonstrated a set of techniques that allow one to add a secondary data flow to anexisting user-defined R script. The core idea is that we manipulate way expressions are combinedbefore they are executed. In practice, we use R’s parse() and eval() to add secondary data streamto user code, or build a special ‘pipe’ operator. Local masking and local side effects allow a user tocontrol the secondary data flow without global side-effects. The result is a clean separation of concernsbetween the primary and secondary data flow, that does not rely on condition handling, is void ofglobal side-effects, and that is implemented in pure R.
Mark P.J. van der LooStatistics NetherlandsPO-BOX 24500, 2490HA Den HaagThe Netherlands
Bibliography
S. M. Bache and H. Wickham. magrittr: A Forward-Pipe Operator for R , 2014. URL https://CRAN.R-project.org/package=magrittr . R package version 1.5. [p]M. Burger, K. Juenemann, and T. Koenig.
RUnit: R Unit Test Framework , 2018. URL https://CRAN.R-project.org/package=RUnit . R package version 0.4.32. [p]
The R Journal Vol. XX/YY, AAAA 20ZZ ISSN 2073-4859
ONTRIBUTED RESEARCH ARTICLE W. Chang.
R6: Encapsulated Classes with Reference Semantics , 2019. URL https://CRAN.R-project.org/package=R6 . R package version 2.4.0. [p]G. Daróczi. logger: A Lightweight, Modern and Flexible Logging Utility , 2019. URL https://CRAN.R-project.org/package=logger . R package version 0.1. [p]M. Frasca. logging: R Logging Package , 2019. URL https://CRAN.R-project.org/package=logging . Rpackage version 0.10-108. [p]B. Gaslam. unitizer: Interactive R Unit Tests , 2019. URL https://CRAN.R-project.org/package=unitizer . R package version 1.4.8. [p]J. Hester. covr: Test Coverage for Packages , 2018. URL https://CRAN.R-project.org/package=covr . Rpackage version 3.2.1. [p]J. Hester. bench: High Precision Timing of R Expressions , 2019. URL https://CRAN.R-project.org/package=bench . R package version 1.0.2. [p]O. Mersmann. microbenchmark: Accurate Timing Functions , 2018. URL https://CRAN.R-project.org/package=microbenchmark . R package version 1.4-6. [p]B. Milewski.
Category Theory for Programmers . Blurb, Incorporated, 2018. ISBN 9780464825081. Seealso the online lectures: https://youtu.be/I8LbkfSSR58. [p]J. Mount and N. Zumel. Dot-Pipe: an S3 Extensible Pipe for R.
The R Journal , 10(2):309–316, 2018. doi:10.32614/RJ-2018-042. URL https://doi.org/10.32614/RJ-2018-042 . [p]B. L. Y. Rowe. futile.logger: A Logging Utility for R , 2016. URL https://CRAN.R-project.org/package=futile.logger . R package version 1.4.3. [p]M. van der Loo. lumberjack: Track Changes in Data , 2018. URL https://CRAN.R-project.org/package=lumberjack . R package version 1.0.2. [p]M. van der Loo. tinytest: Lightweight but Feature Complete Unit Testing Framework , 2019. URL https://github.com/markvanderloo/tinytest . R package version 0.9.6. [p]H. Wickham. testthat: Get started with testing.
The R Journal , 3:5–10, 2011. URL https://journal.r-project.org/archive/2011-1/RJournal_2011-1_Wickham.pdf . [p]Y. Xie. testit: A Simple Package for Testing R Packages , 2018. URL https://CRAN.R-project.org/package=testit . R package version 0.9. [p]. R package version 0.9. [p]