[PDF] An R Autograder for PrairieLearn

Abstract

We describe how we both use and extend the PrarieLearn framework by taking advantage of its built-in support for external auto-graders. By using a custom Docker container, we can match our course requirements perfectly. Moreover, by relying on the flexibility of the interface we can customize our Docker container. A specific extension for unit testing is described which creates context-dependent difference between student answers and reference solution providing a more comprehensive response at test time.

Full PDF

AAn R Autograder for PrairieLearn

Dirk Eddelbuettel and Alton Barbehenn Department of Statistics, University of Illinois, Urbana-Champaign, IL, USAThis version was compiled on March 17, 2020

We desribe how we both use and extend the PrarieLearn framework bytaking advantage of its built-in support for external auto-graders. By us-ing a custom Docker container, we can match our course requirementsperfectly. Moreover, by relying on the ﬂexibility of the interface we cancustomize our Docker container. A speciﬁc extension for unit testing is de-scribed which creates context-dependent difference between student an-swers and reference solution providing a more comprehensive responseat test time.

Context

We describe the motivation, design and use of an autograderfor the R language within the PrairieLearn system (Zilles et al. ,2018). PrairieLearn is in use at the University of Illinois at Urbana-Champaign, where it is also being developed, and other campusesto support fully automated computer-based testing of homework,quizzes and examples for undergraduate and graduate students.We use it to support the topics course STAT 430 “Data ScienceProgramming Methods” we have been teaching since 2019 in theDepartment of Statistics at the University of Illinois at Urbana-Champaign.As documented, PrairieLearn supports external graders , and weare providing one such grader for the R language and system. Ourimplementation follows KISS principles, and is sufﬁciently safeand robust for deployment. Our approach uses two key insights.First, testing student submissions is close to unit testing code—andwe beneﬁt from relying on a very clever, small and nimble testframework package, tinytest (van der Loo, 2019). Second, thePrairieLearn decision to allow external graders under a ‘bring yourown container’ scheme allows us to regroup all our requirement ina simple Docker container—extending a base container from theRocker Project (Boettiger and Eddelbuettel, 2017)—we provisionand control.

PrairieLearn

PrairieLearn (Zilles et al. , 2018) is an online problem-driven learn-ing system for creating homeworks and tests that enables auto-mated code evaluation as well as more traditional question types(like multiple choice questions) for both homework assignmentsas well as exams. It is built to be ﬂexible, and enables grading tohappen however the instructor wishes using Docker. PrairieLearncomes with many easy ways of adding randomization to questions,and a custom set of HTML tags that makes writing questions easy.

Direct PrairieLearn Integration

The integration between different components is designed to besimple and ﬂexible. Data is exchanged by conﬁguration text ﬁlesin the JSON format (which is discussed below). At its core, thisinvolves only two ﬁles (which we describe next) that are madeavailable in the top-level directory of the contributed grader asshown the following listing: fs ::dir_tree ("r_autograder") run.sh. The ﬁrst ﬁle, run.sh , shown in Appendix 1, is more-or-less unchanged from the run.sh ﬁle in the PrairieLearn examplecourse which invokes the ﬁle pltest.R discussed next. It sets up anumber of environment variables reﬂecting the PrairieLearn setup.It also copies ﬁles in place, adjusts modes (more on that belowwhen we discuss security), calls the evaluation script discussednext, and assembles the result. pltest.R.

The second ﬁle is the actual test runner for PrairieLearnunder R, and is shown in Appendix 2.For a given question, essentially three things happen:

Extract Metadata from Tests.

The displayed title of each available test,and the available points per test, are extracted from the question’stest ﬁles themselves. This is performed by the helper function plr::get_question_details() which is given the default testdirectory used in our layout: "/grade/tests/tests" . Our im-plementation is inspired by the doxygen and roxygen2 tag use fordocumentation and is most appropriate: metadata for each testsis stored with test. This allows for quick iteration during develop-ment as test ﬁles can simply be renamed to be deactivated withoutworrying about conﬂicting metadata.

Run the Tests.

The actual test predicates are being run using the tinytest package and its function run_test_dir() traversing adirectory (more on that below). The result is then converted intoa data.frame object. We discuss the tinytest framework in moredetail below.

Merge and Post-Process.

The two data.frame objects (metadata andtest results) are merged using the names of each test ﬁle as the key.Then points are calculated and the resulting object is written as aJSON ﬁle for PrairieLearn to consume.

Test Framework tinytest is an appropriately light-weight test framework withoutfurther dependencies. As stated in the opening of its vignette:The purpose of unit testing is to check whether a functiongives the output you expect, when it is provided withcertain input.This is precisely what checking student answers amounts to.Given the context of a question, students provide code, typicallyas a function, which we can test given inputs—and compare to areference answer and its output. Our framework does just that.Two of the key insights of tinytest are: Earlier or alternate approaches use an explicit ﬁle points.json ; we ﬁnd it more suitable to deﬁne thison the ﬂy given the test ﬁles. https://github.com/stat430dspm/rocker-pl R Autograder for PrairieLearn |

March 17, 2020 | a r X i v : . [ s t a t . O T ] M a r ) test results are data which can be stored and manipulated,andb) that each test ﬁle is a script interspersed with command andsuitable to be programmed over.We use another key feature of tinytest : its extensibility. Oursmall helper package ttdo (Eddelbuettel and Barbehenn, 2019b)extends the tinytest framework by using diffobj (Gaslam, 2019) tocompute succinct diff(1) -style summaries of object comparisons.This is most useful to show students the differences between theirresult and the reference result. We show this below in the contextof a question. Example R Question

Within the testing framework, questions are a key component. Ingeneral, each question is organized in its own directory. Questionsmay then be grouped by directory name for assignments, exams orquizzes comprising a set of such questions.For each question used in our autograder, the directory layoutis as shown in the next ﬁgure. fs ::dir_tree ("rfunction-fib") There are two mandatory top-level ﬁles:First, info.json which contains all the relevant data for thisquestion, including of course which grader to use. As discussedabove, this ﬁle controls which of several graders is used. { "uuid": "32A98E04-0A4C-497A-91D2-18BC4FE98047","title": "Fibonacci Sequence 2.0","topic": "Functions","tags": ["code", "v3", "barbehe2", "deddel","balamut2", "stat430dspm", "Fa19","rautograder"],"type": "v3","singleVariant": true ,"gradingMethod": "External","externalGradingOptions": {"enabled": true ,"image": "stat430/pl","serverFilesCourse": ["r_autograder/"],"entrypoint": "/grade/server/r_grader/run.sh","timeout": 5}}

We note that this points speciﬁcally to• a top-level directory (such as the one shown above),• an entry-point script (as discussed above)• a container to run the evaluations in. Second, question.html which deﬁnes the display shown tothe student. PrairieLearn now allows for markdown to describethe central part, and can reference external ﬁles such as the ﬁle initial_code.R listed here too. initial_code.R provides thestanza of code shown in the Ace editor component (and the ﬁlename is speciﬁed in question.html ).Then, the tests/ directory contains the test infrastructure. Byour convention, tests/ans.R is the reference answer. This ﬁle isset to mode 0600 to ensure the student code can never read it. fib <- function (n) {out <- rep (1, n) if (n >= for (i in : n)out[i] <- out[i-1] + out[i-2] return (out)} The subdirectory tests/tests/ then contains one or more unittests or, in our case, question validations. The ﬁrst question sourcesthe ﬁle, evaluates F ( ) and compares to the expected answer, 1.(Other test questions then check for other values as shown below;several test predicates could also be present in a single test ﬁle butwe are keeping it simple here.) file <- "/grade/student/fib.R"v <- plr ::source_and_eval_safe (file, fib (1), "ag") expect_equal (v, 1) Of note is our use of a function from the helper package plr (Eddelbuettel and Barbehenn, 2019a). As the same code fragmentwould be repeated across numerous question ﬁles, it makes senseto regroup this code in a (simple) function. At its core are the system() call, made as the autograde user ag , and the subsequentevaluation of the supplied expression. We take full advantage of thelazy evaluation that makes R so powerful: fib(1) is not evaluatedby the caller but rather in the context of the caller—after sourcingthe corresponding ﬁle. We also make the ﬁle to sourced visibleto the ag user. All other ﬁles remain inaccessible thanks for theirmode of 0600.Another key aspect is the use of eval_safe() from the unix package (Ooms, 2019b). As we are effectively running as root inside a container, we have the ability to lower to permission tothose of another user, here ag . source_and_eval_safe <- function (file, expr,uid=NULL) { if ( !is.null (uid) &&class (uid) == "character")uid <- unix ::user_info (uid) $ uid if ( !file.exists (file)) return ( invisible (NULL))oldmode <- file.mode (file) Sys.chmod (file, mode="0664") | https://github.com/stat430dspm/rocker-pl Eddelbuettel + Barbehenn ource (file)res <- unix ::eval_safe (expr, uid=uid) Sys.chmod (file, mode=oldmode)res}

We omit the second question which is largely identical to theﬁrst, but tests F ( ) for the expected answer of c(1,1) .The third ﬁle uses randomization to prevent students fromhardcoding an answer to F ( n ) for a given knowable value n . library (tinytest) using (ttdo)n <- sample (3 :

20, 1)file <- "/grade/student/fib.R"student <- plr ::source_and_eval_safe (file, fib (n), "ag") source ("/grade/tests/ans.R")correct <- fib (n) expect_equivalent_with_diff (student, correct)

It also shows another key feature: our use of the diffobj pack-age. We use a very small add-on package ttdo (an acronym for‘tinytest-diffobj’) we created utilizing the extensions mechanismof tinytest in order to provide more speciﬁc feedback in the testresults. The ttdo package is separate from our plr package becauseof its potential use in contexts other than PrairieLearn.Figure 1 shows a screenshot resulting from providing an answerthat returns a content of 1 no matter the input. This passes F ( ) ,fails F ( ) and fails F ( n ) for n > =

3. The screenshot displays theeffect of the colorized difference between the received answer andthe expected answer for the latter two questions.

Container

PrairieLearn allows for external containers. We use this featureto deploy a custom container based on the r-ubuntu containerfrom the Rocker Project (Boettiger and Eddelbuettel, 2017). Thiscontainer is setup with access to the “Personal Package Archive”(PPA) by Michel Rutter which provides a considerable subset of theR repositories (known as “CRAN”) as pre-compiled binaries. Our

Dockerfile is shown in Appendix 3. PrairieLearn alwayschecks for updated containers, so deployment of a new containeris more or less guaranteed. This also facilitates a workﬂow ofincremental changes as the ‘continuous deployment’ is automatedand relies on trusted workﬂows supporting many other open sourceprojects.Similarly, by relying on widely-used and tested components suchas the Rocker Project containers as a base, along with providedUbuntu binaries, the risk of inadvertent breakage is minimized aswell (when compared to bespoke custom solutions not relying onmore widely-used compoents). See the brief description at the top of https://CRAN.R-Project.org/bin/linux/ubuntu for more details.

Fig. 1.

Example Output of Autograder for Fibonacci Question

Security Aspects

R is a very ﬂexible language that is somewhat difﬁcult to sandboxas it allows computation on the language . Some approaches doexist—the

RAppArmor package (Ooms, 2019a) wraps around oneof the two prevalent approaches for Linux is a candidate given thatwe match the installation requirements by being on Debian / Ubuntusystems.Here, however, we opted for a more basic approach.

All ﬁlescopied in by run.sh are set to be owned by the root user withno read, write or execution rights set for groups or others. Theone exception is the uploaded ﬁle containing the to-be-evaluatedstudent code. This ﬁle is then source() -ed in a lower-priorityprocess owned by the autograde user ag , and the supplied functionis evaluated with a given argument. We use the unix package(Ooms, 2019b) for this, taking advantage of the fact that inside acontainer we are running as the superuser permitting us to lowerpermissions. In other words, the one execution that could exposesecrets (of the untrusted code submitted by the student) is the onerunning with the lowest possible permissions of the ag user withall other ﬁles being “locked-away” and readable only by the root user.Concretely, our function plr::source_and_eval_safe() shown above relies on the function unix::eval_safe() whichtakes care of the (system-speciﬁc) details of process permissioncontrol. In addition, we also minimize ﬁle permission changes. Asibbling function plr::eval_safe_as() works similarly on an Rexpression rather than ﬁle. Summary

The PrairieLearn system (Zilles et al. , 2018) permits large-scaleand automated testing and grading of quizzes, exercises and testsas used in university educated. It is designed as an open andextensible system.We have created a custom autograding container for the R lan-guage to both take advantage of the excellent PrairieLearn system,

Eddelbuettel + Barbehenn R Autograder for PrairieLearn |

March 17, 2020 ||

March 17, 2020 || nd extends its facilities by using a unit testing framework whichallows for further customization. Our plr package (Eddelbuetteland Barbehenn, 2019a) for R autograding with PrairieLearn de-ployes the tinytest system (van der Loo, 2019) for unit testsing. Italso extends it via the ttdo package (Eddelbuettel and Barbehenn,2019b) which permits the creation of highly-informative diff ob-jects produced by the diffobj package (Gaslam, 2019) which canbe deployed directly in the dataﬂow based on JSON objects usedby PrairieLearn. References

Boettiger C, Eddelbuettel D (2017). “An Introduction to Rocker: Docker Contain-ers for R.”

The R Journal , (2), 527–536. doi:10.32614/RJ-2017-065 .URL https://doi.org/10.32614/RJ-2017-065.Eddelbuettel D, Barbehenn A (2019a). plr: Utility Functions for ’PrairieLearn’ andR . R package version 0.0.2, URL https://github.com/stat430dspm/plr.Eddelbuettel D, Barbehenn A (2019b). ttdo: Extend ’tinytest’ with ’diffobj’ . Rpackage version 0.0.4, URL https://CRAN.R-project.org/package=ttdo.Gaslam B (2019). diffobj: Diffs for R Objects . R package version 0.2.3, URLhttps://CRAN.R-project.org/package=diffobj.Ooms J (2019a). RAppArmor: Bindings to AppArmor and Security Related LinuxTools . R package version 3.2, URL https://CRAN.R-project.org/package=RAppArmor.Ooms J (2019b). unix: POSIX System Utilities . R package version 1.5, URLhttps://CRAN.R-project.org/package=unix.van der Loo M (2019). tinytest: Lightweight and Feature Complete Unit Test-ing Framework . R package version 1.0.0, URL https://CRAN.R-project.org/package=tinytest.Zilles C, West M, Mussulman D, Bretl T (2018). “Making testing less trying:Lessons learned from operating a computer-based testing facility.” In

Pro-ceedings of the 2018 Frontiers in Education Conference (FIE 2018) . URL http://lagrange.mechse.illinois.edu/pubs/ZiWeMuBr2018/ZiWeMuBr2018.pdf. | https://github.com/stat430dspm/rocker-pl Eddelbuettel + Barbehenn ppendix 1: run.sh . JOB_DIR="/grade/"

STUDENT_DIR="${JOB_DIR}student/"AG_DIR="${JOB_DIR}serverFilesCourse/r_autograder/"TEST_DIR="${JOB_DIR}tests/"OUT_DIR="${JOB_DIR}results/"

MERGE_DIR="${JOB_DIR}run/"

BIN_DIR="${MERGE_DIR}bin/" echo "[init] making directories"mkdir ${MERGE_DIR} ${BIN_DIR} ${OUT_DIR} chown -R root:root ${TEST_DIR}chmod -R go-rwx ${TEST_DIR} echo "[init] setting up tests directory for 'ag' user"chown ag:ag ${TEST_DIR}testsecho "[init] copying content"cp ${STUDENT_DIR}* ${BIN_DIR}cp ${AG_DIR}* ${MERGE_DIR}cp -r ${TEST_DIR}* ${MERGE_DIR}chown ag:ag ${MERGE_DIR}tests cd ${MERGE_DIR}echo "[run] starting autograder" echo "[run] Rscript pltest.R"Rscript pltest.R if [ ! -s results.json ]; then echo -n '{"succeeded": false, "score": 0.0, "message": "Catastrophic failure! ' > results.jsonecho 'Contact course staff and have them check the logs for this submission."}' >> results.json fi echo "[run] autograder completed" cp ${MERGE_DIR}/results.json '/grade/results/results.json'echo "[run] copied results" Eddelbuettel + Barbehenn R Autograder for PrairieLearn |

March 17, 2020 ||

March 17, 2020 || ppendix 2: pltest.R . message_to_test_result <- function (msg, mxpts=100) { data.frame (name = "Error",max_points = mxpts,points = 0,output = msg $ message)}result <- tryCatch ({ set.seed ( as.integer ( Sys.Date ())) tests_dir <- "/grade/tests/tests" question_details <- plr ::get_question_details (tests_dir) cat ("[pltest] about to call tests from", getwd (), "\n")test_results <- as.data.frame (tinytest ::run_test_dir (tests_dir, verbose = FALSE)) res <- merge (test_results, question_details, by = "file", all = TRUE) res $ points <- ifelse ( !is.na (res $ result) & res $ result == TRUE, res $ max_points, 0) res $ output <- ifelse ( !is.na (res $ result) & res $ result == FALSE, paste (res $ call, res $ diff, sep = "\n"), "")score <- sum (res $ points) / sum (res $ max_points) res <- res[, c ("name", "max_points", "points", "output")] list (tests = res, score = score, succeeded = TRUE)},warning = function (w) list (tests = message_to_test_result (w), score = 0, succeeded = FALSE),error = function (e) list (tests = message_to_test_result (e), score = 0, succeeded = FALSE) ) jsonlite ::write_json (result, path = "results.json", auto_unbox = TRUE, force = TRUE) | https://github.com/stat430dspm/rocker-pl Eddelbuettel + Barbehenn ppendix 3: Dockerfile for stat430/pl container.

FROM rocker/r-ubuntu:18.04

ENV PYTHONIOENCODING=UTF-8

RUN apt-get update && apt-get install -y \git \r-cran-data.table \r-cran-devtools \r-cran-doparallel \r-cran-dygraphs \r-cran-foreach \r-cran-fs \r-cran-future.apply \r-cran-gh \r-cran-git2r \r-cran-igraph \r-cran-memoise \r-cran-microbenchmark \r-cran-png \r-cran-rcpparmadillo \r-cran-rex \r-cran-rsqlite \r-cran-runit \r-cran-shiny \r-cran-stringdist \r-cran-testthat \r-cran-tidyverse \r-cran-tinytest \r-cran-xts \sqlite3 \sudo RUN install.r bench diffobj flexdashboard lintr ttdo unix

RUN installGithub.r stat430dspm/plr MangoTheCat/visualTestRUN useradd ag \ && mkdir /home/ag \ && chown ag:ag /home/ag \ && echo "[user]" > /home/ag/.gitconfig \ && echo " name = Autograding User" >> /home/ag/.gitconfig \ && echo " email = ag@nowhere" >> /home/ag/.gitconfig \ && chown ag:ag /home/ag/.gitconfig Eddelbuettel + Barbehenn R Autograder for PrairieLearn |

March 17, 2020 ||