AAn R Autograder for PrairieLearn
Dirk Eddelbuettel and Alton Barbehenn Department of Statistics, University of Illinois, Urbana-Champaign, IL, USAThis version was compiled on March 17, 2020
We desribe how we both use and extend the PrarieLearn framework bytaking advantage of its built-in support for external auto-graders. By us-ing a custom Docker container, we can match our course requirementsperfectly. Moreover, by relying on the flexibility of the interface we cancustomize our Docker container. A specific extension for unit testing is de-scribed which creates context-dependent difference between student an-swers and reference solution providing a more comprehensive responseat test time.
Context
We describe the motivation, design and use of an autograderfor the R language within the PrairieLearn system (Zilles et al. ,2018). PrairieLearn is in use at the University of Illinois at Urbana-Champaign, where it is also being developed, and other campusesto support fully automated computer-based testing of homework,quizzes and examples for undergraduate and graduate students.We use it to support the topics course STAT 430 “Data ScienceProgramming Methods” we have been teaching since 2019 in theDepartment of Statistics at the University of Illinois at Urbana-Champaign.As documented, PrairieLearn supports external graders , and weare providing one such grader for the R language and system. Ourimplementation follows KISS principles, and is sufficiently safeand robust for deployment. Our approach uses two key insights.First, testing student submissions is close to unit testing code—andwe benefit from relying on a very clever, small and nimble testframework package, tinytest (van der Loo, 2019). Second, thePrairieLearn decision to allow external graders under a ‘bring yourown container’ scheme allows us to regroup all our requirement ina simple Docker container—extending a base container from theRocker Project (Boettiger and Eddelbuettel, 2017)—we provisionand control.
PrairieLearn
PrairieLearn (Zilles et al. , 2018) is an online problem-driven learn-ing system for creating homeworks and tests that enables auto-mated code evaluation as well as more traditional question types(like multiple choice questions) for both homework assignmentsas well as exams. It is built to be flexible, and enables grading tohappen however the instructor wishes using Docker. PrairieLearncomes with many easy ways of adding randomization to questions,and a custom set of HTML tags that makes writing questions easy.
Direct PrairieLearn Integration
The integration between different components is designed to besimple and flexible. Data is exchanged by configuration text filesin the JSON format (which is discussed below). At its core, thisinvolves only two files (which we describe next) that are madeavailable in the top-level directory of the contributed grader asshown the following listing: fs ::dir_tree ("r_autograder") run.sh. The first file, run.sh , shown in Appendix 1, is more-or-less unchanged from the run.sh file in the PrairieLearn examplecourse which invokes the file pltest.R discussed next. It sets up anumber of environment variables reflecting the PrairieLearn setup.It also copies files in place, adjusts modes (more on that belowwhen we discuss security), calls the evaluation script discussednext, and assembles the result. pltest.R.
The second file is the actual test runner for PrairieLearnunder R, and is shown in Appendix 2.For a given question, essentially three things happen:
Extract Metadata from Tests.
The displayed title of each available test,and the available points per test, are extracted from the question’stest files themselves. This is performed by the helper function plr::get_question_details() which is given the default testdirectory used in our layout: "/grade/tests/tests" . Our im-plementation is inspired by the doxygen and roxygen2 tag use fordocumentation and is most appropriate: metadata for each testsis stored with test. This allows for quick iteration during develop-ment as test files can simply be renamed to be deactivated withoutworrying about conflicting metadata.
Run the Tests.
The actual test predicates are being run using the tinytest package and its function run_test_dir() traversing adirectory (more on that below). The result is then converted intoa data.frame object. We discuss the tinytest framework in moredetail below.
Merge and Post-Process.
The two data.frame objects (metadata andtest results) are merged using the names of each test file as the key.Then points are calculated and the resulting object is written as aJSON file for PrairieLearn to consume.
Test Framework tinytest is an appropriately light-weight test framework withoutfurther dependencies. As stated in the opening of its vignette:The purpose of unit testing is to check whether a functiongives the output you expect, when it is provided withcertain input.This is precisely what checking student answers amounts to.Given the context of a question, students provide code, typicallyas a function, which we can test given inputs—and compare to areference answer and its output. Our framework does just that.Two of the key insights of tinytest are: Earlier or alternate approaches use an explicit file points.json ; we find it more suitable to define thison the fly given the test files. https://github.com/stat430dspm/rocker-pl R Autograder for PrairieLearn |
March 17, 2020 | a r X i v : . [ s t a t . O T ] M a r ) test results are data which can be stored and manipulated,andb) that each test file is a script interspersed with command andsuitable to be programmed over.We use another key feature of tinytest : its extensibility. Oursmall helper package ttdo (Eddelbuettel and Barbehenn, 2019b)extends the tinytest framework by using diffobj (Gaslam, 2019) tocompute succinct diff(1) -style summaries of object comparisons.This is most useful to show students the differences between theirresult and the reference result. We show this below in the contextof a question. Example R Question
Within the testing framework, questions are a key component. Ingeneral, each question is organized in its own directory. Questionsmay then be grouped by directory name for assignments, exams orquizzes comprising a set of such questions.For each question used in our autograder, the directory layoutis as shown in the next figure. fs ::dir_tree ("rfunction-fib") There are two mandatory top-level files:First, info.json which contains all the relevant data for thisquestion, including of course which grader to use. As discussedabove, this file controls which of several graders is used. { "uuid": "32A98E04-0A4C-497A-91D2-18BC4FE98047","title": "Fibonacci Sequence 2.0","topic": "Functions","tags": ["code", "v3", "barbehe2", "deddel","balamut2", "stat430dspm", "Fa19","rautograder"],"type": "v3","singleVariant": true ,"gradingMethod": "External","externalGradingOptions": {"enabled": true ,"image": "stat430/pl","serverFilesCourse": ["r_autograder/"],"entrypoint": "/grade/server/r_grader/run.sh","timeout": 5}}
We note that this points specifically to• a top-level directory (such as the one shown above),• an entry-point script (as discussed above)• a container to run the evaluations in. Second, question.html which defines the display shown tothe student. PrairieLearn now allows for markdown to describethe central part, and can reference external files such as the file initial_code.R listed here too. initial_code.R provides thestanza of code shown in the Ace editor component (and the filename is specified in question.html ).Then, the tests/ directory contains the test infrastructure. Byour convention, tests/ans.R is the reference answer. This file isset to mode 0600 to ensure the student code can never read it. fib <- function (n) {out <- rep (1, n) if (n >= for (i in : n)out[i] <- out[i-1] + out[i-2] return (out)} The subdirectory tests/tests/ then contains one or more unittests or, in our case, question validations. The first question sourcesthe file, evaluates F ( ) and compares to the expected answer, 1.(Other test questions then check for other values as shown below;several test predicates could also be present in a single test file butwe are keeping it simple here.) file <- "/grade/student/fib.R"v <- plr ::source_and_eval_safe (file, fib (1), "ag") expect_equal (v, 1) Of note is our use of a function from the helper package plr (Eddelbuettel and Barbehenn, 2019a). As the same code fragmentwould be repeated across numerous question files, it makes senseto regroup this code in a (simple) function. At its core are the system() call, made as the autograde user ag , and the subsequentevaluation of the supplied expression. We take full advantage of thelazy evaluation that makes R so powerful: fib(1) is not evaluatedby the caller but rather in the context of the caller—after sourcingthe corresponding file. We also make the file to sourced visibleto the ag user. All other files remain inaccessible thanks for theirmode of 0600.Another key aspect is the use of eval_safe() from the unix package (Ooms, 2019b). As we are effectively running as root inside a container, we have the ability to lower to permission tothose of another user, here ag . source_and_eval_safe <- function (file, expr,uid=NULL) { if ( !is.null (uid) &&class (uid) == "character")uid <- unix ::user_info (uid) $ uid if ( !file.exists (file)) return ( invisible (NULL))oldmode <- file.mode (file) Sys.chmod (file, mode="0664") | https://github.com/stat430dspm/rocker-pl Eddelbuettel + Barbehenn ource (file)res <- unix ::eval_safe (expr, uid=uid) Sys.chmod (file, mode=oldmode)res}
We omit the second question which is largely identical to thefirst, but tests F ( ) for the expected answer of c(1,1) .The third file uses randomization to prevent students fromhardcoding an answer to F ( n ) for a given knowable value n . library (tinytest) using (ttdo)n <- sample (3 :
20, 1)file <- "/grade/student/fib.R"student <- plr ::source_and_eval_safe (file, fib (n), "ag") source ("/grade/tests/ans.R")correct <- fib (n) expect_equivalent_with_diff (student, correct)
It also shows another key feature: our use of the diffobj pack-age. We use a very small add-on package ttdo (an acronym for‘tinytest-diffobj’) we created utilizing the extensions mechanismof tinytest in order to provide more specific feedback in the testresults. The ttdo package is separate from our plr package becauseof its potential use in contexts other than PrairieLearn.Figure 1 shows a screenshot resulting from providing an answerthat returns a content of 1 no matter the input. This passes F ( ) ,fails F ( ) and fails F ( n ) for n > =
3. The screenshot displays theeffect of the colorized difference between the received answer andthe expected answer for the latter two questions.
Container
PrairieLearn allows for external containers. We use this featureto deploy a custom container based on the r-ubuntu containerfrom the Rocker Project (Boettiger and Eddelbuettel, 2017). Thiscontainer is setup with access to the “Personal Package Archive”(PPA) by Michel Rutter which provides a considerable subset of theR repositories (known as “CRAN”) as pre-compiled binaries. Our
Dockerfile is shown in Appendix 3. PrairieLearn alwayschecks for updated containers, so deployment of a new containeris more or less guaranteed. This also facilitates a workflow ofincremental changes as the ‘continuous deployment’ is automatedand relies on trusted workflows supporting many other open sourceprojects.Similarly, by relying on widely-used and tested components suchas the Rocker Project containers as a base, along with providedUbuntu binaries, the risk of inadvertent breakage is minimized aswell (when compared to bespoke custom solutions not relying onmore widely-used compoents). See the brief description at the top of https://CRAN.R-Project.org/bin/linux/ubuntu for more details.
Fig. 1.
Example Output of Autograder for Fibonacci Question
Security Aspects
R is a very flexible language that is somewhat difficult to sandboxas it allows computation on the language . Some approaches doexist—the
RAppArmor package (Ooms, 2019a) wraps around oneof the two prevalent approaches for Linux is a candidate given thatwe match the installation requirements by being on Debian / Ubuntusystems.Here, however, we opted for a more basic approach.
All filescopied in by run.sh are set to be owned by the root user withno read, write or execution rights set for groups or others. Theone exception is the uploaded file containing the to-be-evaluatedstudent code. This file is then source() -ed in a lower-priorityprocess owned by the autograde user ag , and the supplied functionis evaluated with a given argument. We use the unix package(Ooms, 2019b) for this, taking advantage of the fact that inside acontainer we are running as the superuser permitting us to lowerpermissions. In other words, the one execution that could exposesecrets (of the untrusted code submitted by the student) is the onerunning with the lowest possible permissions of the ag user withall other files being “locked-away” and readable only by the root user.Concretely, our function plr::source_and_eval_safe() shown above relies on the function unix::eval_safe() whichtakes care of the (system-specific) details of process permissioncontrol. In addition, we also minimize file permission changes. Asibbling function plr::eval_safe_as() works similarly on an Rexpression rather than file. Summary
The PrairieLearn system (Zilles et al. , 2018) permits large-scaleand automated testing and grading of quizzes, exercises and testsas used in university educated. It is designed as an open andextensible system.We have created a custom autograding container for the R lan-guage to both take advantage of the excellent PrairieLearn system,
Eddelbuettel + Barbehenn R Autograder for PrairieLearn |
March 17, 2020 ||
March 17, 2020 || nd extends its facilities by using a unit testing framework whichallows for further customization. Our plr package (Eddelbuetteland Barbehenn, 2019a) for R autograding with PrairieLearn de-ployes the tinytest system (van der Loo, 2019) for unit testsing. Italso extends it via the ttdo package (Eddelbuettel and Barbehenn,2019b) which permits the creation of highly-informative diff ob-jects produced by the diffobj package (Gaslam, 2019) which canbe deployed directly in the dataflow based on JSON objects usedby PrairieLearn. References
Boettiger C, Eddelbuettel D (2017). “An Introduction to Rocker: Docker Contain-ers for R.”
The R Journal , (2), 527–536. doi:10.32614/RJ-2017-065 .URL https://doi.org/10.32614/RJ-2017-065.Eddelbuettel D, Barbehenn A (2019a). plr: Utility Functions for ’PrairieLearn’ andR . R package version 0.0.2, URL https://github.com/stat430dspm/plr.Eddelbuettel D, Barbehenn A (2019b). ttdo: Extend ’tinytest’ with ’diffobj’ . Rpackage version 0.0.4, URL https://CRAN.R-project.org/package=ttdo.Gaslam B (2019). diffobj: Diffs for R Objects . R package version 0.2.3, URLhttps://CRAN.R-project.org/package=diffobj.Ooms J (2019a). RAppArmor: Bindings to AppArmor and Security Related LinuxTools . R package version 3.2, URL https://CRAN.R-project.org/package=RAppArmor.Ooms J (2019b). unix: POSIX System Utilities . R package version 1.5, URLhttps://CRAN.R-project.org/package=unix.van der Loo M (2019). tinytest: Lightweight and Feature Complete Unit Test-ing Framework . R package version 1.0.0, URL https://CRAN.R-project.org/package=tinytest.Zilles C, West M, Mussulman D, Bretl T (2018). “Making testing less trying:Lessons learned from operating a computer-based testing facility.” In
Pro-ceedings of the 2018 Frontiers in Education Conference (FIE 2018) . URL http://lagrange.mechse.illinois.edu/pubs/ZiWeMuBr2018/ZiWeMuBr2018.pdf. | https://github.com/stat430dspm/rocker-pl Eddelbuettel + Barbehenn ppendix 1: run.sh . JOB_DIR="/grade/"
STUDENT_DIR="${JOB_DIR}student/"AG_DIR="${JOB_DIR}serverFilesCourse/r_autograder/"TEST_DIR="${JOB_DIR}tests/"OUT_DIR="${JOB_DIR}results/"
MERGE_DIR="${JOB_DIR}run/"
BIN_DIR="${MERGE_DIR}bin/" echo "[init] making directories"mkdir ${MERGE_DIR} ${BIN_DIR} ${OUT_DIR} chown -R root:root ${TEST_DIR}chmod -R go-rwx ${TEST_DIR} echo "[init] setting up tests directory for 'ag' user"chown ag:ag ${TEST_DIR}testsecho "[init] copying content"cp ${STUDENT_DIR}* ${BIN_DIR}cp ${AG_DIR}* ${MERGE_DIR}cp -r ${TEST_DIR}* ${MERGE_DIR}chown ag:ag ${MERGE_DIR}tests cd ${MERGE_DIR}echo "[run] starting autograder" echo "[run] Rscript pltest.R"Rscript pltest.R if [ ! -s results.json ]; then echo -n '{"succeeded": false, "score": 0.0, "message": "Catastrophic failure! ' > results.jsonecho 'Contact course staff and have them check the logs for this submission."}' >> results.json fi echo "[run] autograder completed" cp ${MERGE_DIR}/results.json '/grade/results/results.json'echo "[run] copied results" Eddelbuettel + Barbehenn R Autograder for PrairieLearn |
March 17, 2020 ||
March 17, 2020 || ppendix 2: pltest.R . message_to_test_result <- function (msg, mxpts=100) { data.frame (name = "Error",max_points = mxpts,points = 0,output = msg $ message)}result <- tryCatch ({ set.seed ( as.integer ( Sys.Date ())) tests_dir <- "/grade/tests/tests" question_details <- plr ::get_question_details (tests_dir) cat ("[pltest] about to call tests from", getwd (), "\n")test_results <- as.data.frame (tinytest ::run_test_dir (tests_dir, verbose = FALSE)) res <- merge (test_results, question_details, by = "file", all = TRUE) res $ points <- ifelse ( !is.na (res $ result) & res $ result == TRUE, res $ max_points, 0) res $ output <- ifelse ( !is.na (res $ result) & res $ result == FALSE, paste (res $ call, res $ diff, sep = "\n"), "")score <- sum (res $ points) / sum (res $ max_points) res <- res[, c ("name", "max_points", "points", "output")] list (tests = res, score = score, succeeded = TRUE)},warning = function (w) list (tests = message_to_test_result (w), score = 0, succeeded = FALSE),error = function (e) list (tests = message_to_test_result (e), score = 0, succeeded = FALSE) ) jsonlite ::write_json (result, path = "results.json", auto_unbox = TRUE, force = TRUE) | https://github.com/stat430dspm/rocker-pl Eddelbuettel + Barbehenn ppendix 3: Dockerfile for stat430/pl container.
FROM rocker/r-ubuntu:18.04
ENV PYTHONIOENCODING=UTF-8
RUN apt-get update && apt-get install -y \git \r-cran-data.table \r-cran-devtools \r-cran-doparallel \r-cran-dygraphs \r-cran-foreach \r-cran-fs \r-cran-future.apply \r-cran-gh \r-cran-git2r \r-cran-igraph \r-cran-memoise \r-cran-microbenchmark \r-cran-png \r-cran-rcpparmadillo \r-cran-rex \r-cran-rsqlite \r-cran-runit \r-cran-shiny \r-cran-stringdist \r-cran-testthat \r-cran-tidyverse \r-cran-tinytest \r-cran-xts \sqlite3 \sudo RUN install.r bench diffobj flexdashboard lintr ttdo unix
RUN installGithub.r stat430dspm/plr MangoTheCat/visualTestRUN useradd ag \ && mkdir /home/ag \ && chown ag:ag /home/ag \ && echo "[user]" > /home/ag/.gitconfig \ && echo " name = Autograding User" >> /home/ag/.gitconfig \ && echo " email = ag@nowhere" >> /home/ag/.gitconfig \ && chown ag:ag /home/ag/.gitconfig Eddelbuettel + Barbehenn R Autograder for PrairieLearn |
March 17, 2020 ||