AA shiny update to an old experiment game
Robert B. GramacyDepartment of StatisticsVirginia Tech ∗ . Abstract
Games can be a powerful tool for learning about statistical methodology. Effectivegame design involves a fine balance between caricature and realism, to simultaneouslyillustrate salient concepts in a controlled setting and serve as a testament to real-worldapplicability. Striking that balance is particularly challenging in response surface anddesign domains, where real-world scenarios often play out over long time scales, duringwhich theories are revised, model and inferential techniques are improved, and knowl-edge is updated. Here I present a game, borrowing liberally from one first playedover forty years ago, that attempts to achieve that balance while reinforcing a cascadeof topics in modern nonparametric response surfaces, sequential design and optimiza-tion. The game embeds a blackbox simulation within a shiny app whose interface isdesigned to simulate a realistic information-availability setting, while offering a stim-ulating, competitive environment wherein students can try out new methodology, andultimately appreciate its power and limitations. Interface, rules, timing with coursematerial, and evaluation are described, along with a “case study” involving a cohortof students at Virginia Tech.
Key words: response surface, computer experiment, experimental design, Bayesianoptimization, input sensitivity, teaching game
In-class games are a common way to encourage learning—to interject some fun and buildintuition in an seemingly esoteric, or tedious technical landscape. A good game could befundamental to retaining students, say in introductory statistics. One fine example useschocolate chip cookies to illustrate aspects of sampling distributions (Lee, 2007). The longarc of an out-of-class game played over an entire semester is attempted rather less frequently.However for some topics, like experimental design and response surface optimization, thatsetting is quite natural: real-life applications play out on longer temporal scales, and in aninherently dynamic landscape. In this article I present such a game, designed to simulate areal-world setting that students might encounter on internship, which was played during a ∗ Contact: [email protected] or Hutcheson Hall (MCO439), 250 Drillfield Drive, Blacksburg, VA 24061 a r X i v : . [ s t a t . O T ] M a y raduate course I recently gave at Virginia Tech. The game is an update of one first playedover forty years ago (Mead and Freeman, 1973).Mead and Freeman’s game was ahead of its time. Today, few are aware of the contribu-tion. Although it is cited prominently in Box and Draper (1987), which is how I found it,that is one of just eight references in the literature. Perhaps this is because, for many decades(70s-90s, say) the setup of the game, requiring a custom computing environment with studentaccess, was hard to replicate. Today, with R /CRAN (R Core Team, 2017) implementationand shiny (Chang et al., 2017) web interfaces, barriers have come way down.The original game involves blackbox evaluation of agricultural yield as a function of sixnutrient levels, borrowed from Nelder (1966) and reproduced in R as follows. yield <- function(N, P, K, Na, Ca, Mg){ l1 <- 0.015 + 0.0005*N + 0.001*P + 1/((N+5)*(P+2)) + 0.001*K + 0.1/(K+2)l2 <- 0.001*((2 + K + 0.5*Na)/(Ca+1)) + 0.004*((Ca+1)/(2 + K + 0.5*Na))l3 <- 0.02/(Mg+1)return(1/(l1 + l2 + l3))} Players obtain noisy yield evaluations, due to additive block and plot-within-block (Gaus-sian) effects, with the ultimate goal of maximizing yield. There are five computer sessions(simulating crop years). Multiple experiments can be undertaken in a single session butstrategies can only be revised between sessions. My updated version still focuses on max-imizing yield through a series of runs, and its inaugural run involved a yield form verysimilar to the one coded above. The specifics are detailed in Section 2. But the environmentno longer considers blocks and plots.My revisions to the game are primarily motivated by modern technology/application,and a desire to teach a more sophisticated statistical toolkit. Classical response surfacemethods and design emphasize low degree (first- and second-order) linear modeling. Theresulting steepest ascent and ridge analyses (see, e.g., Box and Draper, 1987; Myers et al.,2016) enjoy a high degree of analytical tractability. Many relevant calculations can eitherbe performed by hand, or with a graphing calculator. Yet such conveniences offer negligibleadvantage in modern applications demanding facilities that, historically, would have beenconsidered supercomputing but are now simply routine numerics. Two prominent exam-ples are information technology/e-commerce and computer simulation experiments. Modernresponse surface methods in those domains borrow heavily from geostatistics and machinelearning with Gaussian processes, deep neural networks, and regression and classificationtrees. Sequential design strategies like expected improvement (Jones et al., 1998) promise amore modular approach to (so-called Bayesian) optimization, allowing fancy models to beswapped in and out, and a degree of human-free automation with light coding. My teachingand research philosophy favors modularity in implementation, and the game was in partconceived to reinforce those aspects in practice.Several aspects of the new game setup emphasize friendly competition as a means of2nhancing the learning arena. Exploring a diverse, state-of-the-art, toolkit benefits fromsandboxing aspects of game play, but the superlative nature of the goal—of optimizingyield—organically encourages students to adapt and improve over time. Relative benchmarksalso enhance the sense of realism, albeit through somewhat artificial means. At the sametime, it remains important that students feel compelled engage with the game in a waythat is advantageous pedagogically, which for some could come at the expense of winning.Toward that end, Section 3 covers player benchmarking, timing of methodology with lecturematerial, and an assessment strategy designed to encourage regular engagement and thedeployment of a wealth of tools. Some results from a real run of the game at Virginia Techare provided. Section 4 concludes with lessons learned and ideas for future variations. Anonline supplement includes a suite of supporting codes and other materials.
The core of game play is facilitated by an R shiny app, shown in Figure 1, which serves asboth a multi-player portal and an interface to the back-end database of player(s) records. An Rmarkdown document, yield.Rmd , provided with the supplementary material, compiles a fullset of instructions on how to use the app, the rules of the game, and some suggestions aboutstrategy. An HTML rendering of that document is provided at http://bobby.gramacy.com/teaching/rsm/yield.html . The salient details follow. shiny app
Since all game players use the same app, interaction begins with a “logging in” phase. Inadvance of opening the game, I asked each player to provide me with their initials (2–4characters) and a four-digit secret PIN, the combination of which comprises of the logintoken. Logging in involves providing “url?token” in the browser search bar. In Figure 1 theURL points to a local shiny server, and the token is “rbg4036”, a combination of my initials(I played the game too) and my office number, 403G.Once logged in, the player is presented with three blocks of game content. A greetingblock provides details on spent and total budget for experimental runs. How the budgetworks, costs of runs, etc., is discussed in detail in Section 2.2. As long as the player hasnot fully spent, or over-spent, their budget, new runs may be performed, via the secondblock on the page. Here, the player enters the coordinates of the next run. Until all fieldsare populated with valid (positive, non-empty) values, helpful error messages appear on theright. The last input, which is a slider, indicates the number of replicates desired—againdiscussed in more detail shortly. Once all entries are valid, the error messages are replacedby a “Run” button and a warning that there are no do-overs.Performing a run causes the table in the final block of the page to be updated, with theweek, inputs, and outputs of the new run appearing at the top of the table. The primarypurpose of the table is to provide visual confirmation that the run has been successfully Competition between independent agents may not be the best way to pool efforts in a real-world setting. yield simulation session. User rbg4036 may inspect the run budget,perform new runs (if budget remains), browse historical runs, or download a text file.stored in the player’s database file. Buttons are provided to aid in browsing, however this isnot intended as the main data-access vehicle. A “Download” button at the bottom createsa text file which can be saved via browser support—usually into the ~/Downloads directory.Empty output fields are converted to NA s in the downloaded file.The appendix contains some details about the backend of the shiny app, including specialconsiderations required for hosting in the cloud, say on shinyapps.io .4 .2 Rules, startup and twists Players are encouraged to collaborate on strategy, and the development of relevant math-ematical calculations, but they may not share code or data. They start with a databasecontaining an identical design of seven inputs, each with five replicate responses whose noisestructure is explained momentarily. Students were introduced to the game in the fifth week(game week zero), and could perform their first runs in the sixth week of a 15-week semester.Including Thanksgiving and final exam weeks tallies thirteen weeks of game play.In each week, including week zero, students accrue 100 playing units to spend on runs,with full roll-over from previous weeks. The cost structure for runs favors replicates. Obtain-ing the first replicate costs ten units. Replicates two through four cost an additional threeunits each; five through seven cost two, and eight through ten cost one each. As long as aplayer’s account is in the black (i.e., positive balance), s/he can perform a new run with asmany replicates as desired, up to the limit of ten. If performing that run causes the balanceto be zero, or negative, future runs will not be allowed (the run box on the app disappears)until the following week, after 100 new units are added.Replication is important for two game “twists” designed to encourage players to thinkabout signal-to-noise trade-offs, and to nudge them to spend units regularly, rather than savethem all until the end of the semester. The first is that players are told that the varianceof the additive noise on yield simulations is changing weekly, following a smooth process intime, and that they will need to provide an estimate of that variance over time for their finalreport. They are further warned that the variance may be increasing, effectively devaluingunused units. In fact the variance in week w followed a simple sinusoidal structure σ ( w ) = 0 . . π ( w − w s ) /
10) + 1) for starting week w s , which peaks in the first and tenth week. The second wrinkle is a seventh input, Nx , which isunrelated to the response. Students are not told that one of the inputs is useless, however thefinal writeup instructions ask for sensitivity analyses for the inputs, including main, partialdependence, and total effects. The class was offered as a 6000-level seminar, which graduate students in statistics usuallytake after they’ve completed the bulk of their coursework requirements for a terminal degree(Masters or Ph.D.). Most students were, therefore, not novices in the core pillars of thecourse: linear and nonlinear regression, design, etc. But few had previously worked in anenvironment similar one synthesized by the game. Starting in the fifth week of the semesterallowed for ramp-up on methodological training, giving students the chance to learn/reviewfundamentals like steepest ascent and ridge analyses (e.g., Myers et al., 2016, Chapters5–6) before entering the game as players. Using R code provided in class, most studentswere able perform runs in the first two weeks that improved upon yields from week zero,while getting a feel for the system. In subsequent weeks, new methodology was introduced5hich students were expected to try in the game. Most students had not, previously, beenexposed to these topics, which leverage nonparametric response surfaces. Conceptual andimplementational hurdles were anticipated by a careful timing of training (first examples inlecture and homework) and testing (live, in the game) stages in the course development.Details on how game play tracks course material, cookie crumbs to “catch up” stragglers,and windows into game progress are provided below. Homeworks were assigned roughly every two weeks and each one, from the third onward,contained a problem on the game. Problem statements and solutions are provided with thesupplementary material. The third homework was the the most prescriptive about what todo in the game. It instructed the student to fit a first-order model to the initial data set (7 × step -wise selection procedure with BIC. Then, after reducing to a first order model having onlythose three components, they were asked to search for interactions. I found one.With best fitted model in hand, students were asked to characterize a path of steepestascent, and to obtain yield simulations along that path. This required determining valuesfor the remaining (in my case four) inputs. No guidance was given here; I used a Latinhypercube sample, paired with six settings of the three active variables a short ways alongthe path of steepest ascent. Next, students were given several options about how to proceed,including a second-order ridge analysis, more exploration with space-filling designs, or moresteepest ascent. In my own solution I did a bit of all three, and the result was a second-orderfit to the data that had many relevant main effects, interactions, etc.After the homework deadline, I released my solution so students could see what I did,and use it to “catch up” with other players during subsequent weeks. Then we transitionedto modern material involving Gaussian processes (GPs), presented from machine learning(Rasmussen and Williams, 2006) and computer surrogate modeling (Santner et al., 2003)perspectives. Both communities evangelize the potential for fitted GP predictive surfaces toguide searches for global optima in blackbox functions. Machine learning researchers call thisBayesian optimization, whereas computer modelers call it surrogate-assisted optimization(however increasingly they are adopting the machine learning terminology). The simplestvariation involves optimizing the fitted GP predictive mean equations (Booker et al., 1999),in lieu of working directly with locally winnowed input–output data pairs. The method ofexpected improvement (EI, Jones et al., 1998) and integrated variations (e.g., IECI, Gramacyand Lee, 2011) were subsequently introduced to better balance exploration and exploitationby incorporating degrees of predictive variability (local to global), a hallmark of statisticaldecision-making. This family of nonparametric approaches were reinforced in three questionsspanning three subsequent homeworks. Solutions were provided (after the due dates) as plug-n-play R scripts leveraging mature laGP subroutines (Gramacy, 2016), converting a databasefile into a suggested new run without any additional human interaction.As one example consider EI, which is based on the improvement: I ( x ) = max { , f min n − ( x ) } . The quantity f min n is best value of the objective so far, and Y ( x ) is a prediction of thatobjective as a function of the inputs x . Both are derived from the estimated response sur-face, and are thus random variables—a property inherited by I ( x ). Although there are manyvariations in the literature, the most common setup simplifies to treat f min n as a determin-istic quantity derived from minimizing mean predictive surface for Y ( x ), and supposes thatsurface to be Gaussian. In such cases, as arises when training a GP on input-output pairs( x , y ) , . . . , ( x n , y n ), the distribution of Y ( x ) is completely specified by µ n ( x ) and σ n ( x ),which have closed form expressions. Integrating I ( x ) over Y ( x ) is also analytically tractablein that Gaussian setting, leading to the expected improvement (EI): E { I ( x ) } = ( f n min − µ n ( x )) Φ (cid:18) f n min − µ n ( x ) σ n ( x ) (cid:19) + σ n ( x ) φ (cid:18) f n min − µ n ( x ) σ n ( x ) (cid:19) , where Φ and φ are the standard normal cdf and pdf, respectively. Notice how EI natu-rally balances exploitation, µ n ( x ) below f n min , and exploration, large σ n ( x ), which is key toautomating iterations of search for local optima (by maximizing E { I ( x ) } over x ).Homework 5, provided in the supplementary material, asked students to try EI in thegame. That homework was assigned in week 7 of game play, and variations were consid-ered in subsequent homeworks/weeks. Although timing is difficult to pinpoint exactly, thestudents’ implementation of EI-like methods, and the scripts provided as catch-up after thedue date (week 9, see, e.g., yield_gp_ei.R in the supplementary material), are responsiblefor much of the progress realized in latter weeks. For example student “mds”, who hadrelatively mediocre success with classical RSM tools, saw big improvements [see Figure 2]from employing these more sophisticated, but more easily automated, tools. In hopes that friendly competition would spur interest, I provided leaderboard-style viewsinto players’ performance relative to one another, updated in real time. A live
Rmarkdown script compiled four views into player progress for web viewing. Two of those views, snappedduring the final week of game play, are provided in Figure 2. On the x -axis is the week ofgame play ( left panel) or the unique run number ( right ), and on the y -axis is a normalizedyield response. Each player has a line in the plot, with color and type indicated by theirinitials (masking the pin). The responses shown have been de-noised in order to view pureprogress, making normalization essential otherwise players would know the true mean valueof their best noisy response. The two other views provided by the Rmarkdown script show“raw” versions of these same plots, on the original scale—these are not shown here.Observe from the panel that about half of the progress is made in the first five weeks,spanning around forty runs. Students “jh” and “fs” made rapid progress. By contrast, “hm”ends up at the same place in the end, but with more steady increments. The leaderboardwould seem to partition students into three classes, comprising of the top five, middle four,and lower four. Evidently, student “wt” gave up early and “ww” may have misjudged theappropriate balance of replicates versus unique runs. These students, while not having the7 . . . . Weekly Leaderboard week m a x y i e l d . . . . Leaderboard by Run run m a x y e il d amefshmicjbljhjtfmdsrbgsswtww Figure 2: Two de-noised views into the real-time leaderboard during the final week of gameplay: best yield by week ( left) , and by run ( right ).strictly poorest overall performance in terms of max yield, received the lowest grades on theproject. Their final writeup reinforced an erstwhile low engagement with many aspects ofthe competition, including a failure to capitalize on the “catch up” scripts provided as partof the homework keys. Finally, observe in the figure that my own strategy placed me fifthby these measures. I favored replication over unique runs in hopes of obtaining better maineffects, sensitivity indices, and estimates of variance over time. Very few students picked upon the small asymptotic effect of Mg , commented on the potential useless of Nx , or picked upon periodic effects in the noise. Although all three were evident in my own solution, that’sperhaps because I knew in advance what to look for. Despite those caveats, students withthe best yield results on the leaderboard had remarkably tight estimates for the optimalcoordinates of the five “active” inputs, relative both to one another and to the truth. I have described a statistical optimization game updating a previous one in two senses. Thefirst sense is that the game uses modern tools for implementation. Supplementary materialsprovide R code with support for a shiny app interface, cloud storage for instance-basedhosting like shinyapps.io , and real-time views into progress via a leaderboard. Althoughthe original version of the game was innovative, when introduced it was essentially unusableby others at the time owing to the nature of computing environments available in the early1970s. The second update has to do with modern response surface methods. My re-castingof the game, and the homework problems from class which support it, emphasize a machinelearning and computer surrogate modeling “hands-free” toolkit from the 21st century.8pon reflection, the game perhaps could be revised to better emphasize these moremodern tools. Students who made early progress on the leaderboard unfortunately realizeddiminishing returns from methodological upgrades developed later in the semester. In somesense, the had been “unlucky” in finding right answer “too early” to benefit from the moderntools. I recall an office hours session where one of the better students, “fs”, was somewhatdistraught by lack of improvement in yield from his/her EI implementation. Eventually, Ihad no choice but to reveal that s/he had essentially “solved it” so that s/he could moveproductively on to other things. In future play I may opt for a multi-modal blackbox inorder to keep students engaged, and to demonstrate the explorative value that comes fromEI and IECI-like heuristics appearing later in the syllabus. Another potentially excitingvariation may entertain a real blackbox simulation, and perhaps one not involving agricul-ture, which could feel somewhat old-fashioned to the younger generation of data scientists.A promising example is the so-called assemble-to-order (ATO, Xie et al., 2012) solver. Thiseight-dimensional example is challenging because it has a spatially dependent noise structure.Future versions of the class will cover heteroskedastic GP methodology (Binois et al.,2018b,a), via the hetGP package for R (Binois and Gramacy, 2017), in order to accommodatethat kind of process. In its inaugural run, the game featured mild heteroskedasticity in time(i.e., over game weeks) in order to encourage regular game-play. The number of playersperforming runs in each week were 3, 10, 9, 10, 9, 9, 8, 10, 9, 9, 8, 8, and 10 (of 12 total),respectively. Although not a controlled experiment, these results suggest that that aspect ofthe game was a success. With homeworks due (at most) by-weekly, it would be surprising tohave 80% or higher engagement in weeks without homework due barring extra incentives. Inthe presence of spatial heteroskedastic effects, which is arguably more realistic than sinusoidsin time, an alternate incentive for regular engagement may be needed. Acknowledgments
I am grateful to students and colleagues at Virginia Tech who helped curate, and participatein, my class on modern response surfaces and computer experiments. Thanks to the TASeditorial team for many thoughtful comments.
A Backend details
Under the hood, the app maintains the player database as files quite similar to those offeredfor downloading. If the app is being hosted on a standalone node running a shiny server,those files may be stored locally in the app’s working directory. However, if the game is hostedin the cloud, say on shinyapps.io , the database can only temporarily be stored locally, whilethat instance is active. Ensuring that each of multiple running instances share the same gamedatabase, and guaranteeing its integrity after instances time-out due to inactivity (whichcauses the running directory to be purged), requires that the files be stored elsewhere, in asingle persistently available location in the cloud. I hosted my game at http://gramacylab.shinyapps.io/yield , with database files at accessed through the rdrop2 rdrop2 calls to drop_get files into the local working directory, and subsequently drop_upload tosync those with new runs. Note that the Rmarkdown leaderboard would also pull from thesame rdrop2
API. See the supplementary materials for more details. The default setupof those codes accomodates local (i.e., non-cloud) implementation, since that setup mostexpediently facilitates development and other tinkering. Commented code and descriptiontherein clarifies how to engage the rdrop2 features for production in the cloud.
References
Binois, M., Gramacy, Robert B, H. J., and Ludkovski, M. (2018a). “Replication or ex-ploration? Sequential design for stochastic simulation experiments.”
Technometrics , toappear ; preprint on arXiv:1611.05902.Binois, M. and Gramacy, R. B. (2017). hetGP : Heteroskedastic Gaussian Process Modelingand Design under Replication . R package version 1.0.0.Binois, M., Gramacy, R. B., and Ludkovski, M. (2018b). “Practical heteroskedastic Gaus-sian process modeling for large simulation experiments.” Journal of Computational andGraphical Statistics , to appear ; preprint on arXiv:1611.05902.Booker, A. J., Dennis, Jr., J. E., Frank, P. D., Serafani, D. B., Torczon, V., and Tros-set, M. W. (1999). “A Rigorous Framework for Optimization of Expensive Functions bySurrogates.” Structural Optimization , 17, 1–13.Box, G. E. and Draper, N. R. (1987).
Empirical model-building and response surfaces . JohnWiley & Sons.Chang, W., Cheng, J., Allaire, J., Xie, Y., and McPherson, J. (2017). shiny : Web Applica-tion Framework for R . R package version 1.0.5.Gramacy, R. (2016). “laGP: Large-Scale Spatial Modeling via Local Approximate GaussianProcesses in R.” Journal of Statistical Software, Articles , 72, 1, 1–46.Gramacy, R. B. and Lee, H. K. H. (2011). “Optimization under Unknown Constraints.” In
Bayesian Statistics 9 , eds. J. Bernardo, S. Bayarri, J. O. Berger, A. P. Dawid, D. Hecker-man, A. F. M. Smith, and M. West, 229–256. Oxford University Press.Jones, D. R., Schonlau, M., and Welch, W. J. (1998). “Efficient Global Optimization ofExpensive Black Box Functions.”
J. of Global Optimization , 13, 455–492.Lee, H. K. H. (2007). “Chocolate Chip Cookies as a Teaching Aid.”
The American Statisti-cian , 61, 4, 351–355. All of the player files are downloaded as part of the instance initialization, which can unfortunately betime consuming. However, the instance is then ready to serve multiple players, if needed.
Journal of the RoyalStatistical Society. Series C (Applied Statistics) , 22, 1, 1–6.Myers, R. H., Montgomery, D. C., and Anderson-Cook, C. (2016).
Response surface method-ology: process and product optimization using designed experiments . John Wiley & Sons.Nelder, J. (1966). “Inverse polynomials, a useful group of multi-factor response functions.”
Biometrics , 22, 128–141.R Core Team (2017). R : A Language and Environment for Statistical Computing . R Foun-dation for Statistical Computing, Vienna, Austria.Ram, K. and Yochum, C. (2017). rdrop2 : Programmatic Interface to the ’Dropbox’ API . Rpackage version 0.8.1.Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning .The MIT Press.Santner, T. J., Williams, B. J., and Notz, W. I. (2003).