[PDF] A shiny update to an old experiment game

Abstract

Games can be a powerful tool for learning about statistical methodology. Effective game design involves a fine balance between caricature and realism, to simultaneously illustrate salient concepts in a controlled setting and serve as a testament to real-world applicability. Striking that balance is particularly challenging in response surface and design domains, where real-world scenarios often play out over long time scales, during which theories are revised, model and inferential techniques are improved, and knowledge is updated. Here I present a game, borrowing liberally from one first played over forty years ago, that attempts to achieve that balance while reinforcing a cascade of topics in modern nonparametric response surfaces, sequential design and optimization. The game embeds a blackbox simulation within a shiny app whose interface is designed to simulate a realistic information-availability setting, while offering a stimulating, competitive environment wherein students can try out new methodology, and ultimately appreciate its power and limitations. Interface, rules, timing with course material, and evaluation are described, along with a "case study" involving a cohort of students at Virginia Tech.

Full PDF

AA shiny update to an old experiment game

Robert B. GramacyDepartment of StatisticsVirginia Tech ∗ . Abstract

Games can be a powerful tool for learning about statistical methodology. Eﬀectivegame design involves a ﬁne balance between caricature and realism, to simultaneouslyillustrate salient concepts in a controlled setting and serve as a testament to real-worldapplicability. Striking that balance is particularly challenging in response surface anddesign domains, where real-world scenarios often play out over long time scales, duringwhich theories are revised, model and inferential techniques are improved, and knowl-edge is updated. Here I present a game, borrowing liberally from one ﬁrst playedover forty years ago, that attempts to achieve that balance while reinforcing a cascadeof topics in modern nonparametric response surfaces, sequential design and optimiza-tion. The game embeds a blackbox simulation within a shiny app whose interface isdesigned to simulate a realistic information-availability setting, while oﬀering a stim-ulating, competitive environment wherein students can try out new methodology, andultimately appreciate its power and limitations. Interface, rules, timing with coursematerial, and evaluation are described, along with a “case study” involving a cohortof students at Virginia Tech.

Key words: response surface, computer experiment, experimental design, Bayesianoptimization, input sensitivity, teaching game

In-class games are a common way to encourage learning—to interject some fun and buildintuition in an seemingly esoteric, or tedious technical landscape. A good game could befundamental to retaining students, say in introductory statistics. One ﬁne example useschocolate chip cookies to illustrate aspects of sampling distributions (Lee, 2007). The longarc of an out-of-class game played over an entire semester is attempted rather less frequently.However for some topics, like experimental design and response surface optimization, thatsetting is quite natural: real-life applications play out on longer temporal scales, and in aninherently dynamic landscape. In this article I present such a game, designed to simulate areal-world setting that students might encounter on internship, which was played during a ∗ Contact: [email protected] or Hutcheson Hall (MCO439), 250 Drillﬁeld Drive, Blacksburg, VA 24061 a r X i v : . [ s t a t . O T ] M a y raduate course I recently gave at Virginia Tech. The game is an update of one ﬁrst playedover forty years ago (Mead and Freeman, 1973).Mead and Freeman’s game was ahead of its time. Today, few are aware of the contribu-tion. Although it is cited prominently in Box and Draper (1987), which is how I found it,that is one of just eight references in the literature. Perhaps this is because, for many decades(70s-90s, say) the setup of the game, requiring a custom computing environment with studentaccess, was hard to replicate. Today, with R /CRAN (R Core Team, 2017) implementationand shiny (Chang et al., 2017) web interfaces, barriers have come way down.The original game involves blackbox evaluation of agricultural yield as a function of sixnutrient levels, borrowed from Nelder (1966) and reproduced in R as follows. yield <- function(N, P, K, Na, Ca, Mg){ l1 <- 0.015 + 0.0005*N + 0.001*P + 1/((N+5)*(P+2)) + 0.001*K + 0.1/(K+2)l2 <- 0.001*((2 + K + 0.5*Na)/(Ca+1)) + 0.004*((Ca+1)/(2 + K + 0.5*Na))l3 <- 0.02/(Mg+1)return(1/(l1 + l2 + l3))} Players obtain noisy yield evaluations, due to additive block and plot-within-block (Gaus-sian) eﬀects, with the ultimate goal of maximizing yield. There are ﬁve computer sessions(simulating crop years). Multiple experiments can be undertaken in a single session butstrategies can only be revised between sessions. My updated version still focuses on max-imizing yield through a series of runs, and its inaugural run involved a yield form verysimilar to the one coded above. The speciﬁcs are detailed in Section 2. But the environmentno longer considers blocks and plots.My revisions to the game are primarily motivated by modern technology/application,and a desire to teach a more sophisticated statistical toolkit. Classical response surfacemethods and design emphasize low degree (ﬁrst- and second-order) linear modeling. Theresulting steepest ascent and ridge analyses (see, e.g., Box and Draper, 1987; Myers et al.,2016) enjoy a high degree of analytical tractability. Many relevant calculations can eitherbe performed by hand, or with a graphing calculator. Yet such conveniences oﬀer negligibleadvantage in modern applications demanding facilities that, historically, would have beenconsidered supercomputing but are now simply routine numerics. Two prominent exam-ples are information technology/e-commerce and computer simulation experiments. Modernresponse surface methods in those domains borrow heavily from geostatistics and machinelearning with Gaussian processes, deep neural networks, and regression and classiﬁcationtrees. Sequential design strategies like expected improvement (Jones et al., 1998) promise amore modular approach to (so-called Bayesian) optimization, allowing fancy models to beswapped in and out, and a degree of human-free automation with light coding. My teachingand research philosophy favors modularity in implementation, and the game was in partconceived to reinforce those aspects in practice.Several aspects of the new game setup emphasize friendly competition as a means of2nhancing the learning arena. Exploring a diverse, state-of-the-art, toolkit beneﬁts fromsandboxing aspects of game play, but the superlative nature of the goal—of optimizingyield—organically encourages students to adapt and improve over time. Relative benchmarksalso enhance the sense of realism, albeit through somewhat artiﬁcial means. At the sametime, it remains important that students feel compelled engage with the game in a waythat is advantageous pedagogically, which for some could come at the expense of winning.Toward that end, Section 3 covers player benchmarking, timing of methodology with lecturematerial, and an assessment strategy designed to encourage regular engagement and thedeployment of a wealth of tools. Some results from a real run of the game at Virginia Techare provided. Section 4 concludes with lessons learned and ideas for future variations. Anonline supplement includes a suite of supporting codes and other materials.

The core of game play is facilitated by an R shiny app, shown in Figure 1, which serves asboth a multi-player portal and an interface to the back-end database of player(s) records. An Rmarkdown document, yield.Rmd , provided with the supplementary material, compiles a fullset of instructions on how to use the app, the rules of the game, and some suggestions aboutstrategy. An HTML rendering of that document is provided at http://bobby.gramacy.com/teaching/rsm/yield.html . The salient details follow. shiny app

Since all game players use the same app, interaction begins with a “logging in” phase. Inadvance of opening the game, I asked each player to provide me with their initials (2–4characters) and a four-digit secret PIN, the combination of which comprises of the logintoken. Logging in involves providing “url?token” in the browser search bar. In Figure 1 theURL points to a local shiny server, and the token is “rbg4036”, a combination of my initials(I played the game too) and my oﬃce number, 403G.Once logged in, the player is presented with three blocks of game content. A greetingblock provides details on spent and total budget for experimental runs. How the budgetworks, costs of runs, etc., is discussed in detail in Section 2.2. As long as the player hasnot fully spent, or over-spent, their budget, new runs may be performed, via the secondblock on the page. Here, the player enters the coordinates of the next run. Until all ﬁeldsare populated with valid (positive, non-empty) values, helpful error messages appear on theright. The last input, which is a slider, indicates the number of replicates desired—againdiscussed in more detail shortly. Once all entries are valid, the error messages are replacedby a “Run” button and a warning that there are no do-overs.Performing a run causes the table in the ﬁnal block of the page to be updated, with theweek, inputs, and outputs of the new run appearing at the top of the table. The primarypurpose of the table is to provide visual conﬁrmation that the run has been successfully Competition between independent agents may not be the best way to pool eﬀorts in a real-world setting. yield simulation session. User rbg4036 may inspect the run budget,perform new runs (if budget remains), browse historical runs, or download a text ﬁle.stored in the player’s database ﬁle. Buttons are provided to aid in browsing, however this isnot intended as the main data-access vehicle. A “Download” button at the bottom createsa text ﬁle which can be saved via browser support—usually into the ~/Downloads directory.Empty output ﬁelds are converted to NA s in the downloaded ﬁle.The appendix contains some details about the backend of the shiny app, including specialconsiderations required for hosting in the cloud, say on shinyapps.io .4 .2 Rules, startup and twists Players are encouraged to collaborate on strategy, and the development of relevant math-ematical calculations, but they may not share code or data. They start with a databasecontaining an identical design of seven inputs, each with ﬁve replicate responses whose noisestructure is explained momentarily. Students were introduced to the game in the ﬁfth week(game week zero), and could perform their ﬁrst runs in the sixth week of a 15-week semester.Including Thanksgiving and ﬁnal exam weeks tallies thirteen weeks of game play.In each week, including week zero, students accrue 100 playing units to spend on runs,with full roll-over from previous weeks. The cost structure for runs favors replicates. Obtain-ing the ﬁrst replicate costs ten units. Replicates two through four cost an additional threeunits each; ﬁve through seven cost two, and eight through ten cost one each. As long as aplayer’s account is in the black (i.e., positive balance), s/he can perform a new run with asmany replicates as desired, up to the limit of ten. If performing that run causes the balanceto be zero, or negative, future runs will not be allowed (the run box on the app disappears)until the following week, after 100 new units are added.Replication is important for two game “twists” designed to encourage players to thinkabout signal-to-noise trade-oﬀs, and to nudge them to spend units regularly, rather than savethem all until the end of the semester. The ﬁrst is that players are told that the varianceof the additive noise on yield simulations is changing weekly, following a smooth process intime, and that they will need to provide an estimate of that variance over time for their ﬁnalreport. They are further warned that the variance may be increasing, eﬀectively devaluingunused units. In fact the variance in week w followed a simple sinusoidal structure σ ( w ) = 0 . . π ( w − w s ) /

10) + 1) for starting week w s , which peaks in the ﬁrst and tenth week. The second wrinkle is a seventh input, Nx , which isunrelated to the response. Students are not told that one of the inputs is useless, however theﬁnal writeup instructions ask for sensitivity analyses for the inputs, including main, partialdependence, and total eﬀects. The class was oﬀered as a 6000-level seminar, which graduate students in statistics usuallytake after they’ve completed the bulk of their coursework requirements for a terminal degree(Masters or Ph.D.). Most students were, therefore, not novices in the core pillars of thecourse: linear and nonlinear regression, design, etc. But few had previously worked in anenvironment similar one synthesized by the game. Starting in the ﬁfth week of the semesterallowed for ramp-up on methodological training, giving students the chance to learn/reviewfundamentals like steepest ascent and ridge analyses (e.g., Myers et al., 2016, Chapters5–6) before entering the game as players. Using R code provided in class, most studentswere able perform runs in the ﬁrst two weeks that improved upon yields from week zero,while getting a feel for the system. In subsequent weeks, new methodology was introduced5hich students were expected to try in the game. Most students had not, previously, beenexposed to these topics, which leverage nonparametric response surfaces. Conceptual andimplementational hurdles were anticipated by a careful timing of training (ﬁrst examples inlecture and homework) and testing (live, in the game) stages in the course development.Details on how game play tracks course material, cookie crumbs to “catch up” stragglers,and windows into game progress are provided below. Homeworks were assigned roughly every two weeks and each one, from the third onward,contained a problem on the game. Problem statements and solutions are provided with thesupplementary material. The third homework was the the most prescriptive about what todo in the game. It instructed the student to ﬁt a ﬁrst-order model to the initial data set (7 × step -wise selection procedure with BIC. Then, after reducing to a ﬁrst order model having onlythose three components, they were asked to search for interactions. I found one.With best ﬁtted model in hand, students were asked to characterize a path of steepestascent, and to obtain yield simulations along that path. This required determining valuesfor the remaining (in my case four) inputs. No guidance was given here; I used a Latinhypercube sample, paired with six settings of the three active variables a short ways alongthe path of steepest ascent. Next, students were given several options about how to proceed,including a second-order ridge analysis, more exploration with space-ﬁlling designs, or moresteepest ascent. In my own solution I did a bit of all three, and the result was a second-orderﬁt to the data that had many relevant main eﬀects, interactions, etc.After the homework deadline, I released my solution so students could see what I did,and use it to “catch up” with other players during subsequent weeks. Then we transitionedto modern material involving Gaussian processes (GPs), presented from machine learning(Rasmussen and Williams, 2006) and computer surrogate modeling (Santner et al., 2003)perspectives. Both communities evangelize the potential for ﬁtted GP predictive surfaces toguide searches for global optima in blackbox functions. Machine learning researchers call thisBayesian optimization, whereas computer modelers call it surrogate-assisted optimization(however increasingly they are adopting the machine learning terminology). The simplestvariation involves optimizing the ﬁtted GP predictive mean equations (Booker et al., 1999),in lieu of working directly with locally winnowed input–output data pairs. The method ofexpected improvement (EI, Jones et al., 1998) and integrated variations (e.g., IECI, Gramacyand Lee, 2011) were subsequently introduced to better balance exploration and exploitationby incorporating degrees of predictive variability (local to global), a hallmark of statisticaldecision-making. This family of nonparametric approaches were reinforced in three questionsspanning three subsequent homeworks. Solutions were provided (after the due dates) as plug-n-play R scripts leveraging mature laGP subroutines (Gramacy, 2016), converting a databaseﬁle into a suggested new run without any additional human interaction.As one example consider EI, which is based on the improvement: I ( x ) = max { , f min n − ( x ) } . The quantity f min n is best value of the objective so far, and Y ( x ) is a prediction of thatobjective as a function of the inputs x . Both are derived from the estimated response sur-face, and are thus random variables—a property inherited by I ( x ). Although there are manyvariations in the literature, the most common setup simpliﬁes to treat f min n as a determin-istic quantity derived from minimizing mean predictive surface for Y ( x ), and supposes thatsurface to be Gaussian. In such cases, as arises when training a GP on input-output pairs( x , y ) , . . . , ( x n , y n ), the distribution of Y ( x ) is completely speciﬁed by µ n ( x ) and σ n ( x ),which have closed form expressions. Integrating I ( x ) over Y ( x ) is also analytically tractablein that Gaussian setting, leading to the expected improvement (EI): E { I ( x ) } = ( f n min − µ n ( x )) Φ (cid:18) f n min − µ n ( x ) σ n ( x ) (cid:19) + σ n ( x ) φ (cid:18) f n min − µ n ( x ) σ n ( x ) (cid:19) , where Φ and φ are the standard normal cdf and pdf, respectively. Notice how EI natu-rally balances exploitation, µ n ( x ) below f n min , and exploration, large σ n ( x ), which is key toautomating iterations of search for local optima (by maximizing E { I ( x ) } over x ).Homework 5, provided in the supplementary material, asked students to try EI in thegame. That homework was assigned in week 7 of game play, and variations were consid-ered in subsequent homeworks/weeks. Although timing is diﬃcult to pinpoint exactly, thestudents’ implementation of EI-like methods, and the scripts provided as catch-up after thedue date (week 9, see, e.g., yield_gp_ei.R in the supplementary material), are responsiblefor much of the progress realized in latter weeks. For example student “mds”, who hadrelatively mediocre success with classical RSM tools, saw big improvements [see Figure 2]from employing these more sophisticated, but more easily automated, tools. In hopes that friendly competition would spur interest, I provided leaderboard-style viewsinto players’ performance relative to one another, updated in real time. A live

Rmarkdown script compiled four views into player progress for web viewing. Two of those views, snappedduring the ﬁnal week of game play, are provided in Figure 2. On the x -axis is the week ofgame play ( left panel) or the unique run number ( right ), and on the y -axis is a normalizedyield response. Each player has a line in the plot, with color and type indicated by theirinitials (masking the pin). The responses shown have been de-noised in order to view pureprogress, making normalization essential otherwise players would know the true mean valueof their best noisy response. The two other views provided by the Rmarkdown script show“raw” versions of these same plots, on the original scale—these are not shown here.Observe from the panel that about half of the progress is made in the ﬁrst ﬁve weeks,spanning around forty runs. Students “jh” and “fs” made rapid progress. By contrast, “hm”ends up at the same place in the end, but with more steady increments. The leaderboardwould seem to partition students into three classes, comprising of the top ﬁve, middle four,and lower four. Evidently, student “wt” gave up early and “ww” may have misjudged theappropriate balance of replicates versus unique runs. These students, while not having the7 . . . . Weekly Leaderboard week m a x y i e l d . . . . Leaderboard by Run run m a x y e il d amefshmicjbljhjtfmdsrbgsswtww Figure 2: Two de-noised views into the real-time leaderboard during the ﬁnal week of gameplay: best yield by week ( left) , and by run ( right ).strictly poorest overall performance in terms of max yield, received the lowest grades on theproject. Their ﬁnal writeup reinforced an erstwhile low engagement with many aspects ofthe competition, including a failure to capitalize on the “catch up” scripts provided as partof the homework keys. Finally, observe in the ﬁgure that my own strategy placed me ﬁfthby these measures. I favored replication over unique runs in hopes of obtaining better maineﬀects, sensitivity indices, and estimates of variance over time. Very few students picked upon the small asymptotic eﬀect of Mg , commented on the potential useless of Nx , or picked upon periodic eﬀects in the noise. Although all three were evident in my own solution, that’sperhaps because I knew in advance what to look for. Despite those caveats, students withthe best yield results on the leaderboard had remarkably tight estimates for the optimalcoordinates of the ﬁve “active” inputs, relative both to one another and to the truth. I have described a statistical optimization game updating a previous one in two senses. Theﬁrst sense is that the game uses modern tools for implementation. Supplementary materialsprovide R code with support for a shiny app interface, cloud storage for instance-basedhosting like shinyapps.io , and real-time views into progress via a leaderboard. Althoughthe original version of the game was innovative, when introduced it was essentially unusableby others at the time owing to the nature of computing environments available in the early1970s. The second update has to do with modern response surface methods. My re-castingof the game, and the homework problems from class which support it, emphasize a machinelearning and computer surrogate modeling “hands-free” toolkit from the 21st century.8pon reﬂection, the game perhaps could be revised to better emphasize these moremodern tools. Students who made early progress on the leaderboard unfortunately realizeddiminishing returns from methodological upgrades developed later in the semester. In somesense, the had been “unlucky” in ﬁnding right answer “too early” to beneﬁt from the moderntools. I recall an oﬃce hours session where one of the better students, “fs”, was somewhatdistraught by lack of improvement in yield from his/her EI implementation. Eventually, Ihad no choice but to reveal that s/he had essentially “solved it” so that s/he could moveproductively on to other things. In future play I may opt for a multi-modal blackbox inorder to keep students engaged, and to demonstrate the explorative value that comes fromEI and IECI-like heuristics appearing later in the syllabus. Another potentially excitingvariation may entertain a real blackbox simulation, and perhaps one not involving agricul-ture, which could feel somewhat old-fashioned to the younger generation of data scientists.A promising example is the so-called assemble-to-order (ATO, Xie et al., 2012) solver. Thiseight-dimensional example is challenging because it has a spatially dependent noise structure.Future versions of the class will cover heteroskedastic GP methodology (Binois et al.,2018b,a), via the hetGP package for R (Binois and Gramacy, 2017), in order to accommodatethat kind of process. In its inaugural run, the game featured mild heteroskedasticity in time(i.e., over game weeks) in order to encourage regular game-play. The number of playersperforming runs in each week were 3, 10, 9, 10, 9, 9, 8, 10, 9, 9, 8, 8, and 10 (of 12 total),respectively. Although not a controlled experiment, these results suggest that that aspect ofthe game was a success. With homeworks due (at most) by-weekly, it would be surprising tohave 80% or higher engagement in weeks without homework due barring extra incentives. Inthe presence of spatial heteroskedastic eﬀects, which is arguably more realistic than sinusoidsin time, an alternate incentive for regular engagement may be needed. Acknowledgments

I am grateful to students and colleagues at Virginia Tech who helped curate, and participatein, my class on modern response surfaces and computer experiments. Thanks to the TASeditorial team for many thoughtful comments.

A Backend details

Under the hood, the app maintains the player database as ﬁles quite similar to those oﬀeredfor downloading. If the app is being hosted on a standalone node running a shiny server,those ﬁles may be stored locally in the app’s working directory. However, if the game is hostedin the cloud, say on shinyapps.io , the database can only temporarily be stored locally, whilethat instance is active. Ensuring that each of multiple running instances share the same gamedatabase, and guaranteeing its integrity after instances time-out due to inactivity (whichcauses the running directory to be purged), requires that the ﬁles be stored elsewhere, in asingle persistently available location in the cloud. I hosted my game at http://gramacylab.shinyapps.io/yield , with database ﬁles at accessed through the rdrop2 rdrop2 calls to drop_get ﬁles into the local working directory, and subsequently drop_upload tosync those with new runs. Note that the Rmarkdown leaderboard would also pull from thesame rdrop2

API. See the supplementary materials for more details. The default setupof those codes accomodates local (i.e., non-cloud) implementation, since that setup mostexpediently facilitates development and other tinkering. Commented code and descriptiontherein clariﬁes how to engage the rdrop2 features for production in the cloud.

References

Binois, M., Gramacy, Robert B, H. J., and Ludkovski, M. (2018a). “Replication or ex-ploration? Sequential design for stochastic simulation experiments.”

Technometrics , toappear ; preprint on arXiv:1611.05902.Binois, M. and Gramacy, R. B. (2017). hetGP : Heteroskedastic Gaussian Process Modelingand Design under Replication . R package version 1.0.0.Binois, M., Gramacy, R. B., and Ludkovski, M. (2018b). “Practical heteroskedastic Gaus-sian process modeling for large simulation experiments.” Journal of Computational andGraphical Statistics , to appear ; preprint on arXiv:1611.05902.Booker, A. J., Dennis, Jr., J. E., Frank, P. D., Serafani, D. B., Torczon, V., and Tros-set, M. W. (1999). “A Rigorous Framework for Optimization of Expensive Functions bySurrogates.” Structural Optimization , 17, 1–13.Box, G. E. and Draper, N. R. (1987).

Empirical model-building and response surfaces . JohnWiley & Sons.Chang, W., Cheng, J., Allaire, J., Xie, Y., and McPherson, J. (2017). shiny : Web Applica-tion Framework for R . R package version 1.0.5.Gramacy, R. (2016). “laGP: Large-Scale Spatial Modeling via Local Approximate GaussianProcesses in R.” Journal of Statistical Software, Articles , 72, 1, 1–46.Gramacy, R. B. and Lee, H. K. H. (2011). “Optimization under Unknown Constraints.” In

Bayesian Statistics 9 , eds. J. Bernardo, S. Bayarri, J. O. Berger, A. P. Dawid, D. Hecker-man, A. F. M. Smith, and M. West, 229–256. Oxford University Press.Jones, D. R., Schonlau, M., and Welch, W. J. (1998). “Eﬃcient Global Optimization ofExpensive Black Box Functions.”

J. of Global Optimization , 13, 455–492.Lee, H. K. H. (2007). “Chocolate Chip Cookies as a Teaching Aid.”

The American Statisti-cian , 61, 4, 351–355. All of the player ﬁles are downloaded as part of the instance initialization, which can unfortunately betime consuming. However, the instance is then ready to serve multiple players, if needed.

Journal of the RoyalStatistical Society. Series C (Applied Statistics) , 22, 1, 1–6.Myers, R. H., Montgomery, D. C., and Anderson-Cook, C. (2016).

Response surface method-ology: process and product optimization using designed experiments . John Wiley & Sons.Nelder, J. (1966). “Inverse polynomials, a useful group of multi-factor response functions.”

Biometrics , 22, 128–141.R Core Team (2017). R : A Language and Environment for Statistical Computing . R Foun-dation for Statistical Computing, Vienna, Austria.Ram, K. and Yochum, C. (2017). rdrop2 : Programmatic Interface to the ’Dropbox’ API . Rpackage version 0.8.1.Rasmussen, C. E. and Williams, C. K. I. (2006). Gaussian Processes for Machine Learning .The MIT Press.Santner, T. J., Williams, B. J., and Notz, W. I. (2003).