When black box algorithms are (not) appropriate: a principled prediction-problem ontology
WWhen black box algorithms are(not) appropriate: a principledprediction-problem ontology
Jordan Rodu and Michael Baiocchi
University of Virginia and Stanford University
Abstract.
In the 1980s a new, extraordinarily productive way of reasoningabout algorithms emerged. Though this type of reasoning has cometo dominate areas of data science, it has been under-discussed and itsimpact under-appreciated. For example, it is the primary way we reasonabout “black box” algorithms. In this paper we analyze its current use(i.e., as “the common task framework”) and its limitations; we find alarge class of prediction-problems are inappropriate for this type of rea-soning. Further, we find the common task framework does not provide afoundation for the deployment of an algorithm in a real world situation.Building off of its core features, we identify a class of problems wherethis new form of reasoning can be used in deployment. We purposefullydevelop a novel framework so both technical and non-technical peoplecan discuss and identify key features of their prediction problem andwhether or not it is suitable for this new kind of reasoning.
Key words and phrases: machine learning, black box, algorithms, rea-soning.
Jordan Rodu is Assistant Professor, Department of Statistics, University ofVirginia, Halsey Hall, Charlottesville, Virginia 22903 (e-mail:[email protected]). Michael Baiocchi is Assistant Professor, Departmentof Epidemiology and Population Health, School of Medicine, StanfordUniversity, Redwood Building, 150 Governors Lane, Stanford, California 94305(e-mail: [email protected]). a r X i v : . [ s t a t . O T ] A p r J. RODU ET AL.
1. INTRODUCTION
It is exciting to witness the development of flexible, fast, and useful predic-tive algorithms. Algorithms are driving cars, identifying breast cancers, enablingglobe-spanning businesses, and organizing unprecedented amounts of informa-tion; prediction provides a strong foundation for technological innovation andscientific breakthroughs Shmueli and Koppius (2011). The excitement about thesealgorithms is warranted; these achievements are unparalleled in history. A verynatural question for members of the business, government, and academic commu-nities is: “how can we use them?” This simple question is more complicated thanit first appears because modern predictive algorithms have several very distinc-tive features, including this one: some of the most successful algorithms are socomplex that no person can describe the mathematical features of the algorithmthat gives rise to the high performance (a.k.a. “black box algorithms”). There isan important tension right now because these extraordinary black-box algorithmsexist – with so much potential to do good – despite deep uncertainty about whenand how to use them. And this tension is warranted: for all of their achievements,black-box algorithms have shown to be unpredictably brittle in the real world.This is a consequence of how they are developed.These algorithms have come into existence through a confluence of innovations.Some of these innovations are practical (e.g., the price of computing has contin-ued to drop), some are market-based (e.g., online platforms have proved to beprofitable business models and so corporations have funded much of this researchand development), some are due to political decisions (e.g., emphasis on STEMfields has created a big pipeline of data scientists), but a major – yet also under-appreciated – shift has come from a new framework for assessing algorithms, aframework that does not require slow-moving mathematical proofs. While un-derstanding black box algorithms is not possible, by understanding how theyare being developed and assessed we can understand what situations are more– and less – compatible or safe for the use of black box algorithms. To provideguidance on using prediction algorithms, this paper offers a new framework forstakeholders (e.g., business people, government officials, non-statistically mindedacademics) to discuss and critique the use of these algorithms. For reasons whichwill become clear in section 2, we call this framework MARA(s).
ARA(S) Below we give a short introduction to the intellectual-engine that has drivenmuch of the recent innovation, and the one that has consequences on how algo-rithms are deployed.The Common Task Framework (CTF) (Liberman, 2010; Donoho, 2017; Breiman,2001) provides a fast, low-barriers-to-entry means for researchers to settle debatesabout the relative utility of competing algorithms. This is in contrast to the tra-ditional use of mathematical descriptions of the behavior of an algorithm, orsimulations of the algorithm’s ability to recover parameters of a data generat-ing function. Many readers are likely familiar with the CTF even if the name isunfamiliar; the NetFlix Prize (Bennett, Lanning and Others, 2007) and Kagglecompetitions are excellent examples of this framework. The key features of theCTF are: (a) curated data that have been placed in a repository; (b) static data(all analysts have access to the same data); (c) a well defined task (e.g., predicty given a vector of inputs x for previously unobserved units of observation); (d)consensus on the evaluation metric (e.g., the mean squared error of the predic-tions from the algorithm on a set of observations); and (e) an evaluation dataset with observations which have not been accessible to the analysts. Today, inpractice, some of the features of the CTF are relaxed. In particular, outside of themajor competitions, feature “e” is often self-policed - i.e., the analyst has directaccess to the evaluation data set. When performed correctly, the CTF gives us away of justifying a claim that “Algorithm A performs better than Algorithm B.”More than just a form of justification, the CTF provides an efficient environ-ment for development. The data exist already. All analysts have access to thesecommon data so many people can work on the problem at the same time. Fastcomputation takes the place of proving theorems, and performance is quicklyassessed using held-out data. The consequences of a poorly performing predic-tion algorithm in the CTF are minimal - e.g., after a failure the analyst tweaksthe algorithm and tries again. Fundamentally, the CTF takes complex real-worldproblems and sand-boxes them.In the CTF, because there is a specific performance metric, there is littleambiguity in the relative ordering of the algorithms conditional on a particulardataset. The ordering gives rise to a ranking of the algorithms, and the publicdisplay of these rankings are called ‘leader boards.” Leader boards are cited as
J. RODU ET AL. using competition to motivate analysts to reach high levels of performance. Theunderlying logic of using leader boards is new and productive. Leader boards relyon a form of reasoning we call “outcome-reasoning” (discussed in detail in 4.1).For the reasons described in the previous two paragraphs, outcome reasoning– if appropriate – is preferred. However, many people are deploying black-boxalgorithms, which rely on outcome-reasoning, in problem settings when outcome-reasoning is unavailable. In these problem settings, “model-reasoning” should beused. Many current debates in the literatures about the suitability of black boxmodels hinge on (mis)understandings about what kind of reasoning is appropriatefor given problems. The goal of this paper is not to resolve these debates but toprovide a useful framework for understanding the type of problem for which asolution is sought.In this paper, we make three fundamental shifts in focus from how currentdebates about black box algorithms usually proceed. We make these shifts ex-plicit here, to prevent reading this paper as contradicting existing decision-makingframeworks that analysts are already using to assess black box algorithms. Insteadof being in contradiction, the three shifts listed below indicate that our consider-ations occur earlier in the data-driven decision-making pipeline than do existingframeworks. The three shifts are 1) Our ontology starts with stakeholders defininga prediction-problem. This means we think about how to elicit ideas and feed-back from stakeholders. We work through several examples in Section 3, but wecall your attention to the example on recidivism in subsection 3.4. An immediateimplication of our ontology’s focus on stakeholders is that changes in stakeholdermembership is likely to change how we think about the prediction-problem. 2)We focus on crafting the problem and its consequences. This means we are en-gaging the problem quite early in the process - helping to understand, shape,and quantify the key issues. We do not start after the “prediction task” alreadyexists; the task is not given to us. Like in experimental design, the stakeholdersmust first think about what they want to accomplish by using data, and then wehelp to turn that into a specification of the problem. 3) Our ontology focuses onfeatures of the problem , rather than features of the algorithm . Problem featuresare a consequence of both real world constraints, and the interplay between thoseconstraints and stakeholders’ goals (see section 3 for detailed examples).
ARA(S) We designed this paper to be accessible to different readers. Policy-mindedreaders, scientists, and non-technical academics will find the information mostuseful to them from sections 2 and 3. Analysts and technical academics who areasked by colleagues to select appropriate algorithms will find sections 2, 4, and 5provide useful context and language to guide these conversations. This modularityleads to some slight repetition between sections, but even those in the algorithmresearch and development communities will benefit from the reinforcement of keyconcepts.
2. INTRODUCTION TO THE PRINCIPLED PREDICTION-PROBLEMONTOLOGY
In any interesting prediction problem, errors will occur. Viewed one way, thePrediction-Problem Ontology is a framework for achieving buy-in from stakehold-ers before these errors start to accumulate. Imagine for a moment that a blackbox algorithm is shown to outperform all other existing algorithms in the train-ing and test data. Based on this information alone, the algorithm is deployed.After deployment a terrible event occurs due to errors in the predictions. Who isaccountable for the consequences of these errors? The reasoning that went intothe algorithm’s justification turned on knowledge owned by the analyst: Werethe training and test data sets adequate? Were the algorithms selected for con-sideration adequate? Was the performance metric informative? Did the analystprotect against over-fitting? These are technical questions. How could stakehold-ers be held accountable if the algorithm cannot be interrogated by other means?So who is accountable in this situation? The decision-making turned on trust inthe analyst’s judgment.We can avoid this dynamic by including non-technical stakeholders in thedecision-making process. To do so we propose the Principled Prediction-ProblemOntology, which extends the features of the Common Task Framework. We callthis framework MARA(s).In MARA(s) we classify problems using four features (“problem-features”),which we refer to collectively with the mnemonic “MARA”:1. [measurement] ability to measure a function of individual predictions andactual outcomes on future data,
J. RODU ET AL. [adaptability] ability to adapt the algorithm on a useful timescale,3. [resilience] tolerance for accumulated error in predictions, and4. [agnosis] tolerance for potential incompatibility with stakeholder beliefs.Stakeholders classify their problem as either “satisfying” or “not satisfying”each problem-feature individually. (In some settings it may be more appropriateto relax the binary classification.) If a problem satisfies the MARA problem-features then the problem is suitable for outcome-reasoning – that is, the power-ful form of reasoning that is the foundation for the CTF. If the problem fails tosatisfy even one of the features then the problem requires a more complex form ofreasoning to justify the algorithm’s deployment – i.e., model-reasoning (discussedin detail in 4.1). In section 2.1, we extend the MARA acronym to MARA(s) toemphasize the importance of stakeholder composition in deployment. Our frame-work name, MARA(s), derives its name from the extended acronym.Two dynamics contribute to this framework. First, MARA(s) extends the CTFto “live” problems, providing a principled foundation for assessing the general-izability and transportability of an algorithm into the real-world. Note that thecuration process that goes into creating a problem for use in the traditional CTF –i.e., abstracting data sets from their applied example, providing labeled outcomesof interest, and standardizing the task and performance metric – ensures satisfy-ing the MARA problem features. Algorithms that are developed in the CTF canbe successfully applied to problems in the real world that similarly satisfy theMARA problem-features. However, these algorithms may be unsuitable for de-ployment in problems that do not satisfy the problem-features. Mara(s) provideslanguage to clarify a problem’s features and facilitate debate among stakeholdersabout the suitability of an algorithm.The second, and most fundamental, dynamic is that MARA(s) starts with thestakeholders - that is, the people who are accountable to the performance of thealgorithm. Starting with stakeholders has two major implications: (a) the focusof this assessment is based on understanding the problem itself rather than thealgorithm, and (b) if different stakeholders are brought in and out of the groupassessing the problem then the group can reach very different conclusions aboutthe appropriateness of a black box algorithm (see the recidivism example belowfor how this works). ARA(S) Fig 1 . Workflow
This framework implies a workflow (see Fig 1) in which stakeholders first en-gage a prediction problem through MARA(s). This occurs prior to any technicalconsiderations (e.g., using cross validation to assess the fit of the algorithm).Only after classification under MARA(s), analysts can identify suitable algo-rithms that adhere to the proper form of algorithmic reasoning, and can assessthe potential algorithms for their technical merits. Finally the analyst can de-ploy the algorithm with proper re-assessment, depending again on the class ofreasoning chosen through MARA(s).
One of the fundamental tenets of MARA(s) is that it focuses energy in decisionmaking on the problem at hand. This forces the question of who is defining theproblem, or who is affected by the problem. These are the “stakeholders.” Untilthis point, we have avoided specifying who the stakeholders are, as this dependson the context of the scenario under which the problem is formulated. In thissubsection we spend a few paragraphs working through problems and focus onwho stakeholders might be. The notion of the stakeholder is quite flexible andrequires explicit consideration.First, note that the definition of a stakeholder is contingent on several deci-sions. For instance, a hospital considering the adoption of software to automate oraugment a task might very well consider the stakeholders to be the upper man-agement who oversee the overall financial health of their institution. However,they might also consider the doctors or nurses as stakeholders, who might bringadditional constraints to considerations of satisfaction of MARA. In some cases,the hospital might wish to consider a hypothetical patient as a stakeholder, whowould in many scenarios impact how we think about the adaptability problem-feature of MARA(s). MARA(s) does not prescribe who is a stakeholder. Rather,
J. RODU ET AL. once stakeholders have been identified, MARA(s) aides in the discussion of themerits of various algorithms.Now consider how a business might define stakeholders in several ways. Insome scenarios, the stakeholders might be the C-suite, who are responsible forthe big decisions a company has to make. Alternatively, a business might engageemployees at any level, or even customers in understanding their problem’s satis-fying of MARA constraints. And even the notion of customer may have differentinterpretations here – current customers versus future customers– which may di-verge in their interests. It’s likely that in many settings, in order to anticipatemore long-term consequences of their decisions, the company will need to includeshareholders as stakeholders.While the term “stakeholder” might conjure a business application or commu-nity mobilization, we intend for it to refer to a broad range of disciplines, includ-ing academia. A scientist, for instance, should define stakeholders when engagingMARA(s). In some cases, the stakeholder will be the scientist. In other cases, thescientist might wish to include her peers as hypothetical stakeholders, perhapsrepresenting the pursuit of knowledge in a broad sense. In some cases it mightmake sense for the scientist to consider journal editors as stakeholders. With aneye towards getting published, the scientist might put more weight on the agnosisproblem-feature when considering editors as stakeholders, for instance, than sheotherwise might.One of the most important examples of stakeholder involvement arises in gov-ernment. Governing committees often seek involvement and input from a cross-section of the population, with particular emphasis on those most impacted bydecisions. In large part, MARA(s) was developed with this kind of dynamic inmind: bridging the communication-gap between non-technical stakeholders andanalysts. While stakeholders may benefit from the use of these algorithms, theyshould not be required to understand the technical issues. With the appropriateframework, stakeholders can offer useful critiques and assessments of the impor-tant issues which can be incorporated into the design of the prediction-problem.MARA(s) defines a framework for engaging decisions on deploying algorithms.There are two levels to this debate. The first is on the scope of the stakeholders.The second is on the satisfaction of MARA, given the stakeholders. While we
ARA(S) anticipate a lot of time and emphasis to be placed on whether or not a problem-feature is satisfied, it is usually most important to debate who should be includedas stakeholders as inclusion/exclusion will greatly impact the assessment of theMARA problem-features.The importance of the definition of the stakeholders motivates our extensionof the MARA acronym to MARA(s) in naming our framework. This extensionreminds the user that the definition of stakeholders is an important aspect ofour framework, while still keeping it separated from the MARA conditions byway of the parenthetical enclosure. For readers familiar with function notation,MARA(s) emphasizes that the satisfaction of MARA problem-features is a func-tion of the stakeholder composition. While stakeholder selection is critical toMARA(s), the framework is not designed to facilitate the selection process. Wehope that the process of stakeholder selection becomes an active area of thinkingacross many interested domains, and that MARA(s) helps to clarify the criticalparameters of that debate.We flag an important issue here: there can be deep ethical concerns about howstakeholders are included and excluded. We do not engage these concerns in thispaper, but we do want to emphasize that we offer a language for critique: (i) “Ishould have been a stakeholder when the problem was being defined.” (ii) “If Ihad been a stakeholder then I would have argued that this problem fails bothadaptability and agnosis.” It is an important feature of this framework to providelanguage that can express that different sets of stakeholders – e.g., s and s – willhave different assessments of satisfying the MARA: M ARA ( s ) (cid:54) = M ARA ( s ).A different theory is needed to think about how we should go about includingand excluding stakeholders.
3. EXAMPLES
In this section we discuss several common prediction problems. Each exam-ple was chosen to highlight different aspects of MARA(s). Each of the MARAproblem-features is considered, as well as how stakeholder selection impacts theseconsiderations. J. RODU ET AL.
The goal of a recommendation system is to introduce users to products ofinterest. Training data for a recommendation system often is represented as asparse matrix with individuals on the rows and products on the columns. If the i th user has engaged with the j th product and rated it, the ij entry of the matrixwill contain the user-product rating. Other entries will be blank. To select algo-rithms, some of the known entries are obscured, and the task is to predict therating for those user-product pairs. In deployment, algorithmic performance canbe measured as a function of the individual outcome of each recommendation,which might take the form of a) acceptance of recommendation with a subsequenthigh rating, b) acceptance with a low rating, c) acknowledgement of the recom-mendation without taking it, or d) no acknowledgement of the recommendation.Technically, only some of the predictions are verified in the wild. When the systempredicts that the user will assign a low rating to a product, that product will notbe recommended, and that rating might not be verified. But the ultimate goal ofthe recommendation system is to provide good recommendations to make money.Often there is a large pool of products, many of which could be recommended tothe user with great success. The loss function, then, is asymmetric. If a product isrecommended to a user, it is desirable that the product will obtain a high rating.On the other hand, if a product that would otherwise have been enjoyed by auser is not recommended, because of the large pool of potential products, missingthis product is not such a big deal. In general, recommendation systems satisfythe MARA problem-features, though specific examples could be constructed inwhich some of the problem-features might fail to be satisfied. Consider a high-frequency trading group with large reserves and a goal of profitmaximizing. In order to construct a risk-diversified portfolio, one sub-goal inhigh frequency trading is to predict the instantaneous covariance between stocks.While one cannot directly assess the accuracy of the covariance prediction, onecould use an external measure of performance as a direct consequence of theprediction to monitor the success of the algorithm. For example, the group canmonitor if the ongoing use of the algorithm increases the value of their holdings.In this setting, all four problem-features are satisfied.
ARA(S) In contrast, a manager of family wealth may require an algorithm that can beevaluated by the family so that they can assess whether or not the algorithm’santicipated behavior matches their beliefs about the market or to verify thatthe trading algorithm comports with their ethical concerns. In this case, agnosiswould not be satisfied.
An interesting application of prediction algorithms is to use them as cheapmeasurements in lieu of obtaining expensive, gold-standard labels. Consider anautomated system for triaging mild health symptoms. Instead of using a tele-phonic nurse-based system, automated prediction algorithms could be used tooffer some level of diagnosis and either recommend a patient seeks further helpor not. From the perspective of a health care administrator, triaging mild healthsymptoms may satisfy all four problem-features because the administrator con-siders outcomes of many patients who interact with the health system. But ifadministrators take as their goal the delivery of care to a particular patient thenthe problem fails measurement because the prediction is explicitly deployed inlieu of an actual diagnosis. That is, the point of such a prediction algorithm is toskip the burden of obtaining the desired measurement.Importantly, if the patient is included as a stakeholder, then the adaptabilityproblem-feature is also violated since this is a one-shot prediction for this pa-tient. This is a general principle: problem-feature satisfaction depends on who isconsidered as a stakeholder. That is, in general,
M ARA ( s i ) (cid:54) = M ARA ( s j ). In an algorithm is used to predict recidivism, if all defendants with a scoreabove a certain threshold are incarcerated then we are unable to observe thecorrectness of our predictions for people who are incarcerated. This arises froma missingness in the outcome, and will happen in general when the predictionalgorithm causes changes in the outcome. The recidivism problem also violatesthe resilience and agnosis problem-features. In the resilience problem-feature,the debate hinges on the stakeholders’ concerns about depriving rights throughunnecessary incarceration, balanced against possible future criminal acts. Failureto satisfy agnosis stems from two concerns. First, it is necessary to explain to J. RODU ET AL. the defendant why the decision to incarcerate was made. Second, even if thealgorithm were a flawless predictor, if it did so through morally repugnant meansthen stakeholders would need to know that it is achieving its predictions thisway and these means would need to be debated by stakeholders (these kinds ofconcerns are often referred to as “deontological” – roughly: having to do withethical considerations about the actions taken rather than just concerns aboutthe outcomes achieved).
Suppose we run a website that can place only one of three ads – correspondingto one of three items for purchase – for each customer who arrives to the website.We are unsure of which ad to place for a given customer. Let us consider if ablack box algorithm is suitable for use in learning an optimal assignment of ads.In this setting, a useful algorithm estimates how much the probability of buyinga product changes if the website shows a certain ad to a particular customer. Atfirst pass it may seem best to assign customers to the ad that will increase theirprobability of purchase the most – but that is true only if the analyst knowshow the probabilities of purchasing change given an ad, up to some tolerance.The prediction problem is thus to learn these probabilities and the question iswhether or not this is a suitable problem for outcome reasoning.Similar to the recidivism example above, for any particular person we can onlymeasure the outcome for one of the ad placements. It is possible, though, totarget a useful function of these outcomes: the average outcome after exposureto the ad placement conditional on some set of covariates (this is often referredto as the conditional average treatment effect and can often be used to buildup other popular estimands). This can be done by having the algorithm assignthe ad placements in such a way that for any person there is some chance ofseeing each ad - i.e., 0 < p , p , p <
1. This is how randomized trials obtaincausal estimates. In contrast to a uniform randomized controlled trial – wherethe assignment to treatment is proportional to rolling a die with three equallyweighted outcomes – in this setting we can use some more sophisticated algorithmthat may be able to obtain better estimates of the probabilities more efficiently.This kind of thinking has led to interesting work on contextual bandit theory,and other forms of adaptive trials.
ARA(S) In most randomized controlled trials (RCTs), the goal is to assess the compat-ibility of beliefs with reality, thus traditional RCTs fail agnosis. When humansare the subjects in RCTs, issues related to resilience are considered by the insti-tutional review board. But when the MARA problem-features are satisfied, thisgives rise to a peculiar problem type: an atheoretic randomized controlled trialwith treatment assignment determined by a black box algorithm.
4. REASONING ABOUT ALGORITHMS
In this section we describe the types of reasoning that can be used for assessingan algorithm.
MARA(s) concentrates on the degree to which beliefs , or the current state ofknowledge in the given domain, are used to constrain a model. While contentknowledge is rarely ignored entirely, its utilization in assessment can vary fromheuristically informing the choice of method to actively validating the learnedparameters. We now introduce two forms of assessment that reside on either endof this spectrum: model-reasoning and outcome-reasoning.Model-reasoning requires checking that the model conforms to current beliefs.For example consider a linear regression; we can use model-reasoning by verifyingthe direction of individual coefficients matches what is expected from our domainknowledge. We can think of these checks as a mapping from the model, or “pa-rameters” of the model, to the space of current beliefs. In model-reasoning, it istherefore possible to hypothesize how a particular instantiation of a model, ˆ f , willperform on future data without reference to the data used to fit the algorithm.This provides solid ground on which experts in a field can debate and discussthe fitted algorithm and its suitability for future predictions in a concise man-ner, with discussions stemming from beliefs, and not potential difficult-to-findshortcomings of the data set or algorithm.In contrast, outcome-reasoning relies very little on beliefs, which primarilyenter into consideration through choice of the performance metric. Outcome-reasoning is the reasoning used in the CTF. In MARA(s), outcome-reasoning isextended to the ongoing, out-of-sample prediction setting. If the four problem-features are satisfied then the analyst and stakeholders can monitor the perfor- J. RODU ET AL. mance metric during the deployment phase in order to assess whether or not theiralgorithm is working.The two types of reasoning lead to two different kinds of thinking when com-paring algorithms. Model-reasoning tends to involve discussions of parametersand the algorithm’s ability to faithfully recover the parameters. That is, model-reasoning forces the analyst to think carefully about how changes in the covariatesshould be linked to variation in the outcome (e.g., should we predict that a tallerperson weights more than a shorter person?). But when divergent algorithmsare compared – e.g., say an ARMA(p,q) is compared to a decision tree – it isquite challenging to translate between different conceptualizations of how theinput space is linked to the outcome space. Consequently, given the challengeof translating between algorithms using model-reasoning, comparisons tend tobe pairwise and slow. In stark contrast, outcome-reasoning assiduously avoidsany debates in the input space. Instead, outcome-reasoning operates in the spaceof the outcome – where all candidate algorithms must operate. An analogy tocapture this dynamic: consider two economies. Model-reasoning is a bit like abarter-based economy; each transaction requires careful consideration of idiosyn-cratic features and how much the parties need each of the products. An economythat uses currency to store value allows a lower friction form of transactions;each product’s value is translated into the currency and then comparisons can bemade rapidly between different products. Now imagine these two economies andtheir ability to develop, innovate, and scale.Many, though not all, concerns about black box algorithms can be framed asissues of extrapolation. While the CTF offers a foundation for comparing algo-rithms’ performance on currently available data, the CTF alone does not offer aprincipled foundation for reasoning about the future, out-of-sample performanceof an algorithm. Without access to model-reasoning, the mechanisms for reason-ing through performance on future data, and not just held-out data, are limited.Such reasoning would require careful consideration of the interaction betweenproperties of the data set and the properties of the algorithm. In a setting thatuses a black box algorithm that requires massive training data, understanding thedata itself can be impossible. Even under a smaller data regime, the task of un-packing the data/algorithm interaction can be difficult, if not impossible. Indeed
ARA(S) one common fix when a black box fails is to add data to the training data set inhopes that a new fit on the new data might remedy the failure. The underlyingcause for the failure with respect to the current ˆ f often remains unknown.An important distinction between the two modes of reasoning: model-reasoningallows for detailed debate to happen before the deployment of the algorithm,whereas outcome-reasoning affords assessment purely post-deployment. We now discuss the four problem-features in more detail. Collectively, we referto the four problem-features as MARA.
The first problem-feature is the abilityto measure a function of the predictions and actual outcomes on future data.Let y ∗ be the value of a future outcome associated with predictors x ∗ , and ˆ f ( x )denote the estimated prediction function. For some agreed-upon notion of close,this problem-feature describes the ability to track whether ˆ f ( x ∗ ) is close to y ∗ ,measured as g ( y ∗ , ˆ f ( x ∗ )), within some reasonable tolerance.This is the most foundational problem-feature for the MARA(s) framework.If this problem-feature isn’t satisfied the analyst will not be able to verify if thealgorithm is performing well. The use of a black box model for a problem thatdoesn’t satisfy measurement requires faith. If this problem-feature is satisfiedthen the algorithm’s performance can be monitored after deployment by mon-itoring the error function g () (notably, both Google (Google, 2019) and Uber(Hermann et al., 2018) include monitoring predictions after algorithm deploy-ment as a critical component of their machine learning workflows). Without thisfeedback mechanism, assessment must happen before deployment. In this case,our framework requires that the analyst pursue model-reasoning. Problem-feature 2 is the ability to adaptthe algorithm on a useful timescale. In some settings, upon discovering errors inprediction, an algorithm can be updated quickly and will be presented with suf-ficient opportunity to update. In other settings, the underlying dynamics of thepopulation of interest change at a rate such that those changes dominate algo-rithm adaptations from observed error. The latter situation renders predictionsas one-shot extrapolations, at which point the observation of the function g () is J. RODU ET AL. useless.For instance, predicting the outcome of the United States presidential electiondepends on measuring the ebb and flow of priorities of the voting population.With an algorithm assessed under outcome-reasoning, the lessons learned fromprediction errors in one election may not be informative for the next electionbecause the underling priorities of the population may have shifted. Anothercommon violation of adaptability is when the deployment of the algorithm it-self changes the way the outcomes are generated; this phenomenon has beendescribed many times in policy settings - the Lucas Critique, Goodhart’s law,and Campbell’s law being famous formulations.
Problem-feature 3 describes the stakehold-ers’ tolerance for accumulated error in predictions. As errors accumulate, someoneor some group will be held accountable. Some stakeholders will see errors in pre-diction as so intolerable as to bar any unjustified use of an algorithm, for examplewhen an error in prediction may lead to a death or a false incarceration. On theother end, settings like recommendation algorithms may be viewed as having min-imal consequences to errors in prediction. Most scenarios will be somewhere inbetween, where the stakeholders are willing to trade off some unaccounted errorin predictions against the accumulation of value gained from better predictions.If the group deploying the algorithm has large reserves relative to the accumula-tion of costs due to the accumulation of errors then the problem at hand satisfiesresilience.
Problem-feature 4 describes tolerance for in-compatibility with stakeholder beliefs. Stakeholders will hold certain beliefs aboutthe process being predicted. These beliefs may take the form of prior knowledgeor scientific evidence (e.g., experience with how gravity works in this setting).Other beliefs may arise from moral or ethical concerns (e.g., racial informationshould not be used to assign credit scores). In some cases, if an algorithm reachesits predictions in a way that violates their beliefs then this dynamic can makestakeholders uncomfortable with deploying the algorithm.The agnosis problem-feature requires both eliciting and clarifying the stake-holders’ beliefs about the problem. Understanding agnosis also requires the ana-lyst and stakeholders to gauge comfort with the algorithm violating their beliefs.
ARA(S) In Section 3 we took great care to isolate examples so that each problem-featureappeared as clear and distinct as possible. In reality, these features interact andin practice should be discussed collectively as MARA. We encourage the useof “MARA(s)” to emphasize the role of the stakeholders in assessing the fourproblem features.Model-reasoning requires deductive reasoning, meaning that we understandthe mathematical structures of the model well enough so that once decoupledfrom the data it was fit on, stakeholders can reason about future behavior ofthe algorithm. Methods of assessing an algorithm that are inductive cannot beused for model-reasoning. Inductive reasoning is contingent and depends on thedata in hand (e.g. recycled predictions) or on details of hypothesized, future,out-of-sample data. Methods of assessment built off of these are attempting toapproximate model-reasoning.
5. THINKING IN TERMS OF “OUTCOME-REASONING” INSTEAD OF“BLACK BOX ALGORITHMS”5.1 Outcome-reasoning as a connection between black boxes
In this section we have a more detailed discussion of what we mean by “blackbox” algorithms. The classification laid out in Burrell is quite useful for discussinghow an algorithm becomes opaque: [Type 1:] opacity as intentional corporate orstate secrecy, [Type 2:] opacity as technical illiteracy, and [Type 3:] an opacitythat arises from the characteristics of machine learning algorithms and the scalerequired to apply them usefully. The root-causes of the opacity are interesting,and have implications for how to remove the opacity. Important research focuseson how to reduce the opacity of the algorithms without losing the strengthsoffered by these algorithms. Instead of focusing on the causes of the opacity, wefocus on how one goes about convincing others that the black box algorithm issuitable for deployment.For the sake of argument, consider an algorithm that satisfies all three defi-nitions of black boxes as offered by Burrell. Taking this as our example, whatdo we mean when we say this algorithm “works well”? Few understand how thiscomplex algorithm is implemented, and none argue the behavior of the algorithm J. RODU ET AL. is well-understood. So describing its behavior is not an option; we cannot usemodel-reasoning. It appears that outcome-reasoning is the dominant way to rea-son about the performance of this kind of algorithm. People believe this complexalgorithm works not because they believe the algorithm should work for someproblem because of how the algorithm functions, but rather because people haveobserved these algorithms working – in a CTF-sense – by out performing otheralgorithms on predictions tasks. If these algorithms are not justified by outcome-reasoning then what argument, based on data, is used to justify these algorithms?While there are three distinct causes of black boxes, recognize that regardlessof type we justify a black box by using outcome-reasoning. While the CTF wasfirst developed to address the development of Type 3 black boxes, the logic ofoutcome-reasoning has been co-opted to justify the other types of black boxes.With outcome-reasoning, organizations can try to convince stakeholders to usetheir algorithm while withholding details of how their Type 1 black box works.Without outcome-reasoning it is possible fewer Type 1 black boxes would be de-ployed, because it would be harder for corporations and state actors to convincestakeholders to accept the algorithm’s utility without allowing a deeper inter-rogation of their algorithm (i.e., blocking any chance for model-reasoning). It isinteresting to think about how outcome-reasoning has facilitated the proliferationof Type 2 black boxes – e.g., by lowering the barriers to assessing an algorithmand therefore lulling more stakeholders into feeling comfortable deploying thesealgorithms.Outcome-reasoning does not require as detailed engagement with the behaviorof an algorithm, so more people can reason about the relative performance ofalgorithms than if they were required to use model-reasoning. It is possible thatoutcome-reasoning has led to overconfidence from non-technical stakeholders asoutcome-reasoning feels more accessible than model-reasoning.
In this section we try to clarify a miscommunication happening in debatesabout black box algorithms. We will motivate this example focusing on Type 3black boxes, but the arguments need to be only slightly adapted in order to holdfor Type 1 and 2 black boxes. To orient the discussion we look at the followingquestion twice: “What happens if a black box algorithm fails?”
ARA(S) First, in a technical sense, what happens if a Type 3 black box algorithm isdeployed in the real world and something goes wrong? Speaking loosely, thereare four fixes that are commonly used to correct an algorithm that is not wellunderstood: 1) add a human to the loop; 2) collect more data and retrain thealgorithm; 3) use a different algorithm; or 4) force the algorithm to return a pre-specified output for inputs that have been identified as producing “unacceptablybad” outputs. All four fixes might work, though they could be difficult to im-plement (fix 1) or impossible to validate (fixes 2-4). In the paragraph below wediscuss each fix and their limitations.The first fix is to add a human into the prediction process. For example, supposean airplane’s autopilot algorithm malfunctions under conditions that analysts areunable to describe a priori . While the algorithm can successfully pilot most of thetime, an override can allow an alert pilot to take the controls when she detects anerror, thereby inhibiting the autopilot algorithm. With the human in the loop fix,there is a fundamental loss of scalability which has been a hallmark of modernprediction. The second fix uses more training data, especially in areas of inputspace that the analyst believes to be problematic; however it is difficult to reasonabout how to sample from the space of observations. From the perspective of ablack box algorithm, it is not clear if two points are “close or “far apart” in theinput space – these algorithms take advantage of non-linear patterns in the spaceof inputs. For example, suppose we have two people one who is 36 years old andanother person is 36.1 years old. If we were working with an algorithm that usessmooth change in the covariate space to change its prediction then these twopoints might be seen to be “near” to each other and the algorithm would tendto return quite similar predictions, but there is no such guarantee of smoothnesswith a black box algorithm. Metaphorically, the black box algorithm “thinks”differently about changes in the covariate space than more accessible models, so itis hard for an analyst to think about how to get “useful” or “novel” or “divergent”data to use to retrain the black box algorithm. The third fix restarts the originalsearch for an appropriate algorithm. As with the previous fix, it is impossibleto say that the original failure is corrected (without deploying the algorithmand waiting to see if it fails again), and it is possible that new failures havebeen introduced. The fourth fix is a special case of the third fix, creating a hard J. RODU ET AL. patch wherein the analyst forces the algorithm to return an analyst-determinedprediction in parts of the input space which are most problematic. For instance,if an image labeler pairs otherwise innocuous labels and images in a way that isobjectionable, then the analyst may force the algorithm to return a non-responsewhenever that input is paired with that label. We reiterate that, despite theselimitations, these four fixes may be appropriate for specific circumstances. Inother cases, these limitations may present insurmountable objections, renderingthese fixes useless.Let us return to the orienting question: “What happens if a black box algorithmfails?” Miscommunication appears in the literature because this question is useddifferently by different people in order to highlight different types of concerns.Inside of data analyst communities, the current decision-making frameworks andconversations about improving prediction have understood this question in thetechnical ways outlined in the two paragraphs above. But the question shouldalso be understood as asking: “How will responsibility be assigned?” There isreal value to be gained from deploying complex algorithms in the real world buthow can non-technical stakeholders be made part of the decision-making pro-cess for deployment? If we do not offer a meaningful framework for non-technicalstakeholders to interrogate the suitability of an algorithm for deployment thenthe entire burden is on the analyst. Our community’s work on impressive statisti-cal technologies has outpaced our work on means for including our non-technicalcollaborators at vital points in the development and deployment of these algo-rithms.This paper is motivated by the “How will responsibility be assigned?” style ofquestions.
6. USING MARA(S) TO REASON ABOUT A RECIDIVISM ALGORITHM
In this section, we work through an example and focus on how MARA makesreasoning about a prediction problem clearer. Consider an example where a par-ticular county-court wants to have a decision support tool to quantify the poten-tial for recidivism. (Note: it is not clear to us that a decision support tool in thissetting is a wise decision. But this is a scenario that has occurred, and it can beframed as a prediction.) The Court announces its interest and requests proposalsfrom companies to create a decision support tool. Several hundred companies bid
ARA(S) on the contract.Recognizing that this can be thought of as a prediction problem, the Courtprovides a data set to the companies and then holds a contest using the CommonTask Framework. The Court can thus rank the performance of the algorithmsusing their desired metric(s). The three top performers are kept and move onto a new round of consideration. The results of the competition were as follows:Company A used a fantastically complex algorithm and out-performed the otheralgorithms in the contest. Company B, which performed a noticeably less success-fully compared to Company A, used a proprietary algorithm that Company Bbelieves should not be shared publicly (perhaps because they are concerned aboutpeople exploiting weakness of the algorithm). Company C, which placed third inthe competition, uses an algorithm that is based on linear regression and they arewilling to share it publicly. Because of its best-in-competition performance (andperhaps given other considerations such as speed and cost) the Court decides tomove forward with Company A.Given the MARA(s) framework, it should be recognizable that the data-drivenpart of the selection process above was based on outcome-reasoning, which mayor may not be a good way to cut down the pool of competitors according tohow stakeholders think about the MARA features. MARA(s) gives language tostakeholders (particularly non-technical stakeholders) to criticize the process asdescribed above. The way we’ve told the story so far, it is not clear who selectedthe data set used for the competition and selected the prediction task(s) andperformance metric. Even in using outcome-reasoning, there are important rolesfor the stakeholders. It is likely that if the stakeholders are (i) judges, prosecutors,politicians, and law enforcement officers then they will have different prioritiesthan a group of stakeholders that includes (ii) public defenders, victims’ families,and prisoner advocates. Again, even within an outcome reasoning situation thereare vital roles for the stakeholders.Potential stakeholders can offer more fundamental critiques of the process out-lined above; they can reject outcome reasoning. While the contest demonstratedthe ordering on the data in existence, the real question for the stakeholders is toreason about deploying the algorithm in future, unseen data. In order to proceedthe stakeholders need to have a discussion of how they, as a group, believe the J. RODU ET AL. prediction problem they are interested in satisfies the MARA conditions. Giventhe sensitivities around depriving citizens of their rights, it is likely that bothresiliency and agnosis are not satisfied for stakeholders like public defenders andprisoner advocates. Additionally, when an individual is incarcerated, it is not pos-sible to assess the validity of the algorithm’s prediction based on their outcome,so for anyone incarcerated there is no measurement available.Without recourse to MARA(s), given the relative prowess of the companies’prediction algorithms – it may feel hard for stakeholders to articulate their con-cerns and, further, to identify that outcome-reasoning should not be sufficient toconvince them given the concerns they have with the prediction problem. Fur-ther, MARA(s) allows certain types of criticism to be clearer: (i) Company A’sType 3 black box is concerning because the stakeholders have strong beliefs suchas deonotological concerns that no one can reason about. It is likely Company Aneeds to be disqualified. (ii) Company B’s Type 1 black box may allow peoplewithin the company to reason about future functioning of the algorithm (if it isan algorithm that they can model-reason about) but the reasoning is not beingdone by the stakeholders and any assertions from Company B that the algorithmis consistent with the stakeholder beliefs requires a level of trust and possible ver-ification (i.e., social-psychological arguments). It needs to be clear that CompanyB has not ”successfully” articulated a data-driven argument that can convincestakeholders to the same level as Company C. Competing successfully in a CTFevent does not warrant deployment.It is important for stakeholders to know, and have the language to hold oth-ers accountable for, getting buy-in pre-deployment. This buy-in process centerson understanding if the MARA conditions are satisfied. If the MARA are notsatisfied then stakeholders should request model-reasoning before deployment.
7. RELATED WORK
Recently the CTF, while continuing to yield huge success in algorithmic devel-opment, has seen a host of criticisms. One concern about the CTF is that datasetscan be overfit over time (Sculley et al., 2018; Rogers, 2019; Van Calster et al.,2019; Ghosh, 2019). Despite all attempts to protect against overfitting, idiosyn-cratic aspects of a particular dataset are learned when heuristic improvementsyield improved predictions over the state of the art. Additionally, and a potential
ARA(S) corollary of this criticism, given two sets of reference data, it is often unclear whyan algorithm performs well on one dataset, but poorly on the other. This has ledsome to call for more resources to be dedicated to understanding the theoreticalunderpinnings of the algorithms that have achieved such huge success, hoping toavoid a catastrophic failure in the future. There have been several debates on therelative merits of careful theoretical justification vs. rapid performance improve-ment (see Rahimi and Recht (2017) and rebuttal in LeCun (2017); see also Barber(2019)). We do not enter this debate here. However, the substance of the debateis important in MARA(s). As can be seen in how these algorithms respond todifferent datasets, their performance is a complex interaction of the data, whichcan often be quite large, and difficult-to-uncover aspects of the algorithms. Whilesome in the aforementioned debates call for more theoretical understanding ofthese algorithms ahead of rapid innovation, we take a different tack and askwhen we can deploy a black box algorithm through outcome-reasoning to makepredictions in the wild. For an algorithm whose future performance is justifiedusing a measurement of the algorithm’s success in the space of the outcome, asis done in the CTF, this framework recognizes an extension to the CTF, at veryleast satisfying the measurement problem-feature. In problems that do not sat-isfy the measurement problem-feature, the algorithm is being used to extrapolatewithout apriori justification, and we have no way of measuring – or perhaps evenbeing aware of – failures. By construction, such extrapolation does not exist inthe static version of the CTF.This paper also relates to debates in ethical machine learning through both theagnosis problem feature, as well as through stakeholder inclusion. This literatureis new, rapidly expanding, and impactful; we suggest interested readers consultthe following as solid entry points: Corbett-Davies and Goel (2018); Lum andIsaac (2016); Kusner et al. (2017); Nabi and Shpitser (2018); Wiens et al. (2019).We use the MARA(s) framework in the examples in Section 3 to demonstrate howthis framework can be used to clarify concerns of this nature – see the exampleson recidivism (subsection 3.4) and prediction in lieu of measurement (subsection3.3).In the public literature, most discussion of the CTF has been undertaken byDavid Donoho (Donoho, 2017, 2019). (Note: it appears that much of the devel- J. RODU ET AL. opment of the CTF happened outside of the public-facing, academic literature.See section 9) Of particular interest, Donoho develops the notion of hypotheticalreasoning in Donoho (2019), exploring how analysts have developed “models”– formalizations of their beliefs into statements of probability models – to “...genuinely allow us to go far beyond the surface appearance of data and, by so do-ing, augment our intelligence.” Using language developed for MARA(s) we mightsay that satisfying the agnosis problem-feature means forgoing the advantagesDonoho identifies accrue because of hypothetical reasoning. That may be a rea-sonable choice in some settings, but it should give pause to researchers interestedin generating solid scientific evidence.Finally, the MARA(s) framework is related to work on explainability/interpretabilityof algorithms. The conversations on these topics have been happening for manyyears, and in several distinct literatures, so understanding the foundational con-cerns and identifying the through lines of thinking can be challenging. We di-rect readers to two touchstone pieces in the literature as a good place to start:Breiman (2001) and Shmueli et al. (2010). Cynthia Rudin (Rudin, 2018) exploresthe suitability of two types of models, explainable machine learning models andinterpretable machine learning models, in the context of high risk and low riskpredictions. In Rudin’s dichotomy, she warns against using explainable modelsin a high risk prediction due to our inability to make sense of the performanceof the model despite the promise of explanation. Instead, for high risk scenar-ios, she urges the practitioners to use interpretable models that can be linkeddirectly to domain knowledge, and encourages researchers to put effort into find-ing suitable interpretable models where none exist. In terms of our framework,Rudin is exploring the joint impact of the resilience and agnosis problem-features.We direct the reader to Rudin’s paper for details on why explainable machinelearning models are not sufficient for what we identify as problems that requiremodel-reasoning.
8. DISCUSSION
The MARA(s) framework focuses on features of the prediction problem athand, rather than the features of the algorithm. The problem itself is selected bythe stakeholders who have concerns that include accountability. Understandinghow stakeholders see the problem is the critical first step towards selecting an
ARA(S) appropriate algorithm. This framework directs attention to the four problem-features that stakeholders should assess: measurement, adaptability, resilience,and agnosis (“MARA”). Once assessed, the appropriate method for reasoningabout the algorithm can be selected.In contrast to (but not in conflict with) MARA(s), there are other frameworksfor decision-making about the suitability of an algorithm, technical in nature anduseful for understanding the performance of different algorithms – e.g. diagnostictools or asymptotic performance – but these are helpful after the method ofreasoning has been selected.While MARA(s) is a statement about the problem and not the algorithm,it does imply a loose structure to the set of possible algorithms. One way tothink about this implied structure is that model-reasoning methods are decou-pled from the data, allowing for deductive reasoning about future performance,while outcome-reasoning relies on contingent, inductive reasoning. Many of thecurrent approaches to describing black-box algorithms are inductive in nature(see Rudin (2018)). While these can be quite useful, they are still a qualitativelydifferent form of reasoning. This is a familiar distinction in the type of evidencewe bring to problems, and the reader need not look hard to find examples inwhich deductive reasoning is a required component of our decision making. Thegold-standard of inductive reasoning is randomized trials, but in the most con-sequential settings, the result of the most solid form of inductive reasoning doesnot provide sufficient justification. For example, when approving a new drug,government agencies would not allow evidence from an atheoretical randomizedtrial to warrant approval. Instead agencies require a detailed scientific hypothesisabout how the drug’s mechanism causes the outcome. The addition of deductivereasoning and coherence across beliefs provides a firmer, evidence-based founda-tion. And yet, when appropriate, the use of outcome-reasoning is to be preferredbecause it is a powerful engine for producing the highest quality predictions.Outcome-reasoning is the intellectual-engine of modern prediction algorithms.If the data sets are interesting, the task is useful, and the performance metricdescribes an ordering that matches how the algorithm will be used then outcome-reasoning leads to an extraordinary consequence: it allows an analyst to bypassthe slow, technical challenge of mathematically describing the behavior of the J. RODU ET AL. algorithm. Instead, outcome-reasoning allows the analyst to look at the joint dis-tribution of predicted and observed outcomes and then rank performance of al-gorithms by creating statistical summaries. Outcome-reasoning leverages Tukeysinsight that (Tukey, 1986): “In a world in which the price of calculation continuesto decrease rapidly, but the price of theorem proving continues to hold steadyor increase, elementary economics indicates that we ought to spend a larger andlarger fraction of our time on calculation.”The power and popularity of the CTF has inspired extensions to predictiondomains that are not traditionally investigated inside the framework. For instancein Wikle et al. (2017) the authors propose an extension to spatial predictionwhich, among other additions, includes an abundance of relevant data sets ofdiffering characteristics on which the algorithm must succeed, and additionalmetrics, like assessment of prediction coverage. It seems reasonable that the CTF-SP will enhance rapid innovation of algorithms for certain types of problems,which is an exciting prospect for the spatial forecasting community. However, wecaution that, like all algorithms that are developed in the CTF, those algorithmsthat are not qualified to be reasoned about using model-reasoning and shouldonly be used in a situation that permits outcome-reasoning.The MARA(s) framework provides a language to help stakeholders and ana-lysts communicate the key features of a problem and then guide the selection of anappropriate algorithm. This language can also be used by algorithm developers tohelp identify areas for innovation. For instance, in section 3 we discuss “predictionin lieu of measurement” and we are unaware of effective algorithms that could beused to provide model-reasoning. This provides analysts and stakeholders a wayto identify critical gaps in the existing set of approaches.The unpredictable brittleness of black-box algorithms has provoked concernand increased scrutiny. But black-box algorithms have had extraordinary successin some settings. We are concerned that a desire to use these powerful algorithms– combined with the facile strength of outcome-reasoning – has led to overconfi-dence from non-technical stakeholders. In contrast, model-reasoning can be moretechnically challenging to understand and has put limitations on the algorithmsthat can be deployed. We think these limitations are important to recognize. Ithas been hard to discuss these limitations because, for any given black-box al-
ARA(S) gorithm, understanding why it might fail or when it might fail is challenging. Incontrast, understanding which settings are appropriate for black-box deploymentonly requires understanding how they are developed – that is, using the CommonTask Framework (CTF). The MARA(s) framework extends the CTF into real-world settings, by isolating four problem-features – measurement, adaptability,resilience, and agnosis (“MARA”) – that mark a problem as being more or lesssuitable for black-box algorithms. Further, we suggest that the compact notationMARA(s) makes it clear how the assessment of the problem features is a functionof the stakeholders. We hope MARA(s) will help the two cultures of statisticalmodeling – and our stakeholders – communicate and reason about algorithms.
9. CORRESPONDENCE WITH MARK LIBERMAN: HISTORICALPERSPECTIVE
Mark Liberman agreed to read a draft of our paper. As part of his feedback, heprovided a historical perspective that is often downplayed in current discussionsof the CTF, but that can– as most accounts of history do– provide guidance onhow we might work inside the CTF to tackle new technical challenges (Liberman,2020).Liberman writes, “The [CTF] was developed as a way to manage and guidesponsored research on very hard problems that were far away from practicallyuseful solutions – two or three decades away, as it turned out. It was NOT origi-nally meant for the development and evaluation of real solutions to real problems,though obviously it can be (and has been) generalized in that direction.” Thiswas intentional, and had implications on the choice of task and metric.He says that “it was seen as a mistake to choose tasks that directly repre-sent the real goal of the work.” Instead tasks were chosen that balanced severalrequirements, including their fitness for the CTF (i.e. cost-effective data set cre-ation); their isolation of specific, current challenges that were not trivial but alsonot insurmountable; and their appeal to funders. “New tasks (or new versions ofold tasks) should be introduced every year or so ... in order to check generalizationand approach the real goal more closely.”The same careful thinking applied to metrics. Liberman writes, “Again, it wasseen as a mistake to try to measure what you actually care about.” Instead,metrics should be conceptually simple and easy to automate, and should serve J. RODU ET AL. to move research in the general direction of the ultimate goals– to this point,he stresses, “it was always explicit that these metrics were at best somewhatcorrelated with the (anyhow varied) research goals”. While tasks were to befrequently updated or changed, “metrics should not be changed very often, thoughnew metrics need to be added from time to time as appropriate.” As examples ofuseful metrics in human language technology (HLT), Liberman points to “worderror rate” and the BLEU metric (Papineni et al., 2002), that are flawed as metricsfor real-world applications, but have served HLT research well for several decades.The key is to understand when each metric is useful in promoting progress towardsthe ultimate goals of research, and when they are not. We note that this notionof a task or method being useful, but not necessarily realistic for use in the“real world”, should not seem unfamiliar, as it conjures George Box’s well-knownsaying, “All models are wrong, but some are useful.”Liberman adds, “The issues in question were discussed and debated extensivelyin the period 1985-2005, and to a lesser extent since then.” In particular, thechoices of task and metric “need to be re-thought for shorter-term applications.”The emergence of these shorter-term applications coincided with two importantshifts in research drivers: 1) the technologies had become “commercially viable,so that a short-term outlook began to make sense”, and 2) “Support for R&Din these areas shifted from the government to industry.” The second point inparticular has interesting implications on how the CTF is administered. Undergovernment funding, the CTF was intentional. Tasks and metrics were curated soas to promote progress as intended by the CTF, and evaluation data sets guardedby a third party. Today, in contrast, the CTF exists in flavors. Each flavor bearsresemblance to the original design of the CTF, though with some characteristicsmodified or even entirely missing.We mention in the introduction to this paper, for instance, that often an evalu-ation data set is no longer maintained by a third party (of course, there are manyinstances, like Kaggle competitions, where this is not the case). Much researchoccurs in environments that rely on self-policing. The researcher will create theirown evaluation data set and hold their own bake-off between their current algo-rithm and other algorithms. Indeed, every slight modification to an algorithm iscompared to a previous iteration through a process that mimics the CTF. On the
ARA(S) other hand, in some competitions, aspects of the CTF competition culture mightbe absent. This happens when a company creates a proprietary data set with theintention of creating an algorithm to be released as a product. In this case, thecompany might develop algorithms in a manner that mimics the CTF, with atask, metric, labeled data set, and rapid empirical evaluation, but with no (or atleast minimal) competition from analysts outside of the company.We believe that MARA(s) applies to all flavors of the CTF, with special focuson the technologies close to deployment (Liberman refers to these as “shorter-term”). We echo Liberman’s urging that the community think through the im-plications that a deployment-focused CTF has on the tasks and metrics, as wellimplications of the relaxation or omission of characteristics of the original, care-fully planned CTF. Because the deployment-focused CTF is linked to the real-world problem of interest, we stress that MARA(s) should be a cornerstone indiscussions.In our original draft that we sent to Liberman, we used MARA(s) to explore theHLT subdomain of chatbots, which try to engage a user in a natural, informativeconversation, in a narrowly defined setting. However, compared to the rapid in-novation in many natural language processing (NLP) tasks, the development anddeployment of chatbots has been slow. Focusing on problem-first analysis usingMARA(s), it appears that the problem-feature of measurement might be difficultto satisfy, since the space of possible responses in a conversation is vast. After all,not only are there many ways to convey the same information, there are oftenmany plausible types of information that would be appropriate for a particularresponse. But in the real world there is feedback on the predictions of a chatbot.A useless chatbot might garner complaints, or conversation might be terminatedearly. A cleverly designed deployment could test the utility of a chatbot on anongoing basis, serving as a proxy function to the actual measurement of interest.Assuming the other MARA problem features are satisfied (which depends on theproblem of interest), then outcome-reasoning could be used.The problem for chatbot development is that, because the space of possibleresponses mentioned in the previous paragraph is vast, it is essentially impossi-ble to curate a dataset for a static-CTF. Our analysis pointed to this as an odd(though possibly not unusual) situation in which a problem is fit for outcome- J. RODU ET AL. reasoning, but cannot fully benefit from the friction-less environment and rapidinnovation promised by the traditional CTF. One interesting consequence of thisis that it might be possible to mimic characteristics of the CTF by creative earlydeployment of chatbots. For instance (and we caution here that this is for illus-tration purposes only as there are many issues we are ignoring that need to beconsidered), an online marketplace, let’s call it “WHAMazon”, might be inter-ested in using chatbots to help customers determine which product best suitstheir needs in order to expand on the current service they offer that requires hu-man helpers. In order to mimic the CTF through early deployment, WHAMazoncould invite (or entice with discounts) customers to interact with a chatbot fortwo minutes in a way that reflects the customers needs at the time. Upfront, thecustomer would know that this was for fun and not intended for informationalpurposes. This would minimize customer frustration and potential abandonment,while providing an endless amount of training data.Given the nuanced history of the CTF that he provided, Liberman suggeststhat there is a lack of effective choices of tasks and metrics that will lend them-selves to rapid development. Adopting a development-focused (long-term) view,rather than a deployment-focused (short-term) view might help to identify suit-able choices. We agree with this assessment. Further, in the case where a com-pany chooses the early-deployment strategy discussed above, the development-focus CTF vs. deployment-focus CTF distinction should still affect the choice oftask and metric. For instance, in a development-focused CTF design (and verymuch dependent on the current challenge being faced by the development team),WHAMazon could set up a scenario in which the customer was incentivized toidentify, as early as possible in a conversation, if they were conversing with a chat-bot or a human. In a deployment-focused CTF (once WHAMazon feels like theyare close to being able to deploy their chatbot technology), they might record if acustomer purchased an item after having chatted with the chatbot, or requestedto speak with a human after the 2 minute window had expired.In the case of chatbots, MARA(s) suggests that the CTF is a worthwhile andpowerful paradigm for algorithmic development, while a careful understanding ofthe history of the CTF suggests that the development-focused CTF is the ap-propriate paradigm, at least for now. In order to harness the power of the CTF
ARA(S) in as many domains as possible, we feel that a thorough understanding of thehistory of the framework is essential. Careful thinking about the CTF and theconsequences of its development-focus and deployment-focus modes, combinedwith the considerations of MARA(s), could bring to real-world challenges a prin-cipled placement into the most effective and appropriate format for algorithmdevelopment.
10. ACKNOWLEDGMENTS
We are grateful to many for helpful feedback on early versions of this work, in-cluding Angel Christin, Mark Cullen, Devin Curry, David Donoho, Jamie Doyle,Steve Goodman, Karen Kafadar, Mark Liberman, Joshua Loftus, Kristian Lum,Ben Marafino, Blake McShane, Art Owens and his lab, Cynthia Rudin, KrisSankaran, Joao Sedoc, Dylan Small, Larry Wasserman, Wen Zhou, The Quanti-tative Collaborative at the University of Virginia, and The Human and MachineIntelligence Group at the University of Virginia.
REFERENCES
Barber, G. (2019). Artificial Intelligence Confronts a ’Reproducibility’ Crisis.
Wired . Bennett, J. , Lanning, S. and
Others (2007). The netflix prize. In
Proceedings of KDD cupand workshop
Breiman, L. (2001). Statistical Modeling: The Two Cultures (with comments and a rejoinderby the author).
Stat. Sci. Burrell, J. (2016). How the machine thinks: Understanding opacity in machine learning algo-rithms.
Big Data & Society Corbett-Davies, S. and
Goel, S. (2018). The measure and mismeasure of fairness: A criticalreview of fair machine learning. arXiv preprint arXiv:1808.00023 . Donoho, D. (2017). 50 Years of Data Science.
J. Comput. Graph. Stat. Donoho, D. (2019). Comments on Michael Jordan’s Essay The AI Revolution Hasn’t HappenedYet.
Harvard Data Science Review . https://hdsr.mitpress.mit.edu/pub/rim3pvdw. Ghosh, P. (2019). Machine learning ’causing science crisis’.
BBC . Google (2019). Machine learning workflow — AI Platform — Google Cloud. https://cloud.google.com/ml-engine/docs/ml-solutions-overview . Accessed: 2019-11-6.
Hermann, J. , Del Balso, M. , Kjelstrøm, K. , Reinhold, E. , Beinstein, A. and
Sumers, T. (2018). Scaling Machine Learning at Uber with Michelangelo. https://eng.uber.com/scaling-michelangelo/ . Accessed: 2019-11-6.
Kusner, M. J. , Loftus, J. , Russell, C. and
Silva, R. (2017). Counterfactual fairness. In
Advances in Neural Information Processing Systems J. RODU ET AL.
LeCun, Y. (2017). Rebuttal to “Reflections on Random Kitchen Sinks”. . Accessed: 2019-11-5.
Liberman, M. (2010). Fred Jelinek.
Comput. Linguist. Liberman, M. (2020). Personal Correspondence.
Lum, K. and
Isaac, W. (2016). To predict and serve?
Significance Nabi, R. and
Shpitser, I. (2018). Fair inference on outcomes. In
Thirty-Second AAAI Confer-ence on Artificial Intelligence . Papineni, K. , Roukos, S. , Ward, T. and
Zhu, W.-J. (2002). BLEU: a method for automaticevaluation of machine translation. In
Proceedings of the 40th annual meeting on associationfor computational linguistics
Rahimi, A. and
Recht, B. (2017). Reflections on Random Kitchen Sinks. http://benjamin-recht.github.io/2017/12/05/kitchen-sinks/ . Accessed: 2019-11-5.
Rogers, A. (2019). How the Transformers broke NLP leaderboards. https://hackingsemantics.xyz/2019/leaderboards/ . Accessed: 2019-11-6.
Rudin, C. (2018). Stop Explaining Black Box Machine Learning Models for High Stakes Deci-sions and Use Interpretable Models Instead. arxiv . Sculley, D. , Snoek, J. , Wiltschko, A. and
Rahimi, A. (2018). Winner’s Curse? On Pace,Progress, and Empirical Rigor.
ICLR Workshop . Shmueli, G. et al. (2010). To explain or to predict?
Statistical science Shmueli, G. and
Koppius, O. R. (2011). Predictive analytics in information systems research.
MIS quarterly
Tukey, J. W. (1986). Sunset salvo.
The American Statistician Van Calster, B. , Wynants, L. , Timmerman, D. , Steyerberg, E. W. and
Collins, G. S. (2019). Predictive analytics in health care: how can we know it works?
J. Am. Med. Inform.Assoc.
Wiens, J. , Saria, S. , Sendak, M. , Ghassemi, M. , Liu, V. X. , Doshi-Velez, F. , Jung, K. , Heller, K. , Kale, D. , Saeed, M. , Ossorio, P. N. , Thadaney-Israni, S. and
Golden-berg, A. (2019). Do no harm: a roadmap for responsible machine learning for health care.
Nat. Med. Wikle, C. K. , Cressie, N. , Zammit-Mangion, A. and
Shumack, C. (2017). A commontask framework (ctf) for objective comparison of spatial prediction methodologies.