Is the Most Accurate AI the Best Teammate? Optimizing AI for Teamwork
Gagan Bansal, Besmira Nushi, Ece Kamar, Eric Horvitz, Daniel S. Weld
OOptimizing AI for Teamwork
Gagan Bansal , Besmira Nushi , Ece Kamar , Eric Horvitz , Daniel S. Weld , University of Washington Microsoft Research Allen Institute for Artificial Intelligence
Abstract
In many high-stakes domains such as criminal jus-tice, finance, and healthcare, AI systems may rec-ommend actions to a human expert responsible forfinal decisions, a context known as
AI-advised de-cision making . When AI practitioners deploy themost accurate system in these domains, they im-plicitly assume that the system will function alonein the world. We argue that the most accurate AIteam-mate is not necessarily the best teammate ; forexample, predictable performance is worth a slightsacrifice in AI accuracy. So, we propose trainingAI systems in a human-centered manner and di-rectly optimizing for team performance . We studythis proposal for a specific type of human-AI team,where the human overseer chooses to accept the AIrecommendation or solve the task themselves. Tooptimize the team performance we maximize theteam’s expected utility , expressed in terms of qual-ity of the final decision, cost of verifying, and in-dividual accuracies. Our experiments with linearand non-linear models on real-world, high-stakesdatasets show that the improvements in utility whilebeing small and varying across datasets and param-eters (such as cost of mistake), are real and consis-tent with our definition of team utility. We discussthe shortcoming of current optimization approachesbeyond well-studied loss functions such as log-loss ,and encourage future work on human-centered op-timization problems motivated by human-AI col-laborations.
Increasingly, humans work collaboratively with an AI team-mate, for example, because the team may perform betterthan either the AI or human alone [Nagar and Malone, 2011;Patel et al. , 2019; Kamar et al. , 2012], or because legal re-quirements may prohibit complete automation [GDPR, 2020;Nickelsburg, 2020]. For human-AI teams, just like for anyteam, optimizing the performance of the whole team is moreimportant than optimizing the performance of an individualmember. Yet, to date for the most part, the AI community has
Figure 1: In a human-AI team, a more accurate classifier ( h , leftpane, learned using log-loss) may produce lower team utility thana less accurate one ( h , right pane). Suppose the human can ei-ther quickly accept the AI’s recommendation or solve the task them-selves, incurring a cost λ , to yield a more reliable result. The pay-off matrix describes the utility of different outcomes. One optimalpolicy is for the human to accept recommendations when the AI isconfident, but verify uncertain predictions (shown in the light greyregion surrounding each hyperplane). While h is less accurate than h (because B is incorrectly classified), it results in a higher teamutility: Since h moved A outside the verify region, there are more correctly classified inputs on which the user can rely on the system. focused on maximizing the individual accuracy of machine-learning models. This raises an important question: Is themost accurate AI the best possible teammate for a human?We argue that the answer is "No." We show this formally,but the intuition is simple. Consider human-human teams, Is the best-ranked tennis player necessarily the best doublesteammate?
Clearly not — teamwork puts additional demandson participants besides high individual performance, suchas ability to complement and coordinate with one’s partner.Similarly, creating high-performing human-AI teams mayrequire training AI that exhibits additional human-centered properties that facilitate trust and delegation. Implicitly, thisis the motivation behind much work in intelligible AI [Caru-ana et al. , 2015; Weld and Bansal, 2019] and post-hoc ex-plainable AI [Ribeiro et al. , 2016], but we suggest that di-rectly modeling the collaborative process may offer addi-tional benefits.Recent work emphasized the importance of better under- a r X i v : . [ c s . A I] J un tanding how people transform AI recommendations into de-cisions [Kleinberg et al. , 2018]. For instance, consider sce-narios when a system outputs a recommendation on whichit is uncertain. A rational user is likely to distrust such rec-ommendations — erroneous recommendations are often cor-related with a low confidence in prediction [Hendrycks andGimpel, 2018]. In this work we assume that the user will dis-card the recommendation and solve the task themselves, afterincurring a cost ( e.g. , due to additional human effort). As aresult, the team performance depends on the AI accuracy onlyin the accept region , i.e. , the region where a user is actuallylikely to rely on AI t the singular objective of optimizing forAI accuracy ( e.g. , using log-loss ) may hurt team performancewhen the model has fixed inductive bias; team performancewill instead benefit from improving AI in the accept regionsin Figure 1. While there exist other aspects of collaborationthat can also be addressed via optimization techniques, suchas model interpretability, supporting complementary skills, orenabling learning among partners, the problem we address inthis paper is to account for team-based utility as a basis forcollaboration.In sum, we make the following contributions:1. We highlight a novel, important problem in the field ofhuman-centered artificial intelligence: the most accu-rate ML model may not lead to the highest team utility when paired with a human overseer.2. We show that log-loss , the most popular loss function, isinsufficient (as it ignores team utility ) and develop a newloss function team-loss , which overcomes its issues bycalculating a team’s expected utility.3. We present experiments on multiple real-world datasetsthat compare the gains in utility achieved by team-loss and log-loss . We observed that while the gains are smalland vary across datasets they reflect the behavior en-coded in the loss. We present further analysis to under-stand how team-loss results in a higher utility and when,for example, as a function of domain parameters such ascost of mistake.
We focus on a special case of AI-advised decision makingwhere a classifier h gives recommendations to a human de-cision maker to help make decisions (Figure 2a). If h ( x ) de-notes the classifier’s output, a probability distribution over Y ,the recommendation r consists of a label ˆ y = arg max h ( x ) and a confidence value max h ( x ) , i.e. , r := (ˆ y, max h ( x )) .Using this recommendation, the user computes a final deci-sion d . The environment, in response, returns a utility whichdepends on the quality of the final decision and any cost in-curred due to human effort. Let U denote the utility function.If the team classifies a sequence of instances, the objectiveof this team is to maximize the cumulative utility. Beforederiving a closed form equation of the objective, we char-acterize the form of the human-AI collaboration along withour assumptions. We study this particular, simple setting as afirst step to explore the opportunities and challenges in team-centric optimization. If we cannot optimize for this simple (a)(b)Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Solve meta-decision is costlier than
Accept . setting, it may be much harder to optimize for more complexscenarios (discussed more in Section 4).1. User either accepts the recommendation or solves thetask themselves:
The human computes the final deci-sion by first making a meta-decision:
Accept or Solve (Figure 2b).
Accept passes off the recommendation asthe final decision. In contrast,
Solve ignores the rec-ommendation and the user computes the final decisionthemselves. Let m denote the function that maps an in-put instance and recommendation to a meta-decision in M = { Accept , Solve } . As a result, the optimal clas-sifier h ∗ would maximize the team’s expected utility: h ∗ = arg max h E x,y [ U ( m, d )] (1)2. Mistakes are costly:
A correct decision results in unitreward whereas an incorrect decision results in a penalty β ≥ .3. Solving the task is costly:
Since it takes time and effortfor the human to perform the task themselves, ( e.g. , cog-nitive effort), we assume that the
Solve meta-decisioncosts more than
Accept . Further, without loss of gener-ality, we assume λ units of cost to Solve and zero costto
Accept .Using the above assumptions we obtain the followingutility function. The values in each cell of the table orig-inate from subtracting the cost of the action from theenvironment reward. ymbol Description a Human accuracy β ∈ R + Cost of mistake h ( x )[ˆ y ] = max h ( x ) Confidence in the predicted label d : X × R → Y
Human decision maker h : X → [0 , |Y| Classifier H Classifier hypothesis space λ ∈ R + Cost of human effort to
Solve m : X × R → M
Meta-decision function M := { Accept , Solve } Meta-decision space ψ : H × X × Y → R Expected team utility r ∈ R Recommendation R := Y × [0 , Recommendation space U : M × Y → R Utility function X Feature space ˆ y ∈ Y Recommended label Y Label spaceTable 1: Notation.Meta-decision/Decision Correct Incorrect
Accept [ A ] − β Solve [ S ] − λ − β − λ Figure 3: Team utility w.r.t. meta-decision and decision accuracy. Human is uniformly accurate:
Let a ∈ [0 , denote theconditional probability that if the user solves the task,they will make the correct decision, i.e. , P ( d = y | m = S ) = a (2)5. Human is rational:
The user makes the meta-decisionby comparing expected utilities. Further, the user truststhe classifier’s confidence as an accurate indicator of therecommendation’s reliability. As a result, the user willchoose
Accept if and only if the expected utility for ac-cepting is higher than the expected utility for solving. E [ U ( m = A )] ≥ E [ U ( m = S )] h ( x )[ˆ y ] − (1 − h ( x )[ˆ y ]) · β ≥ a − (1 − a ) · β − λh ( x )[ˆ y ] ≥ a − λ β Let c ( β, λ, a ) denote the minimum value of system con-fidence for which the user’s meta-decision is Accept . c ( β, λ, a ) = a − λ β (3)This implies the human will follow the followingthreshold-based policy to make meta-decisions: P ( m = A ) = (cid:26) if h ( x )[ˆ y ] ≥ c ( β, λ, a )0 otherwise (4) We now derive the equation for expected utility of recommen-dations. Let ψ ( h ) denote the expected utility of the classifier h and a decision maker d . (a) In the Accept region the expected team utility isequal to expected automation utility, while in the
Solve region it is the same as the human utility. The nega-tive team utility in the left-most region indicates scenar-ios where AI gives high-confidence but incorrect recom-mendations to the human.(b) In the
Accept region, team-loss behaves similar to log-loss , however in the
Solve region it results in a con-stant loss.Figure 4: Visualization of expected utility and loss. This visualiza-tion corresponds to the case when λ = 0 . , β = 1 , and a = 1 ( i.e. ,the human is perfectly accurate but it costs them half a unit of utilityto solve the task). ψ ( x, y ) = E [ U ( m, d )] ψ ( x, y ) = P ( m = A ) · (cid:104) P ( d = y | m = A ) · P ( d (cid:54) = y | m = A ) · ( − β )) (cid:105) + P ( m = S ) · (cid:104) P ( d = y | m = S ) · (1 − λ )+ P ( d (cid:54) = y | m = S ) · ( − β − λ ) (cid:105) Since upon
Accept , the human returns the classifier’s rec-ommendation, the probability that the final decision is correctis the same as the classifier’s predicted probability of the cor-rect decision: P ( d = y | m = A ) = h ( x )[ y ] (5)sing Equations 2 and 5, we obtain the following equationfor expected utility of team. = P ( m = A ) · (cid:104) (1 + β ) · h ( x )[ y ] − β (cid:105) + P ( m = S ) · (cid:104) (1 + β ) · a − β − λ (cid:105) = P ( m = A ) · (cid:104) (1 + β ) · ( h ( x )[ y ] − a ) + λ (cid:105) + (cid:104) (1 + β ) · a − β − λ (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) constant (6)Using Equations 4 and 6, we obtain the following expres-sion for expected utility: ψ ( x, y ) = (cid:26) (1 + β ) · h ( x )[ y ] − β if h ( x )[ˆ y ] ≥ c ( β, λ, a )(1 + β ) · a − β − λ otherwise (7)Figure 4a visualizes the expected team utility of the classi-fier predictions as a function of confidence in the true label. Since gradient descent-based minimization of loss functionsis common in machine learning, we transform the expectedutility ψ into a loss function by negating it. We call this newloss function team-loss . L team ( x, y ) = − log( ψ ( x, y )) (8)We take a logarithm before negating utility to allow com-parisons with log-loss , where the logarithmic nature of lossis known to benefit optimization, for example, by heavily pe-nalizing high-confidence mistakes. Figure 4b visualizes thisnew loss function.
We conducted experiments to answer the following researchquestions:
RQ1 Does the new loss function result in a classifier thatimproves team utility over the most accurate classi-fier?RQ2 How do these improvements change with propertiesof the task, e.g. , the cost of mistakes ( β )?RQ3 How do these improvements change with propertiesof the dataset, e.g. , with data distribution or dimen-sionality?Metrics and Datasets We compared the utility achieved bytwo models: the most accurate classifier trained using log-loss and a classifier optimized using team-loss on the datasetsdescribed in Table 2. We experimented with two syntheticdatasets and four real-world datasets with high-stakes. Thereal datasets are from domains that are known to or alreadydeploy AI to assist human decision makers. The Scenario1
Dataset
Scenario1 2 10000 0.43Moons 2 10000 0.50German 24 1000 0.30Fico 39 9861 0.52Recidivism 13 6172 0.46Mimic 714 21139 0.13
Table 2: We used two synthetic datasets (Scenario1, Moons) andfour real-world datasets from high-stakes domains that are known tobe used in AI-assisted decision making settings. The Mimic datasethas the most class imbalance. dataset refers to a dataset we created by sampling 10000points from the data distribution described in Figure 1.
Training Procedure
We experimented with two models: lo-gistic regression and multi-layered perceptron (two hiddenlayers with 50 and 100 units). For each task (defined bya choice of task parameters, dataset, model, loss) we opti-mized the loss using stochastic gradient descent (SGD) andalso used standard, well-known training practices such as us-ing regularization, check-pointing, and learning rate sched-ulers. We selected the best hyper-parameters using 5-foldcross validation, including values for the learning rate, batchsize, patience and decay factor of the learning rate scheduler,and weight of the L2 regularizer.In our initial experiments for training with team-loss usingSGD, we observed that the classifier’s loss would never re-duce and in fact remain constant. This happened because, inpractice, random initializations resulted in classifiers that areuncertain on all examples. And, since, by definition, team-loss is flat in these uncertain regions (Figure 4b), the gradi-ents was zero and uninformative. To overcome this issue, weinitialized the classifiers with the (already converged) mostaccurate classifier.
RQ1:
Experiments showed that when we used team-loss , themagnitude of improvements in team utility over log-loss var-ied across the datasets but were consistently observed (Ta-ble 3). We observed that team-loss often sacrifices classi-fier accuracy to improve team utility, the more desirable met-ric. For the linear classifier, this sacrifice is especially largeon the synthetic datasets: Scenario1 (16%) and Moons (1%)datasets. For the MLP, team-loss sacrifices 2% accuracy toimprove team utility. While the metrics in Table 3 (change in accuracy and util-ity) provide a global understanding of the effect of team-loss ,they do not help understand how team-loss achieved improve-ments and whether the behavior of the new models is con-sistent with intuition. Figure 5 visualizes the difference inbehavior (averaged over 150 seeds) between the classifiersproduced by log-loss and team-loss on the Scenario1 dataset. Since loss can be negative, in our implementation, before com-puting a logarithmic of utility we appropriately shift up the utilityfunction (by subtracting its minimum value). We report absolute improvements instead of percentage im-provements in utility because utility can be negative. odel Dataset Acc LL Util LL ∆ Acc ∆ Util
Linear Scenario1 0.86 0.59 -0.16 0.165Moons 0.89 0.81 -0.01 0.020German 0.75 0.61 -0.004 0.009Mimic 0.88 0.80 -0.000 0.001Recid 0.68 0.53 0.000 0.000Fico 0.73 0.58 0.000 -0.000MLP Fico 0.72 0.56 0.01 0.018Scenario1 0.98 0.84 -0.04 0.008Moons 1.00 0.99 0.00 0.007German 0.74 0.61 -0.02 0.003Mimic 0.88 0.80 0.00 0.003Recid 0.67 0.52 -0.00 0.001
Note: LL indicates log-loss
Table 3: Differences in performance (accuracy and utility) of team-loss and log-loss for all datasets (averaged over 150 runs). Datasetsare sorted in descending order of improvements in utility and theanalysis is divided by classifier type, linear and multi-layered per-ceptron. We observe that team-loss often sacrifices accuracy to im-prove utility. While the gains in utility are small they are consistentlyobserved across datasets.
Specifically, as shown in Figure 5, we visualize and comparetheir:V1. Calibration using reliability curves , which compare sys-tem confidence and its true accuracy. A perfectly cal-ibrated system, for example, will be 80% accurate onregions that is 80% confident. However, in practice, sys-tems may over- or under-confident.V2. Distributions of confidence in predictions. For example,in Figure 5, team-loss makes more high-confidence pre-dictions than log-loss .V3. Fraction of total system accuracy contributed by differ-ent regions (of confidence values). Thus, the area underthis curve indicates the system’s total accuracy. Notethat for our setup the area under the curve in the
Accept region is more crucial than the area in the
Solve regionsince in the latter the human is expected to take over.V4. Similar to (V4), the forth sub-graph shows the fractionof total system utility contributed by different regions ofconfidence.If team-loss had not resulted in different predictions than log-loss , the curves in Figure 5 for the two loss functionswould have been indistinguishable. However, we observedthat team-loss results in dramatically different predictionsthan log-loss . In fact, we noticed two types of behaviors when team-loss improved utility.B1 The first type of behavior was observed on Scenario1datatset (Figure 5) and is easier to understand as itmatches the intuition we set out in the beginning– theclassifier trained with team-loss sacrifices accuracy onthe uncertain examples in the
Solve region to makemore high-confidence predictions in the
Accept region.This change improves system accuracy in the
Accept region, which is where the system accuracy matters and
Figure 5: Differences between behavior of linear classifiers learnedusing log-loss and team-loss on the Scenario1 and Moons datasets(averaged over 150 runs).
Scenario1: team-loss sacrifices accu-racy in the Solve region, 2) makes fewer predictions in the
Solve region and more high-confidence predictions in the right-half of the
Accept region (annotated as X), 3) reduces the contribution to sys-tem accuracy from
Solve and increases it from the
Accept re-gion, 4) results in higher area under the curve indicating an increasein overall utility.
Moons: team-loss improves accuracy in the Accept region, 2) makes fewer very-high confidence predictions(marked as Y) and more moderately-high confidence predictions inthe
Accept region. Figure 6 shows similar visualizations for thereal-world datasets. contributes to team utility. Later, we show that this samebehavior is observed on the German dataset (Figure 6)B2 The second type of behavior was observed on Moons(Figure 5), where the new loss increases accuracy in the
Accept region at the cost predicting fewer very high-confidence predictions ( e.g. , when confidence is greaterthan 0.95 in the region marked Y). This change improvesutility because the system’s accuracy in the
Accept re-gion matters more than making very high-confidencepredictions.In both these behaviors, team-loss effectively increases thecontribution to AI accuracy from the
Accept region, i.e. , theregion where AI’s performance providing value to the team.In contrast, log-loss has no such considerations. Figure 6shows a similar analysis on the real datasets for both the lin-ear and MLP classifiers. When team-loss improves utility, wesee one of the two behaviors we described above.
RQ2:
Since the penalty of mistakes may be task-dependant( e.g. , an incorrect diagnosis may be costlier than incorrectloan approval), we varied the mistake penalty β to studyits effects on the improvements from team-loss . Our ex-periments (Figure 7) showed that the difference in utilitiesdepend on the cost of mistake, and highest difference is a) Linear classifier : On German, we observed B1, where team-loss compared to log-loss preserved accuracy and mademore predictions in the
Accept region, and sacrificed accu-racy and mass of prediction distribution in the
Solve region.In contrast, on Mimic, we observed B2, where team-loss in-creased accuracy in the
Accept region but made fewer veryhigh confidence predictions ( e.g. , confidence > 0.9). (b)
MLP classifier:
On Fico, we observed a behavior B2 simi-lar to Moons (Figure 5), where using team-loss increased the ac-curacy in the
Accept region and reduced the number of very-high confidence predictions (same as moons for linear). In con-trast on the German dataset, we observed a behavior B1 similarto the Scenario1 dataset (Figure 5), where using team-loss sac-rificed accuracy in the
Solve region and increased the numberof predictions in the
Accept region.Figure 6: Comparison of the predictions of log-loss and team-loss on the real-world datasets when team-loss improves utility (150 seeds).Figure 7: Comparison of utility achieved by the two loss functions.Across values of β , in most datasets, team-loss achieves higher util-ity than log-loss ; however, the β value that results in the highest rel-ative improvements is different across datasets. Interestingly, when log-loss results in lower utility than a human-only baseline (indi-cated by dotted line), e.g. , as seen on Recidivism as penalty in-creases, team-loss still attempts to nudge its utility to match the hu-man baseline. observed for a different value of β across datasets. Wealso observed that, for our setup, as the mistake penaltyincreases, log-loss may achieve lower performance thanthe human-only baseline, and so, deploying automationis undesirable in these cases. For example, on Fico and β = 7 ), linear model learned using log-loss achieves lowerperformance than human baseline. Similarly, on Mimic and β = 500 , MLP learned using log-loss deploying the AI isundesirable. Model Dataset Acc LL Util LL ∆ Acc ∆ Util
Linear German-b 0.72 0.56 -0.01 0.004Mimic-b 0.77 0.65 0.00
MLP German-b 0.74 0.57 0.00
Mimic-b 0.93 0.87 0.00 0.002
Table 4: Performance on German and Mimic datasets after correct-ing class imbalance. Bold indicates setting where balancing thedataset improved the gains in utility compared to its original version.We observed that for MLP, after balancing the German dataset, thegains in utility improved substantially, from 0.003 (see Table 3) to0.024.
RQ3:
Since the gains from using team-loss were small andvaried across datasets, we conducted experiments to inves-tigate properties of the dataset that may have affected theseimprovements. While there are many properties of a datasetone could investigate, we studied following: igure 8: Relative performance of team-loss on the Moons datasetfor linear classifier as we varied the data distribution and movedmore points towards the edges. “Fraction Moved" indicates the frac-tion of total number of points that were moved towards the overlap-ping edges of the two moons. Data distribution
In the Moons dataset, we observedthat the linear model trained with team-loss increasedutility by increasing confidence on the examples on theouter edges of the moons enough to move these exam-ples to the
Accept region. So, to test whether a differentdata distribution would benefits from using team-loss ,we created additional versions of Moons by systemat-ically moving points from the middle of the circle to-wards its edges. Figure 9 shows the improvements inutility as we moved more data.2.
Class imbalance
While most of our datasets were bal-anced, German and Mimic had a lower percentage ofpositive instances (see Table 2). We conducted exper-iments on balanced versions of these two datasets tounderstand if class imbalance affected our observationsin the previous experiments. Table 4 shows the perfor-mance after we over-sampled the positive class to adjustfor class imbalance in the two datasets. We observedthat in both cases correcting class imbalance increasedthe improvement when using team-loss .We also focussed on the dimensionality of the datasets.Since team-loss may be harder to optimize than NLL, an in-crease in data dimensions may affect the optimizer’s ability(in this case SGD) to optimize team-loss objective. We alsoexperimented with ADAM as the optimizer, unfortunately itdid not provide any benefits. For a given dataset, we varied itsdimensionality by using only a subset of features. However,we did not notice any correlations between dimensionalityand improvements in utility.
We conjecture two reasons to explain the small gains in utilityon real datasets using team-loss : either there is no scope forimproving the utility on those datasets and model pairs or ourcurrent optimization procedures are ineffective. Since we donot know the optimal (utility) solution for a given dataset and model, we cannot verify or reject the first conjecture. How-ever, the results on the two synthetic datasets suggest the ex-istence of situations where there is a significant gap betweenthe utilities achieved using team-loss and log-loss .However, it is possible that our current optimization pro-cedures may be ineffective for optimizing team-loss . Onereason this might happen is that team-loss is more complexthat log-loss – it introduces new plateaus in the loss surfaceand thus may increase the chances of optimization methodssuch stochastic gradient descent getting stuck in local min-ima. In fact, in our experiments, we observed that on thedatasets where team-loss did not increase utility it resulted inpredictions identical to log-loss . This may, for example, hap-pen if the most accurate classifier is a local minima. Since weuse the most accurate classifier to initialize the optimizationon team-loss , this entails that the further optimization withthe new loss did not manage to overcome the potential localminima.While we propose a solution for simplified human-AIteamwork (see assumptions in Section 2), our observationshave implications for human-AI teams in general. If we can-not optimize utility for our simplified case, it may be harderto optimize utility in scenarios where users make
Accept and
Solve decisions using a richer, more complex mental modelinstead beyond relying on just model confidence. Such sce-narios are common in cases where the system confidence isan unreliable indicator of performance ( e.g. , due to poor cal-ibration), and, as a result, the user develops an understandingof system failures in terms of domain features. For exam-ple, Tesla drivers often override the Autopilot using featuressuch as road and weather conditions. We can reduce this case,where users have a complex mental model, to the one westudied. Specifically, we can construct a loss function thatis constant when a prediction belongs to the
Solve regiondescribed by the user’s mental model and log-loss otherwise.This case may be harder to optimize because the resultant losssurface will contain more complex combinations of plateausand local optima than the one we considered.
Our approach is closely related to maximum-margin classi-fiers , such as an SVM optimized with the hinge loss [Burges,1998], where a larger soft margin can be used to make high-confidence and accurate predictions. However, unlike our ap-proach, it is not possible to directly plug the domain’s payoffmatrix ( e.g. , in Figure 3) into such a model. Furthermore, theSVM’s output and margin do not have an immediate proba-bilistic interpretation, which is crucial for our problem set-ting. One possible (though computationally intensive) solu-tion direction is to convert margin into probabilities, e.g. , us-ing post-hoc calibration ( e.g. , Platt scaling [Platt, 1999]), anduse cross-validation for selecting margin parameters to opti-mize team utility. While it is still an open question whethersuch an approach would be effective for SVM classifiers, inthis work we focused our attention on gradient-based opti-mization.Another related problem is cost-sensitive learning , wheredifferent mistakes incur different penalties; for example, igure 9: Difference between the predictions of team-loss and log-loss as we varied the data distribution of the Moons dataset. As we movedmore points towards the outer edges of the moons, the behavior of team-loss changed from B2 to a combination of B2 and B1; for example,when 50% was moved, team-loss both sacrificed accuracy in the
Solve region and also improved accuracy in
Accept region. false-negatives may be costlier than false-positives [Zadrozny et al. , 2003]. A common solution here is up-weighting the in-puts where the mistakes are costlier. Also relevant is work on importance-based learning where re-weighting helps learnfrom imbalanced data or speed-up training. However, in oursetup, re-weighting the inputs makes less sense— the weightswould depend on the classifier’s output, which has not beentrained yet. An iterative approach may be possible, but ourinitial analysis showed this approach is prone to oscillations,where the classifier may never converge. We leave exploringthis avenue for future work.A fundamental line of work that renders AI predictionsmore actionable (for humans) and better suitable for teamingis confidence-calibration , for example, using Bayesian mod-els [Ghahramani, 2015; Beach, 1975; Gal and Ghahramani,2016] or via post-hoc calibration [Platt, 1999; Zadrozny andElkan, 2001; Guo et al. , 2017; Niculescu-Mizil and Caru-ana, 2005]. A key difference between these methods andour approach is that team-loss re-trains the model to improveon inputs on which users are more likely to rely on the AIpredictions. The same contrast distinguishes our approachfrom outlier detection techniques [Hendrycks et al. , 2018;Lee et al. , 2017; Hodge and Austin, 2004].More recent work that adjusts model behavior to ac-commodate collaboration is backward-compatibility forAI [Bansal et al. , 2019b], where the model considers userinteractions with a previous version of the system to pre-serve trust across updates. Recent user studies showed thatwhen users develop mental models of system’s mistakes,properties other than accuracy are also desirable for success-ful collaboration, for example, parsimonious and determin-istic error boundaries [Bansal et al. , 2019a]. Our approachis a first step towards implementing these desiderata withinmachine learning optimization itself. Other approaches onhuman-centered optimization regularize or constrain modeloptimization for other human-centered requirements such asinterpretability [Wu et al. , 2019; Wu et al. , 2018] or fair- ness [Jung et al. , 2019; Zafar et al. , 2015].
We studied the problem of training classifiers that optimizeteam performance, a metric that for collaboration matters thanmere automation accuracy. To support direct optimization ofteam performance we advised a new loss function with a for-mulation based on the expected utility of the human-AI teamfor decision making. Thorough investigations and visualiza-tions of classifier behavior before and after leveraging team-loss for optimization show that, when such an optimization iseffective, team-loss can fundamentally change model behav-ior and improve team utility. Changes in model behavior in-clude either (i) sacrificing model accuracy in low confidenceregions for more accurate high-confidence predictions, or (ii)increasing accuracy in the
Accept region through more accu-rate predictions but fewer highly confident ones. Such behav-iors were observed in synthetic and real-world datasets whereAI is known to be employed as support for human decisionmakers. However, we also report that current optimizationtechniques were not always effective and in fact sometimesthey did not change model behavior, i.e., models remain iden-tical even after fine-tuning with team-loss . Since team-loss clearly emphasizes optimization challenges mostly related toits flat curvature and potential local minimas in the
Solve re-gion, we invite future work on machine learning optimizationand human-AI collaboration to jointly approach such chal-lenges at the intersection of both fields.
This material is based upon work supported by ONR grantN00014-18-1-2193, the University of Washington WRF/Ca-ble Professorship, the Allen Institute for Artificial Intelli-gence (AI2), and Microsoft Research. The authors thank RichCaruana, Bryan Wilder, and Zeyuan Allen-Zhu for useful dis-cussions and comments. eferences [Bansal et al. , 2019a] Gagan Bansal, Besmira Nushi, EceKamar, Walter S Lasecki, Daniel S Weld, and Eric Horvitz.Beyond accuracy: The role of mental models in human-aiteam performance. In
Proceedings of the AAAI Conferenceon Human Computation and Crowdsourcing , volume 7,pages 2–11, 2019.[Bansal et al. , 2019b] Gagan Bansal, Besmira Nushi, EceKamar, Daniel S Weld, Walter S Lasecki, and Eric Horvitz.Updates in human-ai teams: Understanding and address-ing the performance/compatibility tradeoff. In
Proceed-ings of the AAAI Conference on Artificial Intelligence , vol-ume 33, pages 2429–2437, 2019.[Beach, 1975] Barbara Heinrich Beach. Expert judgmentabout uncertainty: Bayesian decision making in realis-tic settings.
Organizational Behavior and Human Perfor-mance , 14(1):10–59, 1975.[Burges, 1998] Christopher JC Burges. A tutorial on supportvector machines for pattern recognition.
Data mining andknowledge discovery , 2(2):121–167, 1998.[Caruana et al. , 2015] Rich Caruana, Yin Lou, JohannesGehrke, Paul Koch, Marc Sturm, and Noemie Elhadad.Intelligible models for healthcare: Predicting pneumoniarisk and hospital 30-day readmission. In
KDD , 2015.[Gal and Ghahramani, 2016] Yarin Gal and Zoubin Ghahra-mani. Dropout as a bayesian approximation: Representingmodel uncertainty in deep learning. In international con-ference on machine learning , pages 1050–1059, 2016.[GDPR, 2020] GDPR. Art. 22 gdpr, automated individualdecision-making, including profiling. https://gdpr-info.eu/art-22-gdpr/, 2020. [Online; accessed 14-January-2020].[Ghahramani, 2015] Zoubin Ghahramani. Probabilisticmachine learning and artificial intelligence.
Nature ,521(7553):452, 2015.[Guo et al. , 2017] Chuan Guo, Geoff Pleiss, Yu Sun, andKilian Q Weinberger. On calibration of modern neural net-works. In
Proceedings of the 34th International Confer-ence on Machine Learning-Volume 70 , pages 1321–1330.JMLR. org, 2017.[Hendrycks and Gimpel, 2018] Dan Hendrycks and KevinGimpel. A baseline for detecting misclassifiedand out-of-distribution examples in neural networks. arXiv:1610.02136v3 , 2018.[Hendrycks et al. , 2018] Dan Hendrycks, Mantas Mazeika,and Thomas G Dietterich. Deep anomaly detection withoutlier exposure. arXiv preprint arXiv:1812.04606 , 2018.[Hodge and Austin, 2004] Victoria Hodge and Jim Austin. Asurvey of outlier detection methodologies.
Artificial intel-ligence review , 22(2):85–126, 2004.[Jung et al. , 2019] Christopher Jung, Michael Kearns, SethNeel, Aaron Roth, Logan Stapleton, and Zhiwei StevenWu. Eliciting and enforcing subjective individual fairness. arXiv preprint arXiv:1905.10660 , 2019. [Kamar et al. , 2012] Ece Kamar, Severin Hacker, and EricHorvitz. Combining human and machine intelligence inlarge-scale crowdsourcing. In
Proceedings of the 11th In-ternational Conference on Autonomous Agents and Mul-tiagent Systems-Volume 1 , pages 467–474. InternationalFoundation for Autonomous Agents and Multiagent Sys-tems, 2012.[Kleinberg et al. , 2018] Jon Kleinberg, HimabinduLakkaraju, Jure Leskovec, Jens Ludwig, and SendhilMullainathan. Human decisions and machine predictions.
The quarterly journal of economics , 133(1):237–293,2018.[Lee et al. , 2017] Kimin Lee, Honglak Lee, Kibok Lee, andJinwoo Shin. Training confidence-calibrated classifiersfor detecting out-of-distribution samples. arXiv preprintarXiv:1711.09325
Proceedings of the 22nd inter-national conference on Machine learning , pages 625–632.ACM, 2005.[Patel et al. , 2019] Bhavik N Patel, Louis Rosenberg, GreggWillcox, David Baltaxe, Mimi Lyons, Jeremy Irvin,Pranav Rajpurkar, Timothy Amrhein, Rajan Gupta,Safwan Halabi, et al. Human–machine partnership withartificial intelligence for chest radiograph diagnosis.
NPJdigital medicine , 2(1):1–10, 2019.[Platt, 1999] John Platt. Probabilistic outputs for supportvector machines and comparisons to regularized likelihoodmethods.
Advances in large margin classifiers , 10(3):61–74, 1999.[Ribeiro et al. , 2016] Marco Tulio Ribeiro, Sameer Singh,and Carlos Guestrin. "Why should I trust you?": Explain-ing the predictions of any classifier. In
Proc. of KDD ,2016.[Weld and Bansal, 2019] Daniel S. Weld and Gagan Bansal.The challenge of crafting intelligible intelligence.
Com-mun. ACM , 62:70–79, 2019.[Wu et al. , 2018] Mike Wu, Michael C Hughes, Sonali Parb-hoo, Maurizio Zazzi, Volker Roth, and Finale Doshi-Velez.Beyond sparsity: Tree regularization of deep models forinterpretability. In
Thirty-Second AAAI Conference on Ar-tificial Intelligence , 2018.[Wu et al. , 2019] Mike Wu, Sonali Parbhoo, MichaelHughes, Ryan Kindle, Leo Celi, Maurizio Zazzi, Volkeroth, and Finale Doshi-Velez. Regional tree regular-ization for interpretability in black box models. arXivpreprint arXiv:1908.04494 , 2019.[Zadrozny and Elkan, 2001] Bianca Zadrozny and CharlesElkan. Obtaining calibrated probability estimates from de-cision trees and naive bayesian classifiers. In
Icml , vol-ume 1, pages 609–616. Citeseer, 2001.[Zadrozny et al. , 2003] Bianca Zadrozny, John Langford,and Naoki Abe. Cost-sensitive learning by cost-proportionate example weighting. In
ICDM , volume 3,page 435, 2003.[Zafar et al. , 2015] Muhammad Bilal Zafar, Isabel Valera,Manuel Gomez Rodriguez, and Krishna P Gummadi. Fair-ness constraints: Mechanisms for fair classification. arXivpreprint arXiv:1507.05259 , 2015.
In Section 2 we assumed that the user acted rationally whilemaking the meta-decision. We now relax this assumptionand assume that with a small probability (cid:15) ∼ B ( γ , γ ) theuser may (uniformly) randomly choose between Accept and
Solve . Then, extending Equation 4, the user will
Accept system recommendation with probability: P (cid:15) ( m = A ) = (cid:26) − (cid:15) if h ( x )[ˆ y ] ≥ c ( β, λ, a ) (cid:15) otherwise (9)In the above equation, when the model is confident, proba-bility decreased by (cid:15) because the user may decide to Solve .Similarly, when the model is not confident, the (cid:15) increase(compared to Equation 4) indicates that the user may ran-domly decide to Accept an uncertain recommendation.To simplify deriving the new equation for the expected util-ity, we introduce re-write Equation 6 as: ψ ( x, y ) = (cid:26) ψ A ( x, y ) if h ( x )[ˆ y ] ≥ c ( β, λ, a ) ψ S ( x, y ) otherwise (10)Using the above two equations, we obtain the followingequation for expected utility when the user is not perfectlyrational: ψ (cid:15) ( x, y ) = (cid:26) (1 − (cid:15) ) · ψ A ( x, y ) + (cid:15) · ψ S ( x, y ) if h ( x )[ˆ y ] ≥ c ( β, λ, a )(1 − (cid:15) ) · ψ S ( x, y ) + (cid:15) · ψ A ( x, y ) otherwise(11)The above equation denotes that, when the system is confi-dent, instead of always obtaining ψ A as in Equation 10, witha small probability (cid:15) the user may obtain the expected utilityassociated with an Solve action. Similarly, when the systemis uncertain, the user may sometimes obtain expected utilityassociated with an
Accept action. Qualitatively, this will re-sult in a worse best-case expected utility, an artifact of usermaking sub-optimal decision (to
Solve ) when automationwould result in the highest utility. Similarly, the expectedutility in the
Solve region will also decrease– the user may
Accept uncertain recommendations. On the other hand, thiswill improve the worst-case utility— the new user will avoidsome high-confidence mistakes that a rational user would not.However, unlike ψ , ψ epsilon is strictly monotonic: ψ A is a lin-ear function and hence strictly monotonic, and sum of strictlymonotonic and constant function is strictly monotonic. Note γ , γ2