[PDF] Is the Most Accurate AI the Best Teammate? Optimizing AI for Teamwork

Abstract

AI practitioners typically strive to develop the most accurate systems, making an implicit assumption that the AI system will function autonomously. However, in practice, AI systems often are used to provide advice to people in domains ranging from criminal justice and finance to healthcare. In such AI-advised decision making, humans and machines form a team, where the human is responsible for making final decisions. But is the most accurate AI the best teammate? We argue "No" -- predictable performance may be worth a slight sacrifice in AI accuracy. Instead, we argue that AI systems should be trained in a human-centered manner, directly optimized for team performance. We study this proposal for a specific type of human-AI teaming, where the human overseer chooses to either accept the AI recommendation or solve the task themselves. To optimize the team performance for this setting we maximize the team's expected utility, expressed in terms of the quality of the final decision, cost of verifying, and individual accuracies of people and machines. Our experiments with linear and non-linear models on real-world, high-stakes datasets show that the most accuracy AI may not lead to highest team performance and show the benefit of modeling teamwork during training through improvements in expected team utility across datasets, considering parameters such as human skill and the cost of mistakes. We discuss the shortcoming of current optimization approaches beyond well-studied loss functions such as log-loss, and encourage future work on AI optimization problems motivated by human-AI collaboration.

Full PDF

OOptimizing AI for Teamwork

Gagan Bansal , Besmira Nushi , Ece Kamar , Eric Horvitz , Daniel S. Weld , University of Washington Microsoft Research Allen Institute for Artiﬁcial Intelligence

Abstract

In many high-stakes domains such as criminal jus-tice, ﬁnance, and healthcare, AI systems may rec-ommend actions to a human expert responsible forﬁnal decisions, a context known as

AI-advised de-cision making . When AI practitioners deploy themost accurate system in these domains, they im-plicitly assume that the system will function alonein the world. We argue that the most accurate AIteam-mate is not necessarily the best teammate ; forexample, predictable performance is worth a slightsacriﬁce in AI accuracy. So, we propose trainingAI systems in a human-centered manner and di-rectly optimizing for team performance . We studythis proposal for a speciﬁc type of human-AI team,where the human overseer chooses to accept the AIrecommendation or solve the task themselves. Tooptimize the team performance we maximize theteam’s expected utility , expressed in terms of qual-ity of the ﬁnal decision, cost of verifying, and in-dividual accuracies. Our experiments with linearand non-linear models on real-world, high-stakesdatasets show that the improvements in utility whilebeing small and varying across datasets and param-eters (such as cost of mistake), are real and consis-tent with our deﬁnition of team utility. We discussthe shortcoming of current optimization approachesbeyond well-studied loss functions such as log-loss ,and encourage future work on human-centered op-timization problems motivated by human-AI col-laborations.

Increasingly, humans work collaboratively with an AI team-mate, for example, because the team may perform betterthan either the AI or human alone [Nagar and Malone, 2011;Patel et al. , 2019; Kamar et al. , 2012], or because legal re-quirements may prohibit complete automation [GDPR, 2020;Nickelsburg, 2020]. For human-AI teams, just like for anyteam, optimizing the performance of the whole team is moreimportant than optimizing the performance of an individualmember. Yet, to date for the most part, the AI community has

Figure 1: In a human-AI team, a more accurate classiﬁer ( h , leftpane, learned using log-loss) may produce lower team utility thana less accurate one ( h , right pane). Suppose the human can ei-ther quickly accept the AI’s recommendation or solve the task them-selves, incurring a cost λ , to yield a more reliable result. The pay-off matrix describes the utility of different outcomes. One optimalpolicy is for the human to accept recommendations when the AI isconﬁdent, but verify uncertain predictions (shown in the light greyregion surrounding each hyperplane). While h is less accurate than h (because B is incorrectly classiﬁed), it results in a higher teamutility: Since h moved A outside the verify region, there are more correctly classiﬁed inputs on which the user can rely on the system. focused on maximizing the individual accuracy of machine-learning models. This raises an important question: Is themost accurate AI the best possible teammate for a human?We argue that the answer is "No." We show this formally,but the intuition is simple. Consider human-human teams, Is the best-ranked tennis player necessarily the best doublesteammate?

Clearly not — teamwork puts additional demandson participants besides high individual performance, suchas ability to complement and coordinate with one’s partner.Similarly, creating high-performing human-AI teams mayrequire training AI that exhibits additional human-centered properties that facilitate trust and delegation. Implicitly, thisis the motivation behind much work in intelligible AI [Caru-ana et al. , 2015; Weld and Bansal, 2019] and post-hoc ex-plainable AI [Ribeiro et al. , 2016], but we suggest that di-rectly modeling the collaborative process may offer addi-tional beneﬁts.Recent work emphasized the importance of better under- a r X i v : . [ c s . A I] J un tanding how people transform AI recommendations into de-cisions [Kleinberg et al. , 2018]. For instance, consider sce-narios when a system outputs a recommendation on whichit is uncertain. A rational user is likely to distrust such rec-ommendations — erroneous recommendations are often cor-related with a low conﬁdence in prediction [Hendrycks andGimpel, 2018]. In this work we assume that the user will dis-card the recommendation and solve the task themselves, afterincurring a cost ( e.g. , due to additional human effort). As aresult, the team performance depends on the AI accuracy onlyin the accept region , i.e. , the region where a user is actuallylikely to rely on AI t the singular objective of optimizing forAI accuracy ( e.g. , using log-loss ) may hurt team performancewhen the model has ﬁxed inductive bias; team performancewill instead beneﬁt from improving AI in the accept regionsin Figure 1. While there exist other aspects of collaborationthat can also be addressed via optimization techniques, suchas model interpretability, supporting complementary skills, orenabling learning among partners, the problem we address inthis paper is to account for team-based utility as a basis forcollaboration.In sum, we make the following contributions:1. We highlight a novel, important problem in the ﬁeld ofhuman-centered artiﬁcial intelligence: the most accu-rate ML model may not lead to the highest team utility when paired with a human overseer.2. We show that log-loss , the most popular loss function, isinsufﬁcient (as it ignores team utility ) and develop a newloss function team-loss , which overcomes its issues bycalculating a team’s expected utility.3. We present experiments on multiple real-world datasetsthat compare the gains in utility achieved by team-loss and log-loss . We observed that while the gains are smalland vary across datasets they reﬂect the behavior en-coded in the loss. We present further analysis to under-stand how team-loss results in a higher utility and when,for example, as a function of domain parameters such ascost of mistake.

We focus on a special case of AI-advised decision makingwhere a classiﬁer h gives recommendations to a human de-cision maker to help make decisions (Figure 2a). If h ( x ) de-notes the classiﬁer’s output, a probability distribution over Y ,the recommendation r consists of a label ˆ y = arg max h ( x ) and a conﬁdence value max h ( x ) , i.e. , r := (ˆ y, max h ( x )) .Using this recommendation, the user computes a ﬁnal deci-sion d . The environment, in response, returns a utility whichdepends on the quality of the ﬁnal decision and any cost in-curred due to human effort. Let U denote the utility function.If the team classiﬁes a sequence of instances, the objectiveof this team is to maximize the cumulative utility. Beforederiving a closed form equation of the objective, we char-acterize the form of the human-AI collaboration along withour assumptions. We study this particular, simple setting as aﬁrst step to explore the opportunities and challenges in team-centric optimization. If we cannot optimize for this simple (a)(b)Figure 2: (a) A schematic of AI-advised decision making. (b) Tomake a decision, the human decision maker either accepts or over-rides a recommendation. The Solve meta-decision is costlier than

Accept . setting, it may be much harder to optimize for more complexscenarios (discussed more in Section 4).1. User either accepts the recommendation or solves thetask themselves:

The human computes the ﬁnal deci-sion by ﬁrst making a meta-decision:

Accept or Solve (Figure 2b).

Accept passes off the recommendation asthe ﬁnal decision. In contrast,

Solve ignores the rec-ommendation and the user computes the ﬁnal decisionthemselves. Let m denote the function that maps an in-put instance and recommendation to a meta-decision in M = { Accept , Solve } . As a result, the optimal clas-siﬁer h ∗ would maximize the team’s expected utility: h ∗ = arg max h E x,y [ U ( m, d )] (1)2. Mistakes are costly:

A correct decision results in unitreward whereas an incorrect decision results in a penalty β ≥ .3. Solving the task is costly:

Since it takes time and effortfor the human to perform the task themselves, ( e.g. , cog-nitive effort), we assume that the

Solve meta-decisioncosts more than

Accept . Further, without loss of gener-ality, we assume λ units of cost to Solve and zero costto

Accept .Using the above assumptions we obtain the followingutility function. The values in each cell of the table orig-inate from subtracting the cost of the action from theenvironment reward. ymbol Description a Human accuracy β ∈ R + Cost of mistake h ( x )[ˆ y ] = max h ( x ) Conﬁdence in the predicted label d : X × R → Y

Human decision maker h : X → [0 , |Y| Classiﬁer H Classiﬁer hypothesis space λ ∈ R + Cost of human effort to

Solve m : X × R → M

Meta-decision function M := { Accept , Solve } Meta-decision space ψ : H × X × Y → R Expected team utility r ∈ R Recommendation R := Y × [0 , Recommendation space U : M × Y → R Utility function X Feature space ˆ y ∈ Y Recommended label Y Label spaceTable 1: Notation.Meta-decision/Decision Correct Incorrect

Accept [ A ] − β Solve [ S ] − λ − β − λ Figure 3: Team utility w.r.t. meta-decision and decision accuracy. Human is uniformly accurate:

Let a ∈ [0 , denote theconditional probability that if the user solves the task,they will make the correct decision, i.e. , P ( d = y | m = S ) = a (2)5. Human is rational:

The user makes the meta-decisionby comparing expected utilities. Further, the user truststhe classiﬁer’s conﬁdence as an accurate indicator of therecommendation’s reliability. As a result, the user willchoose

Accept if and only if the expected utility for ac-cepting is higher than the expected utility for solving. E [ U ( m = A )] ≥ E [ U ( m = S )] h ( x )[ˆ y ] − (1 − h ( x )[ˆ y ]) · β ≥ a − (1 − a ) · β − λh ( x )[ˆ y ] ≥ a − λ β Let c ( β, λ, a ) denote the minimum value of system con-ﬁdence for which the user’s meta-decision is Accept . c ( β, λ, a ) = a − λ β (3)This implies the human will follow the followingthreshold-based policy to make meta-decisions: P ( m = A ) = (cid:26) if h ( x )[ˆ y ] ≥ c ( β, λ, a )0 otherwise (4) We now derive the equation for expected utility of recommen-dations. Let ψ ( h ) denote the expected utility of the classiﬁer h and a decision maker d . (a) In the Accept region the expected team utility isequal to expected automation utility, while in the

Solve region it is the same as the human utility. The nega-tive team utility in the left-most region indicates scenar-ios where AI gives high-conﬁdence but incorrect recom-mendations to the human.(b) In the

Accept region, team-loss behaves similar to log-loss , however in the

Solve region it results in a con-stant loss.Figure 4: Visualization of expected utility and loss. This visualiza-tion corresponds to the case when λ = 0 . , β = 1 , and a = 1 ( i.e. ,the human is perfectly accurate but it costs them half a unit of utilityto solve the task). ψ ( x, y ) = E [ U ( m, d )] ψ ( x, y ) = P ( m = A ) · (cid:104) P ( d = y | m = A ) · P ( d (cid:54) = y | m = A ) · ( − β )) (cid:105) + P ( m = S ) · (cid:104) P ( d = y | m = S ) · (1 − λ )+ P ( d (cid:54) = y | m = S ) · ( − β − λ ) (cid:105) Since upon

Accept , the human returns the classiﬁer’s rec-ommendation, the probability that the ﬁnal decision is correctis the same as the classiﬁer’s predicted probability of the cor-rect decision: P ( d = y | m = A ) = h ( x )[ y ] (5)sing Equations 2 and 5, we obtain the following equationfor expected utility of team. = P ( m = A ) · (cid:104) (1 + β ) · h ( x )[ y ] − β (cid:105) + P ( m = S ) · (cid:104) (1 + β ) · a − β − λ (cid:105) = P ( m = A ) · (cid:104) (1 + β ) · ( h ( x )[ y ] − a ) + λ (cid:105) + (cid:104) (1 + β ) · a − β − λ (cid:105)(cid:124) (cid:123)(cid:122) (cid:125) constant (6)Using Equations 4 and 6, we obtain the following expres-sion for expected utility: ψ ( x, y ) = (cid:26) (1 + β ) · h ( x )[ y ] − β if h ( x )[ˆ y ] ≥ c ( β, λ, a )(1 + β ) · a − β − λ otherwise (7)Figure 4a visualizes the expected team utility of the classi-ﬁer predictions as a function of conﬁdence in the true label. Since gradient descent-based minimization of loss functionsis common in machine learning, we transform the expectedutility ψ into a loss function by negating it. We call this newloss function team-loss . L team ( x, y ) = − log( ψ ( x, y )) (8)We take a logarithm before negating utility to allow com-parisons with log-loss , where the logarithmic nature of lossis known to beneﬁt optimization, for example, by heavily pe-nalizing high-conﬁdence mistakes. Figure 4b visualizes thisnew loss function.

We conducted experiments to answer the following researchquestions:

RQ1 Does the new loss function result in a classiﬁer thatimproves team utility over the most accurate classi-ﬁer?RQ2 How do these improvements change with propertiesof the task, e.g. , the cost of mistakes ( β )?RQ3 How do these improvements change with propertiesof the dataset, e.g. , with data distribution or dimen-sionality?Metrics and Datasets We compared the utility achieved bytwo models: the most accurate classiﬁer trained using log-loss and a classiﬁer optimized using team-loss on the datasetsdescribed in Table 2. We experimented with two syntheticdatasets and four real-world datasets with high-stakes. Thereal datasets are from domains that are known to or alreadydeploy AI to assist human decision makers. The Scenario1

Dataset

Scenario1 2 10000 0.43Moons 2 10000 0.50German 24 1000 0.30Fico 39 9861 0.52Recidivism 13 6172 0.46Mimic 714 21139 0.13

Table 2: We used two synthetic datasets (Scenario1, Moons) andfour real-world datasets from high-stakes domains that are known tobe used in AI-assisted decision making settings. The Mimic datasethas the most class imbalance. dataset refers to a dataset we created by sampling 10000points from the data distribution described in Figure 1.

Training Procedure

We experimented with two models: lo-gistic regression and multi-layered perceptron (two hiddenlayers with 50 and 100 units). For each task (deﬁned bya choice of task parameters, dataset, model, loss) we opti-mized the loss using stochastic gradient descent (SGD) andalso used standard, well-known training practices such as us-ing regularization, check-pointing, and learning rate sched-ulers. We selected the best hyper-parameters using 5-foldcross validation, including values for the learning rate, batchsize, patience and decay factor of the learning rate scheduler,and weight of the L2 regularizer.In our initial experiments for training with team-loss usingSGD, we observed that the classiﬁer’s loss would never re-duce and in fact remain constant. This happened because, inpractice, random initializations resulted in classiﬁers that areuncertain on all examples. And, since, by deﬁnition, team-loss is ﬂat in these uncertain regions (Figure 4b), the gradi-ents was zero and uninformative. To overcome this issue, weinitialized the classiﬁers with the (already converged) mostaccurate classiﬁer.

RQ1:

Experiments showed that when we used team-loss , themagnitude of improvements in team utility over log-loss var-ied across the datasets but were consistently observed (Ta-ble 3). We observed that team-loss often sacriﬁces classi-ﬁer accuracy to improve team utility, the more desirable met-ric. For the linear classiﬁer, this sacriﬁce is especially largeon the synthetic datasets: Scenario1 (16%) and Moons (1%)datasets. For the MLP, team-loss sacriﬁces 2% accuracy toimprove team utility. While the metrics in Table 3 (change in accuracy and util-ity) provide a global understanding of the effect of team-loss ,they do not help understand how team-loss achieved improve-ments and whether the behavior of the new models is con-sistent with intuition. Figure 5 visualizes the difference inbehavior (averaged over 150 seeds) between the classiﬁersproduced by log-loss and team-loss on the Scenario1 dataset. Since loss can be negative, in our implementation, before com-puting a logarithmic of utility we appropriately shift up the utilityfunction (by subtracting its minimum value). We report absolute improvements instead of percentage im-provements in utility because utility can be negative. odel Dataset Acc LL Util LL ∆ Acc ∆ Util

Linear Scenario1 0.86 0.59 -0.16 0.165Moons 0.89 0.81 -0.01 0.020German 0.75 0.61 -0.004 0.009Mimic 0.88 0.80 -0.000 0.001Recid 0.68 0.53 0.000 0.000Fico 0.73 0.58 0.000 -0.000MLP Fico 0.72 0.56 0.01 0.018Scenario1 0.98 0.84 -0.04 0.008Moons 1.00 0.99 0.00 0.007German 0.74 0.61 -0.02 0.003Mimic 0.88 0.80 0.00 0.003Recid 0.67 0.52 -0.00 0.001

Note: LL indicates log-loss

Table 3: Differences in performance (accuracy and utility) of team-loss and log-loss for all datasets (averaged over 150 runs). Datasetsare sorted in descending order of improvements in utility and theanalysis is divided by classiﬁer type, linear and multi-layered per-ceptron. We observe that team-loss often sacriﬁces accuracy to im-prove utility. While the gains in utility are small they are consistentlyobserved across datasets.

Speciﬁcally, as shown in Figure 5, we visualize and comparetheir:V1. Calibration using reliability curves , which compare sys-tem conﬁdence and its true accuracy. A perfectly cal-ibrated system, for example, will be 80% accurate onregions that is 80% conﬁdent. However, in practice, sys-tems may over- or under-conﬁdent.V2. Distributions of conﬁdence in predictions. For example,in Figure 5, team-loss makes more high-conﬁdence pre-dictions than log-loss .V3. Fraction of total system accuracy contributed by differ-ent regions (of conﬁdence values). Thus, the area underthis curve indicates the system’s total accuracy. Notethat for our setup the area under the curve in the

Accept region is more crucial than the area in the

Solve regionsince in the latter the human is expected to take over.V4. Similar to (V4), the forth sub-graph shows the fractionof total system utility contributed by different regions ofconﬁdence.If team-loss had not resulted in different predictions than log-loss , the curves in Figure 5 for the two loss functionswould have been indistinguishable. However, we observedthat team-loss results in dramatically different predictionsthan log-loss . In fact, we noticed two types of behaviors when team-loss improved utility.B1 The ﬁrst type of behavior was observed on Scenario1datatset (Figure 5) and is easier to understand as itmatches the intuition we set out in the beginning– theclassiﬁer trained with team-loss sacriﬁces accuracy onthe uncertain examples in the

Solve region to makemore high-conﬁdence predictions in the

Accept region.This change improves system accuracy in the

Accept region, which is where the system accuracy matters and

Figure 5: Differences between behavior of linear classiﬁers learnedusing log-loss and team-loss on the Scenario1 and Moons datasets(averaged over 150 runs).

Scenario1: team-loss sacriﬁces accu-racy in the Solve region, 2) makes fewer predictions in the

Solve region and more high-conﬁdence predictions in the right-half of the

Accept region (annotated as X), 3) reduces the contribution to sys-tem accuracy from

Solve and increases it from the

Accept re-gion, 4) results in higher area under the curve indicating an increasein overall utility.

Moons: team-loss improves accuracy in the Accept region, 2) makes fewer very-high conﬁdence predictions(marked as Y) and more moderately-high conﬁdence predictions inthe

Accept region. Figure 6 shows similar visualizations for thereal-world datasets. contributes to team utility. Later, we show that this samebehavior is observed on the German dataset (Figure 6)B2 The second type of behavior was observed on Moons(Figure 5), where the new loss increases accuracy in the

Accept region at the cost predicting fewer very high-conﬁdence predictions ( e.g. , when conﬁdence is greaterthan 0.95 in the region marked Y). This change improvesutility because the system’s accuracy in the

Accept re-gion matters more than making very high-conﬁdencepredictions.In both these behaviors, team-loss effectively increases thecontribution to AI accuracy from the

Accept region, i.e. , theregion where AI’s performance providing value to the team.In contrast, log-loss has no such considerations. Figure 6shows a similar analysis on the real datasets for both the lin-ear and MLP classiﬁers. When team-loss improves utility, wesee one of the two behaviors we described above.

RQ2:

Since the penalty of mistakes may be task-dependant( e.g. , an incorrect diagnosis may be costlier than incorrectloan approval), we varied the mistake penalty β to studyits effects on the improvements from team-loss . Our ex-periments (Figure 7) showed that the difference in utilitiesdepend on the cost of mistake, and highest difference is a) Linear classiﬁer : On German, we observed B1, where team-loss compared to log-loss preserved accuracy and mademore predictions in the

Accept region, and sacriﬁced accu-racy and mass of prediction distribution in the

Solve region.In contrast, on Mimic, we observed B2, where team-loss in-creased accuracy in the

Accept region but made fewer veryhigh conﬁdence predictions ( e.g. , conﬁdence > 0.9). (b)

MLP classiﬁer:

On Fico, we observed a behavior B2 simi-lar to Moons (Figure 5), where using team-loss increased the ac-curacy in the

Accept region and reduced the number of very-high conﬁdence predictions (same as moons for linear). In con-trast on the German dataset, we observed a behavior B1 similarto the Scenario1 dataset (Figure 5), where using team-loss sac-riﬁced accuracy in the

Solve region and increased the numberof predictions in the

Accept region.Figure 6: Comparison of the predictions of log-loss and team-loss on the real-world datasets when team-loss improves utility (150 seeds).Figure 7: Comparison of utility achieved by the two loss functions.Across values of β , in most datasets, team-loss achieves higher util-ity than log-loss ; however, the β value that results in the highest rel-ative improvements is different across datasets. Interestingly, when log-loss results in lower utility than a human-only baseline (indi-cated by dotted line), e.g. , as seen on Recidivism as penalty in-creases, team-loss still attempts to nudge its utility to match the hu-man baseline. observed for a different value of β across datasets. Wealso observed that, for our setup, as the mistake penaltyincreases, log-loss may achieve lower performance thanthe human-only baseline, and so, deploying automationis undesirable in these cases. For example, on Fico and β = 7 ), linear model learned using log-loss achieves lowerperformance than human baseline. Similarly, on Mimic and β = 500 , MLP learned using log-loss deploying the AI isundesirable. Model Dataset Acc LL Util LL ∆ Acc ∆ Util

Linear German-b 0.72 0.56 -0.01 0.004Mimic-b 0.77 0.65 0.00

MLP German-b 0.74 0.57 0.00

Mimic-b 0.93 0.87 0.00 0.002

Table 4: Performance on German and Mimic datasets after correct-ing class imbalance. Bold indicates setting where balancing thedataset improved the gains in utility compared to its original version.We observed that for MLP, after balancing the German dataset, thegains in utility improved substantially, from 0.003 (see Table 3) to0.024.

RQ3:

Since the gains from using team-loss were small andvaried across datasets, we conducted experiments to inves-tigate properties of the dataset that may have affected theseimprovements. While there are many properties of a datasetone could investigate, we studied following: igure 8: Relative performance of team-loss on the Moons datasetfor linear classiﬁer as we varied the data distribution and movedmore points towards the edges. “Fraction Moved" indicates the frac-tion of total number of points that were moved towards the overlap-ping edges of the two moons. Data distribution

In the Moons dataset, we observedthat the linear model trained with team-loss increasedutility by increasing conﬁdence on the examples on theouter edges of the moons enough to move these exam-ples to the

Accept region. So, to test whether a differentdata distribution would beneﬁts from using team-loss ,we created additional versions of Moons by systemat-ically moving points from the middle of the circle to-wards its edges. Figure 9 shows the improvements inutility as we moved more data.2.

Class imbalance

While most of our datasets were bal-anced, German and Mimic had a lower percentage ofpositive instances (see Table 2). We conducted exper-iments on balanced versions of these two datasets tounderstand if class imbalance affected our observationsin the previous experiments. Table 4 shows the perfor-mance after we over-sampled the positive class to adjustfor class imbalance in the two datasets. We observedthat in both cases correcting class imbalance increasedthe improvement when using team-loss .We also focussed on the dimensionality of the datasets.Since team-loss may be harder to optimize than NLL, an in-crease in data dimensions may affect the optimizer’s ability(in this case SGD) to optimize team-loss objective. We alsoexperimented with ADAM as the optimizer, unfortunately itdid not provide any beneﬁts. For a given dataset, we varied itsdimensionality by using only a subset of features. However,we did not notice any correlations between dimensionalityand improvements in utility.

We conjecture two reasons to explain the small gains in utilityon real datasets using team-loss : either there is no scope forimproving the utility on those datasets and model pairs or ourcurrent optimization procedures are ineffective. Since we donot know the optimal (utility) solution for a given dataset and model, we cannot verify or reject the ﬁrst conjecture. How-ever, the results on the two synthetic datasets suggest the ex-istence of situations where there is a signiﬁcant gap betweenthe utilities achieved using team-loss and log-loss .However, it is possible that our current optimization pro-cedures may be ineffective for optimizing team-loss . Onereason this might happen is that team-loss is more complexthat log-loss – it introduces new plateaus in the loss surfaceand thus may increase the chances of optimization methodssuch stochastic gradient descent getting stuck in local min-ima. In fact, in our experiments, we observed that on thedatasets where team-loss did not increase utility it resulted inpredictions identical to log-loss . This may, for example, hap-pen if the most accurate classiﬁer is a local minima. Since weuse the most accurate classiﬁer to initialize the optimizationon team-loss , this entails that the further optimization withthe new loss did not manage to overcome the potential localminima.While we propose a solution for simpliﬁed human-AIteamwork (see assumptions in Section 2), our observationshave implications for human-AI teams in general. If we can-not optimize utility for our simpliﬁed case, it may be harderto optimize utility in scenarios where users make

Accept and

Solve decisions using a richer, more complex mental modelinstead beyond relying on just model conﬁdence. Such sce-narios are common in cases where the system conﬁdence isan unreliable indicator of performance ( e.g. , due to poor cal-ibration), and, as a result, the user develops an understandingof system failures in terms of domain features. For exam-ple, Tesla drivers often override the Autopilot using featuressuch as road and weather conditions. We can reduce this case,where users have a complex mental model, to the one westudied. Speciﬁcally, we can construct a loss function thatis constant when a prediction belongs to the

Solve regiondescribed by the user’s mental model and log-loss otherwise.This case may be harder to optimize because the resultant losssurface will contain more complex combinations of plateausand local optima than the one we considered.

Our approach is closely related to maximum-margin classi-ﬁers , such as an SVM optimized with the hinge loss [Burges,1998], where a larger soft margin can be used to make high-conﬁdence and accurate predictions. However, unlike our ap-proach, it is not possible to directly plug the domain’s payoffmatrix ( e.g. , in Figure 3) into such a model. Furthermore, theSVM’s output and margin do not have an immediate proba-bilistic interpretation, which is crucial for our problem set-ting. One possible (though computationally intensive) solu-tion direction is to convert margin into probabilities, e.g. , us-ing post-hoc calibration ( e.g. , Platt scaling [Platt, 1999]), anduse cross-validation for selecting margin parameters to opti-mize team utility. While it is still an open question whethersuch an approach would be effective for SVM classiﬁers, inthis work we focused our attention on gradient-based opti-mization.Another related problem is cost-sensitive learning , wheredifferent mistakes incur different penalties; for example, igure 9: Difference between the predictions of team-loss and log-loss as we varied the data distribution of the Moons dataset. As we movedmore points towards the outer edges of the moons, the behavior of team-loss changed from B2 to a combination of B2 and B1; for example,when 50% was moved, team-loss both sacriﬁced accuracy in the

Solve region and also improved accuracy in

Accept region. false-negatives may be costlier than false-positives [Zadrozny et al. , 2003]. A common solution here is up-weighting the in-puts where the mistakes are costlier. Also relevant is work on importance-based learning where re-weighting helps learnfrom imbalanced data or speed-up training. However, in oursetup, re-weighting the inputs makes less sense— the weightswould depend on the classiﬁer’s output, which has not beentrained yet. An iterative approach may be possible, but ourinitial analysis showed this approach is prone to oscillations,where the classiﬁer may never converge. We leave exploringthis avenue for future work.A fundamental line of work that renders AI predictionsmore actionable (for humans) and better suitable for teamingis conﬁdence-calibration , for example, using Bayesian mod-els [Ghahramani, 2015; Beach, 1975; Gal and Ghahramani,2016] or via post-hoc calibration [Platt, 1999; Zadrozny andElkan, 2001; Guo et al. , 2017; Niculescu-Mizil and Caru-ana, 2005]. A key difference between these methods andour approach is that team-loss re-trains the model to improveon inputs on which users are more likely to rely on the AIpredictions. The same contrast distinguishes our approachfrom outlier detection techniques [Hendrycks et al. , 2018;Lee et al. , 2017; Hodge and Austin, 2004].More recent work that adjusts model behavior to ac-commodate collaboration is backward-compatibility forAI [Bansal et al. , 2019b], where the model considers userinteractions with a previous version of the system to pre-serve trust across updates. Recent user studies showed thatwhen users develop mental models of system’s mistakes,properties other than accuracy are also desirable for success-ful collaboration, for example, parsimonious and determin-istic error boundaries [Bansal et al. , 2019a]. Our approachis a ﬁrst step towards implementing these desiderata withinmachine learning optimization itself. Other approaches onhuman-centered optimization regularize or constrain modeloptimization for other human-centered requirements such asinterpretability [Wu et al. , 2019; Wu et al. , 2018] or fair- ness [Jung et al. , 2019; Zafar et al. , 2015].

We studied the problem of training classiﬁers that optimizeteam performance, a metric that for collaboration matters thanmere automation accuracy. To support direct optimization ofteam performance we advised a new loss function with a for-mulation based on the expected utility of the human-AI teamfor decision making. Thorough investigations and visualiza-tions of classiﬁer behavior before and after leveraging team-loss for optimization show that, when such an optimization iseffective, team-loss can fundamentally change model behav-ior and improve team utility. Changes in model behavior in-clude either (i) sacriﬁcing model accuracy in low conﬁdenceregions for more accurate high-conﬁdence predictions, or (ii)increasing accuracy in the

Accept region through more accu-rate predictions but fewer highly conﬁdent ones. Such behav-iors were observed in synthetic and real-world datasets whereAI is known to be employed as support for human decisionmakers. However, we also report that current optimizationtechniques were not always effective and in fact sometimesthey did not change model behavior, i.e., models remain iden-tical even after ﬁne-tuning with team-loss . Since team-loss clearly emphasizes optimization challenges mostly related toits ﬂat curvature and potential local minimas in the

Solve re-gion, we invite future work on machine learning optimizationand human-AI collaboration to jointly approach such chal-lenges at the intersection of both ﬁelds.

This material is based upon work supported by ONR grantN00014-18-1-2193, the University of Washington WRF/Ca-ble Professorship, the Allen Institute for Artiﬁcial Intelli-gence (AI2), and Microsoft Research. The authors thank RichCaruana, Bryan Wilder, and Zeyuan Allen-Zhu for useful dis-cussions and comments. eferences [Bansal et al. , 2019a] Gagan Bansal, Besmira Nushi, EceKamar, Walter S Lasecki, Daniel S Weld, and Eric Horvitz.Beyond accuracy: The role of mental models in human-aiteam performance. In

Proceedings of the AAAI Conferenceon Human Computation and Crowdsourcing , volume 7,pages 2–11, 2019.[Bansal et al. , 2019b] Gagan Bansal, Besmira Nushi, EceKamar, Daniel S Weld, Walter S Lasecki, and Eric Horvitz.Updates in human-ai teams: Understanding and address-ing the performance/compatibility tradeoff. In

Proceed-ings of the AAAI Conference on Artiﬁcial Intelligence , vol-ume 33, pages 2429–2437, 2019.[Beach, 1975] Barbara Heinrich Beach. Expert judgmentabout uncertainty: Bayesian decision making in realis-tic settings.

Organizational Behavior and Human Perfor-mance , 14(1):10–59, 1975.[Burges, 1998] Christopher JC Burges. A tutorial on supportvector machines for pattern recognition.

Data mining andknowledge discovery , 2(2):121–167, 1998.[Caruana et al. , 2015] Rich Caruana, Yin Lou, JohannesGehrke, Paul Koch, Marc Sturm, and Noemie Elhadad.Intelligible models for healthcare: Predicting pneumoniarisk and hospital 30-day readmission. In

KDD , 2015.[Gal and Ghahramani, 2016] Yarin Gal and Zoubin Ghahra-mani. Dropout as a bayesian approximation: Representingmodel uncertainty in deep learning. In international con-ference on machine learning , pages 1050–1059, 2016.[GDPR, 2020] GDPR. Art. 22 gdpr, automated individualdecision-making, including proﬁling. https://gdpr-info.eu/art-22-gdpr/, 2020. [Online; accessed 14-January-2020].[Ghahramani, 2015] Zoubin Ghahramani. Probabilisticmachine learning and artiﬁcial intelligence.

Nature ,521(7553):452, 2015.[Guo et al. , 2017] Chuan Guo, Geoff Pleiss, Yu Sun, andKilian Q Weinberger. On calibration of modern neural net-works. In

Proceedings of the 34th International Confer-ence on Machine Learning-Volume 70 , pages 1321–1330.JMLR. org, 2017.[Hendrycks and Gimpel, 2018] Dan Hendrycks and KevinGimpel. A baseline for detecting misclassiﬁedand out-of-distribution examples in neural networks. arXiv:1610.02136v3 , 2018.[Hendrycks et al. , 2018] Dan Hendrycks, Mantas Mazeika,and Thomas G Dietterich. Deep anomaly detection withoutlier exposure. arXiv preprint arXiv:1812.04606 , 2018.[Hodge and Austin, 2004] Victoria Hodge and Jim Austin. Asurvey of outlier detection methodologies.

Artiﬁcial intel-ligence review , 22(2):85–126, 2004.[Jung et al. , 2019] Christopher Jung, Michael Kearns, SethNeel, Aaron Roth, Logan Stapleton, and Zhiwei StevenWu. Eliciting and enforcing subjective individual fairness. arXiv preprint arXiv:1905.10660 , 2019. [Kamar et al. , 2012] Ece Kamar, Severin Hacker, and EricHorvitz. Combining human and machine intelligence inlarge-scale crowdsourcing. In

Proceedings of the 11th In-ternational Conference on Autonomous Agents and Mul-tiagent Systems-Volume 1 , pages 467–474. InternationalFoundation for Autonomous Agents and Multiagent Sys-tems, 2012.[Kleinberg et al. , 2018] Jon Kleinberg, HimabinduLakkaraju, Jure Leskovec, Jens Ludwig, and SendhilMullainathan. Human decisions and machine predictions.

The quarterly journal of economics , 133(1):237–293,2018.[Lee et al. , 2017] Kimin Lee, Honglak Lee, Kibok Lee, andJinwoo Shin. Training conﬁdence-calibrated classiﬁersfor detecting out-of-distribution samples. arXiv preprintarXiv:1711.09325

Proceedings of the 22nd inter-national conference on Machine learning , pages 625–632.ACM, 2005.[Patel et al. , 2019] Bhavik N Patel, Louis Rosenberg, GreggWillcox, David Baltaxe, Mimi Lyons, Jeremy Irvin,Pranav Rajpurkar, Timothy Amrhein, Rajan Gupta,Safwan Halabi, et al. Human–machine partnership withartiﬁcial intelligence for chest radiograph diagnosis.

NPJdigital medicine , 2(1):1–10, 2019.[Platt, 1999] John Platt. Probabilistic outputs for supportvector machines and comparisons to regularized likelihoodmethods.

Advances in large margin classiﬁers , 10(3):61–74, 1999.[Ribeiro et al. , 2016] Marco Tulio Ribeiro, Sameer Singh,and Carlos Guestrin. "Why should I trust you?": Explain-ing the predictions of any classiﬁer. In

Proc. of KDD ,2016.[Weld and Bansal, 2019] Daniel S. Weld and Gagan Bansal.The challenge of crafting intelligible intelligence.

Com-mun. ACM , 62:70–79, 2019.[Wu et al. , 2018] Mike Wu, Michael C Hughes, Sonali Parb-hoo, Maurizio Zazzi, Volker Roth, and Finale Doshi-Velez.Beyond sparsity: Tree regularization of deep models forinterpretability. In

Thirty-Second AAAI Conference on Ar-tiﬁcial Intelligence , 2018.[Wu et al. , 2019] Mike Wu, Sonali Parbhoo, MichaelHughes, Ryan Kindle, Leo Celi, Maurizio Zazzi, Volkeroth, and Finale Doshi-Velez. Regional tree regular-ization for interpretability in black box models. arXivpreprint arXiv:1908.04494 , 2019.[Zadrozny and Elkan, 2001] Bianca Zadrozny and CharlesElkan. Obtaining calibrated probability estimates from de-cision trees and naive bayesian classiﬁers. In

Icml , vol-ume 1, pages 609–616. Citeseer, 2001.[Zadrozny et al. , 2003] Bianca Zadrozny, John Langford,and Naoki Abe. Cost-sensitive learning by cost-proportionate example weighting. In

ICDM , volume 3,page 435, 2003.[Zafar et al. , 2015] Muhammad Bilal Zafar, Isabel Valera,Manuel Gomez Rodriguez, and Krishna P Gummadi. Fair-ness constraints: Mechanisms for fair classiﬁcation. arXivpreprint arXiv:1507.05259 , 2015.

In Section 2 we assumed that the user acted rationally whilemaking the meta-decision. We now relax this assumptionand assume that with a small probability (cid:15) ∼ B ( γ , γ ) theuser may (uniformly) randomly choose between Accept and

Solve . Then, extending Equation 4, the user will

Accept system recommendation with probability: P (cid:15) ( m = A ) = (cid:26) − (cid:15) if h ( x )[ˆ y ] ≥ c ( β, λ, a ) (cid:15) otherwise (9)In the above equation, when the model is conﬁdent, proba-bility decreased by (cid:15) because the user may decide to Solve .Similarly, when the model is not conﬁdent, the (cid:15) increase(compared to Equation 4) indicates that the user may ran-domly decide to Accept an uncertain recommendation.To simplify deriving the new equation for the expected util-ity, we introduce re-write Equation 6 as: ψ ( x, y ) = (cid:26) ψ A ( x, y ) if h ( x )[ˆ y ] ≥ c ( β, λ, a ) ψ S ( x, y ) otherwise (10)Using the above two equations, we obtain the followingequation for expected utility when the user is not perfectlyrational: ψ (cid:15) ( x, y ) = (cid:26) (1 − (cid:15) ) · ψ A ( x, y ) + (cid:15) · ψ S ( x, y ) if h ( x )[ˆ y ] ≥ c ( β, λ, a )(1 − (cid:15) ) · ψ S ( x, y ) + (cid:15) · ψ A ( x, y ) otherwise(11)The above equation denotes that, when the system is conﬁ-dent, instead of always obtaining ψ A as in Equation 10, witha small probability (cid:15) the user may obtain the expected utilityassociated with an Solve action. Similarly, when the systemis uncertain, the user may sometimes obtain expected utilityassociated with an

Accept action. Qualitatively, this will re-sult in a worse best-case expected utility, an artifact of usermaking sub-optimal decision (to

Solve ) when automationwould result in the highest utility. Similarly, the expectedutility in the

Solve region will also decrease– the user may

Accept uncertain recommendations. On the other hand, thiswill improve the worst-case utility— the new user will avoidsome high-conﬁdence mistakes that a rational user would not.However, unlike ψ , ψ epsilon is strictly monotonic: ψ A is a lin-ear function and hence strictly monotonic, and sum of strictlymonotonic and constant function is strictly monotonic. Note γ , γ2