[PDF] Designing AI for Trust and Collaboration in Time-Constrained Medical Decisions: A Sociotechnical Lens

Abstract

Major depressive disorder is a debilitating disease affecting 264 million people worldwide. While many antidepressant medications are available, few clinical guidelines support choosing among them. Decision support tools (DSTs) embodying machine learning models may help improve the treatment selection process, but often fail in clinical practice due to poor system integration. We use an iterative, co-design process to investigate clinicians' perceptions of using DSTs in antidepressant treatment decisions. We identify ways in which DSTs need to engage with the healthcare sociotechnical system, including clinical processes, patient preferences, resource constraints, and domain knowledge. Our results suggest that clinical DSTs should be designed as multi-user systems that support patient-provider collaboration and offer on-demand explanations that address discrepancies between predictions and current standards of care. Through this work, we demonstrate how current trends in explainable AI may be inappropriate for clinical environments and consider paths towards designing these tools for real-world medical systems.

Full PDF

DDesigning AI for Trust and Collaboration in Time-ConstrainedMedical Decisions: A Sociotechnical Lens

Maia Jacobs

Northwestern [email protected]

Jeffrey He

Harvard [email protected]

Melanie F. Pradier

Microsoft [email protected]

Barbara Lam

Beth Israel Deaconess Medical [email protected]

Andrew C. Ahn

Harvard Medical SchoolBeth Israel Deaconess Medical [email protected]

Thomas H. McCoy

Massachusetts General HospitalHarvard Medical [email protected]

Roy H. Perlis

Massachusetts General HospitalHarvard Medical [email protected]

Finale Doshi-Velez

Harvard [email protected]

Krzysztof Z. Gajos

Harvard [email protected]

ABSTRACT

Major depressive disorder is a debilitating disease affecting 264million people worldwide. While many antidepressant medicationsare available, few clinical guidelines support choosing among them.Decision support tools (DSTs) embodying machine learning modelsmay help improve the treatment selection process, but often fail inclinical practice due to poor system integration.We use an iterative, co-design process to investigate clinicians’perceptions of using DSTs in antidepressant treatment decisions.We identify ways in which DSTs need to engage with the healthcaresociotechnical system, including clinical processes, patient prefer-ences, resource constraints, and domain knowledge. Our resultssuggest that clinical DSTs should be designed as multi-user systemsthat support patient-provider collaboration and offer on-demandexplanations that address discrepancies between predictions andcurrent standards of care. Through this work, we demonstrate howcurrent trends in explainable AI may be inappropriate for clinicalenvironments and consider paths towards designing these tools forreal-world medical systems.

CCS CONCEPTS • Human-centered computing → User centered design ; •

Ap-plied computing → Health care information systems ; •

In-formation systems → Decision support systems . KEYWORDS decision support tools, healthcare, major depressive disorder

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

CHI ’21, May 8–13, 2021, Yokohama, Japan © 2021 Copyright held by the owner/author(s). Publication rights licensed to ACM.ACM ISBN 978-1-4503-8096-6/21/05...$15.00https://doi.org/10.1145/3411764.3445385

ACM Reference Format:

Maia Jacobs, Jeffrey He, Melanie F. Pradier, Barbara Lam, Andrew C. Ahn,Thomas H. McCoy, Roy H. Perlis, Finale Doshi-Velez, and Krzysztof Z. Gajos.2021. Designing AI for Trust and Collaboration in Time-Constrained MedicalDecisions: A Sociotechnical Lens. In

CHI Conference on Human Factors inComputing Systems (CHI ’21), May 8–13, 2021, Yokohama, Japan.

ACM, NewYork, NY, USA, 14 pages. https://doi.org/10.1145/3411764.3445385

Advances in artificial intelligence (AI) and machine learning (ML)offer opportunities to uncover complex data patterns. In medicine,the integration of AI tools could lead to a substantial paradigm shiftin which human-AI collaboration becomes integrated in medicaldecision-making. Such AI-powered decision support tools (DSTs)may support many clinical practices, including diagnosing illnesses,selecting the optimal treatment for a patient, and predicting diseasetrajectories [72]. While the promise of AI in medicine is alluring, theuse of these systems in healthcare has been discussed for decades,and yet few of these tools have resulted in successful implementa-tion and use in clinical practice [41].One area of healthcare that researchers have expected to ben-efit from the implementation of DSTs, but has yet to adopt suchtechnological support, is major depressive disorder (MDD). Antide-pressant medications are a common form of treatment for MDD,but selecting an effective treatment for a patient is a complex task.The majority of mental health care is initiated in primary caresettings, yet the extent of training primary care providers (PCP)receive in managing MDD can vary widely [63, 69], and contempo-rary treatment guidelines provide little support in choosing amongthem [31]. Limited guidelines as well as heterogeneity in patients’symptoms and in patients’ tolerability of antidepressants often re-sults in using trial and error to identify an effective treatment [65].Currently, an estimated one-third of patients fail to reach remissioneven after four antidepressant trials [53]. The frequency of ineffec-tive drug trials has resulted in the psychiatry community callingfor more information on which treatments will be most effectivefor an individual patient [58]. In response, we have seen several a r X i v : . [ c s . H C ] F e b HI ’21, May 8–13, 2021, Yokohama, Japan Jacobs, et al. studies focused on creating ML models to support MDD treatmentdecisions [25, 49, 57].While ML models for antidepressant treatment selection exist,these systems are rarely integrated into clinical practice due to lowuser acceptance and a failure to account for user expectations in thesystem design [19, 33]. Therefore, motivated by Berg’s sociotechni-cal approach, which highlights the importance of empirical researchof healthcare practices in which the technology will be used [6], weconsider the social, technical, and organizational issues that mustbe considered in the design of DSTs for MDD treatment selection.Using an iterative design process and two qualitative studies withPCPs, our findings raise many challenges to integrating ML-enabledtools into real clinical workflows. We found that intelligent decisionsupport tools will need to seamlessly integrate into clinicians’ timeconstrained workflows, which typically involve short appointmentswith patients to understand their symptoms and make treatmentdecisions. We also found that clinicians’ busy schedules influencedhow they thought about trust as it relates to black-box ML models.Conversations revealed that PCPs wanted DSTs that: 1) engagepatients in decision-making, 2) connect DST output to existinghealthcare system processes, 3) do not require making decisions oftrust at every interaction, and 4) compare and contrast DST outputwith existing standards of care.Based on these findings, we discuss how using a sociotechnicallens challenges current trends in the design of explainable AI. Wehighlight the need to design DSTs as multi-user systems that facili-tate patient-provider-AI collaboration. Further, our work revealsissues with using explanations that encourage users to make de-terminations of trust for each prediction. We recommend that fortime-constrained medical environments, we shift from designingexplanations for every decision to on-demand explanations thatcontrast AI recommendations with current standards of care.While we have seen many recent advances related to the use ofHCI methods for designing AI tools, we have seen few studies thatdiscuss what it will take to make these tools work within complexsociotechnical systems. Our contributions include the following:(1) We use an iterative design process to create a prototype of anMDD decision support tool that integrates both patient-levelprognostic predictions and treatment selection support.(2) Based on primary care providers’ feedback, we present theimportant facets of the healthcare sociotechnical systemthat must be considered in the design and development ofmachine learning tools for real-world clinical care.(3) We discuss how using a sociotechnical lens presents newopportunities and challenges for both HCI and ML research.

In recent years, we have seen increased interest in embedding AItools in a variety of contexts, such as the justice system [2, 62],the U.S. child welfare system [54], and medicine [22, 59, 68]. Paststudies have identified a number of problems with implementingexisting models into real-world workflows. One issue is that re-cent AI work has focused on improving the accuracy of the model,rather than the needs of the intended user [56], and improvingmodel accuracy does not always correlate with overall performance once implemented in the real-world [4, 5]. In the context of an-tidepressant treatment selection, Jacobs et al. found in a factorialexperiment that AI recommendations did not improve treatmentselection accuracy, highlighting the need for research that engagesdirectly with clinicians to create tools that are interpretable anduseful [29].In the past few years, we have seen an upswing in CHI researchthat uses co-design methods to consider the real-world challenges,beyond accuracy, that must be considered in the design and devel-opment of these tools. HCI research has helped to advance the fieldof explainable AI, examining how people interact with AI tools anddesigning tools to help end-users understand the inner workings ofmachine learning models [3, 8, 10, 24, 46, 73]. We have also seen anincreased use of co-design activities that give end-users a greatervoice in how these systems should function [9, 36, 64, 71]. Thisresearch has revealed the importance of context in determining ifand when end-users want computational assistance.We assert that another problem is the lack of context awarenessof the broader sociotechnical systems in which these tools are beingembedded. In a recent paper, Selbst et al. discuss how a failure touse a sociotechnical lens can lead ML tools to be ineffective [56].Beede et al. also recently showed how environmental factors in aclinical setting influenced the usability of a deep learning system [5].Sociotechnical approaches have been important in healthcare forunderstanding how new technologies may be effectively integratedinto the social processes that make up healthcare work [7]. Herewe use clinical perspectives to consider the broader sociotechnicalcontext that will be necessary to consider when creating AI systemsfor real-world use.

Decision support tools are computational systems created to facili-tate medical decision-making [47]. These tools can be designed toprovide a range of outputs, including treatment recommendations,prognosis predictions, and patient diagnoses [72]. Clinical DSTshave long attracted researchers due to their ability to perform tasksthat exceed human capabilities, such as extracting information andpatterns from large amounts of data. These data-driven tools offerthe opportunity to improve health outcomes and reduce humanerrors in the decision-making process [41].Despite many years of enthusiasm towards these technologies [41,47], the vast majority of these tools fail once they are deployed inreal-world health systems. Notably, DSTs are typically not aban-doned due to poor performance, but rather due to failures in ac-counting for the complexity of the healthcare sociotechnical sys-tem [51]. Researchers have found that poor workflow integration,low context awareness, and a failure to incorporate clinicians’ ex-pectations have led to low user acceptance of DSTs [33, 41, 51, 60].Several papers have discussed the relationship between poor DSTintegration and low user acceptance. For example, studies in medi-cal informatics found that when DST alerts arise at inopportunetimes (from clinicians’ vantage), the alerts become ignored or over-ridden [1, 35, 43]. In response, there have been calls both withinand outside of the field of HCI to use CSCW and HCI methods toimprove these tools [51, 60, 72]. esigning AI for Time-Constrained Medical Decisions CHI ’21, May 8–13, 2021, Yokohama, Japan

Within HCI, recent work has helped to establish clinical expec-tations for DSTs and develop guidelines for improving clinician-AIinteractions. For example, Cai et al. found that refinement toolswere considered more helpful and easier to use than traditionaltools when interacting with image retrieval systems [10]. In anotherstudy looking at image retrieval tools, Xie et al. developed a setof design recommendations for supporting clinician explorationand subsequent understanding of AI tools [70]. Finally, Yang etal. designed a DST prototype to facilitate artificial heart implantdecisions [71]. Using concepts from unremarkable computing, theyaimed to make AI prediction unobtrusive in clinicians’ workflowand found that this integration supported clinicians’ acceptanceof the technology. These studies collectively provide insights intohow DSTs may be better situated into clinical routines. Throughconversations with clinicians we have found that successful DSTimplementation will also require a broader context awareness. Weextend existing literature by identifying other aspects of the health-care system that must be considered in the creation and deploymentof novel DSTs.

Major depressive disorder (MDD) is a brain disorder character-ized by depressed mood, loss of interest in daily activities, as wellas change in associated symptoms such as sleep, energy, eating,concentration, and thoughts of death or suicide. Lifetime preva-lence of MDD in the United States is estimated to exceed 15% [32].Treatments for MDD supported by randomized, controlled trialsinclude antidepressant medications, cognitive-behavioral therapy,and somatic therapies such as transcranial magnetic stimulationand electroconvulsive therapy [18].Finding an effective antidepressant medication for someone di-agnosed with MDD is an important but difficult task. Both mentalhealth specialists and PCPs (including physicians and nurse prac-titioners) are authorized to write prescriptions. The majority ofmental healthcare is provided within primary care settings [69].However, primary care appointments last an average of only 20minutes in the United States, and internationally primary careappointments can be as short as a few minutes [28]. Such shortencounters are considered insufficient for effective mental healthcare [27]. Further, the training that PCPs receive on antidepressanttreatment selection can be highly variable [63, 69].To support treatment selection, the Canadian Network for Moodand Anxiety Treatments provides widely followed treatment guide-lines for 25 antidepressants, organized as first-, second-, and (asmall number of) third-line treatments [31]. First-line treatmentsare the recommended initial treatment options. If ineffective, theprovider may try second- and then third-line treatment, which of-ten have more severe side effects or drug interactions. While thesetreatment guidelines provide a useful resource, the heterogeneityin patients’ symptoms and tolerability of antidepressants meansthat identifying an effective treatment for an individual remains aprocess of trial and error [65].The trial and error involved in identifying effective treatmentfor a given individual has prompted calls for more integration ofevidence-based medicine in the treatment of mental health disor-ders [40, 48, 58]. Current state of the art models provide prognostic predictions and support treatment selection [25, 49]. A number ofquestions must be answered before implementing these models,particularly regarding clinician expectations and system integra-tion. To support the future integration of DSTs into primary care,we seek to better understand PCP’s support needs and expectationsfor how DSTs will function in the healthcare system.

We worked with primary carephysicians who currently prescribe antidepressant treatments. Werecruited participants at a large academic medical center. We sent in-formation about the study to an email list of primary care providers(including physicians, nurse practitioners, and residents). We pro-vided each participant with a $20 Amazon gift card to thank themfor their time.

This study was approved by the Harvard In-stitutional Review Board. We used semi-structured interviews andfocus groups with no more than two participants. Each sessionlasted 30 minutes due to participants’ busy schedules. We used thefirst 15 minutes to discuss existing decision-making processes andthe second 15 minutes to discuss future state ideas. To drive dis-cussion on current-state processes we used four questions to guidethe semi-structured interview: 1) What factors do you considerwhen selecting an antidepressant? 2) What do you do if you areunsure which antidepressant to select? 3) Where is there room forimprovement in this process? 4) Is there any information you wishyou had when selecting a treatment for a patient?We also wished to include clinicians in co-design activities to dis-cuss their expectations for future DSTs. As this study was scheduledfor March 2020, all study activities were revised and done remotelyusing Zoom. Prior work has discussed the challenges of runningremote design studies [37]. To facilitate future-state ideation, we de-signed low fidelity prototypes to demonstrate a variety of featuresand information we could potentially derive based on existing MLresearch that uses electronic medical record data. The prototypesincluded the following features.

Patient-level prognostic predictions: (1) Stability score (figure 1A): The probability of continued useof the same medication for at least 3 months [26].(2) Dropout score (figure 1A): The probability of early treatmentdiscontinuation following prescription while staying in thesame health system [49].(3) Stability and dropout feature importance (figure 1B): A widelyused approach for explaining ML predictions that showsrelevant features and their contributions to the prediction[21]. In this context, these features include electronic healthrecord codes that contributed to the stability and dropoutpredictions.

Treatment selection support: (1) Personalized treatment recommendations (figure 1C): To sup-port treatment selection, we used drug-specific rules basedon drug interactions. The rules were curated by two collab-orating psychopharmacologists with a mean of 12 years ofclinical practice in the United States. Example rules include

HI ’21, May 8–13, 2021, Yokohama, Japan Jacobs, et al. for anxiety favor mirtazapine and for poor concentration fa-vor bupropion . The interface shows personalized treatmentrecommendations by showing which rules are relevant to apatient based on their electronic medical records. Throughthese rules, the tool can show which treatments would be fa-vorable or not favorable for a patient, based on their medicalhistory and the drug side effects.The prototype includes a fabricated patient scenario, as seenin figure 1A. The scenario was vetted by psychopharmacology ex-perts as a reasonable example of a real-world patient. In the userstudy, we asked participants to reflect on these prototypes by con-sidering which aspects they found helpful or unhelpful, what theywould change, and other ideas for how technology may supportantidepressant treatment selection.

All interviews were audio recorded and tran-scribed. To analyze the data we used a grounded theory approach [13].The first author open coded the transcripts, comparing and contrast-ing existing decision-making processes and prototype feedback,and looking for patterns in the dataset. The research team then metto discuss emerging themes and review associated data segments.Using the established themes, the first author re-coded the tran-scripts and met with team members to discuss new categories asthey emerged in subsequent coding iterations and to validate thefinal themes.

Ten PCPs volunteered to participate in this study (six physicians,three resident physicians, and one nurse practitioner). Participantshad an average of 12.5 years of experience prescribing antidepres-sants (SD=12.1). Table 1 includes participant details across bothof the user studies described in this paper, including participants’years of experience prescribing antidepressants, and in which ofthe two studies they participated.Through the interviews participants affirmed many establishedtreatment selection challenges. Most commonly, participants spokeabout their limited familiarity with antidepressant treatments, par-ticularly outside of the selective serotonin reuptake inhibitor (SSRI)class, which are the most commonly prescribed antidepressants.Limited knowledge beyond SSRI’s can lead to challenges whencaring for patients who are not seeing improvements from thesetreatments. Clinicians also frequently brought up the difficultythey experienced in connecting patients with psychopharmacologywhen their treatment needs exceeded their care providers’ comfortlevel. Participants discussed the need for guidance when prescrib-ing second- or third-line treatments, as there are fewer clinicalstandards. Further, clinicians said that they frequently worked withpatients who stopped taking their prescribed medications for avariety of reasons, including stigma, negative side effects, or a lackof drug effectiveness. Based on these challenges, we talked withparticipants about how they currently make treatment decisionsand their vision of data-driven support tools. A key result fromthese discussions was that effective decision support tools will needto engage with the broader healthcare system, not just an individualhealthcare provider. In the remainder of this section, we describe

Table 1: Participant details, including years of experienceprescribing antidepressants, and in which of the two userstudies they participated

ID Experience(Years) Study 1(n=10) Study 2(n=8)P1 20 XP2 19 X XP3 <1 X XP4 6 XP5 5 XP6 1.5 XP7 9 X XP8 16 XP9 41 X XP10 7 XP11 2 XP12 <1 XP13 1 XP14 1 Xclinicians’ expectations for how DSTs may better engage with thehealthcare system in order to support complex treatment decisions.

While the primary goal of thisproject was to address clinicians’ treatment decision challenges,participants saw a clear opportunity to use DSTs to engage pa-tients in the decision-making process.

All of the clinicians inthis study emphasized that MDD treatment decisions are a collab-orative process. Though clinicians wanted to engage patients intreatment conversations, several noted the lack of available informa-tion designed for patients. Current informational resources aboutmedications are considered to be either too simple or too complexfor patients. Clinicians saw the DST output, particularly regardingpersonalized treatment recommendations and their associated sideeffects, as information that could be presented directly to patientsin order to involve them in decision-making conversations, thussupporting patient-provider collaboration: “If I was trying to decide between two meds and I’mtalking to a patient... it’s even something that you couldpotentially show a patient or say, "These are two choices.I think they’re both maybe equally effective. This onemay have more of the side effect or something." – P8Engaging patients also means providing ways to integrate patienttreatment preferences into the interactive system design. Cliniciansconsistently said that they considered patient preferences. How-ever, such information is not available within an electronic medicalrecord, and therefore not accounted for within current ML mod-els. Clinicians requested a tool for sharing and collaborating withpatients, emphasizing the need for an interactive interface: “Having an option like, patient’s also worried about thisand that. You can click on the two major side effects andthen based on that, a specific drug will come up.” - P3Overall, clinicians responded positively to the treatment rec-ommendation feature. They felt such tools could remind them of esigning AI for Time-Constrained Medical Decisions CHI ’21, May 8–13, 2021, Yokohama, Japan

ABC

Figure 1: Features included in the initial prototype (from top to bottom): (A) a patient scenario with stability and dropoutscores, (B) stability score feature importance explanation, (C) personalized treatment recommendations.

HI ’21, May 8–13, 2021, Yokohama, Japan Jacobs, et al. treatment side-effects, and promote discussions of drug effects withpatients. The emphasis on patient preferences in the treatmentselection process motivated the need for interactive tools in whichclinicians and patients could change the input variables.

Participants fre-quently commented that DST predictions should be paired withrecommendations for appropriate next steps in the clinical work-flow, often involving other healthcare providers. Notably, whilesome participants helped us to connect the ML predictions to exist-ing health system procedures, others found this task difficult. Suchdifficulties indicate that only displaying the prediction will likelybe insufficient: “I think [patient dropout] is definitely an issue and itis something that I think about. I’m wondering whatI would do differently if the score was higher versuslower, and if that would affect decisions or not. I’m alsotrying to think of what resources we would use or notin different situations.” - P1In this case, while the participant indicated general interest inthe DST recommendations, the conversation did not result in anyspecific conclusions on how to use the DST output to identify possi-ble next steps. In contrast, some clinicians were able to connect themodel output with appropriate and existing healthcare processes.This came up both when discussing patient dropout predictionsas well as treatment recommendations. In the context of patientpredictions, clinicians shared that DSTs could recommend existingprocesses for patients at risk of stopping their medication: “I feel like what that would tell me is, if there is a lowerstability and higher dropout that it would be importantto then involve more of a care team, rather than justsay, why don’t you see me in six weeks for a follow up?I would say, let me have so-and-so in my clinic call youin two weeks.” - P7 “Especially if patients have trouble coming in, it couldbe longer or oftentimes that second visit is canceled. Wedo have some practice options for follow up. One of ourpharmacists will sometimes do phone follow up andtitration of these medications, so you can involve otherpeople.” - P2In the case of dropout prediction, participants discussed threeways in which PCPs could respond to a high dropout risk prediction:including behavioral therapy with patient counseling in the careplan, lowering the drug titration, and reducing follow-up times.Participants said that these steps could be useful for patients at riskof dropout and use resources and procedures already establishedwithin the clinic.We also found that while some predictions helped clinicians toidentify appropriate actions and next steps, other model predictionswere viewed less favorably. While clinicians saw value in dropoutrisk scores, we did not receive such positive feedback regardingstability scores. Unlike the dropout scores, some clinicians did notbelieve that patient-level stability scores guided them towards anyparticular interventions or next steps, and some clinicians saw sta-bility scores as potentially harmful for seemingly “stable” patients: “I don’t know if it would change the initial manage-ment because I would see a patient back four to sixweeks regardless, but maybe any sequential follow-upappointments I’d feel more comfortable spacing thoseout a little bit more.” - P6 “I was just kind of thinking through this, how it mightchange my counseling when I’m prescribing the med-ication. I’d have to think a little bit more about thatbecause it’s not like if somebody is thought to be stableI’d want them to be suffering at home with side effectsand not be reaching out to me.” - P4As in these examples, clinicians raised important concerns aboutDST recommendations. Clinicians were interested in knowing whenthey should follow up sooner for patients who may be at risk ofdiscontinuing treatment, but pointed out that stability scores riskinappropriately indicating that current follow up times could belengthened, leading to potentially negative consequences for thepatient.Clinicians also discussed expectations for direction and nextsteps when considering personalized treatment recommendations.Clinicians wanted tools that showed all appropriate treatment op-tions, rather than only showing one at a time: “Rather than make me feel like there’s no good optionit might point me to consider something else. It mayor may not be appropriate, but certainly, I would thinkabout it I guess.” - P2As shown in the above examples, we found that clinicians ex-pected DSTs to integrate with their existing processes and use the

DST predictions to show a path forward . By including PCPsin design discussions, we were able to uncover which clinical pro-cesses may be most appropriate for various patients and treatmentpredictions. Clinicians were critical in identifying the connectionsbetween DST output and existing healthcare processes. Howeverthese connections were not apparent to all healthcare providers,indicating the decision support tools should explicitly suggest ap-propriate next steps within the technology design.

In theinterviews, we were surprised that discussions of trust in the tech-nology were rarely initiated by participants. Probing questionsrevealed that PCP’s limited time with patients would make in-the-moment determinations of trust nearly impossible. All participantsstressed that they have limited time with each patient and mustfocus on understanding all of the factors that will influence theirtreatment decision, including the patients’ medical history, symp-toms, and treatment preferences.While participants were interested in using technology to helpin their decision-making, they would not have the time in theseencounters to determine if they thought a tool was trust-worthy.However, they also noted that new technology introduced into theclinic would require substantial validation through randomizedcontrolled trials, supporting their trust in new tools. Participantssaid they would use it if other clinicians used it, and would make aone-time decision about if the tool was helpful: “I think the biggest thing is just getting behind how youvalidated your data, how you validated your model ... esigning AI for Time-Constrained Medical Decisions CHI ’21, May 8–13, 2021, Yokohama, Japan

I don’t know if you necessarily need to get into supernitty-gritty details” - P6 “If a major medical society is sort of putting this forth,my colleagues are using it, and I hear people sayingthat it’s that it works, then I am comfortable with it.” -P7When looking at the feature importance charts, participantsoverall found the information unnecessary in determining howthey will care for a patient, and ill-fitted for their short patientappointments: “[The features] just feel a little random, these things.Again, I don’t know if it would help me. I’m just notsure how it’s going to help me change what I’m goingto do.” - P8Thus, we found that conversations of trust were not typically ini-tiated by participants because participants expected that trustin the technology will not be decided at each decision point.

This result contrasts with work in the field of explainable AI, whichoften looks at designing explanations for each prediction or rec-ommendation. Based on this feedback, we see that data validationprocedures should be findable, but not forced upon providers whoare already focused on understanding many complicated facets ofa patients’ history, health status, and treatment needs.

We redesigned the DST prototype to operationalize the guidelinesestablished in study 1. Table 2 describes the features we includedin the prototype. We first considered ways in which we could bet-ter account for patient preferences, as clinicians all described thetreatment selection process as a mutual and collaborative decision.Therefore, we made the treatment recommendations more interac-tive, allowing clinicians to edit which aspects of a patient’s medicalhistory are being used to drive the recommendations. For example,a clinician could select ‘fatigue’ if the patient was experiencing thissymptom, and antidepressant treatments that are recommended fortreating fatigue (and do not have a negative interaction with otherpatient symptoms) would appear as favorable. Educating patientsabout the potential negative side effects of the treatments was alsoan important aspect of clinicians’ work, as this communicationcould help in collaborative decision-making and help reduce therisk of the patient discontinuing the treatment. We therefore displayon hover all of the rules associated with a treatment. By displayingall of the rules for a drug, both the PCP and patient can better under-stand the potential effects and ensure that the treatment they selectis appropriate for the patient. Given the important role patientsplay in the treatment selection process, we also acknowledge theimportance of creating patient-facing tools to support educationand decision-making. For this study, we focused specifically onclinician-facing tools, but plan to look into designing multi-usersystems for patient-provider collaboration in the future.Our second goal was to connect the DST predictions with ap-propriate clinical processes. Based on the discussions from the firststudy, in which participants expressed interest in patient dropoutscores, we aimed to make this more prominent in the second itera-tion. We found through discussions with our clinical collaboratorsthat the dropout prediction was difficult to understand without knowing the distribution of predictions across all patients. Weincluded a graph to help visualize this distribution and supportfurther conversation of information needs. We also included nextsteps PCPs could consider for high dropout risk predictions. Asmentioned previously, these steps included lowering the medica-tion titration, scheduling earlier follow up visits with the patient,and making sure patients were set up with additional behavioraltherapy.We also altered the interface for the treatment recommendations.Rather than asking clinicians to search for a specific treatment, weused a matrix design that allows clinicians to compare antidepres-sants side by side. Participants indicated that such comparisonswere important for helping them consider all of the possible treat-ment options. We selected the matrix design in order to display alarge amount of information in a glanceable display. Also, PCPscurrently use a publicly available table that includes informationabout antidepressant treatments. The matrix design mirrored thetable format that PCPs currently use, while adding the neededinteractivity.Finally, we aimed to design the prototype to work within thetime-constraints of the medical system. We first replaced the feature-importance explanations with information about the model vali-dation process. To view this information, we added a link labeled how dropout is calculated below the dropout prediction. The linkleads to a page that lists the steps used to validate the model. Weexpect that this page will expand with each validation study, in-cluding clinical trial protocols, results, and publications. Cliniciansdid express interest in how the model was validated and the resultsof any randomized trials or other validation studies of the tech-nology. By making these details easily findable, our intention wasto make this information accessible, while also respecting PCPs’limited time with patients. We aimed to make validation data avail-able, without distracting from the information most critical to thedecision-making process.

For this study we again workedwith PCPs who currently prescribe antidepressant treatments. Torecruit participants we worked with the same academic medicalcenter and used the same recruitment method as with the firstuser study. We invited both new participants as well as those whoparticipated in the prior study. We provided each participant witha $20 Amazon gift card to thank them for their time.

This study was approved by the Harvard In-stitutional Review Board. Each study session lasted 30 minutes andwas conducted remotely using Zoom. During the study session, par-ticipants were provided links to the prototype, asked to share theirscreen, and were able to interact with the prototype freely. Aftera participant was presented with the link, we asked the followingquestions in order to guide discussion and feedback:(1) Imagine this patient is sitting in front of you, how wouldyou make a treatment decision?(2) What helped you make a decision?

HI ’21, May 8–13, 2021, Yokohama, Japan Jacobs, et al.

ABC

Figure 2: Features included in the prototype redesign (from top to bottom): (A) patient information, (B) dropout score with linksto further information about how dropout is defined and validation studies conducted on the tool, (C) interactive personalizedtreatment recommendations. esigning AI for Time-Constrained Medical Decisions CHI ’21, May 8–13, 2021, Yokohama, Japan

Table 2: Summary of changes made to the design of the MDD decision support prototype based on study 1 findings

Include patient preferences

Support patient-provider communication, address missing vari-ables • Make treatment recommendations interactive, so that cliniciansand patients may edit the input variables based on changes to apatient’s medical history or side effect preferences • When hovering over a treatment, show all potential side effectsfor that antidepressant, in order to foster communication andeducation of potential medication effects

Recommend appropriateclinical processes

Show a path forward, provide actionable information • For patients with a dropout risk prediction in the top quartile,present recommended next steps based on clinicians’ suggestions • Allow for viewing and comparing multiple antidepressant op-tions

Understand system constraints

Do not require determination of trust at every decision point • Refocus from model features to model validation process • Present an overview of all model validation steps in a singlescreen that is accessible from the main interface, but not com-bined with patient details(3) Did anything detract from making a decision? Or your con-fidence in the decision?(4) If you were going to design this tool for a colleague, whatwould they need to make a decision? What would you change?Due to our focus on walking through a decision process that repli-cates real-world decision-making we worked with experts in clinicalpsychology and psychopharmacology to create a realistic patientscenario. We presented clinicians with a short summary of essentialpatient information, including age, gender, relevant comorbidities,and a prior ineffective SSRI trial, as shown in figure 2A.

Similar to study 1, all sessions were audiorecorded and transcribed. To analyze the data, we continued touse a grounded theory approach, [13], first using an iterative in-ductive analysis to establish themes within the data. As in study 1,the first author open coded the transcripts and identified an initialset of themes. The research team then met to discuss themes andassociated data segments, discuss discrepancies, and amend themedefinitions. The first author then re-coded the data using the estab-lished themes and met with the research team to validate the finalset of themes. During this process, we found many similar themesto study 1. Therefore we also ran a deductive coding process [17],using the codes from the first user study in a subsequent analysisto identify thematic overlap, allowing us to assess specific feedbackrelated to the study 1 guidelines.

Eight PCPs enrolled in this user study (six physicians, two residents,and one nurse practitioner), four of whom also participated in thefirst study. Participants in this study had an average of 9.4 years of experience prescribing antidepressants (SD=14.3). While we didnot directly ask about the results of study 1, we found that muchof the feedback re-emphasized the themes we previously discussed,helping to validate our previous findings and our approach to op-erationalizing those guidelines. The ability to interact with a highfidelity prototype also led participants to discuss new opportunitiesand challenges.

Much of the prototype feed-back re-emphasized the lessons we learned in the first user study.Participants discussed ways in which the prototype successfullymet their needs and opportunities for the system to further alignwith their expectations.Participants positively responded to the ways in which we inte-grated clinical processes by displaying relevant actions. Participantscommented that the recommended steps associated with dropoutrisk predictions were actionable and aligned with what they typi-cally do for patients when they are concerned about their responseto treatment: “One of the things that is the hardest about depressionand anxiety is that patients who have really bad symp-toms tend to not follow up or tend to not follow throughwith therapies just due to the nature of their disease.And so going into a room and already knowing, is therea high chance that this patient may fail on this therapyor may not adhere to this therapy, and going into theroom knowing that this is someone who I need to talkto a little bit more or I need to follow up with a littlebit more or who I need to schedule really close appoint-ments for. I think that’s probably the biggest help thatyou can offer.” - P11

HI ’21, May 8–13, 2021, Yokohama, Japan Jacobs, et al.

In addition to the positive feedback, some participants mentionedopportunities to continue to integrate the DST with existing clinicalprocesses. For example, multiple clinicians discussed the benefitof connecting treatment recommendations with their prescriptionsystems: “I guess if I were using this in practice, I would probablyclick to see if prescribing information or something likethat came up as a next step.” - P2Clinicians also responded positively to the ability to edit pa-tient conditions to see how the inclusion or exclusion of variousconditions changed treatment recommendations, allowing them toincorporate patient concerns: “You can sort of click this and it helps to think about ifI add poor concentration, if maybe that’s another sideeffect she’s having and it kind of modifies the medi-cation regimen based off of that and bupropion is stillappropriate for that. So I think that toggling conditionsis super helpful." - P11Clinicians also suggested additional conditions that should beincluded, such as pregnancy and suicidality, as these will influencetheir treatment decision. Clinicians also discussed ways in whichthe prototype could further support patient communication. Par-ticipants did respond that the treatment recommendations couldfacilitate patient counseling, as expressed in study 1. Even further,some clinicians saw an opportunity to use the information fromthe DST to create educational sources for patients: “It would be nice if, after you make a selection, likesay I click, okay, we’re going to go with bupropion, ifthere was a patient-friendly educational handout thatwould just say, we’re starting you on bupropion. As areminder side effects to expect, side effects that are lesscommon. Just something that I could give the patientso that they remember why we went on this and whatmight be normal, because a lot of the time I’ll end upwriting that down. But if it’s already in here and I canhit print, that might be useful." - P14Finally, a goal with this prototype was to allow clinicians toestablish trust within the tool while being more mindful of theirtime constraints. We found that, similar to study 1, discussions oftrust were infrequently initiated by participants, but the link tomodel validation did pique participants’ interest. Interacting withthis part of the prototype led to conversations about the types ofvalidations that clinicians would expect to see: “If you could show that patients have a better responseto treatment by use of the algorithm, that would beamazing. If you can show that patients actually aremore likely to adhere to treatment, that would be im-portant as well, or that patients are less likely to developadverse side effects that leads to stopping medications.It would be nice to do a trial with outcomes like that.” -P9These discussions reemphasized the type of validation methodsthat would help clinicians’ to establish trust. Overall, we foundthat participants positively responded to the design concepts, reem-phasized the expectations established in the previous study, and discussed ways in which the system could further align with theseexpectations in subsequent designs. As we discuss in the next sec-tion, interacting with the prototype also revealed a new way inwhich DSTs must engage with the broader healthcare system.

While participants re-sponded positively to the overall prototype design, we found thatwhen the recommendations diverged from clinical knowledge orguidelines participants became confused and would often aban-don the recommendation. There were two ways in which the theprototype differed from clinicians’ expectations. First, the systempredicted a high dropout risk for a patient who participants wouldnot typically assume to be high-risk patients: “To me, when I think of someone who’s a high risk ofdropout, it’s like a person with substance abuse or withbipolar disorder. Like, okay, obviously, they’re not goingto show up. There’s a high chance they won’t adhere toit. But for this person who seems like a relatively breadand butter, middle-aged, healthy person, the fact thatshe’s on the higher end of dropout is kind of eye openingto me, and it almost makes me wonder then, who is onthe lower end of dropout?” – P7Surprising predictions, such as this one, led to greater interestin the depression score and how it was calculated. For example, asin the quote below, participants’ next step was to understand whatfactors were leading to this unexpectedly high dropout prediction: “And I have to like, look more into this, but the dropoutprobability, is it because of the side effects that you’redropping out? Or is it because the medication is noteffective.” - P3Here, we see that when faced with unexpected model outputthat contrasted with participants’ expectations or mental model,participants found it challenging to identify appropriate next steps.We also found that treatment recommendations at times contrastedwith participants’ expectations. Specifically, treatments that werelisted as either favorable or neutral included medications that clin-icians would not typically prescribe to patients. In some cases,clinicians indicated that seeing a surprising treatment recommen-dation would give them pause and lead to more reflection aboutthe patient’s case and medication needs: “So truthfully, I would take a step back because it’s notthat common that nortriptyline is a medication I thinkabout as a first or even a second or third line agent,unless they have other conditions that I know [tricyclicantidepressants] can treat. So I would really take a stepback and think about the patient’s pain. Do they havereally bad migraines, that I think will get significantbenefit from the TCAs. It would definitely give me pauseif that was the most favorable medication to come upas a suggestion on this.” – P11The examples we presented in this section describe reactionsto scenarios in which the output of the machine learning modelcontrasts from clinical experiences or standards of care. Impor-tantly, as new machine learning models continue to advance, somay the opportunities for model output to contrast with clincians’expectations or existing care standards. Our findings reveal a need esigning AI for Time-Constrained Medical Decisions CHI ’21, May 8–13, 2021, Yokohama, Japan for researchers to consider how to adapt DST system designsfor instances in which the machine learning model outputconstrasts with existing domain knowledge.

Through conversations on their expectations for AI support, clin-icians revealed a number of critical aspects of the sociotechnicalhealthcare system that need to be considered in the design of noveldecision support tools. Specifically, our work highlights the im-portance of including patient preferences, recommending clinicalprocesses, understanding system constraints, and engaging withdomain knowledge. In the remainder of this section, we discusshow the prototype feedback reveals lessons for how we design AIfor healthcare systems and we reflect on the implications of thiswork for HCI research.These empirical implications provide concrete implications tothe specific context of MDD. We believe the lessons learned fromthis work may be transferable to healthcare systems and processthat share characteristics that were emphasized by the participantsin this work: 1) Patients are deeply involved in treatment decisions,2) Providers have short and infrequent appointments with patients.In particular, we expect these results will be useful for designingML-embedded DSTs for other primary care decisions. Future workshould look at how needs and expectations change for medicalspecialists, healthcare systems in which patients are unwilling orunable to participate in care decisions, and for patients directly.

The clinicians who participated in this study encouraged us touse DSTs to foster greater collaboration with patients. While AIresearch continues to focus on improving information accuracyfor clinicians, modern clinical care aspires toward shared decision-making with patients and clinicians working together to make de-cisions [12, 39]. Consequently, clinicians’ feedback challengedthe idea that AI-driven innovations in healthcare will be sin-gle user systems.

We expect that designing multi-user systemsthat engage patients and their healthcare providers may have sev-eral benefits. First, such tools can help patients have a greater voicein their healthcare decisions. Health tools that directly interactwith patients can help promote patient activation by increasingaccess to important health information and providing new waysto engage in their health care [55, 61]. Studies have connected in-creased patient activation to improved healthcare experiences andhealth outcomes [23]. A recent study also found that includingpatient-facing DSTs can improve clinicians’ adherence to recom-mended practices when compared to DSTs that were only clinician-facing [66]. Finally, we believe multi-user systems can support timemanagement during clinical encounters. DST’s will likely be ableto communicate healthcare options quickly to patients, and mayprovide tailored educational materials that are continuously avail-able to patients. Such technological support may address both thetime constraints in primary care settings and the cognitive con-straints of the patient, who can experience information overloadin clinical encounters [34]. Yet, very few studies have looked atcreating AI tools for patients [30]. In an initial attempt to promote more inclusion of patients’ preferences, we used interaction to helptailor treatment recommendations. However, we see this as a smallstep towards a larger problem. In the future, we intend to co-designsuch systems with patients directly, in order to amplify their voicesin their own care decisions.

We see a clear need for DSTs to explicitly draw the connectionsbetween the model output and actionable next steps. A consis-tent theme within this study was that clinicians wanted tools thatprovided actionable interventions, connecting predictions to ap-propriate clinical processes. Often, these processes involve addi-tional healthcare providers. Thus, decisions in healthcare, suchas treatment selection, are not siloed tasks. Rather, these health-care decisions affect many other aspects of care. For MDD, thismeant connecting patients with behavior therapy, pharmacology,and nurses who could follow up with them and track their progress.

The ability for some clinicians’ to connect DST output toexisting clinical processes demonstrates the benefit of inte-grating participatory design methods into DST developmentworkflows.

We also found that clinicians were thoughtful in considering thepossible adverse effects of DST predictions. This became clear asclinicians considered the potential effects of stability and dropoutpredictions. While clinicians saw value in tailoring care for pa-tients with high dropout risk, some clinicians were wary of stabilityscores. Some clinicians commented that they were unable to iden-tify a clear next step for patients with high stability scores, whileothers were concerned that these patients would not receive theneeded attention. We have seen several examples in recent yearsof AI predictions leading to biased or unfair behaviors [20, 45]. En-gaging clinicians or other target users of AI tools in design fictionmethods [44] may be another useful step in the AI design process.

While explaining black box models is a consistent theme in AI work,we need best practices for adapting the design of DSTs for time-constrained environments. In the case of MDD, and primary caresettings more generally, time constraints will consistently needto be considered in the design of any novel health tools. Timeconstraints have been cited as the most common barrier to theadoption to new decision support tools [14].Designing for fast-paced, time-constrained work environmentshas important design implications, particularly in the context of cur-rent explainable AI research. In our work, we found that due to timelimitations, clinicians wanted to determine their trust in the tech-nology one time, rather than at each decision point. Therefore, clin-icians wanted DSTs to display the evidence-based methodsused to validate the tool (such as randomized controlled trialresults), rather than individual explanations that focus onmodel features.

Prior work has also noted that clinicians wantedML tools to more closely align with evidence-based medicine meth-ods [64]. In our iterative design, our shift from model features asexplanations to tool validation steps helped to better reflect theevidence-based medicine process.

HI ’21, May 8–13, 2021, Yokohama, Japan Jacobs, et al.

Recent literature in explainable AI has made important progresstowards improving transparency of AI models by creating inter-pretable explanations for the model’s output [67]. These explana-tions can influence decision-makers understanding of the modeland perceived fairness of these tools [15]. However, in high stakesor time critical environments, this process places time and mentalburden on users. In the case of primary care our results indicatethat explanations for each model prediction would be unusable dueto the limited time clinicians have with each patient. Our resultstherefore align with recent discussions of the potential problemsof using explainable AI in clinical settings [52]. However, withoutthe time to review an AI prediction or recommendation in detail,greater responsibility needs to be placed up front to determinewhen the system is likely to err, and when the tool should perhapsnot be shown altogether.

We found that when the DST output did not align with clinicalknowledge or guidelines, clinicians were left confused, with mostopting to abandon the recommendation. While explanations to sup-port transparency at each decision point were seen as unhelpfulin general, more details in the case of surprising predictions wereviewed more favorably. In these cases, clinicians requested infor-mation about causal factors that would allow them to interveneappropriately. This is notably distinct from the feature-importanceexplanations we produced in study 1, which included correlatedfeatures produced by the model. Prior studies have also found thatfeature-based explanations were inadequate for helping cliniciansidentify appropriate interventions [71].We do not yet have established best practices for dealing withcontrasting information, but helping clinicians identify the bestway to proceed is critical. Contrasting information in medicine canresult in clinical uncertainty and adverse effects for patients [11].The introduction of AI in clinical settings will likely increase theprevalence of contrasting information, as ML models trained onvast data sets have the promise to uncover nuanced relationshipsthat are not encoded in existing medical training, guidelines, orclinician’s expectations. We see this as an important opportunityfor future work to develop best practices for cases in which DSTrecommendations diverge from domain knowledge, and show whydivergence is happening in a way that is both understandable tothe user and presents actionable next steps.Based on our results, we see an opportunity to present on-demand explanations as differentials from existing clinicalguidelines.

This is consistent with the emerging understanding inthe explainable AI community that in human-human discourse ex-planations are typically contrastive [42, 50]. Rather than providingall evidence in support of a recommendation, such explanationscould visualize how and why a machine learning model output isdiverging from existing guidelines or expert knowledge. This willrequire machine learning models to robustly reason with existingmental models of the users. In healthcare, the use of standard-ized clinical guidelines (e.g., [31, 38] for MDD treatment selection)means that aspects of these mental models are established, encod-able, and may therefore be used in both the development of the machine learning models and the design of DST interfaces. Ehsanand Riedl have also suggested that allowing users to voice skep-ticism in AI models may afford new interactions that encourageusers to consider the limitations of the technology [16]. In the con-text of healthcare decisions, allowing clinicians and patients tovoice skepticism and highlight surprising DST outputs may allowthe underlying models to adapt as medical guidelines continue toevolve. A critical area for future work is designing tools whichallow users to identify contrasting information in time-constrainedenvironments and determine how to best proceed.

The application of machine learning in healthcare brings numerouspractical, ethical, and legal issues. Here we address one challenge inimplementing these tools in the real world—considering the broaderhealthcare system. In this study we used an iterative design processto guide discussions of clinician expectations. We do not expect thatthis is the optimal visualization and much work is needed in thearea of data visualization to represent both the distributions of MLpredictions, as well as the uncertainty within the model. Further, inthis study, we focused solely on the perspective of clinicians. Whilewe see this as an important first step, next-step studies should en-gage with other stakeholders, such as patients, nurses, pharmacists,and therapists.

In this paper, we report on aspects of the healthcare sociotechnicalsystem that should be considered in the design of machine learningdecision support tools. Based on co-design studies with primarycare providers, we identified four important aspects of the caresystem that influence how we design decision support tools forreal-world use. These factors include patients’ preferences, clini-cal processes that often include multiple healthcare providers, theconstraints of the healthcare system, and existing domain knowl-edge. We posit that by making these aspects of healthcare centralto the design of DSTs, we may develop tools that are capable ofsupporting the collaborative nature of healthcare, identifying po-tential adverse events caused by ML predictions, working withintime-critical environments, and recognizing conflicting informa-tion. We do not expect that the sociotechnical factors discussedhere represent the full set of sociotechnical considerations that needto be included in AI design. Rather, we present this as an initialstep in a broader research agenda determining how the new waveof intelligent systems must account for the complexity of medicalwork.

ACKNOWLEDGMENTS

This research was supported in part by the Harvard Data ScienceInitiative. We thank all of the participants for taking time to sharetheir insights and expertise.

REFERENCES [1] Bryant AD, Fletcher GS, and TH Payne. 2014. Drug interaction alert overriderates in the Meaningful Use era.

Appl Clin Inf

The John M. Olin Center for Law, Economics, and Business Fellows’Discussion Paper Series (2019), 1–44. esigning AI for Time-Constrained Medical Decisions CHI ’21, May 8–13, 2021, Yokohama, Japan [3] Saleema Amershi, Dan Weld, Mihaela Vorvoreanu, Adam Fourney, BesmiraNushi, Penny Collisson, Jina Suh, Shamsi Iqbal, Paul N Bennett, Kori Inkpen,Jaime Teevan, Ruth Kikin-gil, and Eric Horvitz. 2019. Guidelines for Human-AI Interaction. In

Proceedings of the 2019 CHI Conference on Human Factors inComputing Systems . 1–13.[4] Gagan Bansal, Besmira Nushi, Ece Kamar, Walter S. Lasecki, Daniel S. Weld, andEric Horvitz. 2019. Beyond Accuracy: The Role of Mental Models in Human-AITeam Performance.

Proceedings of the AAAI Conference on Human Computationand Crowdsourcing

7, 1 (2019), 2–11.[5] Emma Beede, Elizabeth Baylor, Fred Hersch, Anna Iurchenko, Lauren Wilcox,Dr. Paisan Raumviboonsuk, and Laura Vardoulakis. 2020. A Human-CenteredEvaluation of a Deep Learning System Deployed in Clinics for the Detection ofDiabetic Retinopathy. In

CHI Conference on Human Factors in Computing Systems(CHI ’20) . 1–12.[6] Marc Berg. 1999. Patient care information systems and health care work: asociotechnical approach.

International Journal of Medical Informatics (1999),87–101. https://doi.org/10.1016/S1386-5056(99)00011-8[7] Marc Berg, J. Aarts, and J. Van der Lei. 2003. ICT in Health Care: SociotechnicalApproaches.

Methods of Information in Medicine

42, 4 (2003), 297–301. https://doi.org/10.1267/METH03040297[8] Zana Buçinca, Phoebe Lin, Krzysztof Z. Gajos, and Elena L. Glassman. 2020. Proxytasks and subjective measures can be misleading in evaluating explainable AIsystems. In

ACM Proceedings ofthe 25th Conference on Intelligent User Interfaces .454–464. https://doi.org/10.1145/3377325.3377498 arXiv:2001.08298[9] Adrian Bussone, Simone Stumpf, and Dympna O’Sullivan. 2015. The role of ex-planations on trust and reliance in clinical decision support systems.

Proceedings- 2015 IEEE International Conference on Healthcare Informatics, ICHI 2015 (2015),160–169. https://doi.org/10.1109/ICHI.2015.26[10] Carrie J Cai, Emily Reif, Narayan Hegde, Jason Hipp, Been Kim, Daniel Smilkov,Martin Wattenberg, Fernanda Viegas, Greg S Corrado, Martin C Stumpe, MichaelTerry, Carri J Cai, and Nara yan Heg. 2019. Tools for Coping with ImperfectAlgorithms During Medical Decision-Making. In

Proceedings of the 2019 CHIConference on Human Factors in Computing Systems . https://doi.org/10.1145/3290605.3300234[11] Delesha M. Carpenter, Lorie L. Geryk, Annie T. Chen, Rebekah H. Nagler,Nathan F. Dieckmann, and Paul K.J. Han. 2016. Conflicting health informa-tion: a critical research need.

Health Expectations

19, 6 (2016), 1173–1182.https://doi.org/10.1111/hex.12438[12] Cathy Charles, Amiram Gafni, and Tim Whelan. 1997. Shared decision-makingin the medical encounter: What does it mean? (Or it takes, at least two to tango).

Social Science and Medicine

44, 5 (1997), 681–692. https://doi.org/10.1016/S0277-9536(96)00221-3[13] Kathy Charmaz and Linda Liska Belgrave. 2012. Qualitative interviewing andgrounded theory analysis.

The SAGE Handbook of Interview Research: The Com-plexity of the Craft (2012), 347–366. https://doi.org/10.4135/9781452218403.n25[14] Srikant Devaraj, Sushil K. Sharma, Dyan J. Fausto, Sara Viernes, and Hadi Khar-razi. 2014. Barriers and Facilitators to Clinical Decision Support Systems Adop-tion: A Systematic Review.

Journal of Business Administration Research

3, 2 (2014).https://doi.org/10.5430/jbar.v3n2p36[15] Jonathan Dodge, Q. Vera Liao, Yunfeng Zhang, Rachel K.E. Bellamy, and CaseyDugan. 2019. Explaining models: An empirical study of how explanations impactfairness judgment. In

International Conference on Intelligent User Interfaces . 275–285. https://doi.org/10.1145/3301275.3302310[16] Upol Ehsan and Mark O. Riedl. 2020. Human-centered Explainable AI: Towardsa Reflective Sociotechnical Approach. arXiv preprint arXiv:2002.01092. (2020).arXiv:2002.01092 http://arxiv.org/abs/2002.01092[17] Satu Elo and Helvi Kyngäs. 2008. The qualitative content analysis process.

Journal of Advanced Nursing

62, 1 (2008), 107–115. https://doi.org/10.1111/j.1365-2648.2007.04569.x[18] Maurizio Fava and Kenneth S. Kendler. 2000. Major Depressive Disorder.

Neuron

28 (2000), 335–341. https://doi.org/10.1016/B978-012373947-6.00245-2[19] Paolo Fusar-Poli, Ziad Hijazi, Daniel Stahl, and Ewout W. Steyerberg. 2018. TheScience of Prognosis in Psychiatry: A Review.

JAMA Psychiatry

75, 12 (2018),1280–1288. https://doi.org/10.1001/jamapsychiatry.2018.2530[20] Ben Green and Yiling Chen. 2019. The principles and limits of algorithm-in-the-loop decision making.

Proceedings of the ACM on Human-Computer Interaction

Comput. Surveys

51, 5 (2018).[22] Jianxing He, Sally L. Baxter, Jie Xu, Jiming Xu, Xingtao Zhou, and Kang Zhang.2019. The practical implementation of artificial intelligence technologies inmedicine.

Nature Medicine

25, 1 (2019), 30–36. https://doi.org/10.1038/s41591-018-0307-0[23] Judith H. Hibbard and Jessica Greene. 2013. What the evidence shows aboutpatient activation: Better health outcomes and care experiences; fewer data oncosts.

Health Affairs

32, 2 (2013), 207–214. https://doi.org/10.1377/hlthaff.2012.1061 [24] Fred Hohman, Andrew Head, Rich Caruana, Robert DeLine, and Steven M.Drucker. 2019. Gamut: A Design Probe to Understand How Data ScientistsUnderstand Machine Learning Models.

Proceedings of the 2019 CHI Confer-ence on Human Factors in Computing Systems - CHI ’19 (2019), 1–13. https://doi.org/10.1145/3290605.3300809[25] Michael C Hughes, Gabriel Hope, Leah Weiner, Thomas H Mccoy, Roy HPerlis, Erik Sudderth, and Finale Doshi-velez. 2018. Semi-Supervised Prediction-Constrained Topic Models. In

Proceedings of the 21st International Conference onArtificial Intelligence and Statistics (AISTATS) , Vol. 84.[26] Michael C. Hughes, Melanie F. Pradier, Andrew Slavin Ross, Thomas H. McCoy,Roy H. Perlis, and Finale Doshi-Velez. 2020. Assessment of a Prediction Modelfor Antidepressant Treatment Stability Using Supervised Topic Models.

JAMANetwork Open

3, 5 (2020), e205308. https://doi.org/10.1001/jamanetworkopen.2020.5308[27] Catherine Hutton and Jane Gunn. 2007. Do longer consultations improve themanagement of psychological problems in general practice? A systematic litera-ture review.

BMC Health Services Research

BMJ Open

7, 10 (2017), 1–15. https://doi.org/10.1136/bmjopen-2017-017902[29] Maia Jacobs, Melanie F. Pradier, Thomas H. McCoy Jr, Roy H. Perlis, Finale Doshi-Velez, and Krzysztof Z. Gajos. 2021. How machine learning recommendationsinfluence clinician treatment selections: the example of antidepressant selection.

Translational Psychiatry (2021), (in press).[30] Aditya V. Karhade, Paul T. Ogink, Quirina C.B.S. Thio, Thomas D. Cha, William B.Gormley, Stuart H. Hershman, Timothy R. Smith, Jianren Mao, Andrew J. Schoen-feld, Christopher M. Bono, and Joseph H. Schwab. 2019. Development of ma-chine learning algorithms for prediction of prolonged opioid prescription aftersurgery for lumbar disc herniation.

Spine Journal

19, 11 (2019), 1764–1771.https://doi.org/10.1016/j.spinee.2019.06.002[31] Sidney H. Kennedy, Raymond W. Lam, Roger S. McIntyre, S. Valérie Tourjman,Venkat Bhat, Pierre Blier, Mehrul Hasnain, Fabrice Jollant, Anthony J. Levitt,Glenda M. MacQueen, Shane J. McInerney, Diane McIntosh, Roumen V. Milev,Daniel J. Müller, Sagar V. Parikh, Norma L. Pearson, Arun V. Ravindran, andRudolf Uher. 2016. Canadian Network for Mood and Anxiety Treatments (CAN-MAT) 2016 clinical guidelines for the management of adults with major depressivedisorder: Section 3. Pharmacological Treatments.

Canadian Journal of Psychiatry

61, 9 (2016), 540–560. https://doi.org/10.1177/0706743716659417[32] Ronald C. Kessler, P. Berglund, O. Demler, R. Jin, D. Koretz, K. R. Merikangas,A. J. Rush, E. E. Walters, A. Wang, Barry Rovner, and Robin Casten. 2003. Theepidemiology of major depressive disorder.

Evidence-Based Eye Care

4, 4 (2003),186–187. https://doi.org/10.1097/00132578-200310000-00002[33] Saif Khairat, David Marc, William Crosby, and Ali Al Sanousi. 2018. Reasons forphysicians not adopting clinical decision support systems: Critical analysis.

Jour-nal of Medical Internet Research

20, 4 (2018). https://doi.org/10.2196/medinform.8912[34] Israa Khaleel, Barbara C. Wimmer, Gregory M. Peterson, Syed Tabish Razi Zaidi,Erin Roehrer, Elizabeth Cummings, and Kenneth Lee. 2020. Health informationoverload among health consumers: A scoping review.

Patient Education andCounseling

Journal of the American Medical Informatics Association

14, 1 (2007),29–40. https://doi.org/10.1197/jamia.M2170[36] Yuhan Luo, Peiyi Liu, and Eun Kyoung Choe. 2019. Co-Designing Food Trackerswith Dietitians. (2019), 1–13. https://doi.org/10.1145/3290605.3300822[37] Haley MacLeod, Ben Jelen, Annu Prabhakar, Lora Oehlberg, Katie Siek, and KayConnelly. 2017. A Guide to Using Asynchronous Remote Communities (ARC)for Researching Distributed Populations.

EAI Endorsed Transactions on PervasiveHealth and Technology

3, 11 (2017), 152898. https://doi.org/10.4108/eai.18-7-2017.152898[38] Glenda MacQueen, Pasqualina Santaguida, Homa Keshavarz, Natalia Jaworska,Mitchell Levine, Joseph Beyene, and Parminder Raina. 2017. Systematic reviewof clinical practice guidelines for failed antidepressant treatment response inmajor depressive disorder, dysthymia, and subthreshold depression in adults.

TheCanadian Journal of Psychiatry

62, 1 (2017), 11–23.[39] Gregory Makoul and Marla L. Clayman. 2006. An integrative model of shareddecision making in medical encounters.

Patient Education and Counseling

60, 3(2006), 301–312. https://doi.org/10.1016/j.pec.2005.06.010[40] Andreas Menke. 2018. Precision pharmacotherapy: psychiatry’s future directionin preventing, diagnosing, and treating mental disorders.

Pharmacogenomics andPersonalized Medicine (2018), 11–211. https://doi.org/10.2147/PGPM.S146110[41] B Middleton, D F Sittig, and A Wright. 2016. Clinical Decision Support: a 25 YearRetrospective and a 25 Year Vision.

IMIA Yearbook of medical informatics (2016),103–116. https://doi.org/10.15265/IYS-2016-s034

HI ’21, May 8–13, 2021, Yokohama, Japan Jacobs, et al. [42] Tim Miller. 2019. Explanation in artificial intelligence: Insights from the socialsciences.

Artificial Intelligence

267 (2019), 1–38. https://doi.org/10.1016/j.artint.2018.07.007 arXiv:1706.07269[43] Karen C. Nanji, Sarah P. Slight, Diane L. Seger, Insook Cho, Julie M. Fiskio, Lisa M.Redden, Lynn A. Volk, and David W. Bates. 2014. Overrides of medication-relatedclinical decision support alerts in outpatients.

Journal of the American MedicalInformatics Association

21, 3 (2014), 487–491. https://doi.org/10.1136/amiajnl-2013-001813[44] Renee Noortman, Britta F. Schulte, Paul Marshall, Saskia Bakker, and Anna L.Cox. 2019. Hawkeye – Deploying a design fiction probe.

Conference on HumanFactors in Computing Systems - Proceedings (2019), 1–14. https://doi.org/10.1145/3290605.3300652[45] Ziad Obermeyer, Brian Powers, Christine Vogeli, and Sendhil Mullainathan. 2019.Dissecting racial bias in an algorithm used to manage the health of populations.

Science

Proceedingsof the 2018 CHI Conference on Human Factors in Computing Systems (2018), 649:1—-649:13. https://doi.org/10.1145/3173574.3174223[47] Jerome A. Osheroff, Jonathan M. Teich, Blackford Middleton, Elaine B. Steen,Adam Wright, and Don E. Detmer. 2007. A Roadmap for National Action onClinical Decision Support.

Journal of the American Medical Informatics Association

14, 2 (2007), 141–145. https://doi.org/10.1197/jamia.M2334[48] Roy H Perlis. 2016. Abandoning personalization to get to precision in the phar-macotherapy of depression.

World Psychiatry

15, October (2016), 228–235.[49] Melanie F. Pradier, Thomas H. McCoy, Michael Hughes, Roy H. Perlis, and FinaleDoshi-Velez. 2020. Predicting treatment dropout after antidepressant initiation.

Translational Psychiatry

10, 1 (2020). https://doi.org/10.1038/s41398-020-0716-y[50] Melanie F. Pradier, Javier Zazo, Sonali Parbhoo, Roy H. Perlis, Maurizio Zazzi,and Finale Doshi-Velez. 2021. Preferential Mixture-of-Experts: InterpretableModels that Rely on Human Expertise As Much As Possible. In

AMIA 2021Virtual Informatics Summit .[51] Wanda Pratt, Madhu C Reddy, David W McDonald, Peter Tarczy-Hornoch, andJohn H Gennari. 2004. Incorporating ideas from computer-supported cooperativework.

Journal of biomedical informatics

37, 2 (apr 2004), 128–37. https://doi.org/10.1016/j.jbi.2004.04.001[52] Cynthia Rudin. 2019. Stop explaining black box machine learning models for highstakes decisions and use interpretable models instead.

Nature Machine Intelligence

1, 5 (2019), 206–215. https://doi.org/10.1038/s42256-019-0048-x arXiv:1811.10154[53] A. Rush. 2006. Acute and Longer-Term Outcomes in Depressed OutpatientsRequiring One or Several Treatment Steps: A STAR*D Report.

American Journalof Psychiatry

CHI ’20 . arXiv:arXiv:2003.03541v1[55] Kumiko O. Schnock, Julia E. Snyder, Theresa E. Fuller, Megan Duckworth,Maxwell Grant, Catherine Yoon, Stuart Lipsitz, Anuj K. Dalal, David W. Bates,and Patricia C. Dykes. 2019. Acute care patient portal intervention: Portal useand patient activation.

Journal of Medical Internet Research

21, 7 (2019), 1–11.https://doi.org/10.2196/13336[56] Andrew D. Selbst, Danah Boyd, Sorelle A. Friedler, Suresh Venkatasubramanian,and Janet Vertesi. 2019. Fairness and abstraction in sociotechnical systems.

Proceedings of the 2019 Conference on Fairness, Accountability, and Transparency (2019), 59–68. https://doi.org/10.1145/3287560.3287598[57] Adrian B.R. Shatte, Delyse M. Hutchinson, and Samantha J. Teague. 2019. Ma-chine learning in mental health: A scoping review of methods and applica-tions.

Psychological Medicine

49, 9 (2019), 1426–1448. https://doi.org/10.1017/S0033291719000151[58] Gregory E. Simon and Roy H. Perlis. 2010. Personalized medicine for depression:Can we match patients with treatments?

American Journal of Psychiatry

Physiology & behavior

Journal of Biomedical Informatics

41, 2 (2008), 387–392. https://doi.org/10.1016/j.jbi.2007.09.003[61] Michael Solomon, Stephen L. Wagner, and James Goes. 2012. Effects of a Web-based intervention for adults with chronic conditions on patient activation: onlinerandomized controlled trial.

Journal of medical Internet research

14, 1 (2012), 1–13.https://doi.org/10.2196/jmir.1924[62] Megan Stevenson. 2018. Assessing risk assessment in action.

Minnesota LawReview

Prim Care Clin Office Pract (2007), 571–592.[64] Sana Tonekaboni, Shalmali Joshi, Melissa D McCradden, and Anna Goldenberg.2019. What Clinicians Want: Contextualizing Explainable Machine Learning forClinical End Use. Ml (2019), 1–19. arXiv:1905.05134 http://arxiv.org/abs/1905.05134[65] Madhukar H. Trivedi and Ella J. Daly. 2008. Treatment strategies to improve andsustain remission in major depressive disorder.

Dialogues in Clinical Neuroscience

10, 4 (2008), 377–384.[66] Stijn Van de Velde, Annemie Heselmans, Nicolas Delvaux, Linn Brandt, LuisMarco-Ruiz, David Spitaels, Hanne Cloetens, Tiina Kortteisto, Pavel Roshanov,Ilkka Kunnamo, Bert Aertgeerts, Per Olav Vandvik, and Signe Flottorp. 2018.A systematic review of trials evaluating success factors of interventions withcomputerised clinical decision support.

Implementation Science

13, 1 (2018), 1–11.https://doi.org/10.1186/s13012-018-0790-1[67] Danding Wang, Qian Yang, Ashraf Abdul, Brian Y Lim, and United States. 2019.Designing Theory-Driven User-Centric Explainable AI. In

CHI Conference onHuman Factors in Computing Systems (CHI ’19) . 1–15.[68] Jenna Wiens, Suchi Saria, Mark Sendak, Marzyeh Ghassemi, Vincent X. Liu,Finale Doshi-Velez, Kenneth Jung, Katherine Heller, David Kale, MohammedSaeed, Pilar N. Ossorio, Sonoo Thadaney-Israni, and Anna Goldenberg. 2019. Dono harm: a roadmap for responsible machine learning for health care.

NatureMedicine

25, September (2019). https://doi.org/10.1038/s41591-019-0548-6[69] Nicole J. Wolf and Derek R. Hopko. 2008. Psychosocial and pharmacologicalinterventions for depressed adults in primary care: A critical review.

ClinicalPsychology Review

28, 1 (2008), 131–161. https://doi.org/10.1016/j.cpr.2007.04.004[70] Yao Xie, Melody Chen, David Kao, Ge Gao, and Xiang ’Anthony’ Chen. 2020.CheXplain: Enabling Physicians to Explore and Understand Data-Driven, AI-Enabled Medical Imaging Analysis. In

CHI 2020 . 1–13. https://doi.org/10.1145/3313831.3376807 arXiv:2001.05149[71] Qian Yang, Aaron Steinfeld, and John Zimmerman. 2019. Unremarkable AI:Fiting intelligent decision support into critical, clinical decision-making processes.

Conference on Human Factors in Computing Systems - Proceedings (2019), 1–11.https://doi.org/10.1145/3290605.3300468 arXiv:1904.09612[72] Q Yang, J Zimmerman, and A Steinfeld. 2015. Review of Medical Decision SupportTools: Emerging Opportunity for Interaction Design.

IASDR 2015 Interplay

September (2015), 1–16. https://doi.org/10.13140/RG.2.1.1441.3284[73] Ming Yin, Jennifer Wortman Vaughan, and Hanna Wallach. 2019. Understandingthe Effect of Accuracy on Trust in Machine Learning Models.