Nature medicine | 2021

DECIDE-AI: new reporting guidelines to bridge the development-to-implementation gap in clinical artificial intelligence.

 
 
 
 
 
 

Abstract


To the Editor—Recent years have seen an exponential growth in the number of artificial intelligence (AI) algorithms published in the medical literature, yet clinical impact in terms of patient outcomes remains to be demonstrated. One likely explanation for this so-called ‘AI chasm’1 is an overemphasis on the technical aspects of the proposed algorithms, with insufficient attention given to the factors that affect the interaction with their human users. As clinicians occupy, and are likely to keep occupying, the central role in patient care, it is essential to focus the development and evaluation of AI-based clinical algorithms on their potential to augment rather than replace human intelligence. However, AI-based decision-support systems pose unique challenges to the traditional medical decision-making process, such as their frequent lack of explainability (the so-called ‘black box’ problem) or their tendency to sometimes produce unexpected results. Hence, bridging algorithm development to bedside application while keeping humans at the center of the design and evaluation process is a complicated task, and current guidance is incomplete. We make the case for a robust early and small-scale clinical evaluation stage, between the in silico algorithm development/ validation (covered by the upcoming TRIPOD-AI statement2 and STARD-AI statement3) and large-scale clinical trials evaluating AI interventions (covered by the CONSORT-AI statement4). This step can be compared to a phase 1/2 trial for drug development or (a much closer analogy, given the relationship between users’ characteristics and the intervention’s effectiveness) IDEAL stage 2a/2b for surgical innovation5–7. Four key arguments support the need for this intermediary development stage and its adequate reporting. Human decision-making processes are complex and subject to many biases. It cannot be expected, even in the case of directive models, that human users will exactly follow all of an algorithm’s recommendations, especially if these users remain accountable for their decisions8. In order to accurately evaluate an algorithm’s performance and avoid the research waste of conducting expensive large-scale trials with decision-support systems whose interaction with human users is inadequate, it is essential to assess the actual impact of an algorithm on its users’ decisions at an early stage. Additionally, consideration should be given to the difference between the development population and the target patient population, to ensure the algorithm’s relevance in the implementation settings. Therefore, the assisted human performance and algorithm usability (not merely the algorithm’s stand-alone outputs) need to be evaluated in the target clinical environment and need to be reported as outcomes. Because it cannot be assumed that users’ decisions will mirror the algorithm’s recommendations, it is also crucially important to test the safety profile of new algorithms not only in silico but also when used to influence human decisions. Skipping this step and moving directly forward to large-scale trials would expose a considerable number of patients to an unknown risk of harm, which is ethically unacceptable. Suboptimal safety standards led to disastrous consequences in the early days of pharmacological trials; there is no need to repeat these mistakes with clinical AI. The evaluation of human factors (ergonomics) should happen as early as possible and needs iterative evaluation– design cycles. Technical requirements often evolve as a system starts being used, and users’ expectations of a system also vary in the initial exposure period. For example, users might wish for an additional key variable to make sense of the algorithm’s recommendations, which in turn would require developers to access a totally different section of the electronic patient record. From an economic viewpoint, the sooner the evaluation of human factors occurs, the more cost-effective it is likely to be. Finally, iterative design modification is difficult and inappropriate during large-scale trials. Such modification would indeed cause a serious risk of invalidating the summative evaluation’s conclusions, as the intervention tested is likely to have changed during trial. Early formative evaluation and rapid prototyping are therefore essential before large-scale trials. Large-scale clinical trials are complex and expensive endeavors that require careful preparation. A well-thought-out design is essential for the production of valid and meaningful conclusions and needs background information about the intervention under evaluation. Not all such background information can be inferred from in silico evaluation, and some data have to be collected in small-scale prospective studies. For example, the most appropriate outcomes for the trial, the expected effect size, the optimal inclusion and exclusion criteria for the user population, the evolution of the users’ trust in the algorithm, and the most appropriate timing of decision support are crucial pieces of information that should be known to the investigators at the time trial protocols are drafted, and these could be derived from early formative evaluation. Other important considerations, such as how to best use the output of the algorithm or how this output is to be communicated to the patients, could also be investigated at this stage. We believe that clear and transparent reporting on these aspects will not only avoid preventable harm and research waste but also play a key role in transforming AI from a promising technology to an evidence-based component of modern medicine. This is why we have started a Delphi process9,10 to reach expert consensus on the key information items that should be reported during ‘Developmental and Exploratory Clinical Investigation of DEcision-support systems driven by Artificial Intelligence’ (DECIDE-AI). The creation of the DECIDE-AI guidelines will be an open and transparent process, and we

Volume None
Pages None
DOI 10.1038/s41591-021-01229-5
Language English
Journal Nature medicine

Full Text