Archive | 2021

Natural language inference for clinical registry curation

 
 
 
 

Abstract


Clinical registries - structured databases of demographic, diagnosis, and treatment information for patients with specific diseases or phenotypes - play vital roles in high-quality retrospective studies, operational planning for health systems, and assessment of patient eligibility for research, including clinical trials. However, registry-building historically has relied on manual curation, a time and resource-intensive process that is vulnerable to human error. Here we convert registry curation into a natural language inference (NLI) problem, applying five state-of-the-art, pretrained, deep learning based NLI models to clinical, laboratory, and pathology notes to infer information about 43 different breast oncology registry fields. We evaluate the models inferences against a manually curated, 7439 patient breast oncology research database. The NLI models show considerable variation in performance, both within and across registry fields. One model, ALBERT, outperforms the others (BART, RoBERTa, XLNet, and ELECTRA) on 22 out of 43 fields. A detailed error analysis reveals that incorrect inferences primarily arise through models misinterpretations of temporality--they interpret historical findings as current and vice versa--as well as confusion based on subtle terminology and abbreviation variants common in clinical notes. However, modern NLI methods show promise for increasing the efficiency of registry curation, even when used out of the box with no additional training. To our knowledge, this is the first time NLI has been applied to a clinical problem that is not part of a conference shared task or other computer science benchmark.

Volume None
Pages None
DOI 10.1101/2021.06.14.21258493
Language English
Journal None

Full Text