The Journal of Urology | 2019

PD58-09\u2003EXTRACTING STRUCTURED INFORMATION FROM PATHOLOGY REPORTS USING NATURAL LANGUAGE PROCESSING AND MACHINE LEARNING

 
 
 
 
 
 
 

Abstract


INTRODUCTION AND OBJECTIVES: Detailed pathologic information for the estimated 1.74 million Americans diagnosed with cancer every year is locked away as unstructured free text, unavailable for use without manual abstraction. Our objective was to optimize NLP algorithms to extract detailed pathologic details from cancer pathology reports. Building on current pipelines, we developed feature specific optimizations for information extraction from prostate cancer pathology reports and evaluate if high quality information extraction was possible with a minimal training data. METHODS: We used a corpus of 3,232 free text pathology reports from radical prostatectomy specimens at UCSF, each with detailed manual annotations for 20 data elements, such as gleason scores, margin status, extracapsular extension, seminal vesicle invasion, tumor volume, and numbers of lymph nodes (positive and dissected). The full corpus was divided up so that the training, validation, development test, and true test sets contained 65%, 15%, 10%, and 10% reports each, respectively. We then created an NLP pipeline using NLTK and investigated the performance of multiple machine learning methods using scikit-learn and pytorch. We applied random forests, support vector machines, boosting, logistic regression and convolutional networks to the full training dataset as well randomly selected subsets of 8, 16, 32, 64, and 128 reports. RESULTS: We calculated the F1 evaluation metric weighted by support (number of true instances for each label) for each data field using the development test set. When working with the full training corpus (n=2066), convolutional networks perform the best (mean weighted F1 0.968 across all 12 clinical data elements). However, under smaller data conditions with less annotated data for training, boosting typically performs best. Moreover, with only 32 labeled reports we are able to achieve a mean weighted F1 score of 0.91 across fields. CONCLUSIONS: An NLP pipeline using both traditional statistical and machine-learning based methods can extract detailed prostate cancer pathology data from unstructured free text pathology reports with high accuracy, even with small sets of training data. Table. No title available. Source of Funding: None

Volume 201
Pages e1031–e1032
DOI 10.1097/01.JU.0000557177.97226.63
Language English
Journal The Journal of Urology

Full Text