[PDF] Multi-Label Annotation of Chest Abdomen Pelvis Computed Tomography Text Reports Using Deep Learning

Abstract

Purpose: To develop high throughput multi-label annotators for body (chest, abdomen, and pelvis) Computed Tomography (CT) reports that can be applied across a variety of abnormalities, organs, and disease states. Approach: We used a dictionary approach to develop rule-based algorithms (RBA) for extraction of disease labels from radiology text reports. We targeted three organ systems (lungs/pleura, liver/gallbladder, kidneys/ureters) with four diseases per system based on their prevalence in our dataset. To expand the algorithms beyond pre-defined keywords, attention-guided recurrent neural networks (RNN) were trained using the RBA-extracted labels to classify reports as being positive for one or more diseases or normal for each organ system. Confounding effects on model performance were evaluated using random initialization or pre-trained embedding as well as different sizes of training datasets. Performance was evaluated using the receiver operating characteristic (ROC) area under the curve (AUC) against 2,158 manually obtained labels. Results: Our models extracted disease labels from 261,229 radiology reports of 112,501 unique subjects. Pre-trained models outperformed random initialization across all diseases. As the training dataset size was reduced, performance was robust except for a few diseases with relatively small number of cases. Pre-trained classification AUCs achieved > 0.95 for all five disease outcomes across all three organ systems. Conclusions: Our label-extracting pipeline was able to encompass a variety of cases and diseases by generalizing beyond strict rules with exceptional accuracy. This method can be easily adapted to enable automated labeling of hospital-scale medical data sets for training image-based disease classifiers.

Full PDF

MMulti-Label Annotation of Text Reports from Computed Tomography of the Chest, Abdomen, and Pelvis Using Deep Learning

Vincent M. D’Anniballe* ,a Fakrul I. Tushar* ,a,b

Khrystyna Faryna, a,b

Songyue Han a,c , Maciej A. Mazurowski a , Geoffrey D. Rubin a , Joseph Y. Lo a a Carl E. Ravin Advanced Imaging Laboratories, Department of Radiology, Duke University School of Medicine, 2525 Erwin Rd. Ste. 302, Durham, USA, 27705 b Erasmus+ Joint Master In Medical Imaging and Applications, University of Girona, Girona, Spain c School of Software Engineering, South China University of Technology, Guangzhou, Guangdong *These authors contributed equally to this work

Purpose:

To develop high throughput multi-label annotators for body (chest, abdomen, and pelvis) Computed Tomography (CT) reports that can be applied across a variety of abnormalities, organs, and disease states.

Approach : We used a dictionary approach to develop rule-based algorithms (RBA) for extraction of disease labels from radiology text reports. We targeted three organ systems (lungs/pleura, liver/gallbladder, kidneys/ureters) with four diseases per system based on their prevalence in our dataset. To expand the algorithms beyond pre-defined keywords, attention-guided recurrent neural networks (RNN) were trained using the RBA-extracted labels to classify reports as being positive for one or more diseases or normal for each organ system. Confounding effects on model performance were evaluated using random initialization or pre-trained embedding as well as different sizes of training datasets. Performance was evaluated using the receiver operating characteristic (ROC) area under the curve (AUC) against 2,158 manually obtained labels.

Results : Our models extracted disease labels from 261,229 radiology reports of 112,501 unique subjects. Pre-trained models outperformed random initialization across all diseases. As the training dataset size was reduced, performance was robust except for a few diseases with relatively small number of cases. Pre-trained classification AUCs achieved > 0.95 for all five disease outcomes across all three organ systems.

Conclusions : Our label-extracting pipeline was able to encompass a variety of cases and diseases by generalizing beyond strict rules with exceptional accuracy. This method can be easily adapted to enable automated labeling of hospital-scale medical data sets for training image-based disease classifiers.

Keywords : weak supervision, attention RNN, rule-based algorithm, natural language processing, Computed Tomography.

Introduction

Machine learning algorithms have demonstrated considerable potential as disease classifiers for medical images. However, the majority of algorithms are specialized for a single organ or disease making their utility narrow in scope. This limited scope is mainly attributed to a sparsity of training data, since curating datasets for image-based classifiers has traditionally relied on radiologist annotation of the disease or its sequelae. As an alternative to image-based labeling, automated extraction of disease labels from radiology report text has the potential to address this training data scarcity and to avoid human annotation efforts (1-4). Rule-based algorithms (RBA) are a conventional method for mining report text that utilize simple logic based on pre-defined keywords or patterns. In a landmark study, Wang et al. (5) used a RBA to extract labels of 8 thorax diseases from 108,948 chest X-ray reports to effectively train an image-based disease classifier. Using a similar method for CT, Draelos et al. (6) demonstrated the broad applicability of RBA-obtained labels by mining the more complex reports associated with over 36,000 chest CT volumes to train a classifier for 83 chest abnormalities. However, a major limitation of RBAs is that their performance and scope is reliant on the completeness of dictionaries containing pre-defined keywords. Furthermore, the radiologist’s interpretation that accompanies a CT is usually composed in a free or semi-structured text form, rendering the extraction of disease labels using simple logical rules a nontrivial task (7). To improve their utility, RBA-extracted labels can then be used to train neural networks that deviate from strict rules by learning salient semantic features, a form of natural language processing (NLP) (8, 9). For example, Steinkamp et al. (10) trained a recurrent neural network (RNN) to classify disease in pathology reports written in unseen formats, suggesting the network had learned a generalizable encoding of the semantics. Building upon this NLP approach, Yuan et al. (11) combined a pre-trained word embedding model with a deep learning-based sentence encoder to classify pulmonary nodules in a diverse set of radiology reports from different universities. While promising, it is often difficult to determine which semantic or structural features of the reports that the model perceives as most salient. To improve the interpretability of NLP-based classifiers, an attention-guided RNN (12) can be used to project the attention vector onto report text (13), allowing the user to visualize the words that a model is giving the most weight to when classifying an abnormality. In this study, we propose a framework for automated, multi-disease label extraction of body (chest, abdomen, and pelvis) CT reports based on attention-guided RNNs trained on RBA extracted labels. For each organ system, a RNN was trained to classify the lungs/pleura, liver/gallbladder, kidneys/ureters as being positive for one or more of four different diseases or normal. Although there has been extensive work in radiology report labeling, to our knowledge, there are no related works that demonstrate the utility of an RBA to train deep learning-based NLP disease classifiers in such a breadth of organ systems, diseases, and body CT reports. ` This study is a considerable expansion of our previous studies (14, 15) and the main contributions are threefold: 1) To further refine our RBAs to meet the challenges of free-text narration in radiology CT reports. ) To broaden the utility of our extracted labels by training attention-guided RNNs to perform multi-label disease classification of CT reports. 3) To determine alternative factors that influence disease classification performance including random vs. pre-trained embedding and different sizes of training datasets.

2 Materials and Methods

In this section, we first describe the dataset that was used. Then, we outline the development processes of our RBA and the subsequent addition of an attention-guided RNN to enable multi-label classification of radiology reports. Figure 1 displays the overall workflow of this paper.

Fig. 1

Complete workflow. Radiology reports extracted from our health system were deidentified and the findings sections were isolated. The reports were analyzed by an RBA and an attention-guided RNN to classify each report for 5 different outcomes (one or more of four disease states or normal) per organ system (lungs/pleura, liver/gallbladder, kidneys/ureters). A separate RBA and RNN was used for each organ system.

A total of 261,229 chest, abdomen, pelvis structured CT reports of 112,501 unique subjects between the years 2012 to 2017 were extracted from the health system of our institution with IRB approval and deidentified. A representative example of a radiology CT report is shown in Figure 2, which contains protocol, indication, technique, findings, and impression sections. The distribution of CT protocols is shown in Figure 3.

Fig. 2

Representative example of a body CT radiology report within our dataset.

Fig. 3

Distribution of CT protocols in our dataset. CAP= chest, abdomen, and pelvis, C= chest, AP= abdomen-pelvis, A= abdomen, P= pelvis, CA= chest-abdomen, CP= chest-pelvis

A separate RBA was created for the lungs/pleura, liver/gallbladder, and kidneys/ureters. Each RBA was limited to the findings section of the CT reports to minimize the influence of biasing information referenced in other sections and to ensure that the automated annotation reflected image information in the current exam (e.g., indication for exam, patient history, technique factors, and comparison with priors). For example, the impression section could describe a diagnosis based on patient history that could not be made using solely image-based information. For the purpose of RNN training, reports were filtered by protocol name to ensure organ-relevant scans were used for each model. For example, only protocols that included the entire chest (CAP, C, CA, and CP) were used to train the lungs/pleura model. In order to select target disease and organ keywords for the RBA dictionary, we computed term frequency–inverse document frequency (TF-IDF) (16) on the findings sections of a random batch of 3,500 radiology reports. Informed by the prevalence of organ and disease keywords, we intentionally selected the three organ systems and four abnormal findings for each system such that the labels varied widely in location, appearance, and disease manifestations. For lungs/pleura, the four findings selected were atelectasis, nodule/mass, emphysema, and effusion. For liver/gallbladder; stone, lesion, dilation, and fatty liver. For kidneys/ureters; stone, lesion, atrophy, and cyst. A board-certified radiologist (G.D.R.) provided guidance to define the TF-IDF terms into several categories, specifically: a) single-organ descriptors specific to each organ, e.g., pleural effusion or steatosis, b) multi-organ descriptors applicable to numerous organs, e.g., nodule or stone, c) negation terms indicating absence of disease, e.g., no or without, d) qualifier terms describing confounding conditions, e.g., however, OR e) normal terms suggesting normal anatomy in the absence of other diseases and abnormalities, e.g., unremarkable. Appendix 1 displays the dictionary terms and their descriptor type for each organ system. The dictionary used in this study was substantially refined and expanded from our previous study (14) by adding more terms and introducing wild-card entries to tackle misspellings or grammatical errors. Figure 4 displays an overview of the RBA’s flowchart and logic. Although a separate RBA was created for each organ system, the workflow was the same. After the dictionary was refined, report text was converted to lowercase and each sentence was tokenized. In summary, the RBA was deployed on each sentence, and the number of potential diseases was counted first using the logic for the multi-organ descriptor and then the single-organ descriptor. If no potential disease labels were detected, then the normal descriptor logic was finally applied to verify normality. This process was repeated for each disease outcome allowing a report to be positive for one or more diseases or normal. Note that in this study an organ system was defined as normal not only by excluding the four diseases studied but also in the absence of dozens of abnormalities and diseases states that were not otherwise analyzed, as shown in Appendix 1. If the RBA failed to categorize the report definitively as positive for disease or normal (e.g., there was no mention of the organ system), then the report was labeled as uncertain and was not included in this study. Upon manual review, we observed that many reports were incorrectly labeled normal due to excessively long sentences, which were either complex sentences with multiple clauses or fused together due to grammatical errors (e.g., missing periods). Such sentences were impractical to analyze with simple logic, so each report sentence was subject to a length criterion threshold for the normal outcome, another feature which made this RBA noticeably different from previous implementations.

Fig. 4

Overview of the RBA. (Top) The findings section of each report was extracted, then the text was converted to lowercase and each sentence was tokenized. The RBA was deployed on each sentence, and the number of diseases was counted using the multi-organ descriptor first and then the single-organ descriptor logic. If no disease labels were detected, the normal descriptor logic was applied. This process was repeated for each disease allowing a report to be positive for one or more diseases or normal. (Bottom) The normal, multi-organ, and single organ descriptor logics.

From the full set of 261,229 reports, the lungs/pleura RBA classified a total of 165,659 reports from 74,944 subjects, the liver/gallbladder RBA classified 96,532 reports from 50,086 subjects, and the kidneys/ureters RBA classified 87,334 reports from 46,527 subjects. Note that the full set of cases does not correspond to the sum of reports for each organ system due to overlap of disease labels, where a single subject could have multiple findings across multiple organ systems. Figure 5 displays the disease distribution by organ system. Reports were randomly divided by subject into subsets for training (70%), validation (15%), and testing (15%) the RNN model. Since the RNN depends on labels generated by the RBA, we manually validated the quality of the RBA labels. From the above test set, a test subset of 2,158 (lungs/pleura=771, liver/gallbladder=652, kidneys/ureters=749) CT reports were randomly selected, and 2,875 labels (lungs/pleura=1,154, liver/gallbladder=787, kidneys/ureters=934) were manually obtained by a Master of Biomedical Science graduate with gross anatomy training (V.M.D.) who was supervised by a board-certified radiologist (G.D.R.). This reference set was used to compare performance of the final RBA against our previous versions.

Fig 5.

Frequency of reports for each disease and organ system.

A separate RNN was trained for each organ system using labels extracted from its corresponding RBA. The neural networks (Fig. 1) used in this study consisted of an embeddings layer, Bidirectional Long-Short Term Memory (BiLSTM), attention mechanism, dense layer, and final classification layer (17, 18). The BiLSTM layer was composed of 200 units and produces a sequential output. It was followed up by a 0.2 dropout layer to prevent overfitting. The attention mechanism began with a time-distributed dense layer, which received a sequential 3-dimensional input (batch size, maximum sequence length, 1), and computed the aggregation of each hidden state. Next, it was reshaped to 2-dimensional form (batch size, maximum sequence length) followed by softmax activation, which assigned weights to each hidden state to produce an attention vector. The dot product of the attention vector and sequential output of BiLSTM was the final output of the attention mechanism. It was then followed by dense and classification layers. Since outcomes for each disease were non-mutually exclusive, we used a weighted binary cross-entropy loss and modeled the outputs as independent Bernoulli distributions for each of the labels with sigmoid activation.

In this study, we compared the multi-label classification performance of two embedding approaches: with embeddings pretrained on the PubMed+MIMIC-III (19) dataset, and without pretrained embeddings (randomly initialized embedding layer). Embeddings of 200 dimensions were used in both experiments. Afterwards, we analyzed the effect of training data size on classification performance by incrementally increasing the number of training cases from 20%, 40%, 60%, 80%, or 100% of the total dataset. To prepare the training data, a pre-processing step was applied to each report. All numbers and punctuation were removed from each “findings” section, and the text was then converted to lowercase and tokenized. The sequence of tokens was then zero padded to the length of 650 tokens per sample. The models were trained for 50 epochs using a batch size of 512. The models corresponding to the minimum of the validation loss were selected as final. In this study we used Adam optimizer and a constant learning rate of 0.0001. The models were implemented using Python TensorFlow framework. Training duration was approximately 30 minutes for each model using 2 TITAN RTX GPUs. All models’ weights and code will be made publicly available upon publication (https://gitlab.oit.duke.edu/railabs).

3 Results

Table I displays the labeling accuracy and F-score of previously reported RBAs and the final RBAs for the binary labels (present/absent) for each disease and organ system. Performance was calculated based on the manually annotated test set of 2,158 CT reports with 2,875 labels. The performance of the final RBAs was equal to or greater than both previously reported RBAs (14, 15) for all diseases, with accuracy ranging from 91% to 99% and F-score from 0.85 to 0.98.

Table I

Comparison of classification performance between previously reported RBAs and the final RBAs using the manually annotated test set. “

Han et al. (14)

Faryna et al. (15)

Final RBA Organ Label

Atelectasis 251 0.86 0.74 0.97 0.95

Nodule 296 0.77 0.74

Emphysema 193 0.82 0.45 0.98 0.96

Effusion 205 0.82 0.53 0.84 0.58

Normal 209 0.79 0.44 0.96 0.94

Stone 144 0.87 0.62 0.95 0.9

Lesion 224 0.92 0.88 0.94 0.91

Dilatation 87 0.86 0.1 0.9 0.7

Fatty 166 0.97 0.94

Normal 166 0.94 0.9 0.95 0.9

Stone 174 0.91 0.82

Atrophy 94 0.96 0.85

Lesion 238 0.91 0.87

Cyst 234 0.95 0.92

Normal 194 0.94 0.89

Table II displays the classification performance of the attention-guided RNN with and without pre-trained embedding when applied to the test set of 23,411 reports for lungs/pleura; 13,402 reports for liver/gallbladder and 11,954 reports for kidneys/ureters. Pre-trained embedding outperformed the models trained with randomly initialized embedding for all organ systems and diseases.

Table II

Performance of randomly initialized versus pre-trained embeddings using the test set containing 48,758 reports (23,411 reports for lungs/pleura; 13,402 reports for liver/gallbladder and 11,954 reports for kidneys/ureters).

Organ Label

Atelectasis 9329 0.9968 (0.9961-0.9974) (0.9967-0.9997)

Nodule 10183 0.9913 (0.9904-0.9922) (0.9928-0.9943)

Emphysema 3659 0.9972 (0.9963-0.9982) (0.9972-0.9987)

Effusion 5625 0.9975 (0.9970-0.9980) (0.9980-0.9989)

Normal 3110 (0.9985-0.9995) (0.9982-0.9997)

Liver/ Gallbladder

Stone 1981 0.7849 (0.7739-0.7059) (0.9721-0.9801)

Lesion 6463 0.9675 (0.9646-0.9700) (0.9936-0.9955)

Dilatation 1497 0.8120 (0.8013-0.8228) (0.9906-0.9945)

Fatty 1795 0.9984 (0.9851-0.9917) (0.9986-0.9996)

Normal 3162 0.9745 (0.9716-0.9773) (0.9950-0.9974)

Kidneys/ Ureters

Stone 2548 0.9562 (0.9514-0.9609) (0.9764-0.9819)

Atrophy 750 0.9523 (0.9436-0.9611) (0.9936-0.9973)

Lesion 4817 0.9757 (0.9731-0.9783) (0.9886-0.9915)

Cyst 4164 0.9862 (0.9843-0.9881) (0.9914-0.9939)

Normal 2048 0.9909 (0.9890-0.9928) (0.9980-0.9992)

Figure 6 displays examples of the output vectors produced by the attention mechanism for each organ system. Figure 7 displays the classification performance of the attention-guided RNN with pre-trained embedding when different portions of training data were used. Figure 7(a) displays the number of reports used in the training dataset after randomly splitting in 20% increments for lungs/pleura, liver/gallbladder, kidneys/ureters. Figure 7(b) displays the classification performance after training with each increment. AUCs reached > 0.95 for all classes in each organ system when using the complete dataset in the pre-trained models. Although the performance tended to improve as more training samples were used, most labels showed a robust plateau such that performances were still within the confidence intervals for 100% of the data. The most notable drops in performance were classes with smaller sample size (e.g., stone and dilatation for liver/gallbladder and atrophy for kidneys/ureters).

Fig 6.

Examples of attention vectors projected on the findings section of radiology reports. (Top panel) a report positive for nodule in the lungs/pleura. (Middle panel) a normal report for liver/gallbladder. (Bottom panel) a report positive for stone in the kidneys/ureters. As part of standard pre-processing, all numbers and punctuation were removed and text was converted to lowercase. (a) (b) Fig 7.

Comparison of performance using different sizes of training data in the pretrained models for the lungs/pleura, liver/gallbladder, kidneys/ureters. (a) Number of reports randomly split in 20%, 40%, 60%, 80% and 100% of total training dataset for each disease by organ system. (b) Performance of models on test-set trained with randomly split 20%, 40%, 60%, 80%, and 100% training data for each disease by organ system.

4 Discussion

Although deep learning-based disease classification algorithms have recently achieved exceptional accuracy, they often suffer from limited diversity of diseases and organ systems. This narrow scope is largely due to inadequate amounts of curated CT data where human-annotation efforts are required. As an alternative, the work described here sought to develop high-throughput, multi-disease label extractors for body CT reports that was broad in scope and could be easily adapted to new keywords and diseases. As the foundation of our NLP algorithm, we developed RBAs that utilized simple rules to extract precise labels from radiology reports with 91-99% accuracy for all four diseases or normal across all three organ systems. However, the RBAs alone could not provide labels for our entire dataset because radiology reports often contain variability in writing, grammar, and even variation in descriptors for the same disease between radiologists (20). To overcome this obstacle, we demonstrated that an attention-guided RNN can be trained using RBA-annotated reports to learn salient semantic features and generalize beyond simple rules or keywords to encompass more reports. Our final disease classification pipeline performed with an AUC of > 0.95 for all diseases and organ systems. Recent works investigating deep learning-based radiology report annotation have achieved similar performances, although the majority are limited to a specific disease or organ system (21). Examples include classification of pulmonary emboli in thoracic CT reports with AUC from 0.93-0.99 (22), annotation of mammography reports with a keyword accuracy of 0.96 (23), and identification of femur fractures with an F1 score of 0.97 (24). The utility of automated labeling has been demonstrated by efficiently annotating large radiology report datasets to develop image-based CT classifiers, even without specific knowledge regarding disease location (6, 25). Further inspired by the recent wide application of deep learning-based methods in different clinical NLP tasks (10, 26-32) and effective application of word embedding (33-35), we also experimented using a multi-label disease classifier with pre-trained embedding and randomly initialized embedding layers. As expected, the attention-guided RNNs with pretrained embedding outperformed the randomly initialized models in all classes across all organ systems. Additionally, we observed that performance improved steadily with increasing number of cases. The lower frequency classes seemed to be affected greatly compared to classes having high frequency, exemplified by atrophic kidneys where performance experienced a significant drop at around 500 cases (20% of total available cases) for training. The body CT dataset used in this study was dominated by two types of exams: chest and abdomen-pelvis CTs. In many reports, one or more of the three organ systems were out of view and not mentioned at all by the radiologist. For example, if a chest CT did not mention the kidneys, that would be labeled as uncertain by our RBA. However, in specific studies such as abdomen-pelvis CT, large organs such as the lung were often still described even if they were not completely visible e.g., “Limited view of the lung bases appear clear.” This short sentence would satisfy the logic of the RBA to label the report appropriately as normal for the lungs. There are several limitations to this study. As a general limitation of RBA techniques, it was not possible to provide disease labels for all reports within our dataset. This was often because each sentence did not satisfy the pre-defined, strict rules. To mitigate this effect, future work should expand the dictionary through discovery of new and potentially uncommon language uses. Another limitation is that, unlike when radiologists annotate images manually, the labels derived from reports tend to describe all or much of an organ system (e.g., “bibasilar atelectasis”) and in some cases provide limited disease extent and location (e.g., “nodule measuring 1.8 x 2.1 cm on series 2 image 60”). Furthermore, our dataset suffered from notable class imbalance, including a low prevalence of normal cases as well as multi-fold differences between diseases, although this represented the natural prevalence within our study population. The dataset also came from a single health system, which comprises multiple hospitals but may share similarities in the reporting patterns for radiologists. Finally, this initial demonstration focused on building three separate classifiers rather than a single multi-organ model. Independent processing of diseases could have simplified the challenges imposed by multiple organ interaction, and in future work we will consider the feasibility of a single model, multi-organ approach. Overall, the disease labeling pipeline described here offered numerous advantages. By using deep learning-based NLP, our algorithms were able to generalize beyond pre-defined rules and label a vast and heterogenous dataset as positive for one or more diseases or normal for three different organ systems. To the best of our knowledge, this was a first attempt in using RBA-extracted labels to train an attention-guided RNN to annotate a diverse set of diseases in a hospital-scale dataset of body CT reports. Ultimately, the work described here sought to facilitate future research in image-based disease classification algorithms by providing a general framework for labeling vast amounts of hospital-scale data in a manner that is both cost and time efficient.

5 Disclosures

The authors declare that there is no conflict of interest.

6 Acknowledgements/Funding Sources

The work was supported by seed funding from Duke Cancer Institute as part of NIH/NCI P30–CA014236, MAIA Erasmus + University of Girona, and a GPU equipment grant from Nvidia Corp. Appendix 1 . Dictionary terms used in this study.

Lungs/Pleura Liver/Gallbladder Kidneys/Ureters Organ Anatomy lung, pulmonary, lower|upper|middle lobe, centrilobular, perifissural, left|right base, bases, basilar, bronch, trachea, airspace, airway liver, hepatic, hepato, gallbladder, thegallbladder, gall bladder, biliary, bile, left|right|caudate|quadrate lobe kidney, renal, nephr, ureter, cort, medul, caliectasis, UVJ

Single-organ Disease descriptors pneumothorax, emphysema, pneumoni, ground glass, aspiration, bronchiectasis, atelecta, embol, air trapping, pleural effusion, pneumonectomy steatosis, cirrho, cholecystectomy, gallstone, cholelithiasis hydronephrosis, hydroureter, nephrectomy, pelvicaliectasis, uropathy, ureterectasis, nephrolithiasis

Multi-organ Disease Descriptors mass, opaci, calcul, stone, scar, metas, malignan, cancer, tumor, neoplasm, lithiasis, atroph, recurren, hyperenhanc, hypoenhanc, aneurysm, lesion, nodule, nodular, calcifi, opacit, effusion, resect, thromb, infect, infarct, inflam, fluid, consolidate, degenerative, dissect, collaps, fissure, edema, cyst, focus, angioma, spiculated, architectural distortion, lytic, pathologic, defect, hernia, biops, encasement, fibroid, hemorrhage, multilocul, distension, stricture, obstructi, hypodens, hyperdens, hypoattenuat, hyperattenuat, necrosis, irregular, ectasia, destructi, dilat, granuloma, enlarged, abscess, stent, fatty infiltr, stenosis, delay, carcinoma, adenoma, atrophy, hemangioma, density, surgically absent

Negation no, non, other, not, none, without, rather, negative, with regards to, however is no, are no, no evidence, noevidence, limited exam for the evaluation

Qualifiers acute, new, size, contour, attenuation, caliber, however, morphological

Normal

Normal, unremarkable, negative exam, patent, clear, no abnormalit, without abnormalit

7 References

1. E. Pons et al., "Natural language processing in radiology: a systematic review,"

Radiology (2), 329-343 (2016). 2. K. J. Dreyer et al., "Application of recently developed computer algorithm for automatic classification of unstructured radiology reports: validation study,"

Radiology (2), 323-329 (2005). 3. I. Solti et al., "Automated classification of radiology reports for acute lung injury: comparison of keyword and machine learning based natural language processing approaches,"

J. Am. Med. Inform. Assoc. (5), 913-916 (2012). 5. X. Wang et al., "ChestX-Ray8: Hospital-Scale Chest X-Ray Database and Benchmarks on Weakly-Supervised Classification and Localization of Common Thorax Diseases," in , pp. 3462-3471 (2017). 6. R. L. Draelos et al., "Machine-learning-based multiple abnormality prediction with large-scale chest computed tomography volumes," Med. Image Anal. (101857 (2021). 7. D. Ganeshan et al., "Structured reporting in radiology," Acad. Radiol. (1), 66-73 (2018). 8. L. Deng, and Y. Liu, Deep learning in natural language processing , Springer (2018). 9. D. Bahdanau, K. Cho, and Y. Bengio, "Neural machine translation by jointly learning to align and translate," arXiv preprint arXiv:1409.0473 (2014). 10. J. M. Steinkamp et al., "Automated Organ-Level Classification of Free-Text Pathology Reports to Support a Radiology Follow-up Tracking Engine,"

Radiology: Artificial Intelligence (5), e180052 (2019). 11. J. Yuan, H. Zhu, and A. Tahmasebi, "Classification of Pulmonary Nodular Findings based on Characterization of Change using Radiology Reports," AMIA Jt Summits Transl Sci Proc (285-294 (2019). 12. C. Raffel, and D. P. Ellis, "Feed-forward networks with attention can solve some long-term memory problems," arXiv preprint arXiv:1512.08756 (2015). 13. Y. Wang et al., "Attention-based LSTM for aspect-level sentiment classification,"

Proceedings of the 2016 conference on empirical methods in natural language processing

Medical Imaging 2019: Computer-Aided Diagnosis (2019). 15. K. Faryna et al., "Attention-guided classification of abnormalities in semi-structured computed tomography reports," in

Medical Imaging 2020: Computer-Aided Diagnosis (2020). 16. H. C. Wu et al., "Interpreting TF-IDF term weights as making relevance decisions,"

ACM Trans. Inf. Syst. (3), Article 13 (2008). 17. S. Hochreiter, and J. Schmidhuber, "Long Short-Term Memory," Neural Comput. (8), 1735–1780 (1997).

18. Q. Zhou, and H. Wu, "NLP at IEST 2018: BiLSTM-attention and LSTM-attention via soft voting in emotion classification,"

Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis

Scientific Data (1), 52 (2019). 20. A. Brady et al., "Discrepancy and error in radiology: concepts, causes and consequences," Ulster Med J (1), 3-9 (2012). 21. V. Sorin et al., "Deep learning for natural language processing in radiology—fundamentals and a systematic review," Journal of the American College of Radiology (5), 639-648 (2020). 22. I. Banerjee et al., "Comparative effectiveness of convolutional neural network (CNN) and recurrent neural network (RNN) architectures for radiology text report classification," Artif. Intell. Med. (79-88 (2019). 23. R. G. Short et al., "Comprehensive word-level classification of screening mammography reports using a neural network sequence labeling approach," J. Digit. Imaging (5), 685-692 (2019). 24. Y. Wang et al., "A clinical text classification paradigm using weak supervision and deep representation," BMC Med. Inform. Decis. Mak. (1), 1-13 (2019). 25. F. I. Tushar et al., "Weakly Supervised Multi-Organ Multi-Disease Classification of Body CT Scans," arXiv preprint arXiv:2008.01158 (2020). 26. W. H. Weng et al., "Medical subdomain classification of clinical notes using a machine learning-based natural language processing approach," BMC Med Inform Decis Mak (1), 155 (2017). 27. A. Spandorfer et al., "Deep learning to convert unstructured CT pulmonary angiography reports into structured reports," Eur Radiol Exp (1), 37 (2019). 28. H.-C. Shin et al., "Learning to Read Chest X-Rays: Recurrent Neural Cascade Model for Automated Image Annotation," in , pp. 2497-2506 (2016). 29. J. Laserson et al., "TextRay: Mining Clinical Reports to Gain a Broad Understanding of Chest X-Rays," in Medical Image Computing and Computer Assisted Intervention – MICCAI 2018 , pp. 553-561 (2018). 30. C. Kim et al., "Natural language processing and machine learning algorithm to identify brain MRI reports with acute ischemic stroke,"

PLoS One (2), e0212778 (2019). 31. E. Choi et al., "Using recurrent neural network models for early detection of heart failure onset," J Am Med Inform Assoc (2), 361-370 (2017). 32. M. J. Willemink et al., "Preparing Medical Imaging Data for Machine Learning," Radiology (1), 4-15 (2020). 33. Y. Zhang et al., "Drug-drug interaction extraction via hierarchical RNNs on sequence and shortest dependency paths,"

Bioinformatics (5), 828-835 (2018). 34. D. Tang et al., "Learning sentiment-specific word embedding for twitter sentiment classification," Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)

J Biomed Inform (12-20 (2018).(12-20 (2018).