A Systematic Review of Natural Language Processing Applied to Radiology Reports
Arlene Casey, Emma Davidson, Michael Poon, Hang Dong, Daniel Duma, Andreas Grivas, Claire Grover, Víctor Suárez-Paniagua, Richard Tobin, William Whiteley, Honghan Wu, Beatrice Alex
AA Systematic Review of Natural Language ProcessingApplied to Radiology Reports
Arlene Casey , Emma Davidson , Michael Poon , Hang Dong , Daniel Duma ,Andreas Grivas , Claire Grover , V´ıctor Su´arez-Paniagua , Richard Tobin ,William Whiteley , Honghan Wu , and Beatrice Alex School of Literatures, Languages and Cultures (LLC), University of Edinburgh Centre for Clinical Brain Sciences, University of Edinburgh Centre for Medical Informatics, Usher Institute of Population Health Sciencesand Informatics, University of Edinburgh Health Data Research, UK Institute for Language, Cognition and Computation, School of Informatics,University of Edinburgh Nuffield Department of Population Health, University of Oxford Institute of Health Informatics, University College of London Edinburgh Futures Institute, University of Edinburgh * Corresponding author:arlene.casey AT ed.ac.uk
AbstractBackground
Natural language processing (NLP) has a significant role in ad-vancing healthcare and has been found to be key in extracting structured informationfrom radiology reports. Understanding recent developments in NLP application toradiology is of significance but recent reviews on this are limited. This study sys-tematically assesses and quantifies recent literature in NLP applied to radiologyreports.
Methods
We conduct an automated literature search yielding 4,799 results us-ing automated filtering, metadata enriching steps and citation search combined withmanual review. Our analysis is based on 21 variables including radiology characteris-tics, NLP methodology, performance, study, and clinical application characteristics.
Results
We present a comprehensive analysis of the 164 publications retrievedwith publications in 2019 almost triple those in 2015. Each publication is categorisedinto one of 6 clinical application categories. Deep learning use increases in the periodbut conventional machine learning approaches are still prevalent. Deep learningremains challenged when data is scarce and there is little evidence of adoption intoclinical practice. Despite 17% of studies reporting greater than 0.85 F1 scores, itis hard to comparatively evaluate these approaches given that most of them usedifferent datasets. Only 14 studies made their data and 15 their code available with10 externally validating results.
Conclusions
Automated understanding of clinical narratives of the radiologyreports has the potential to enhance the healthcare process and we show that re-search in this field continues to grow. Reproducibility and explainability of models a r X i v : . [ c s . C L ] F e b re important if the domain is to move applications into clinical use. More couldbe done to share code enabling validation of methods on different institutional dataand to reduce heterogeneity in reporting of study properties allowing inter-studycomparisons. Our results have significance for researchers in the field providing asystematic synthesis of existing work to build on, identify gaps, opportunities forcollaboration and avoid duplication. Background
Medical imaging examinations interpreted by radiologists in the form of narrative reportsare used to support and confirm diagnosis in clinical practice. Being able to accuratelyand quickly identify the information stored in radiologists’ narratives has the potentialto reduce workloads, support clinicians in their decision processes, triage patients to geturgent care or identify patients for research purposes. However, whilst these reports aregenerally considered more restricted in vocabulary than other electronic health records(EHR), e.g. clinical notes, it is still difficult to access this efficiently at scale [Bates et al.,2016]. This is due to the unstructured nature of these reports and Natural LanguageProcessing (NLP) is key to obtaining structured information from radiology reports[Pons et al., 2016].NLP applied to radiology reports is shown to be a growing field in earlier reviews[Pons et al., 2016, Cai et al., 2016]. In recent years there has been an even more extensivegrowth in NLP research in general and in particular deep learning methods which is notseen in the earlier reviews. A more recent review of NLP applied to radiology-relatedresearch can be found but this focuses on one NLP technique only, deep learning models[Sorin et al., 2020]. Our paper provides a more comprehensive review comparing andcontrasting all NLP methodologies as they are applied to radiology.It is of significance to understand and synthesise recent developments specific toNLP in the radiology research field as this will assist researchers to gain a broaderunderstanding of the field, provide insight into methods and techniques supporting andpromoting new developments in the field. Therefore, we carry out a systematic reviewof research output on NLP applications in radiology from 2015 onward, thus, allowingfor a more up to date analysis of the area. An additional listing of our synthesis ofpublications detailing their clinical and technical categories along with anatomical scanregions can be made available on request. Also different to the existing work, we lookat both the clinical application areas NLP is being applied in and consider the trends inNLP methods. We describe and discuss study properties, e.g. data size, performance,annotation details, quantifying these in relation to both the clinical application areasand NLP methods. Having a more detailed understanding of these properties allowsus to make recommendations for future NLP research applied to radiology datasets,supporting improvements and progress in this domain.2 elated Work
Amongst pre-existing reviews in this area, [Pons et al., 2016] was the first that was bothspecific to NLP on radiology reports and systematic in methodology. Their literaturesearch identified 67 studies published in the period up to October 2014. They examinedthe NLP methods used, summarised their performance and extracted the studies’ clinicalapplications, which they assigned to five broad categories delineating their purpose.Since Pons et al.’s paper, several reviews have emerged with the broader remit of NLPapplied to electronic health data, which includes radiology reports. [Kreimeyer et al.,2017] conducted a systematic review of NLP systems with a specific focus on coding freetext into clinical terminologies and structured data capture. The systematic review by[Spasic and Nenadic, 2020] specifically examined machine learning approaches to NLP(2015-2019) in more general clinical text data, and a further methodical review wascarried out by [Wu et al., 2020] to synthesise literature on deep learning in clinical NLP(up to April 2019) although the did not follow the PRISMA guideline completely. Withradiology reports as their particular focus, [Cai et al., 2016] published, the same yearas Pons et al.’s review, an instructive narrative review outlining the fundamentals ofNLP techniques applied in radiology. More recently, [Sorin et al., 2020] published asystematic review focused on deep learning radiology-related research. They identified10 relevant papers in their search (up to September 2019) and examined their deeplearning models, comparing these with traditional NLP models and also considered theirclinical applications but did not employ a specific categorisation. We build on thiscorpus of related work, and most specifically Pons et al.’s work. In our initial synthesisof clinical applications we adopt their application categories and further expand uponthese to reflect the nature of subsequent literature captured in our work. Additionally,we quantify and compare properties of the studies reviewed and provide a series ofrecommendations for future NLP research applied to radiology datasets in order topromote improvements and progress in this domain.
Methods
Our methodology followed the Preferred Reporting Items for Systematic Reviews andMeta-Analysis (PRISMA) [Moher et al., 2015], and the protocol is registered on proto-cols.io.
Eligibility for Literature Inclusion and Search Strategy
We included studies using NLP on radiology reports of any imaging modality andanatomical region for NLP technical development, clinical support, or epidemiologicalresearch. Exclusion criteria included: (1) case reports; (2) published before 2015; (3) inlanguage other than English; (4) processing of radiology images; (5) reviews, conferenceabstracts, comments, patents, or editorials; (6) not reporting outcomes of interest; (7)not radiology reports; (8) not using NLP methods; (9) not available in full text; (10)duplicates. 3e used Publish or Perish [Harzing A. W., 2007], a citation retrieval and analysissoftware program, to search Google Scholar. Google Scholar has a similar coverage toother databases [Gehanno et al., 2013] and is easier to integrate into search pipelines.We conducted an initial pilot search following the process described here, but the searchterms were too specific and restricted the number of publications. However, we didinclude papers found in the pilot search in full-text screening. We use the followingsearch query restricted to research articles published in English between January 2015and October 2019. (”radiology” OR ”radiologist”) AND (”natural language” OR ”textmining” OR ”information extraction” OR ”document classification” OR ”word2vec”)NOT patent. We automated the addition of publication metadata and applied filteringto remove irrelevant publications. These automated steps are described in Table 1 &Table 2. Table 1: Metadata enriching steps undertaken for each publication
Metadata Enriching Steps1. Match the paper with its DOI via the Crossref API2. If DOI matched, check Semantic Scholar for metadata/abstract3. If no DOI match and no abstract, search PubMed for abstract4. Search arXiv(for a pre-print)5. If no PDF link, search Unpaywall for available open access versions6. If PDF but no separate abstract via Semantics Scholar/PubMed, extract abstract from the PDF
Table 2: Automated filtering steps to remove irrelevant publications
Automated Filtering Steps1. Document language is English2. Word ’patent’ in title or URL3. Year of publication out of range ( < In addition to query search, another method to find papers is to conduct a citationsearch [Briscoe et al., 2020]. The citation search compiled a list of publications thatcite the Pons et al. review and the articles cited in the Pons’ review. To do this, weuse a snowballing method [Wohlin, 2014] to follow the forward citation branch for eachpublication in this list, i.e. finding every article that cites the publications in our list.The branching factor here is large, so we filter at every stage and automatically addmetadata.
Manual Review of Literature
Four reviewers (three NLP researchers [AG,DD and HD] and one epidemiologist [MTCP])independently screened all titles and abstracts with the Rayyan online platform and dis-4ussed disagreements. Fleiss’ kappa [Fleiss, 1971] agreement between reviewers was 0.70,indicating substantial agreement [Landis and Koch, 1977]. After this screening process,each full-text article was reviewed by a team of eight (six NLP researchers and two epi-demiologists) and double reviewed by a NLP researcher. We resolved any discrepanciesby discussion in regular meetings.
Data Extraction for Analysis
We extracted data on: primary clinical application and technical objective, data source(s),study period, radiology report language, anatomical region, imaging modality, diseasearea, dataset size, annotated set size, training/validation/test set size, external valida-tion performed, domain expert used, number of annotators, inter-annotator agreement,NLP technique(s) used, best-reported results (recall, precision and F1 score), availabilityof dataset, and availability of code.
Results
The literature search yielded 4,799 possibly relevant publications from which our auto-mated exclusion process removed 4,402, and during both our screening processes, 233were removed, leaving 164 publications. See Figure 1 for details of exclusions at eachstep.
General Characteristics
Clinical Application Categories
In synthesis of the literature each publication was classified by the primary clinical pur-pose. Pons’ work in 2016 categorised publications into 5 broad categories: DiagnosticSurveillance, Cohort Building for Epidemiological Studies, Query-based Case Retrieval,Quality Assessment of Radiological Practice and Clinical Support Services. We found5igure 1: PRISMA diagram for search publication retrievalTable 3: Scan modalityScan Modality No. StudiesMultiple Modalities 38MRI 16CT 36X-Ray 18Mammogram 5Ultrasound 4Not specified 47TOTAL 164 Table 4: Image sampling methodSampling Method No. StudiesConsecutive Images 33Non-Consecutive Images 38Not specified 93TOTAL 1646able 5: Anatomical region scannedAnatomical Region No. StudiesMixed 45Thorax 31Head/Neck 25Abdomen 15Breast 15Extremities 8Spine 5Other 1Unspecified 19TOTAL 164 Table 6: Disease categoryDisease Category No. StudiesNot specific disease related 40Oncology 39Various 20Musculoskeletal 10Cerebrovascular 13Other 13Respiratory 10Trauma 7Cardiovascular 6Gastrointestinal 3Hepatobiliary 2Genitourinary 1TOTAL 164Table 7: Radiology report languageReport Language No. StudiesEnglish 142Chinese 5Spanish 4German 3Italian 2French 2Hebrew 1Polish 1Brazilian Portuguese 1Unspecified 3TOTAL 164some changes in this categorisation schema and our categorisation consisted of six cat-egories:
Diagnostic Surveillance, Disease information and classification, Quality Com-pliance, Cohort/Epidemiology, Language Discovery and Knowledge Structure, TechnicalNLP . The main difference is we found no evidence for a category of
Clinical Support Ser-vices which described applications that had been integrated into the workflow to assist.Despite the increase in the number of publications, very few were in clinical use withmore focus on the category of
Disease Information and Classification . We describe eachclinical application area in more detail below and where applicable how our categoriesdiffer from the earlier findings. A listing of all publications and their corresponding clin-ical application category can be made available on request. Table 8 shows the clinicalapplication category by the technical classification and Figure 2 shows the breakdown7able 8: Clinical Application Category by Technical ObjectiveApplicationCategory InformationExtraction(n=81) Report/SentenceClassifi-cation(n=73) Lexicon/Ontol-ogyDiscov-ery(n=9) Clustering(n=1)Disease Information & Classification 14 31 - -Diagnostic Surveillance 28 17 - -Quality Compliance 7 14 - -Cohort-Epid. 6 10 - -Language Discovery & Knowledge 13 4 9 1Technical NLP 6 4 - -of clinical application category by publication year. There were more publications in2019 compared with 2015 for all categories except Language Discovery & KnowledgeStructure, which fell by ≈
25% (Figure 2).Figure 2: Clinical application of publication by year8 iagnostic Surveillance
A large proportion of studies in this category focused on extracting disease informationfor patient or disease surveillance e.g. investigating tumour characteristics [Peng et al.,2019, Bozkurt et al., 2019]; changes over time [Hassanpour et al., 2017] and worsen-ing/progression or improvement/response to treatment [Kehl et al., 2019, Chen et al.,2018]; identifying correct anatomical labels [Cotik et al., 2018]; organ measurements andtemporality [Sevenster et al., 2015a]. Studies also investigated pairing measurementsbetween reports [Sevenster et al., 2015b] and linking reports to monitoring changesthrough providing an integrated view of consecutive examinations [Oberkampf et al.,2016]. Studies focused specifically on breast imaging findings investigating aspects, suchas BI-RADS MRI descriptors (shape, size, margin) and final assessment categories (be-nign, malignant etc.) e.g., [Liu et al., 2019, Gupta et al., 2018, Castro et al., 2017, Shortet al., 2019, Lacson et al., 2017, 2015]. Studies focused on tumour information e.g., forliver [Yim et al., 2016b] and hepatocellular carcinoma (HPC) [Yim et al., 2017, 2016a]and one study on extracting information relevant for structuring subdural haematomacharacteristics in reports [Pruitt et al., 2019].Studies in this category also investigated incidental findings including on lung imag-ing [Farjah et al., 2016, Karunakaran et al., 2017, Tan et al., 2018], with [Farjah et al.,2016] additionally extracting the nodule size; for trauma patients [Trivedi et al., 2019];and looking for silent brain infarction and white matter disease [Fu et al., 2019]. Otherstudies focused on prioritising/triaging reports, detecting follow-up recommendations,and linking a follow-up exam to the initial recommendation report, or bio-surveillanceof infectious conditions, such as invasive mould disease.
Disease Information and Classification
Disease Information and Classification publications use reports to identify informationthat may be aggregated according to classification systems. These publications focusedsolely on classifying a disease occurrence or extracting information about a disease withno focus on the overall clinical application. This category was not found in Pons’ work.Methods considered a range of conditions including intracranial haemorrhage [Jnawaliet al., 2019, Banerjee et al., 2017], aneurysms [K(cid:32)los et al., 2018], brain metastases [Desh-mukh et al., 2019], ischaemic stroke [Kim et al., 2019, Garg et al., 2019], and severalclassified on types and severity of conditions e.g., [Deshmukh et al., 2019, Shin et al.,2017, Wheater et al., 2019, Gorinski et al., 2019, Alex et al., 2019]. Studies focusedon breast imaging considered aspects such as predicting lesion malignancy from BI-RADS descriptors [Bozkurt et al., 2016], breast cancer subtypes [Patel et al., 2017],and extracting or inferring BI-RADS categories, such as [Banerjee et al., 2019a, Miaoet al., 2018]. Two studies focused on abdominal images and hepatocellular carcinoma(HCC) staging and CLIP scoring. Chest imaging reports were used to detect pulmonaryembolism e.g., [Dunne et al., 2015, Banerjee et al., 2019b, Chen et al., 2017], bacterialpneumonia [Meystre et al., 2017], and Lungs-RADS categories [Beyer et al., 2017]. Func-tional imaging was also included, such as echocardiograms, extracting measurements to9valuate heart failure, including left ventricular ejection fractions (LVEF). Other studiesinvestigated classification of fractures and abnormalities and the prediction of ICD codesfrom imaging reports.
Language Discovery and Knowledge Structure
Language Discovery and Knowledge Structure publications investigate the structure oflanguage in reports and how this might be optimised to facilitate decision support andcommunication. Pons et al. reported on applications of
Query-based retrieval which hassimilarities to
Language Discovery and Knowledge Structure but it is not the same. Theircategory contains studies that retrieve cases and conditions that are not predefined andin some instances could be used for research purposes or are motivated for educationalpurposes. Our category is broader and encompasses papers that investigated differentaspects of language including variability, complexity simplification and normalising tosupport extraction and classification tasks.Studies focus on exploring lexicon coverage and methods to support language simpli-fication for patients looking at sources, such as the consumer health vocabulary [Qenamet al., 2017] and the French lexical network (JDM) [Lafourcade and Ramadier, 2017].Other works studied the variability and complexity of report language comparing free-text and structured reports and radiologists. Also investigated was how ontologies andlexicons could be combined with other NLP methods to represent knowledge that cansupport clinicians. This work included improving report reading efficiency [Hong andZhang, 2015]; finding similar reports [Comelli et al., 2015]; normalising phrases to sup-port classification and extraction tasks, such as entity recognition in Spanish reports[Cotik et al., 2015]; imputing semantic classes for labelling [Johnson et al., 2015], sup-porting search [Mujjiga et al., 2019] or to discover semantic relations [Lafourcade andRamadier, 2016].
Quality and Compliance
Quality and Compliance publications use reports to assess the quality and safety ofpractice and reports similar to Pons’ category. Works considered how patient indica-tions for scans adhered to guidance e.g., [Shelmerdine et al., 2019, Mabotuwana et al.,2018b, Dalal et al., 2020, Bobbin et al., 2017, Kwan et al., 2019, Mabotuwana et al.,2018a] or protocol selection [Brown and Marotta, 2017, Trivedi et al., 2018, Zhang et al.,2018, Brown and Marotta, 2018, Yan et al., 2016] or the impact of guideline changeson practice, such as [Kang et al., 2019]. Also investigated was diagnostic utilisationand yield, based on clinicians or on patients, which can be useful for hospital planningand for clinicians to study their work patterns e.g.[Brown and Kachura, 2019]. Otherstudies in this category looked at specific aspects of quality, such as, classification forlong bone fractures to support quality improvement in paediatric medicine [Grundmeieret al., 2016], automatic identification of reports that have critical findings for auditingpurposes [Heilbrun et al., 2019], deriving a query-based quality measure to comparestructured and free-text report variability [Maros et al., 2018], and [Minn et al., 2015]10ho describe a method to fix errors in gender or laterality in a report.
Cohort and Epidemiology
This category is similar to Pons’ earlier review but we treated the studies in this categoryslightly attempting to differentiate which papers described methods for creating cohortsfor research purposes, and those which also reported the outcomes of an epidemiologi-cal analysis. Ten studies use NLP to create specific cohorts for research purposes andsix reported the performance of their tools. Out of these papers, the majority (n=8)created cohorts for specific medical conditions including fatty liver disease [Goldshteinet al., 2020, Redman et al., 2017] hepatocellular cancer [Sada et al., 2016], uretericstones [Li and Elliot, 2019], vertebral facture [Tan and Heagerty, 2019], traumatic braininjury [Yadav et al., 2016, Mahan et al., 2019], and leptomeningeal disease secondary tometastatic breast cancer [Brizzi et al., 2019]. Five papers identified cohorts focused onparticular radiology findings including ground glass opacities (GGO) [Van Haren et al.,2019], cerebral microbleeds (CMB) [Noorbakhsh-Sabet et al., 2018], pulmonary nodules[Gould et al., 2015], [Huhdanpaa et al., 2018], changes in the spine correlated to backpain [Bates et al., 2016] and identifying radiological evidence of people having suffereda fall. One paper focused on identifying abnormalities of specific anatomical regions ofthe ear within an audiology imaging database [Masino et al., 2016] and another paperaimed to create a cohort of people with any rare disease (within existing ontologies - Or-phanet Rare Disease Ontology ORDO and Radiology Gamuts Ontology RGO). Lastly,one paper took a different approach of screening reports to create a cohort of people withcontraindications for MRI, seeking to prevent iatrogenic events. Amongst the epidemiol-ogy studies there were various analytical aims, but they primarily focused on estimatingthe prevalence or incidence of conditions or imaging findings and looking for associationsof these conditions/findings with specific population demographics, associated factors orcomorbidities. The focus of one study differed in that it applied NLP to healthcareevaluation, investigating the association of palliative care consultations and measures ofhigh-quality end-of-life (EOL) care [Brizzi et al., 2019].
Technical NLP
This category is for publications that have a primary technical aim that is not focusedon radiology report outcome, e.g. detecting negation in reports, spelling correction [Zechet al., 2019], fact checking [Zhang et al., 2019, Steinkamp et al., 2019] methods for sampleselection, crowd source annotation [Cocos et al., 2017]. This category did not occur inPons’ earlier review.
NLP Methods in Use
NLP methods capture the different techniques an author applied broken down into rules,machine learning methods, deep learning, ontologies, lexicons and word embeddings.11e discriminate machine learning from deep learning, using the former to representtraditional machine learning methods.Over half of the studies only applied one type of NLP method and just over a quarterof the studies compared or combined methods in hybrid approaches. The remainingstudies either used a bespoke proprietary system or focus on building ontologies orsimilarity measures (Figure 3). Rule-based method use remains almost constant acrossthe period, whereas use of machine learning decreases and deep learning methods rises,from five publications in 2017 to twenty-four publications in 2019 (Figure 4).Figure 3: NLP method breakdownTable 9: Breakdown of NLP MethodML (n=74) No studies Deep Learning (n=36) No studiesSVM 34 RNN variants 14Logistic Regression 23 CNN 10Random Forest 18 Other 5Na¨ıve Bayes 17 Compare CNN, RNN 4Maximum Entropy 7 Combine CNN+RNN 3Decision Trees 4A variety of machine classifier algorithms were used, with SVM and Logistic Regres-sion being the most common (Table 9). Recurrent Neural Networks (RNN) variants werethe most common type of deep learning architectures. RNN methods were split betweenlong short-term memory (LSTM) and bidirectional-LSTM (Bi-LSTM), bi-directional12igure 4: NLP method by yeargated recurrent unit (Bi-GRU), and standard RNN approaches. Four of these studiesadditionally added a Conditional Random Field (CRF) for the final label generationstep. Convolutional Neural Networks (CNN) were the second most common architec-ture explored. Eight studies additionally used an attention mechanism as part of theirdeep learning architecture. Other neural approaches included feed-forward neural net-works, fully connected neural networks and a proprietary neural system IBM Watson[Trivedi et al., 2018] and Snorkel [Ratner et al., 2018]. Several studies proposed combinedarchitectures, such as [Zhu et al., 2019, Short et al., 2019].
NLP Method Features
Most rule-based and machine classifying approaches used features based on bag-of-words,part-of-speech, term frequency, and phrases with only two studies alternatively usingword embeddings. Three studies use feature engineering with deep learning rather thanword embeddings. Thirty-three studies use domain-knowledge to support building fea-tures for their methods, such as developing lexicons or selecting terms and phrases.Comparison of embedding methods is difficult as many studies did not describe theirembedding method. Of those that did, Word2Vec [Mikolov et al., 2013] was the mostpopular (n=19), followed by GLOVE embeddings [Pennington et al., 2014] (n=6), Fast-Text [Mikolov et al., 2018] (n=3), ELMo [Peters et al., 2018] (n=1) and BERT [Devlinet al., 2018] (n=1). Ontologies or lexicon look-ups are used in 100 studies; however, eventhough publications increase over the period in real terms, 20% fewer studies employ the13se of ontologies or lexicons in 2019 compared to 2015. The most widely used resourceswere UMLS [National Library of Medicine, 2021b] (n=15), Radlex [RSNA, 2021] (n=20),SNOMED-CT [National Library of Medicine, 2021a] (n=14). Most studies used these asfeatures for normalising words and phrases for classification, but this was mainly thoseusing rule-based or machine learning classifiers with only six studies using ontologiesas input to their deep learning architecture. Three of those investigated how existingontologies can be combined with word embeddings to create domain-specific mappings,with authors pointing to this avoiding the need for large amounts of annotated data.Other approaches looked to extend existing medical resources using a frequent phrasesapproach, e.g. [Bulu et al., 2018]. Works also used the derived concepts and relationsvisualising these to support activities, such as report reading and report querying (e.g.[Hassanpour and Langlotz, 2016, Zhao et al., 2018])
Annotation and Inter-Annotator Agreement
Eighty-nine studies used at least two annotators, 75 did not specify any annotationdetails, and only one study used a single annotator. Whilst 69 studies use a domain ex-pert for annotation (a clinician or radiologist) only 56 studies report the inter-annotatoragreement. Some studies mention annotation but do not report on agreement or annota-tors. Inter-annotator agreement values for Kappa range from 0.43 to perfect agreementat 1. Whilst most studies reported agreement by Cohen’s Kappa [Cohen, 1960] somereported precision, and percent agreement. Studies reported annotation data sizes differ-ently, e.g., on the sentence or patient level. Studies also considered ground truth labelsfrom coding schemes such as ICD or BI-RADS categories as annotated data. Of studieswhich detailed human annotation at the radiology report level, only 45 specified inter-annotator agreement and/or the number of annotators. Annotated report numbers forthese studies varies with 15 papers having annotated less than 500, 12 having annotatedbetween 500 and less than 1,000, 15 between 1,000 and less than 3,000, and 3 between4,000 and 8,288 reports.
Data Sources and Availability
Only 14 studies reported that their data is available, and 15 studies reported that theircode is available. Most studies sourced their data from medical institutions, a numberof studies did not specify where their data was from, and some studies used publiclyavailable datasets: MIMIC-III (n=5), MIMIC-II (n=1), MIMIC-CXR (n=1); Radcore(n=5) or STRIDE (n=2). Four studies used combined electronic health records such asclinical notes or pathology reports.Reporting on data size and splits differed across studies with some not giving exactdata sizes and others reporting numbers of sentences, patients, or mixed data sourcesrather than radiology reports. Data sizes for those reporting at the radiology reportlevel is n=135 or 82.32% of the studies (Table 10). The biggest variation of data sizeby NLP Method is in studies that apply other methods or are rule-based. Machinelearning also varies in size; however, the median value is lower compared to rule-based14able 10: NLP Method by data size properties, minimum data size, maximum data sizeand median value, studies reporting in numbers of radiology reportsNLP Method Min Size Max Size MedianCompare Methods 513 2,167,445 2,845Hybrid Methods 40 34,926 918Deep Learning (Only) 120 1,567,581 5,000Machine Learning (Only) 101 2,977,739 2,531Rules (Only) 31 10,000,000 8,000Other 25 12,377,743 10,000Table 11: Grouped data size and number of studies in each group, only for studiesreporting in numbers of radiology reportsData Size Group No. Studies (%) <
200 9 (6.7)200 <
500 6 (4.4)500 < < < < NLP Performance and Evaluation Measures
Performance metrics applied for evaluation of methods vary widely with authors usingprecision (positive predictive value (PPV)), recall (sensitivity), specificity, the area underthe curve (AUC) or accuracy. We observed a wide variety in evaluation methodologyemployed concerning test or validation datasets. Different approaches were taken ingenerating splits for testing and validation, including k-fold cross-validation. Ninety-nine studies reported on training and test data splits, of which only 59 studies includeda validation set. Only 10 studies validated their algorithm using an external dataset fromanother institution, another modality, or a different patient population. The most widelyused metrics for reporting performance were precision (PPV) and recall (sensitivity)reported in 47% of studies. However, even though many studies compared methodsand reported on the top-performing method, very few studies carried out significance15esting on these comparisons. Issues of heterogeneity make it difficult and unrealistic tocompare performance between methods applied, hence, we use summary measures as abroad overview (Figure 5). Performance reported varies, but both the mean and medianvalues for the F1 score appear higher for methods using rule-based only or deep learningonly methods. Whilst differences are less discernible between F1 scores for applicationareas,
Diagnostic Surveillance looks on average lower than other categories.Figure 5: Application Category and NLP Method, Mean and Median Summaries. Meanvalue is indicated by a vertical bar, the box shows error bars and the asterisk is themedian value.
Discussion and Future Directions
Our work shows there has been a considerable increase in the number of publicationsusing NLP on radiology reports over the recent time period. Compared to 67 publicationsretrieved in the earlier review of [Pons et al., 2016], we retrieved 164 publications. Inthis section we discuss and offer some insight into the observations and trends of howNLP is being applied to radiology and make some recommendations that may benefitthe field going forward.
Clinical Applications and NLP Methods in Radiology
The clinical applications of the publications is similar to the earlier review of Pons et al.but whilst we observe an increase in research output we also highlight that there appearsto be even less focus on clinical application compared to their review. Like many otherfields applying NLP the use of deep learning has increased, with RNN architecturesbeing the most popular. This is also observed in a review of NLP in clinical text[Wu16t al., 2020]. However, although deep learning use increases, rules and traditional ma-chine classifiers are still prevalent and often used as baselines to compare deep learningarchitectures against. One reason for traditional methods remaining popular is their in-terpretability compared to deep learning models. Understanding the features that drivea model prediction can support decision-making in the clinical domain but the complexlayers of non-linear data transformations deep learning is composed of does not easilysupport transparency [Shickel et al., 2018]. This may also help explain why in synthesisof the literature we observed less focus on discussing clinical application and more em-phasis on disease classification or information task only. Advances in interpretability ofdeep learning models are critical to its adoption in clinical practice.Other challenges exist for deep learning such as only having access to small or im-balanced datasets. Chen et al. [Chen et al., 2019] review deep learning methods withinhealthcare and point to these challenges resulting in poor performance but that thesesame datasets can perform well with traditional machine learning methods. We foundseveral studies highlight this and when data is scarce or datasets imbalanced, they intro-duced hybrid approaches of rules and deep learning to improve performance, particularlyin the
Diagnostic Surveillance category. Yang et al. [Yang et al., 2018] observed rulesperforming better for some entity types, such as time and size, which are proportionallylower than some of the other entities in their train and test sets; hence they combine abidirectional-LSTM and CRF with rules for entity recognition. Peng et al. [Peng et al.,2019] comment that combining rules and the neural architecture complement each other,with deep learning being more balanced between precision and recall, but the rule-basedmethod having higher precision and lower recall. The authors reason that this providesbetter performance as rules can capture rare disease cases, particularly when multi-classlabelling is needed, whilst deep learning architectures perform worse in instances withfewer data points.In addition to its need for large-scale data, deep learning can be computationallycostly. The use of pre-trained models and embeddings may alleviate some of this burden.Pre-trained models often only require fine-tuning, which can reduce computation cost.Language comprehension pre-learned from other tasks can then be inherited from theparent models, meaning fewer domain-specific labelled examples may be needed [Woodet al., 2020]. This use of pre-trained information also supports generalisability, e.g.,[Banerjee et al., 2019b] show that their model trained on one dataset can generalise toother institutional datasets.Embedding use has increased which is expected with the application of deep learn-ing approaches but many rule-based and machine classifiers continue to use traditionalcount-based features, e.g., bag-of-words and n-grams. Recent evidence [Ong et al.,2020] suggests that the trend to continue to use feature engineering with traditionalmachine learning methods does produce better performance in radiology reports thanusing domain-specific word embeddings.Banerjee et al. [Banerjee et al., 2017] found that there was not much difference be-tween a uni-gram approach and a Word2vec embedding, hypothesising this was due totheir narrow domain, intracranial haemorrhage. However, the NLP research field has17een a move towards bi-directional encoder representations from transformers (BERT)based embedding models not reflected in our analysis, with only one study using BERTgenerated embeddings [Deshmukh et al., 2019]. Embeddings from BERT are thought tobe superior as they can deliver better contextual representations and result in improvedtask performance. Whilst more publications since our review period have used BERTbased embeddings with radiology reports e.g. [Wood et al., 2020, Smit et al., 2020a]not all outperform traditional methods [Grivas et al., 2020]. Recent evidence shows thatembeddings generated by BERT fail to show a generalisable understanding of negation[Ettinger, 2020], an essential factor in interpreting radiology reports effectively. Spe-cialised BERT models have been introduced such as ClinicalBERT [Alsentzer et al.,2019] or BlueBERT [Smit et al., 2020a]. BlueBERT has been shown to outperform Clin-icalBERT when considering chest radiology [Smit et al., 2020b] but more explorationof the performance gains versus the benefits of generalisability are needed for radiologytext.All NLP models have in common that they need large amounts of labelled data formodel training [Yasaka and Abe, 2018]. Several studies [Percha et al., 2018, Tahmasebiet al., 2019, Banerjee et al., 2018] explored combining word embeddings and ontologiesto create domain-specific mappings, and they suggest this can avoid a need for largeamounts of annotated data. Additionally, [Percha et al., 2018, Tahmasebi et al., 2019]highlight that such combinations could boost coverage and performance compared tomore conventional techniques for concept normalisation.The number of publications using medical lexical knowledge resources is still rel-atively low, even though a recent trend in the general NLP field is to enhance deeplearning with external knowledge [Young et al., 2018]. This was also observed by [Wuet al., 2020], where only 18% of the deep learning studies in their review utilised knowl-edge resources. Although pre-training supports learning previously known facts it couldintroduce unwanted bias, hindering performance. The inclusion of domain expertisethrough resources such as medical lexical knowledge may help reduce this unwanted bias[Wu et al., 2020]. Exploration of how this domain expertise can be incorporated withdeep learning architectures in future could improve the performance when having accessto less labelled data.
Task Knowledge
Knowledge about the disease area of interest and how aspects of this disease are lin-guistically expressed is useful and could promote better performing solutions. Whilst[Donnelly et al., 2019] find high variability between radiologists, with metric values(e.g. number of syntactic, clinical terms based on ontology mapping) being significantlygreater on free-text than structured reports, [Xie et al., 2019] who look specificallyat anatomical areas find less evidence for variability. Zech et al. [Zech et al., 2018]suggest that the highly specialised nature of each imaging modality creates differentsub-languages and the ability to discover these labels (i.e. disease mentions) reflects theconsistency with which labels are referred to. For example, edema is referred to very con-sistently whereas other labels are not, such as infarction/ischaemic. Understanding the18anguage and the context of entity mentions could help promote novel ideas on how tosolve problems more effectively. For example, [Yim et al., 2017] discuss how the accuracyof predicting malignancy is affected by cues being outside their window of considerationand [Yim et al., 2018] observe problems of co-reference resolution within a report dueto long-range dependencies. Both these studies use traditional NLP approaches, butwe observed novel neural architectures being proposed to improve performance in sim-ilar tasks specifically capturing long-range context and dependency learning, e.g., [Zhuet al., 2019, Short et al., 2019]. This understanding requires close cooperation of health-care professionals and data scientists, which is different to some other fields where moredisconnection is present [Chen et al., 2019].
Study Heterogeneity, a Need for Reporting Standards
Most studies reviewed could be described as a proof-of-concept and not trialled in a clin-ical setting. Pons et al. [Pons et al., 2016] hypothesised that a lack of clinical applicationmay stem from uncertainty around minimal performance requirements hampering imple-mentations, evidence-based practice requiring justification and transparency of decisions,and the inability to be able to compare to human performance as the human agreementis often an unknown. These hypotheses are still valid, and we see little evidence thatthese problems are solved.Human annotation is generally considered the gold standard at measuring humanperformance, and whilst many studies reported that they used annotated data, overall,reporting was inconsistent. Steps were undertaken to measure inter-annotator agree-ment (IAA), but in many studies, this was not directly comparable to the evaluationundertaken of the NLP methods. The size of the data being used to draw experimentalconclusions from is important and accurate reporting of these measures is essential toensure reproducibility and comparison in further studies. Reporting on the training,test and validation splits was varied with some studies not giving details and not usingheld-out validation sets.Most studies use retrospective data from single institutions but this can lead to amodel over-fitting and, thus, not generalising well when applied in a new setting. Over-coming the problem of data availability is challenging due to privacy and ethics concerns,but essential to ensure that performance of models can be investigated across institu-tions, modalities, and methods. Availability of data would allow for agreed benchmarksto be developed within the field that algorithm improvements can be measured upon.External validation of applied methods was extremely low, although, this is likely due tothe availability of external datasets. Making code available would enable researchers toreport how external systems perform on their data. However, only 15 studies reportedthat their code is available. To be able to compare systems there is a need for commondatasets to be available to benchmark and compare systems against.Whilst reported figures in precision and recall generally look high more evidence isneeded for accurate comparison to human performance. A wide variety of performancemeasures were used, with some studies only reporting one measure, e.g., accuracy orF1 scores, with these likely representing the best performance obtained. Individual19tudies are often not directly comparable for such measures, but none-the-less clarityand consistency in reporting is desirable. Many studies making model comparisons didnot carry out any significance testing for these comparisons.The make the following recommendations to help move the field forward, enable moreinter-study comparisons, and increase study reproducibility:1. Clarity in reporting study properties is required: (a) Data characteristics includingsize and the type of dataset should be detailed, e.g., the number of reports, sen-tences, patients, and if patients how many reports per patient. The training, testand validation data split should be evident, as should the source of the data. (b)Annotation characteristics including the methodology to develop the annotationshould be reported, e.g., annotation set size, annotator details, how many, exper-tise. (c) Performance metrics should include a range of metrics: precision, recall,F1, accuracy and not just one overall value.2. Significance testing should be carried out when a comparison between methods ismade.3. Data and code availability are encouraged. While making data available will oftenbe challenging due to privacy concerns, researchers should make code available toenable inter-study comparisons and external validation of methods.4. Common datasets should be used to benchmark and compare systems.
Limitations of Study
Publication search is subject to bias in search methods and it is likely that our searchstrategy did inevitably miss some publications. Whilst trying to be precise and objec-tive during our review process some of the data collected and categorising publicationsinto categories was difficult to agree on and was subjective. For example, many of thepublications could have belonged to more than one category. One of the reasons for thiswas how diverse in structure the content was which was in some ways reflected by thedifferent domains papers were published in. It is also possible that certain keywordswere missed in recording data elements due to the reviewers own biases and researchexperience.
Conclusions
This paper presents an systematic review of publications using NLP on radiology reportsduring the period 2015 to October 2019. We show there has been substantial growthin the field particularly in researchers using deep learning methods. Whilst deep learn-ing use has increased, as seen in NLP research in general, it faces challenges of lowerperformance when data is scarce or when labelled data is unavailable, and is not widelyused in clinical practice perhaps due to the difficulties in interpretability of such models.Traditional machine learning and rule-based methods are, therefore, still widely in use.20xploration of domain expertise such as medial lexical knowledge must be explored fur-ther to enhance performance when data is scarce. The clinical domain faces challengesdue to privacy and ethics in sharing data but overcoming this would enable developmentof benchmarks to measure algorithm performance and test model robustness across in-stitutions. Common agreed datasets to compare performance of tools against would helpsupport the community in inter-study comparisons and validation of systems. The workwe present here has the potential to inform researchers about applications of NLP toradiology and to lead to more reliable and responsible research in the domain.
Acknowledgements
Not applicable
Funding
This research was supported by the Alan Turing Institute, MRC, HDR-UK and the ChiefScientist Office. B.A.,A.C,D.D.,A.G. and C.G. have been supported by the Alan TuringInstitute via Turing Fellowships (B.A,C.G.) and Turing project funding (ESPRC grantEP/N510129/1). A.G. was also funded by a MRC Mental Health Data Pathfinder Award(MRC-MCPC17209). H.W. is MRC/Rutherford Fellow HRD UK (MR/S004149/1).H.D. is supported by HDR UK National Phemomics Resource Project. V.S-P. is sup-ported by the HDR UK National Text Analytics Implementation Project. W.W. issupported by a Scottish Senior Clinical Fellowship (CAF/17/01).
Abbreviations
NLP - natural language processinge.g. - exampleICD - international classification of diseasesBI-RADS - Breast Imaging-Reporting and Data SystemIAA - inter-annotator agreementNo. - numberUMLS - unified medical language systemELMo - embeddings from Language ModelsBERT - bidirectional encoder representations form transformersSVM - support vector machineCNN - convolutional neural networkLSTM - long short-term memoryBi-LSTM - bi-directional long short-term memoryBi-GRU - bi-directional gated recurrent unitCRF - conditional random field 21LOVE - Global Vectors for Word Representation
Bibliography
Beatrice Alex, Claire Grover, Richard Tobin, Cathie Sudlow, Grant Mair, and WilliamWhiteley. Text mining brain imaging reports.
Journal of Biomedical Semantics , 10(1):23, November 2019. ISSN 2041-1480. doi: 10.1186/s13326-019-0211-7. URL https://doi.org/10.1186/s13326-019-0211-7 .Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jindi, Tristan Nau-mann, and Matthew McDermott. Publicly Available Clinical BERT Embeddings. In
Proceedings of the 2nd Clinical Natural Language Processing Workshop , pages 72–78,Minneapolis, Minnesota, USA, June 2019. Association for Computational Linguistics.doi: 10.18653/v1/W19-1909. URL .Imon Banerjee, Sriraman Madhavan, Roger Eric Goldman, and Daniel L. Rubin. Intel-ligent Word Embeddings of Free-Text Radiology Reports.
AMIA Annual SymposiumProceedings , pages 411–420, 2017. ISSN 1942-597X. URL .Imon Banerjee, Matthew C. Chen, Matthew P. Lungren, and Daniel L. Rubin. Radiologyreport annotation using intelligent word embeddings: Applied to multi-institutionalchest CT cohort.
Journal of Biomedical Informatics , 77:11–20, January 2018. ISSN1532-0464. doi: 10.1016/j.jbi.2017.11.012. URL .Imon Banerjee, Selen Bozkurt, Emel Alkim, Hersh Sagreiya, Allison W. Kurian, andDaniel L. Rubin. Automatic inference of BI-RADS final assessment categoriesfrom narrative mammography report findings.
Journal of Biomedical Informatics ,92:103137, April 2019a. ISSN 1532-0464. doi: 10.1016/j.jbi.2019.103137. URL .Imon Banerjee, Yuan Ling, Matthew C. Chen, Sadid A. Hasan, Curtis P. Langlotz,Nathaniel Moradzadeh, Brian Chapman, Timothy Amrhein, David Mong, Daniel L.Rubin, Oladimeji Farri, and Matthew P. Lungren. Comparative effectiveness ofconvolutional neural network (CNN) and recurrent neural network (RNN) archi-tectures for radiology text report classification.
Artificial Intelligence in Medicine ,97:79–88, June 2019b. ISSN 0933-3657. doi: 10.1016/j.artmed.2018.11.004. URL .Jonathan Bates, Samah J. Fodeh, Cynthia A. Brandt, and Julie A. Womack. Classifi-cation of radiology reports for falls in an HIV study cohort.
Journal of the Ameri-can Medical Informatics Association , 23(e1):e113–e117, April 2016. ISSN 1067-5027.doi: 10.1093/jamia/ocv155. URL https://academic.oup.com/jamia/article/23/e1/e113/2379897 . 22ebastian E. Beyer, Brady J. McKee, Shawn M. Regis, Andrea B. McKee, SebastianFlacke, Gilan El Saadawi, and Christoph Wald. Automatic Lung-RADS ™ classificationwith a natural language processing system. Journal of Thoracic Disease , 9(9):3114–3122, September 2017. ISSN 2072-1439. doi: 10.21037/jtd.2017.08.13. URL .Mark D. Bobbin, Ivan K. Ip, V. Anik Sahni, Atul B. Shinagare, and Ramin Kho-rasani. Focal Cystic Pancreatic Lesion Follow-up Recommendations After Publica-tion of ACR White Paper on Managing Incidental Findings.
Journal of the Amer-ican College of Radiology , 14(6):757–764, June 2017. ISSN 1546-1440. doi: 10.1016/j.jacr.2017.01.044. URL .Selen Bozkurt, Francisco Gimenez, Elizabeth S. Burnside, Kemal H. Gulkesen, andDaniel L. Rubin. Using automatically extracted information from mammographyreports for decision-support.
Journal of Biomedical Informatics , 62:224–231, Au-gust 2016. ISSN 1532-0464. doi: 10.1016/j.jbi.2016.07.001. URL .Selen Bozkurt, Emel Alkim, Imon Banerjee, and Daniel L. Rubin. Automated Detectionof Measurements and Their Descriptors in Radiology Reports Using a Hybrid NaturalLanguage Processing Algorithm.
Journal of Digital Imaging , 32(4):544–553, August2019. ISSN 1618-727X. doi: 10.1007/s10278-019-00237-9. URL https://doi.org/10.1007/s10278-019-00237-9 .Simon Briscoe, Alison Bethel, and Morwenna Rogers. Conduct and reporting of citationsearching in Cochrane systematic reviews: A cross-sectional study.
Research SynthesisMethods , 11(2):169–180, 2020. ISSN 1759-2887. doi: 10.1002/jrsm.1355. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/jrsm.1355 .Kate Brizzi, Sophia N. Zupanc, Brooks V. Udelsman, James A. Tulsky, Alexi A.Wright, Hanneke Poort, and Charlotta Lindvall. Natural Language Processing toAssess Palliative Care and End-of-Life Process Measures in Patients With BreastCancer With Leptomeningeal Disease.
American Journal of Hospice and PalliativeMedicine , 37(5):371–376, 2019. doi: https://doi.org/10.1177/1049909119885585. URL https://journals.sagepub.com/doi/abs/10.1177/1049909119885585 .A. D. Brown and J. R. Kachura. Natural Language Processing of Radiology Reportsin Patients With Hepatocellular Carcinoma to Predict Radiology Resource Utiliza-tion.
Journal of the American College of Radiology , 16(6):840–844, June 2019. ISSN1546-1440. doi: 10.1016/j.jacr.2018.12.004. URL .Andrew D. Brown and Thomas R. Marotta. A Natural Language Processing-based Model to Automate MRI Brain Protocol Selection and Prioritization.
Aca-demic Radiology , 24(2):160–166, February 2017. ISSN 1076-6332. doi: 10.1016/23.acra.2016.09.013. URL .Andrew D. Brown and Thomas R. Marotta. Using machine learning for sequence-level automated MRI protocol selection in neuroradiology.
Journal of the Amer-ican Medical Informatics Association , 25(5):568–571, May 2018. ISSN 1067-5027.doi: 10.1093/jamia/ocx125. URL https://academic.oup.com/jamia/article/25/5/568/4569611 .Hakan Bulu, Dorothy A. Sippo, Janie M. Lee, Elizabeth S. Burnside, and Daniel L.Rubin. Proposing New RadLex Terms by Analyzing Free-Text Mammography Re-ports.
Journal of Digital Imaging , 31(5):596–603, October 2018. ISSN 1618-727X. doi:10.1007/s10278-018-0064-0. URL https://doi.org/10.1007/s10278-018-0064-0 .Tianrun Cai, Andreas A. Giannopoulos, Sheng Yu, Tatiana Kelil, Beth Ripley,Kanako K. Kumamaru, Frank J. Rybicki, and Dimitrios Mitsouras. Natural LanguageProcessing Technologies in Radiology Research and Clinical Applications.
Radio-Graphics , 36(1):176–191, January 2016. ISSN 0271-5333. doi: 10.1148/rg.2016150080.URL https://pubs.rsna.org/doi/full/10.1148/rg.2016150080 .Sergio M. Castro, Eugene Tseytlin, Olga Medvedeva, Kevin Mitchell, ShyamVisweswaran, Tanja Bekhuis, and Rebecca S. Jacobson. Automated annotationand classification of BI-RADS assessment from radiology reports.
Journal ofBiomedical Informatics , 69:177–187, May 2017. ISSN 1532-0464. doi: 10.1016/j.jbi.2017.04.011. URL .David Chen, Sijia Liu, Paul Kingsbury, Sunghwan Sohn, Curtis B. Storlie, Elizabeth B.Habermann, James M. Naessens, David W. Larson, and Hongfang Liu. Deep learningand alternative learning strategies for retrospective real-world clinical data. npj DigitalMedicine , 2(1):1–5, May 2019. ISSN 2398-6352. doi: 10.1038/s41746-019-0122-0. URL .Matthew C. Chen, Robyn L. Ball, Lingyao Yang, Nathaniel Moradzadeh, Brian E. Chap-man, David B. Larson, Curtis P. Langlotz, Timothy J. Amrhein, and Matthew P.Lungren. Deep Learning to Classify Radiology Free-Text Reports.
Radiology , 286(3):845–852, November 2017. ISSN 0033-8419. doi: 10.1148/radiol.2017171115. URL https://pubs.rsna.org/doi/full/10.1148/radiol.2017171115 .Po-Hao Chen, Hanna Zafar, Maya Galperin-Aizenberg, and Tessa Cook. Integrat-ing Natural Language Processing and Machine Learning Algorithms to CategorizeOncologic Response in Radiology Reports.
Journal of Digital Imaging , 31(2):178–184, April 2018. ISSN 1618-727X. doi: 10.1007/s10278-017-0027-x. URL https://doi.org/10.1007/s10278-017-0027-x .24nne Cocos, Ting Qian, Chris Callison-Burch, and Aaron J. Masino. Crowd con-trol: Effectively utilizing unscreened crowd workers for biomedical data annota-tion.
Journal of Biomedical Informatics , 69:86–92, May 2017. ISSN 1532-0464.doi: 10.1016/j.jbi.2017.04.003. URL .Jacob Cohen. A Coefficient of Agreement for Nominal Scales.
Educational and Psy-chological Measurement , 20(1):37–46, April 1960. ISSN 0013-1644. doi: 10.1177/001316446002000104. URL https://doi.org/10.1177/001316446002000104 .A. Comelli, L. Agnello, and S. Vitabile. An ontology-based retrieval system for mammo-graphic reports. In ,pages 1001–1006, Larnaca, July 2015. IEEE. doi: 10.1109/ISCC.2015.7405644.Viviana Cotik, Dario Filippo, and Jose Castano. An Approach for Automatic Classifica-tion of Radiology Reports in Spanish.
Studies in Health Technology and Informatics ,216:634–638, jan 2015. ISSN 0926-9630, 1879-8365. URL https://europepmc.org/article/med/26262128 .Viviana Cotik, Horacio Rodr´ıguez, and Jorge Vivaldi. Spanish Named Entity Recogni-tion in the Biomedical Domain. In Juan Antonio Lossio-Ventura, Denisse Mu˜nante,and Hugo Alatrista-Salas, editors,
Information Management and Big Data , vol-ume 898 of
Communications in Computer and Information Science , pages 233–248,Lima, Peru, 2018. Springer International Publishing. ISBN 978-3-030-11680-4. doi:10.1007/978-3-030-11680-4-23.Sandeep Dalal, Vadiraj Hombal, Wei-Hung Weng, Gabe Mankovich, Thusitha Mabo-tuwana, Christopher S. Hall, Joseph Fuller, Bruce E. Lehnert, and Martin L.Gunn. Determining Follow-Up Imaging Study Using Radiology Reports.
Journalof Digital Imaging , 33(1):121–130, February 2020. ISSN 1618-727X. doi: 10.1007/s10278-019-00260-w. URL https://doi.org/10.1007/s10278-019-00260-w .Neil Deshmukh, Selin Gumustop, Romane Gauriau, Varun Buch, Bradley Wright,Christopher Bridge, Ram Naidu, Katherine Andriole, and Bernardo Bizzo. Semi-Supervised Natural Language Approach for Fine-Grained Classification of MedicalReports. arXiv:1910.13573 [cs.LG] , November 2019. URL http://arxiv.org/abs/1910.13573 .Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprintarXiv:1810.04805 , 2018.Lane F. Donnelly, Robert Grzeszczuk, Carolina V. Guimaraes, Wei Zhang, and George S.Bisset III. Using a Natural Language Processing and Machine Learning AlgorithmProgram to Analyze Inter-Radiologist Report Style Variation and Compare VariationBetween Radiologists When Using Highly Structured Versus More Free Text Report-ing.
Current Problems in Diagnostic Radiology , 48(6):524–530, November 2019. ISSN25363-0188. doi: 10.1067/j.cpradiol.2018.09.005. URL .Ruth M. Dunne, Ivan K. Ip, Sarah Abbett, Esteban F. Gershanik, Ali S. Raja, AndettaHunsaker, and Ramin Khorasani. Effect of Evidence-based Clinical Decision Supporton the Use and Yield of CT Pulmonary Angiographic Imaging in Hospitalized Patients.
Radiology , 276(1):167–174, February 2015. ISSN 0033-8419. doi: 10.1148/radiol.15141208. URL https://pubs.rsna.org/doi/full/10.1148/radiol.15141208 .Allyson Ettinger. What BERT Is Not: Lessons from a New Suite of PsycholinguisticDiagnostics for Language Models.
Transactions of the Association for ComputationalLinguistics , 8:34–48, January 2020. doi: 10.1162/tacl \ a \ https://doi.org/10.1162/tacl_a_00298 .Farhood Farjah, Scott Halgrim, Diana S.M. Buist, Michael K. Gould, Steven B. Zeliadt,Elizabeth T. Loggers, and David S. Carrell. An Automated Method for IdentifyingIndividuals with a Lung Nodule Can Be Feasibly Implemented Across Health Systems. eGEMs , 4(1):1254, August 2016. ISSN 2327-9214. doi: 10.13063/2327-9214.1254. URL .Joseph L. Fleiss. Measuring nominal scale agreement among many raters. PsychologicalBulletin , 76(5):378–382, 1971. ISSN 1939-1455(Electronic),0033-2909(Print). doi: 10.1037/h0031619.Sunyang Fu, Lester Y. Leung, Yanshan Wang, Anne-Olivia Raulli, David F. Kallmes,Kristin A. Kinsman, Kristoff B. Nelson, Michael S. Clark, Patrick H. Luetmer, Paul R.Kingsbury, David M. Kent, and Hongfang Liu. Natural Language Processing for theIdentification of Silent Brain Infarcts From Neuroimaging Reports.
JMIR MedicalInformatics , 7(2):e12109, 2019. doi: 10.2196/12109. URL https://medinform.jmir.org/2019/2/e12109/ .Ravi Garg, Elissa Oh, Andrew Naidech, Konrad Kording, and Shyam Prab-hakaran. Automating Ischemic Stroke Subtype Classification Using Machine Learn-ing and Natural Language Processing.
Journal of Stroke and Cerebrovascu-lar Diseases , 28(7):2045–2051, July 2019. ISSN 1052-3057. doi: 10.1016/j.jstrokecerebrovasdis.2019.02.004. URL .Jean-Fran¸cois Gehanno, Laetitia Rollin, and Stefan Darmoni. Is the coverage of googlescholar enough to be used alone for systematic reviews.
BMC Medical Informaticsand Decision Making , 13:7, 2013. doi: 10.1186/1472-6947-13-7. URL .Inbal Goldshtein, Gabriel Chodick, Ilan Kochba, Nitsan Gal, Muriel Webb, andOren Shibolet. Identification and Characterization of Nonalcoholic Fatty Liver Dis-ease.
Clinical Gastroenterology and Hepatology , 18(8):1887–1889, July 2020. ISSN26542-3565. doi: 10.1016/j.cgh.2019.08.007. URL .Philip John Gorinski, Honghan Wu, Claire Grover, Richard Tobin, Conn Talbot,Heather Whalley, Cathie Sudlow, William Whiteley, and Beatrice Alex. NamedEntity Recognition for Electronic Health Records: A Comparison of Rule-basedand Machine Learning Approaches. arXiv:1903.03985 [cs.CL] , June 2019. URL http://arxiv.org/abs/1903.03985 .Michael K. Gould, Tania Tang, In-Lu Amy Liu, Janet Lee, Chengyi Zheng, Kim N.Danforth, Anne E. Kosco, Jamie L. Di Fiore, and David E. Suh. Recent Trends inthe Identification of Incidental Pulmonary Nodules.
American Journal of Respiratoryand Critical Care Medicine , 192(10):1208–1214, July 2015. ISSN 1073-449X. doi:10.1164/rccm.201505-0990OC. URL .A. Grivas, B. Alex, C. Grover, Tobin, R., and Whiteley, W. Not a cute stroke: Analysis ofRule- and Neural Network-Based Information Extraction Systems for Brain RadiologyReports. In
Proceedings of the 11th International Workshop on Health Text Miningand Information Analysis , 2020.Robert W. Grundmeier, Aaron J. Masino, T. Charles Casper, Jonathan M. Dean,Jamie Bell, Rene Enriquez, Sara Deakyne, James M. Chamberlain, and Eliza-beth R. Alpern. Identification of Long Bone Fractures in Radiology ReportsUsing Natural Language Processing to Support Healthcare Quality Improvement.
Applied Clinical Informatics , 7(4):1051–1068, November 2016. ISSN 1869-0327.doi: 10.4338/ACI-2016-08-RA-0129. URL .Anupama Gupta, Imon Banerjee, and Daniel L. Rubin. Automatic information ex-traction from unstructured mammography reports using distributed semantics.
Jour-nal of Biomedical Informatics , 78:78–86, February 2018. ISSN 1532-0464. doi: 10.1016/j.jbi.2017.12.016. URL .Harzing A. W.
Publish or Perish . 2007. URL
Availablefromhttps://harzing.com/resources/publish-or-perish .Saeed Hassanpour and Curtis P. Langlotz. Unsupervised Topic Modeling in a LargeFree Text Radiology Report Repository.
Journal of Digital Imaging , 29(1):59–62,February 2016. ISSN 1618-727X. doi: 10.1007/s10278-015-9823-3. URL https://doi.org/10.1007/s10278-015-9823-3 .Saeed Hassanpour, Graham Bay, and Curtis P. Langlotz. Characterization of Changeand Significance for Clinical Findings in Radiology Reports Through Natural Lan-guage Processing.
Journal of Digital Imaging , 30(3):314–322, June 2017. ISSN27618-727X. doi: 10.1007/s10278-016-9931-8. URL https://doi.org/10.1007/s10278-016-9931-8 .Marta E. Heilbrun, Brian E. Chapman, Evan Narasimhan, Neel Patel, and DanielleMowery. Feasibility of Natural Language Processing–Assisted Auditing of CriticalFindings in Chest Radiology.
Journal of the American College of Radiology , 16(9, PartB):1299–1304, September 2019. ISSN 1546-1440. doi: 10.1016/j.jacr.2019.05.038. URL .Yi Hong and Jin Zhang. Investigation of Terminology Coverage in Radiology Report-ing Templates and Free-text Reports.
International Journal of Knowledge ContentDevelopment & Technology , 5:5–14, 2015. doi: 10.5865/IJKCT.2015.5.1.005.Hannu T. Huhdanpaa, W. Katherine Tan, Sean D. Rundell, Pradeep Suri, Falgun H.Chokshi, Bryan A. Comstock, Patrick J. Heagerty, Kathryn T. James, Andrew L.Avins, Srdjan S. Nedeljkovic, David R. Nerenz, David F. Kallmes, Patrick H. Luetmer,Karen J. Sherman, Nancy L. Organ, Brent Griffith, Curtis P. Langlotz, David Carrell,Saeed Hassanpour, and Jeffrey G. Jarvik. Using Natural Language Processing ofFree-Text Radiology Reports to Identify Type 1 Modic Endplate Changes.
Journalof Digital Imaging , 31(1):84–90, February 2018. ISSN 1618-727X. doi: 10.1007/s10278-017-0013-3. URL https://doi.org/10.1007/s10278-017-0013-3 .K. Jnawali, M. R. Arbabshirani, A. E. Ulloa, N. Rao, and A. A. Patel. AutomaticClassification of Radiological Report for Intracranial Hemorrhage. In , pages 187–190, NewportBeach, CA, USA, January 2019. IEEE. doi: 10.1109/ICOSC.2019.8665578.E. Johnson, W. C. Baughman, and G. Ozsoyoglu. A method for imputation of semanticclass in diagnostic radiology text. In , pages 750–755, Washington, DC, November 2015.IEEE. doi: 10.1109/BIBM.2015.7359780.Stella K. Kang, Kira Garry, Ryan Chung, William H. Moore, Eduardo Iturrate, Jordan L.Swartz, Danny C. Kim, Leora I. Horwitz, and Saul Blecker. Natural Language Pro-cessing for Identification of Incidental Pulmonary Nodules in Radiology Reports.
Jour-nal of the American College of Radiology , 16(11):1587–1594, November 2019. ISSN1546-1440. doi: 10.1016/j.jacr.2019.04.026. URL .B. Karunakaran, D. Misra, K. Marshall, D. Mathrawala, and S. Kethireddy. Closingthe loop — Finding lung cancer patients using NLP. In , pages 2452–2461, Boston, MA, December 2017.IEEE. doi: 10.1109/BigData.2017.8258203.Kenneth L. Kehl, Haitham Elmarakeby, Mizuki Nishino, Eliezer M. Van Allen, Eva M.Lepisto, Michael J. Hassett, Bruce E. Johnson, and Deborah Schrag. Assessment of28eep Natural Language Processing in Ascertaining Oncologic Outcomes From Radiol-ogy Reports.
JAMA Oncology , 5(10):1421–1429, October 2019. ISSN 2374-2437. doi:10.1001/jamaoncol.2019.1800. URL https://doi.org/10.1001/jamaoncol.2019.1800 .Chulho Kim, Vivienne Zhu, Jihad Obeid, and Leslie Lenert. Natural language pro-cessing and machine learning algorithm to identify brain MRI reports with acuteischemic stroke.
PLOS ONE , 14(2):e0212778, February 2019. ISSN 1932-6203.doi: 10.1371/journal.pone.0212778. URL https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0212778 .Kory Kreimeyer, Matthew Foster, Abhishek Pandey, Nina Arya, Gwendolyn Halford,Sandra F. Jones, Richard Forshee, Mark Walderhaug, and Taxiarchis Botsis. Naturallanguage processing systems for capturing and standardizing unstructured clinicalinformation: A systematic review.
Journal of Biomedical Informatics , 73:14–29, 2017.ISSN 1532-0480. doi: 10.1016/j.jbi.2017.07.012.Janice L. Kwan, Darya Yermak, Lezlie Markell, Narinder S. Paul, Kaveh J. Sho-jania, and Peter Cram. Follow Up of Incidental High-Risk Pulmonary Noduleson Computed Tomography Pulmonary Angiography at Care Transitions.
Journalof Hospital Medicine , 14(6):349–352, June 2019. doi: 10.12788/jhm.3128. URL https://europepmc.org/article/med/30794133 .Monika K(cid:32)los, Jaros(cid:32)law ˙Zy(cid:32)lkowski, and Dominik Spinczyk. Automatic Classification ofText Documents Presenting Radiology Examinations. In Ewa Pietka, Pawel Badura,Jacek Kawa, and Wojciech Wieclawek, editors,
Proceedings 6th International Confer-ence Information Technology in Biomedicine(ITIB’2018) , Advances in Intelligent Sys-tems and Computing, pages 495–505. Springer International Publishing, 2018. ISBN978-3-319-91211-0. doi: 10.1007/978-3-319-91211-0-43.Ronilda Lacson, Kimberly Harris, Phyllis Brawarsky, Tor D. Tosteson, Tracy Onega,Anna N. A. Tosteson, Abby Kaye, Irina Gonzalez, Robyn Birdwell, and Jennifer S.Haas. Evaluation of an Automated Information Extraction Tool for Imaging DataElements to Populate a Breast Cancer Screening Registry.
Journal of Digital Imaging ,28(5):567–575, October 2015. ISSN 1618-727X. doi: 10.1007/s10278-014-9762-4. URL https://doi.org/10.1007/s10278-014-9762-4 .Ronilda Lacson, Martha E. Goodrich, Kimberly Harris, Phyllis Brawarsky, and Jen-nifer S. Haas. Assessing Inaccuracies in Automated Information Extraction of BreastImaging Findings.
Journal of Digital Imaging , 30(2):228–233, April 2017. ISSN1618-727X. doi: 10.1007/s10278-016-9927-4. URL https://doi.org/10.1007/s10278-016-9927-4 .M. Lafourcade and Lionel Ramadier. Radiological text simplification using a generalknowledge base. In ntelligent Text Processing (CICLing 2017) , CICLing 2017. Budapest, Hungary, 2017.doi: https://doi.org/10.1007/978-3-319-77116-8 \ Proceedings of the Tenth Inter-national Conference on Language Resources and Evaluation (LREC 2016) , LREC2016 Proceedings, Portoroˇz, Slovenia, 2016. European Language Resources Associa-tion (ELRA). ISBN 978-2-9517408-9-1. URL https://hal.archives-ouvertes.fr/hal-01382320 .J. Richard Landis and Gary G. Koch. The Measurement of Observer Agreement forCategorical Data.
Biometrics , 33(1):159–174, 1977. ISSN 0006-341X. doi: 10.2307/2529310. URL .Andrew Yu Li and Nikki Elliot. Natural language processing to identify uretericstones in radiology reports.
Journal of Medical Imaging and Radiation Oncol-ogy , 63(3):307–310, 2019. ISSN 1754-9485. doi: 10.1111/1754-9485.12861. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/1754-9485.12861 .Yi Liu, Li-Na Zhu, Qing Liu, Chao Han, Xiao-Dong Zhang, and Xiao-Ying Wang.Automatic extraction of imaging observation and assessment categories from breastmagnetic resonance imaging reports with natural language processing.
ChineseMedical Journal , 132(14):1673–1680, July 2019. ISSN 0366-6999. doi: 10.1097/CM9.0000000000000301. URL .Thusitha Mabotuwana, Christopher S Hall, Joel Tieder, and Martin L. Gunn. Im-proving Quality of Follow-Up Imaging Recommendations in Radiology.
AMIA An-nual Symposium Proceedings , 2017:1196–1204, April 2018a. ISSN 1942-597X. URL .Thusitha Mabotuwana, Vadiraj Hombal, Sandeep Dalal, Christopher S. Hall, and MartinGunn. Determining Adherence to Follow-up Imaging Recommendations.
Journalof the American College of Radiology , 15(3, Part A):422–428, March 2018b. ISSN1546-1440. doi: 10.1016/j.jacr.2017.11.022. URL .Margaret Mahan, Daniel Rafter, Hannah Casey, Marta Engelking, Tessneem Abdal-lah, Charles Truwit, Mark Oswood, and Uzma Samadani. tbiExtractor: A frame-work for extracting traumatic brain injury common data elements from radiologyreports. bioRxiv 585331 , 2019. doi: 10.1101/585331. URL .M´at´e E. Maros, Ralf Wenz, Alex F¨orster, Matthias F. Froelich, Christoph Groden,Wieland H. Sommer, Stefan O. Sch¨onberg, Thomas Henzler, and Holger Wenz. Objec-tive Comparison Using Guideline-based Query of Conventional Radiological Reports30nd Structured Reports.
In Vivo , 32(4):843–849, January 2018. ISSN 0258-851X, 1791-7549. doi: 10.21873/invivo.11318. URL http://iv.iiarjournals.org/content/32/4/843 .Aaron J. Masino, Robert W. Grundmeier, Jeffrey W. Pennington, John A. Ger-miller, and E. Bryan Crenshaw. Temporal bone radiology report classification usingopen source machine learning and natural langue processing libraries.
BMC Med-ical Informatics and Decision Making , 16(1):65, June 2016. ISSN 1472-6947. doi:10.1186/s12911-016-0306-3. URL https://doi.org/10.1186/s12911-016-0306-3 .Stephane Meystre, Ramkiran Gouripeddi, Joel Tieder, Jeffrey Simmons, Rajendu Sri-vastava, and Samir Shah. Enhancing Comparative Effectiveness Research With Au-tomated Pediatric Pneumonia Detection in a Multi-Institutional Clinical Repository:A PHIS+ Pilot Study.
Journal of Medical Internet Research , 19(5):e162, 2017. doi:10.2196/jmir.6887. URL .Shumei Miao, Tingyu Xu, Yonghui Wu, Hui Xie, Jingqi Wang, Shenqi Jing, YaoyunZhang, Xiaoliang Zhang, Yinshuang Yang, Xin Zhang, Tao Shan, Li Wang, HuaXu, Shui Wang, and Yun Liu. Extraction of BI-RADS findings from breast ultra-sound reports in Chinese using deep learning approaches.
International Journal ofMedical Informatics , 119:17–21, November 2018. ISSN 1386-5056. doi: 10.1016/j.ijmedinf.2018.08.009. URL .Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean.
Efficient Estimation ofWord Representations in Vector Space . 2013. URL http://arxiv.org/abs/1301.3781 .Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and ArmandJoulin. Advances in Pre-Training Distributed Word Representations. In
Proceedingsof the International Conference on Language Resources and Evaluation (LREC 2018) ,2018.Matthew J. Minn, Arash R. Zandieh, and Ross W. Filice. Improving Radiology ReportQuality by Rapidly Notifying Radiologist of Report Errors.
Journal of Digital Imaging ,28(4):492–498, August 2015. ISSN 1618-727X. doi: 10.1007/s10278-015-9781-9. URL https://doi.org/10.1007/s10278-015-9781-9 .David Moher, Larissa Shamseer, Mike Clarke, Davina Ghersi, Alessandro Liberati,Mark Petticrew, Paul Shekelle, and Lesley A Stewart. Preferred reporting itemsfor systematic review and meta-analysis protocols (PRISMA-P) 2015 statement.
Systematic Reviews , 4(1):1, December 2015. ISSN 2046-4053. doi: 10.1186/2046-4053-4-1. URL https://systematicreviewsjournal.biomedcentral.com/articles/10.1186/2046-4053-4-1 .Srikanth Mujjiga, Vamsi Krishna, Kalyan Chakravarthi, and Vijayananda J. IdentifyingSemantics in Clinical Reports Using Neural Machine Translation.
Proceedings of the AAI Conference on Artificial Intelligence , 33(01):9552–9557, July 2019. ISSN 2374-3468. doi: 10.1609/aaai.v33i01.33019552. URL .National Library of Medicine.
SNOMED CT . 2021a. URL .National Library of Medicine.
Unified Medical Language System . 2021b. URL .Nariman Noorbakhsh-Sabet, Georgios Tsivgoulis, Shima Shahjouei, Yirui Hu, NitinGoyal, Andrei V. Alexandrov, and Ramin Zand. Racial Difference in CerebralMicrobleed Burden Among a Patient Population in the Mid-South United States.
Journal of Stroke and Cerebrovascular Diseases , 27(10):2657–2661, October 2018.ISSN 1052-3057. doi: 10.1016/j.jstrokecerebrovasdis.2018.05.031. URL .Heiner Oberkampf, Sonja Zillner, James A. Overton, Bernhard Bauer, Alexander Cav-allaro, Michael Uder, and Matthias Hammon. Semantic representation of reportedmeasurements in radiology.
BMC Medical Informatics and Decision Making , 16(1):5, January 2016. ISSN 1472-6947. doi: 10.1186/s12911-016-0248-9. URL https://doi.org/10.1186/s12911-016-0248-9 .Charlene Jennifer Ong, Agni Orfanoudaki, Rebecca Zhang, Francois Pierre M. Caprasse,Meghan Hutch, Liang Ma, Darian Fard, Oluwafemi Balogun, Matthew I. Miller, Mar-garet Minnig, Hanife Saglam, Brenton Prescott, David M. Greer, Stelios Smirnakis,and Dimitris Bertsimas. Machine learning and natural language processing methodsto identify ischemic stroke, acuity and location from radiology reports.
PLOS ONE ,15(6):e0234908, June 2020. ISSN 1932-6203. doi: 10.1371/journal.pone.0234908.URL https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0234908 .Tejal A. Patel, Mamta Puppala, Richard O. Ogunti, Joe E. Ensor, Tiancheng He,Jitesh B. Shewale, Donna P. Ankerst, Virginia G. Kaklamani, Angel A. Rodriguez,Stephen T. C. Wong, and Jenny C. Chang. Correlating mammographic and patho-logic findings in clinical decision support using natural language processing and datamining methods.
Cancer , 123(1):114–121, January 2017. ISSN 1097-0142. doi:10.1002/cncr.30245.Y. Peng, K. Yan, V. Sandfort, R. M. Summers, and Z. Lu. A self-attention baseddeep learning method for lesion attribute detection from CT reports. In , pages 1–5, Xi’an, China,June 2019. IEEE Computer Society. doi: 10.1109/ICHI.2019.8904668.32effrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectorsfor word representation. In
Proceedings of the 2014 conference on empirical methodsin natural language processing (EMNLP) , pages 1532–1543, 2014.Bethany Percha, Yuhao Zhang, Selen Bozkurt, Daniel Rubin, Russ B. Altman, and Cur-tis P. Langlotz. Expanding a radiology lexicon using contextual patterns in radiologyreports.
Journal of the American Medical Informatics Association , 25(6):679–685,June 2018. doi: 10.1093/jamia/ocx152. URL https://academic.oup.com/jamia/article/25/6/679/4797401 .Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations.
CoRR , abs/1802.05365, 2018. URL http://arxiv.org/abs/1802.05365 . eprint:1802.05365.Ewoud Pons, Loes M. M. Braun, M. G. Myriam Hunink, and Jan A. Kors. NaturalLanguage Processing in Radiology: A Systematic Review.
Radiology , 279(2):329–343,April 2016. ISSN 0033-8419. doi: 10.1148/radiol.16142770. URL https://pubs.rsna.org/doi/10.1148/radiol.16142770 .Peter Pruitt, Andrew Naidech, Jonathan Van Ornam, Pierre Borczuk, and WilliamThompson. A natural language processing algorithm to extract characteristics ofsubdural hematoma from head CT reports.
Emergency Radiology , 26(3):301–306,June 2019. ISSN 1438-1435. doi: 10.1007/s10140-019-01673-4. URL https://doi.org/10.1007/s10140-019-01673-4 .Basel Qenam, Tae Youn Kim, Mark J. Carroll, and Michael Hogarth. Text Simplifi-cation Using Consumer Health Vocabulary to Generate Patient-Centered RadiologyReporting: Translation and Evaluation.
Journal of Medical Internet Research , 19(12):e417, 2017. doi: 10.2196/jmir.8536. URL .Alex Ratner, Braden Hancock, Jared Dunnmon, Roger Goldman, and Christopher R´e.Snorkel MeTaL: Weak Supervision for Multi-Task Learning. In
Proceedings of theSecond Workshop on Data Management for End-To-End Machine Learning , volume 3of
DEEM’18 , pages 1–4, Houston, TX, USA, 2018. ACM. ISBN 978-1-4503-5828-6.doi: 10.1145/3209889.3209898. URL https://doi.org/10.1145/3209889.3209898 .Joseph S. Redman, Yamini Natarajan, Jason K. Hou, Jingqi Wang, Muzammil Hanif,Hua Feng, Jennifer R. Kramer, Roxanne Desiderio, Hua Xu, Hashem B. El-Serag,and Fasiha Kanwal. Accurate Identification of Fatty Liver Disease in Data WarehouseUtilizing Natural Language Processing.
Digestive Diseases and Sciences , 62(10):2713–2718, October 2017. ISSN 1573-2568. doi: 10.1007/s10620-017-4721-9. URL https://doi.org/10.1007/s10620-017-4721-9 .RSNA.
RadLex . 2021. URL http://radlex.org/ .33vonne Sada, Jason Hou, Peter Richardson, Hashem El-Serag, and Jessica Davila. Val-idation of Case Finding Algorithms for Hepatocellular Cancer from AdministrativeData and Electronic Health Records using Natural Language Processing.
Medical care ,54(2):e9–e14, February 2016. ISSN 0025-7079. doi: 10.1097/MLR.0b013e3182a30373.URL .M. Sevenster, J. Buurman, P. Liu, J. F. Peters, and P. J. Chang. Natural LanguageProcessing Techniques for Extracting and Categorizing Finding Measurements in Nar-rative Radiology Reports.
Applied Clinical Informatics , 06(3):600–610, 2015a. ISSN1869-0327. doi: 10.4338/ACI-2014-11-RA-0110. URL .Merlijn Sevenster, Jeffrey Bozeman, Andrea Cowhy, and William Trost. A natu-ral language processing pipeline for pairing measurements uniquely across free-textCT reports.
Journal of Biomedical Informatics , 53:36–48, February 2015b. ISSN1532-0464. doi: 10.1016/j.jbi.2014.08.015. URL .S. C. Shelmerdine, M. Singh, W. Norman, R. Jones, N. J. Sebire, and O. J. Arthurs.Automated data extraction and report analysis in computer-aided radiology au-dit: practice implications from post-mortem paediatric imaging.
Clinical Radi-ology , 74(9):733.e11–733.e18, September 2019. ISSN 0009-9260. doi: 10.1016/j.crad.2019.04.021. URL .B. Shickel, P. J. Tighe, A. Bihorac, and P. Rashidi. Deep EHR: A Survey of RecentAdvances in Deep Learning Techniques for Electronic Health Record (EHR) Analy-sis.
IEEE Journal of Biomedical and Health Informatics , 22(5):1589–1604, September2018. ISSN 2168-2208. doi: 10.1109/JBHI.2017.2767063.B. Shin, F. H. Chokshi, T. Lee, and J. D. Choi. Classification of radiology reports usingneural attention models. In , pages 4363–4370, Anchorage, AK, May 2017. IEEE. doi: 10.1109/IJCNN.2017.7966408.Ryan G. Short, John Bralich, Dave Bogaty, and Nicholas T. Befera. Comprehen-sive Word-Level Classification of Screening Mammography Reports Using a NeuralNetwork Sequence Labeling Approach.
Journal of Digital Imaging , 32(5):685–692,October 2019. ISSN 1618-727X. doi: 10.1007/s10278-018-0141-4. URL https://doi.org/10.1007/s10278-018-0141-4 .Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Ng, and MatthewLungren. Combining Automatic Labelers and Expert Annotations for AccurateRadiology Report Labeling Using BERT. In
Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Processing (EMNLP) , pages 1500–1519,Online, November 2020a. Association for Computational Linguistics. doi: 10.348653/v1/2020.emnlp-main.117. URL .Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Y. Ng, andMatthew P. Lungren. CheXbert: Combining Automatic Labelers and Expert Annota-tions for Accurate Radiology Report Labeling Using BERT.
CoRR , abs/2004.09167,2020b. URL https://arxiv.org/abs/2004.09167 . eprint: 2004.09167.V. Sorin, Y. Barash, E. Konen, and E. Klang. Deep Learning for Natural Language Pro-cessing in Radiology-Fundamentals and a Systematic Review.
Journal of the AmericanCollege of Radiology : JACR , 17(5):639–648, 2020. doi: 10.1016/j.jacr.2019.12.026.Irena Spasic and Goran Nenadic. Clinical Text Data in Machine Learning: SystematicReview.
JMIR medical informatics , 8(3):e17984, March 2020. ISSN 2291-9694. doi:10.2196/17984.Jackson M. Steinkamp, Charles Chambers, Darco Lalevic, Hanna M. Zafar, and Tessa S.Cook. Toward Complete Structured Information Extraction from Radiology ReportsUsing Machine Learning.
Journal of Digital Imaging , 32(4):554–564, August 2019.ISSN 1618-727X. doi: 10.1007/s10278-019-00234-y. URL https://doi.org/10.1007/s10278-019-00234-y .Amir M. Tahmasebi, Henghui Zhu, Gabriel Mankovich, Peter Prinsen, Prescott Klassen,Sam Pilato, Rob van Ommering, Pritesh Patel, Martin L. Gunn, and Paul Chang.Automatic Normalization of Anatomical Phrases in Radiology Reports Using Unsu-pervised Learning.
Journal of Digital Imaging , 32(1):6–18, February 2019. ISSN1618-727X. doi: 10.1007/s10278-018-0116-5. URL https://doi.org/10.1007/s10278-018-0116-5 .W. Katherine Tan and Patrick J. Heagerty. Surrogate-guided sampling designs for clas-sification of rare outcomes from electronic medical records data. arXiv:1904.00412[stat.ME] , March 2019. URL http://arxiv.org/abs/1904.00412 .W. Katherine Tan, Saeed Hassanpour, Patrick J. Heagerty, Sean D. Rundell, PradeepSuri, Hannu T. Huhdanpaa, Kathryn James, David S. Carrell, Curtis P. Langlotz,Nancy L. Organ, Eric N. Meier, Karen J. Sherman, David F. Kallmes, Patrick H.Luetmer, Brent Griffith, David R. Nerenz, and Jeffrey G. Jarvik. Comparison of Nat-ural Language Processing Rules-based and Machine-learning Systems to Identify Lum-bar Spine Imaging Findings Related to Low Back Pain.
Academic Radiology , 25(11):1422–1432, November 2018. ISSN 1076-6332. doi: 10.1016/j.acra.2018.03.008. URL .Gaurav Trivedi, Charmgil Hong, Esmaeel R. Dadashzadeh, Robert M. Handzel, HarryHochheiser, and Shyam Visweswaran. Identifying incidental findings from radiologyreports of trauma patients: An evaluation of automated feature representation meth-ods.
International Journal of Medical Informatics , 129:81–87, September 2019. ISSN35386-5056. doi: 10.1016/j.ijmedinf.2019.05.021. URL .Hari Trivedi, Joseph Mesterhazy, Benjamin Laguna, Thienkhai Vu, and Jae Ho Sohn.Automatic Determination of the Need for Intravenous Contrast in MusculoskeletalMRI Examinations Using IBM Watson’s Natural Language Processing Algorithm.
Journal of Digital Imaging , 31(2):245–251, April 2018. ISSN 1618-727X. doi: 10.1007/s10278-017-0021-3. URL https://doi.org/10.1007/s10278-017-0021-3 .Robert M. Van Haren, Arlene M. Correa, Boris Sepesi, David C. Rice, Wayne L. Hofstet-ter, Reza J. Mehran, Ara A. Vaporciyan, Garrett L. Walsh, Jack A. Roth, Stephen G.Swisher, and Mara B. Antonoff. Ground Glass Lesions on Chest Imaging: Evalua-tion of Reported Incidence in Cancer Patients Using Natural Language Processing.
The Annals of Thoracic Surgery , 107(3):936–940, March 2019. ISSN 0003-4975. doi:10.1016/j.athoracsur.2018.09.016. URL .Emily Wheater, Grant Mair, Cathie Sudlow, Beatrice Alex, Claire Grover, and WilliamWhiteley. A validated natural language processing algorithm for brain imaging phe-notypes from radiology reports in UK electronic health records.
BMC Medical In-formatics and Decision Making , 19(1):184, September 2019. ISSN 1472-6947. doi:10.1186/s12911-019-0908-7. URL https://doi.org/10.1186/s12911-019-0908-7 .Claes Wohlin. Guidelines for Snowballing in Systematic Literature Studies and a Repli-cation in Software Engineering. In
Proceedings of the 18th International Confer-ence on Evaluation and Assessment in Software Engineering , EASE ’14, New York,NY, USA, 2014. Association for Computing Machinery. ISBN 978-1-4503-2476-2.doi: 10.1145/2601248.2601268. URL https://doi.org/10.1145/2601248.2601268 .event-place: London, England, United Kingdom.David A. Wood, Jeremy Lynch, Sina Kafiabadi, Emily Guilhem, Aisha Al Busaidi,Antanas Montvila, Thomas Varsavsky, Juveria Siddiqui, Naveen Gadapa, MatthewTownend, Martin Kiik, Keena Patel, Gareth Barker, Sebastian Ourselin, James H.Cole, and Thomas C. Booth. Automated Labelling using an Attention model forRadiology reports of MRI scans (ALARM). arXiv:2002.06588 [cs.CV] , 2020. URL http://arxiv.org/abs/2002.06588 .Stephen Wu, Kirk Roberts, Surabhi Datta, Jingcheng Du, Zongcheng Ji, Yuqi Si, SarveshSoni, Qiong Wang, Qiang Wei, Yang Xiang, Bo Zhao, and Hua Xu. Deep learning inclinical natural language processing: a methodical review.
Journal of the AmericanMedical Informatics Association: JAMIA , 27(3):457–470, 2020. ISSN 1527-974X. doi:10.1093/jamia/ocz200.Zhe Xie, Yuanyuan Yang, Mingqing Wang, Ming Li, Haozhe Huang, Dezhong Zheng,Rong Shu, and Tonghui Ling. Introducing Information Extraction to Radiology Infor-mation Systems to Improve the Efficiency on Reading Reports.
Methods of Informationin Medicine , 58(2-03):94–106, 2019. ISSN 2511-705X. doi: 10.1055/s-0039-1694992.36abir Yadav, Efsun Sarioglu, Hyeong-Ah Choi, Walter B. Cartwright, Pamela S. Hinds,and James M. Chamberlain. Automated Outcome Classification of Computed Tomog-raphy Imaging Reports for Pediatric Traumatic Brain Injury.
Academic EmergencyMedicine , 23(2):171–178, 2016. ISSN 1553-2712. doi: 10.1111/acem.12859. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/acem.12859 .Zihao Yan, Ivan K. Ip, Ali S. Raja, Anurag Gupta, Joshua M. Kosowsky, and RaminKhorasani. Yield of CT Pulmonary Angiography in the Emergency Department WhenProviders Override Evidence-based Clinical Decision Support.
Radiology , 282(3):717–725, September 2016. ISSN 0033-8419. doi: 10.1148/radiol.2016151985. URL https://pubs.rsna.org/doi/full/10.1148/radiol.2016151985 .Hongmei Yang, Lin Li, Ridong Yang, and Yi Zhou. Towards Automated KnowledgeDiscovery of Hepatocellular Carcinoma: Extract Patient Information from ChineseClinical Reports. In
Proceedings of the 2nd International Conference on Medical andHealth Informatics , ICMHI ’18, pages 111–116, New York, NY, USA, June 2018. ACM.ISBN 978-1-4503-6389-1. doi: 10.1145/3239438.3239445. URL https://doi.org/10.1145/3239438.3239445 .Koichiro Yasaka and Osamu Abe. Deep learning and artificial intelligence in radiol-ogy: Current applications and future directions.
PLOS Medicine , 15(11):e1002707,November 2018. ISSN 1549-1676. doi: 10.1371/journal.pmed.1002707. URL https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1002707 .Wen-wai Yim, Tyler Denman, Sharon W. Kwan, and Meliha Yetisgen. Tumor infor-mation extraction in radiology reports for hepatocellular carcinoma patients.
AMIASummits on Translational Science Proceedings , 2016:455–464, July 2016a. ISSN 2153-4063. URL .Wen-wai Yim, Sharon W. Kwan, and Meliha Yetisgen. Tumor reference resolutionand characteristic extraction in radiology reports for liver cancer stage prediction.
Journal of Biomedical Informatics , 64:179–191, December 2016b. ISSN 1532-0464.doi: 10.1016/j.jbi.2016.10.005. URL .Wen-wai Yim, Sharon W. Kwan, and Meliha Yetisgen. Classifying tumor event at-tributes in radiology reports.
Journal of the Association for Information Science andTechnology , 68(11):2662–2674, 2017. ISSN 2330-1643. doi: 10.1002/asi.23937. URL https://asistdl.onlinelibrary.wiley.com/doi/abs/10.1002/asi.23937 .Wen-wai Yim, Sharon W Kwan, Guy Johnson, and Meliha Yetisgen. Classification ofhepatocellular carcinoma stages from free-text clinical and radiology reports.
AMIAAnnual Symposium Proceedings , 2017:1858–1867, April 2018. ISSN 1942-597X. URL .37. Young, D. Hazarika, S. Poria, and E. Cambria. Recent Trends in Deep Learning BasedNatural Language Processing [Review Article].
IEEE Computational Intelligence Mag-azine , 13(3):55–75, August 2018. ISSN 1556-6048. doi: 10.1109/MCI.2018.2840738.John Zech, Margaret Pain, Joseph Titano, Marcus Badgeley, Javin Schefflein, An-dres Su, Anthony Costa, Joshua Bederson, Joseph Lehar, and Eric Karl Oermann.Natural Language–based Machine Learning Models for the Annotation of ClinicalRadiology Reports.
Radiology , 287(2):570–580, January 2018. ISSN 0033-8419.doi: 10.1148/radiol.2018171093. URL https://pubs.rsna.org/doi/full/10.1148/radiol.2018171093 .John Zech, Jessica Forde, Joseph J. Titano, Deepak Kaji, Anthony Costa, and Eric KarlOermann. Detecting insertion, substitution, and deletion errors in radiology reportsusing neural sequence-to-sequence models.
Annals of Translational Medicine , 7(11),June 2019. ISSN 2305-5839. doi: 10.21037/atm.2018.08.11. URL .A. Y. Zhang, S. S. W. Lam, N. Liu, Y. Pang, L. L. Chan, and P. H. Tang. Developmentof a Radiology Decision Support System for the Classification of MRI Brain Scans. In , pages 107–115, December 2018. doi: 10.1109/BDCAT.2018.00021.Yuhao Zhang, Derek Merck, Emily Bao Tsai, Christopher D. Manning, and Curtis P.Langlotz. Optimizing the Factual Correctness of a Summary: A Study of SummarizingRadiology Reports. arXiv:1911.02541 [cs.CL] , 2019. URL http://arxiv.org/abs/1911.02541 .Yiqing Zhao, Nooshin J. Fesharaki, Hongfang Liu, and Jake Luo. Using data-drivensublanguage pattern mining to induce knowledge models: application in medical imagereports knowledge representation.
BMC Medical Informatics and Decision Making ,18(1):61, July 2018. ISSN 1472-6947. doi: 10.1186/s12911-018-0645-3. URL https://doi.org/10.1186/s12911-018-0645-3 .Henghui Zhu, Ioannis Ch. Paschalidis, Christopher Hall, and Amir Tahmasebi. Context-Driven Concept Annotation in Radiology Reports: Anatomical Phrase Labeling.
AMIA Summits on Translational Science Proceedings , 2019:232–241, May 2019. ISSN2153-4063. URL