[PDF] A Systematic Review of Natural Language Processing Applied to Radiology Reports

Abstract

NLP has a significant role in advancing healthcare and has been found to be key in extracting structured information from radiology reports. Understanding recent developments in NLP application to radiology is of significance but recent reviews on this are limited. This study systematically assesses recent literature in NLP applied to radiology reports. Our automated literature search yields 4,799 results using automated filtering, metadata enriching steps and citation search combined with manual review. Our analysis is based on 21 variables including radiology characteristics, NLP methodology, performance, study, and clinical application characteristics. We present a comprehensive analysis of the 164 publications retrieved with each categorised into one of 6 clinical application categories. Deep learning use increases but conventional machine learning approaches are still prevalent. Deep learning remains challenged when data is scarce and there is little evidence of adoption into clinical practice. Despite 17% of studies reporting greater than 0.85 F1 scores, it is hard to comparatively evaluate these approaches given that most of them use different datasets. Only 14 studies made their data and 15 their code available with 10 externally validating results. Automated understanding of clinical narratives of the radiology reports has the potential to enhance the healthcare process but reproducibility and explainability of models are important if the domain is to move applications into clinical use. More could be done to share code enabling validation of methods on different institutional data and to reduce heterogeneity in reporting of study properties allowing inter-study comparisons. Our results have significance for researchers providing a systematic synthesis of existing work to build on, identify gaps, opportunities for collaboration and avoid duplication.

Full PDF

AA Systematic Review of Natural Language ProcessingApplied to Radiology Reports

Arlene Casey , Emma Davidson , Michael Poon , Hang Dong , Daniel Duma ,Andreas Grivas , Claire Grover , V´ıctor Su´arez-Paniagua , Richard Tobin ,William Whiteley , Honghan Wu , and Beatrice Alex School of Literatures, Languages and Cultures (LLC), University of Edinburgh Centre for Clinical Brain Sciences, University of Edinburgh Centre for Medical Informatics, Usher Institute of Population Health Sciencesand Informatics, University of Edinburgh Health Data Research, UK Institute for Language, Cognition and Computation, School of Informatics,University of Edinburgh Nuﬃeld Department of Population Health, University of Oxford Institute of Health Informatics, University College of London Edinburgh Futures Institute, University of Edinburgh * Corresponding author:arlene.casey AT ed.ac.uk

AbstractBackground

Natural language processing (NLP) has a signiﬁcant role in ad-vancing healthcare and has been found to be key in extracting structured informationfrom radiology reports. Understanding recent developments in NLP application toradiology is of signiﬁcance but recent reviews on this are limited. This study sys-tematically assesses and quantiﬁes recent literature in NLP applied to radiologyreports.

Methods

We conduct an automated literature search yielding 4,799 results us-ing automated ﬁltering, metadata enriching steps and citation search combined withmanual review. Our analysis is based on 21 variables including radiology characteris-tics, NLP methodology, performance, study, and clinical application characteristics.

Results

We present a comprehensive analysis of the 164 publications retrievedwith publications in 2019 almost triple those in 2015. Each publication is categorisedinto one of 6 clinical application categories. Deep learning use increases in the periodbut conventional machine learning approaches are still prevalent. Deep learningremains challenged when data is scarce and there is little evidence of adoption intoclinical practice. Despite 17% of studies reporting greater than 0.85 F1 scores, itis hard to comparatively evaluate these approaches given that most of them usediﬀerent datasets. Only 14 studies made their data and 15 their code available with10 externally validating results.

Conclusions

Automated understanding of clinical narratives of the radiologyreports has the potential to enhance the healthcare process and we show that re-search in this ﬁeld continues to grow. Reproducibility and explainability of models a r X i v : . [ c s . C L ] F e b re important if the domain is to move applications into clinical use. More couldbe done to share code enabling validation of methods on diﬀerent institutional dataand to reduce heterogeneity in reporting of study properties allowing inter-studycomparisons. Our results have signiﬁcance for researchers in the ﬁeld providing asystematic synthesis of existing work to build on, identify gaps, opportunities forcollaboration and avoid duplication. Background

Medical imaging examinations interpreted by radiologists in the form of narrative reportsare used to support and conﬁrm diagnosis in clinical practice. Being able to accuratelyand quickly identify the information stored in radiologists’ narratives has the potentialto reduce workloads, support clinicians in their decision processes, triage patients to geturgent care or identify patients for research purposes. However, whilst these reports aregenerally considered more restricted in vocabulary than other electronic health records(EHR), e.g. clinical notes, it is still diﬃcult to access this eﬃciently at scale [Bates et al.,2016]. This is due to the unstructured nature of these reports and Natural LanguageProcessing (NLP) is key to obtaining structured information from radiology reports[Pons et al., 2016].NLP applied to radiology reports is shown to be a growing ﬁeld in earlier reviews[Pons et al., 2016, Cai et al., 2016]. In recent years there has been an even more extensivegrowth in NLP research in general and in particular deep learning methods which is notseen in the earlier reviews. A more recent review of NLP applied to radiology-relatedresearch can be found but this focuses on one NLP technique only, deep learning models[Sorin et al., 2020]. Our paper provides a more comprehensive review comparing andcontrasting all NLP methodologies as they are applied to radiology.It is of signiﬁcance to understand and synthesise recent developments speciﬁc toNLP in the radiology research ﬁeld as this will assist researchers to gain a broaderunderstanding of the ﬁeld, provide insight into methods and techniques supporting andpromoting new developments in the ﬁeld. Therefore, we carry out a systematic reviewof research output on NLP applications in radiology from 2015 onward, thus, allowingfor a more up to date analysis of the area. An additional listing of our synthesis ofpublications detailing their clinical and technical categories along with anatomical scanregions can be made available on request. Also diﬀerent to the existing work, we lookat both the clinical application areas NLP is being applied in and consider the trends inNLP methods. We describe and discuss study properties, e.g. data size, performance,annotation details, quantifying these in relation to both the clinical application areasand NLP methods. Having a more detailed understanding of these properties allowsus to make recommendations for future NLP research applied to radiology datasets,supporting improvements and progress in this domain.2 elated Work

Amongst pre-existing reviews in this area, [Pons et al., 2016] was the ﬁrst that was bothspeciﬁc to NLP on radiology reports and systematic in methodology. Their literaturesearch identiﬁed 67 studies published in the period up to October 2014. They examinedthe NLP methods used, summarised their performance and extracted the studies’ clinicalapplications, which they assigned to ﬁve broad categories delineating their purpose.Since Pons et al.’s paper, several reviews have emerged with the broader remit of NLPapplied to electronic health data, which includes radiology reports. [Kreimeyer et al.,2017] conducted a systematic review of NLP systems with a speciﬁc focus on coding freetext into clinical terminologies and structured data capture. The systematic review by[Spasic and Nenadic, 2020] speciﬁcally examined machine learning approaches to NLP(2015-2019) in more general clinical text data, and a further methodical review wascarried out by [Wu et al., 2020] to synthesise literature on deep learning in clinical NLP(up to April 2019) although the did not follow the PRISMA guideline completely. Withradiology reports as their particular focus, [Cai et al., 2016] published, the same yearas Pons et al.’s review, an instructive narrative review outlining the fundamentals ofNLP techniques applied in radiology. More recently, [Sorin et al., 2020] published asystematic review focused on deep learning radiology-related research. They identiﬁed10 relevant papers in their search (up to September 2019) and examined their deeplearning models, comparing these with traditional NLP models and also considered theirclinical applications but did not employ a speciﬁc categorisation. We build on thiscorpus of related work, and most speciﬁcally Pons et al.’s work. In our initial synthesisof clinical applications we adopt their application categories and further expand uponthese to reﬂect the nature of subsequent literature captured in our work. Additionally,we quantify and compare properties of the studies reviewed and provide a series ofrecommendations for future NLP research applied to radiology datasets in order topromote improvements and progress in this domain.

Methods

Our methodology followed the Preferred Reporting Items for Systematic Reviews andMeta-Analysis (PRISMA) [Moher et al., 2015], and the protocol is registered on proto-cols.io.

Eligibility for Literature Inclusion and Search Strategy

We included studies using NLP on radiology reports of any imaging modality andanatomical region for NLP technical development, clinical support, or epidemiologicalresearch. Exclusion criteria included: (1) case reports; (2) published before 2015; (3) inlanguage other than English; (4) processing of radiology images; (5) reviews, conferenceabstracts, comments, patents, or editorials; (6) not reporting outcomes of interest; (7)not radiology reports; (8) not using NLP methods; (9) not available in full text; (10)duplicates. 3e used Publish or Perish [Harzing A. W., 2007], a citation retrieval and analysissoftware program, to search Google Scholar. Google Scholar has a similar coverage toother databases [Gehanno et al., 2013] and is easier to integrate into search pipelines.We conducted an initial pilot search following the process described here, but the searchterms were too speciﬁc and restricted the number of publications. However, we didinclude papers found in the pilot search in full-text screening. We use the followingsearch query restricted to research articles published in English between January 2015and October 2019. (”radiology” OR ”radiologist”) AND (”natural language” OR ”textmining” OR ”information extraction” OR ”document classiﬁcation” OR ”word2vec”)NOT patent. We automated the addition of publication metadata and applied ﬁlteringto remove irrelevant publications. These automated steps are described in Table 1 &Table 2. Table 1: Metadata enriching steps undertaken for each publication

Metadata Enriching Steps1. Match the paper with its DOI via the Crossref API2. If DOI matched, check Semantic Scholar for metadata/abstract3. If no DOI match and no abstract, search PubMed for abstract4. Search arXiv(for a pre-print)5. If no PDF link, search Unpaywall for available open access versions6. If PDF but no separate abstract via Semantics Scholar/PubMed, extract abstract from the PDF

Table 2: Automated ﬁltering steps to remove irrelevant publications

Automated Filtering Steps1. Document language is English2. Word ’patent’ in title or URL3. Year of publication out of range ( < In addition to query search, another method to ﬁnd papers is to conduct a citationsearch [Briscoe et al., 2020]. The citation search compiled a list of publications thatcite the Pons et al. review and the articles cited in the Pons’ review. To do this, weuse a snowballing method [Wohlin, 2014] to follow the forward citation branch for eachpublication in this list, i.e. ﬁnding every article that cites the publications in our list.The branching factor here is large, so we ﬁlter at every stage and automatically addmetadata.

Manual Review of Literature

Four reviewers (three NLP researchers [AG,DD and HD] and one epidemiologist [MTCP])independently screened all titles and abstracts with the Rayyan online platform and dis-4ussed disagreements. Fleiss’ kappa [Fleiss, 1971] agreement between reviewers was 0.70,indicating substantial agreement [Landis and Koch, 1977]. After this screening process,each full-text article was reviewed by a team of eight (six NLP researchers and two epi-demiologists) and double reviewed by a NLP researcher. We resolved any discrepanciesby discussion in regular meetings.

Data Extraction for Analysis

We extracted data on: primary clinical application and technical objective, data source(s),study period, radiology report language, anatomical region, imaging modality, diseasearea, dataset size, annotated set size, training/validation/test set size, external valida-tion performed, domain expert used, number of annotators, inter-annotator agreement,NLP technique(s) used, best-reported results (recall, precision and F1 score), availabilityof dataset, and availability of code.

Results

The literature search yielded 4,799 possibly relevant publications from which our auto-mated exclusion process removed 4,402, and during both our screening processes, 233were removed, leaving 164 publications. See Figure 1 for details of exclusions at eachstep.

General Characteristics

Clinical Application Categories

In synthesis of the literature each publication was classiﬁed by the primary clinical pur-pose. Pons’ work in 2016 categorised publications into 5 broad categories: DiagnosticSurveillance, Cohort Building for Epidemiological Studies, Query-based Case Retrieval,Quality Assessment of Radiological Practice and Clinical Support Services. We found5igure 1: PRISMA diagram for search publication retrievalTable 3: Scan modalityScan Modality No. StudiesMultiple Modalities 38MRI 16CT 36X-Ray 18Mammogram 5Ultrasound 4Not speciﬁed 47TOTAL 164 Table 4: Image sampling methodSampling Method No. StudiesConsecutive Images 33Non-Consecutive Images 38Not speciﬁed 93TOTAL 1646able 5: Anatomical region scannedAnatomical Region No. StudiesMixed 45Thorax 31Head/Neck 25Abdomen 15Breast 15Extremities 8Spine 5Other 1Unspeciﬁed 19TOTAL 164 Table 6: Disease categoryDisease Category No. StudiesNot speciﬁc disease related 40Oncology 39Various 20Musculoskeletal 10Cerebrovascular 13Other 13Respiratory 10Trauma 7Cardiovascular 6Gastrointestinal 3Hepatobiliary 2Genitourinary 1TOTAL 164Table 7: Radiology report languageReport Language No. StudiesEnglish 142Chinese 5Spanish 4German 3Italian 2French 2Hebrew 1Polish 1Brazilian Portuguese 1Unspeciﬁed 3TOTAL 164some changes in this categorisation schema and our categorisation consisted of six cat-egories:

Diagnostic Surveillance, Disease information and classiﬁcation, Quality Com-pliance, Cohort/Epidemiology, Language Discovery and Knowledge Structure, TechnicalNLP . The main diﬀerence is we found no evidence for a category of

Clinical Support Ser-vices which described applications that had been integrated into the workﬂow to assist.Despite the increase in the number of publications, very few were in clinical use withmore focus on the category of

Disease Information and Classiﬁcation . We describe eachclinical application area in more detail below and where applicable how our categoriesdiﬀer from the earlier ﬁndings. A listing of all publications and their corresponding clin-ical application category can be made available on request. Table 8 shows the clinicalapplication category by the technical classiﬁcation and Figure 2 shows the breakdown7able 8: Clinical Application Category by Technical ObjectiveApplicationCategory InformationExtraction(n=81) Report/SentenceClassiﬁ-cation(n=73) Lexicon/Ontol-ogyDiscov-ery(n=9) Clustering(n=1)Disease Information & Classiﬁcation 14 31 - -Diagnostic Surveillance 28 17 - -Quality Compliance 7 14 - -Cohort-Epid. 6 10 - -Language Discovery & Knowledge 13 4 9 1Technical NLP 6 4 - -of clinical application category by publication year. There were more publications in2019 compared with 2015 for all categories except Language Discovery & KnowledgeStructure, which fell by ≈

25% (Figure 2).Figure 2: Clinical application of publication by year8 iagnostic Surveillance

A large proportion of studies in this category focused on extracting disease informationfor patient or disease surveillance e.g. investigating tumour characteristics [Peng et al.,2019, Bozkurt et al., 2019]; changes over time [Hassanpour et al., 2017] and worsen-ing/progression or improvement/response to treatment [Kehl et al., 2019, Chen et al.,2018]; identifying correct anatomical labels [Cotik et al., 2018]; organ measurements andtemporality [Sevenster et al., 2015a]. Studies also investigated pairing measurementsbetween reports [Sevenster et al., 2015b] and linking reports to monitoring changesthrough providing an integrated view of consecutive examinations [Oberkampf et al.,2016]. Studies focused speciﬁcally on breast imaging ﬁndings investigating aspects, suchas BI-RADS MRI descriptors (shape, size, margin) and ﬁnal assessment categories (be-nign, malignant etc.) e.g., [Liu et al., 2019, Gupta et al., 2018, Castro et al., 2017, Shortet al., 2019, Lacson et al., 2017, 2015]. Studies focused on tumour information e.g., forliver [Yim et al., 2016b] and hepatocellular carcinoma (HPC) [Yim et al., 2017, 2016a]and one study on extracting information relevant for structuring subdural haematomacharacteristics in reports [Pruitt et al., 2019].Studies in this category also investigated incidental ﬁndings including on lung imag-ing [Farjah et al., 2016, Karunakaran et al., 2017, Tan et al., 2018], with [Farjah et al.,2016] additionally extracting the nodule size; for trauma patients [Trivedi et al., 2019];and looking for silent brain infarction and white matter disease [Fu et al., 2019]. Otherstudies focused on prioritising/triaging reports, detecting follow-up recommendations,and linking a follow-up exam to the initial recommendation report, or bio-surveillanceof infectious conditions, such as invasive mould disease.

Disease Information and Classiﬁcation

Disease Information and Classiﬁcation publications use reports to identify informationthat may be aggregated according to classiﬁcation systems. These publications focusedsolely on classifying a disease occurrence or extracting information about a disease withno focus on the overall clinical application. This category was not found in Pons’ work.Methods considered a range of conditions including intracranial haemorrhage [Jnawaliet al., 2019, Banerjee et al., 2017], aneurysms [K(cid:32)los et al., 2018], brain metastases [Desh-mukh et al., 2019], ischaemic stroke [Kim et al., 2019, Garg et al., 2019], and severalclassiﬁed on types and severity of conditions e.g., [Deshmukh et al., 2019, Shin et al.,2017, Wheater et al., 2019, Gorinski et al., 2019, Alex et al., 2019]. Studies focusedon breast imaging considered aspects such as predicting lesion malignancy from BI-RADS descriptors [Bozkurt et al., 2016], breast cancer subtypes [Patel et al., 2017],and extracting or inferring BI-RADS categories, such as [Banerjee et al., 2019a, Miaoet al., 2018]. Two studies focused on abdominal images and hepatocellular carcinoma(HCC) staging and CLIP scoring. Chest imaging reports were used to detect pulmonaryembolism e.g., [Dunne et al., 2015, Banerjee et al., 2019b, Chen et al., 2017], bacterialpneumonia [Meystre et al., 2017], and Lungs-RADS categories [Beyer et al., 2017]. Func-tional imaging was also included, such as echocardiograms, extracting measurements to9valuate heart failure, including left ventricular ejection fractions (LVEF). Other studiesinvestigated classiﬁcation of fractures and abnormalities and the prediction of ICD codesfrom imaging reports.

Language Discovery and Knowledge Structure

Language Discovery and Knowledge Structure publications investigate the structure oflanguage in reports and how this might be optimised to facilitate decision support andcommunication. Pons et al. reported on applications of

Query-based retrieval which hassimilarities to

Language Discovery and Knowledge Structure but it is not the same. Theircategory contains studies that retrieve cases and conditions that are not predeﬁned andin some instances could be used for research purposes or are motivated for educationalpurposes. Our category is broader and encompasses papers that investigated diﬀerentaspects of language including variability, complexity simpliﬁcation and normalising tosupport extraction and classiﬁcation tasks.Studies focus on exploring lexicon coverage and methods to support language simpli-ﬁcation for patients looking at sources, such as the consumer health vocabulary [Qenamet al., 2017] and the French lexical network (JDM) [Lafourcade and Ramadier, 2017].Other works studied the variability and complexity of report language comparing free-text and structured reports and radiologists. Also investigated was how ontologies andlexicons could be combined with other NLP methods to represent knowledge that cansupport clinicians. This work included improving report reading eﬃciency [Hong andZhang, 2015]; ﬁnding similar reports [Comelli et al., 2015]; normalising phrases to sup-port classiﬁcation and extraction tasks, such as entity recognition in Spanish reports[Cotik et al., 2015]; imputing semantic classes for labelling [Johnson et al., 2015], sup-porting search [Mujjiga et al., 2019] or to discover semantic relations [Lafourcade andRamadier, 2016].

Quality and Compliance

Quality and Compliance publications use reports to assess the quality and safety ofpractice and reports similar to Pons’ category. Works considered how patient indica-tions for scans adhered to guidance e.g., [Shelmerdine et al., 2019, Mabotuwana et al.,2018b, Dalal et al., 2020, Bobbin et al., 2017, Kwan et al., 2019, Mabotuwana et al.,2018a] or protocol selection [Brown and Marotta, 2017, Trivedi et al., 2018, Zhang et al.,2018, Brown and Marotta, 2018, Yan et al., 2016] or the impact of guideline changeson practice, such as [Kang et al., 2019]. Also investigated was diagnostic utilisationand yield, based on clinicians or on patients, which can be useful for hospital planningand for clinicians to study their work patterns e.g.[Brown and Kachura, 2019]. Otherstudies in this category looked at speciﬁc aspects of quality, such as, classiﬁcation forlong bone fractures to support quality improvement in paediatric medicine [Grundmeieret al., 2016], automatic identiﬁcation of reports that have critical ﬁndings for auditingpurposes [Heilbrun et al., 2019], deriving a query-based quality measure to comparestructured and free-text report variability [Maros et al., 2018], and [Minn et al., 2015]10ho describe a method to ﬁx errors in gender or laterality in a report.

Cohort and Epidemiology

This category is similar to Pons’ earlier review but we treated the studies in this categoryslightly attempting to diﬀerentiate which papers described methods for creating cohortsfor research purposes, and those which also reported the outcomes of an epidemiologi-cal analysis. Ten studies use NLP to create speciﬁc cohorts for research purposes andsix reported the performance of their tools. Out of these papers, the majority (n=8)created cohorts for speciﬁc medical conditions including fatty liver disease [Goldshteinet al., 2020, Redman et al., 2017] hepatocellular cancer [Sada et al., 2016], uretericstones [Li and Elliot, 2019], vertebral facture [Tan and Heagerty, 2019], traumatic braininjury [Yadav et al., 2016, Mahan et al., 2019], and leptomeningeal disease secondary tometastatic breast cancer [Brizzi et al., 2019]. Five papers identiﬁed cohorts focused onparticular radiology ﬁndings including ground glass opacities (GGO) [Van Haren et al.,2019], cerebral microbleeds (CMB) [Noorbakhsh-Sabet et al., 2018], pulmonary nodules[Gould et al., 2015], [Huhdanpaa et al., 2018], changes in the spine correlated to backpain [Bates et al., 2016] and identifying radiological evidence of people having suﬀereda fall. One paper focused on identifying abnormalities of speciﬁc anatomical regions ofthe ear within an audiology imaging database [Masino et al., 2016] and another paperaimed to create a cohort of people with any rare disease (within existing ontologies - Or-phanet Rare Disease Ontology ORDO and Radiology Gamuts Ontology RGO). Lastly,one paper took a diﬀerent approach of screening reports to create a cohort of people withcontraindications for MRI, seeking to prevent iatrogenic events. Amongst the epidemiol-ogy studies there were various analytical aims, but they primarily focused on estimatingthe prevalence or incidence of conditions or imaging ﬁndings and looking for associationsof these conditions/ﬁndings with speciﬁc population demographics, associated factors orcomorbidities. The focus of one study diﬀered in that it applied NLP to healthcareevaluation, investigating the association of palliative care consultations and measures ofhigh-quality end-of-life (EOL) care [Brizzi et al., 2019].

Technical NLP

This category is for publications that have a primary technical aim that is not focusedon radiology report outcome, e.g. detecting negation in reports, spelling correction [Zechet al., 2019], fact checking [Zhang et al., 2019, Steinkamp et al., 2019] methods for sampleselection, crowd source annotation [Cocos et al., 2017]. This category did not occur inPons’ earlier review.

NLP Methods in Use

NLP methods capture the diﬀerent techniques an author applied broken down into rules,machine learning methods, deep learning, ontologies, lexicons and word embeddings.11e discriminate machine learning from deep learning, using the former to representtraditional machine learning methods.Over half of the studies only applied one type of NLP method and just over a quarterof the studies compared or combined methods in hybrid approaches. The remainingstudies either used a bespoke proprietary system or focus on building ontologies orsimilarity measures (Figure 3). Rule-based method use remains almost constant acrossthe period, whereas use of machine learning decreases and deep learning methods rises,from ﬁve publications in 2017 to twenty-four publications in 2019 (Figure 4).Figure 3: NLP method breakdownTable 9: Breakdown of NLP MethodML (n=74) No studies Deep Learning (n=36) No studiesSVM 34 RNN variants 14Logistic Regression 23 CNN 10Random Forest 18 Other 5Na¨ıve Bayes 17 Compare CNN, RNN 4Maximum Entropy 7 Combine CNN+RNN 3Decision Trees 4A variety of machine classiﬁer algorithms were used, with SVM and Logistic Regres-sion being the most common (Table 9). Recurrent Neural Networks (RNN) variants werethe most common type of deep learning architectures. RNN methods were split betweenlong short-term memory (LSTM) and bidirectional-LSTM (Bi-LSTM), bi-directional12igure 4: NLP method by yeargated recurrent unit (Bi-GRU), and standard RNN approaches. Four of these studiesadditionally added a Conditional Random Field (CRF) for the ﬁnal label generationstep. Convolutional Neural Networks (CNN) were the second most common architec-ture explored. Eight studies additionally used an attention mechanism as part of theirdeep learning architecture. Other neural approaches included feed-forward neural net-works, fully connected neural networks and a proprietary neural system IBM Watson[Trivedi et al., 2018] and Snorkel [Ratner et al., 2018]. Several studies proposed combinedarchitectures, such as [Zhu et al., 2019, Short et al., 2019].

NLP Method Features

Most rule-based and machine classifying approaches used features based on bag-of-words,part-of-speech, term frequency, and phrases with only two studies alternatively usingword embeddings. Three studies use feature engineering with deep learning rather thanword embeddings. Thirty-three studies use domain-knowledge to support building fea-tures for their methods, such as developing lexicons or selecting terms and phrases.Comparison of embedding methods is diﬃcult as many studies did not describe theirembedding method. Of those that did, Word2Vec [Mikolov et al., 2013] was the mostpopular (n=19), followed by GLOVE embeddings [Pennington et al., 2014] (n=6), Fast-Text [Mikolov et al., 2018] (n=3), ELMo [Peters et al., 2018] (n=1) and BERT [Devlinet al., 2018] (n=1). Ontologies or lexicon look-ups are used in 100 studies; however, eventhough publications increase over the period in real terms, 20% fewer studies employ the13se of ontologies or lexicons in 2019 compared to 2015. The most widely used resourceswere UMLS [National Library of Medicine, 2021b] (n=15), Radlex [RSNA, 2021] (n=20),SNOMED-CT [National Library of Medicine, 2021a] (n=14). Most studies used these asfeatures for normalising words and phrases for classiﬁcation, but this was mainly thoseusing rule-based or machine learning classiﬁers with only six studies using ontologiesas input to their deep learning architecture. Three of those investigated how existingontologies can be combined with word embeddings to create domain-speciﬁc mappings,with authors pointing to this avoiding the need for large amounts of annotated data.Other approaches looked to extend existing medical resources using a frequent phrasesapproach, e.g. [Bulu et al., 2018]. Works also used the derived concepts and relationsvisualising these to support activities, such as report reading and report querying (e.g.[Hassanpour and Langlotz, 2016, Zhao et al., 2018])

Annotation and Inter-Annotator Agreement

Eighty-nine studies used at least two annotators, 75 did not specify any annotationdetails, and only one study used a single annotator. Whilst 69 studies use a domain ex-pert for annotation (a clinician or radiologist) only 56 studies report the inter-annotatoragreement. Some studies mention annotation but do not report on agreement or annota-tors. Inter-annotator agreement values for Kappa range from 0.43 to perfect agreementat 1. Whilst most studies reported agreement by Cohen’s Kappa [Cohen, 1960] somereported precision, and percent agreement. Studies reported annotation data sizes diﬀer-ently, e.g., on the sentence or patient level. Studies also considered ground truth labelsfrom coding schemes such as ICD or BI-RADS categories as annotated data. Of studieswhich detailed human annotation at the radiology report level, only 45 speciﬁed inter-annotator agreement and/or the number of annotators. Annotated report numbers forthese studies varies with 15 papers having annotated less than 500, 12 having annotatedbetween 500 and less than 1,000, 15 between 1,000 and less than 3,000, and 3 between4,000 and 8,288 reports.

Data Sources and Availability

Only 14 studies reported that their data is available, and 15 studies reported that theircode is available. Most studies sourced their data from medical institutions, a numberof studies did not specify where their data was from, and some studies used publiclyavailable datasets: MIMIC-III (n=5), MIMIC-II (n=1), MIMIC-CXR (n=1); Radcore(n=5) or STRIDE (n=2). Four studies used combined electronic health records such asclinical notes or pathology reports.Reporting on data size and splits diﬀered across studies with some not giving exactdata sizes and others reporting numbers of sentences, patients, or mixed data sourcesrather than radiology reports. Data sizes for those reporting at the radiology reportlevel is n=135 or 82.32% of the studies (Table 10). The biggest variation of data sizeby NLP Method is in studies that apply other methods or are rule-based. Machinelearning also varies in size; however, the median value is lower compared to rule-based14able 10: NLP Method by data size properties, minimum data size, maximum data sizeand median value, studies reporting in numbers of radiology reportsNLP Method Min Size Max Size MedianCompare Methods 513 2,167,445 2,845Hybrid Methods 40 34,926 918Deep Learning (Only) 120 1,567,581 5,000Machine Learning (Only) 101 2,977,739 2,531Rules (Only) 31 10,000,000 8,000Other 25 12,377,743 10,000Table 11: Grouped data size and number of studies in each group, only for studiesreporting in numbers of radiology reportsData Size Group No. Studies (%) <

200 9 (6.7)200 <

500 6 (4.4)500 < < < < NLP Performance and Evaluation Measures

Performance metrics applied for evaluation of methods vary widely with authors usingprecision (positive predictive value (PPV)), recall (sensitivity), speciﬁcity, the area underthe curve (AUC) or accuracy. We observed a wide variety in evaluation methodologyemployed concerning test or validation datasets. Diﬀerent approaches were taken ingenerating splits for testing and validation, including k-fold cross-validation. Ninety-nine studies reported on training and test data splits, of which only 59 studies includeda validation set. Only 10 studies validated their algorithm using an external dataset fromanother institution, another modality, or a diﬀerent patient population. The most widelyused metrics for reporting performance were precision (PPV) and recall (sensitivity)reported in 47% of studies. However, even though many studies compared methodsand reported on the top-performing method, very few studies carried out signiﬁcance15esting on these comparisons. Issues of heterogeneity make it diﬃcult and unrealistic tocompare performance between methods applied, hence, we use summary measures as abroad overview (Figure 5). Performance reported varies, but both the mean and medianvalues for the F1 score appear higher for methods using rule-based only or deep learningonly methods. Whilst diﬀerences are less discernible between F1 scores for applicationareas,

Diagnostic Surveillance looks on average lower than other categories.Figure 5: Application Category and NLP Method, Mean and Median Summaries. Meanvalue is indicated by a vertical bar, the box shows error bars and the asterisk is themedian value.

Discussion and Future Directions

Our work shows there has been a considerable increase in the number of publicationsusing NLP on radiology reports over the recent time period. Compared to 67 publicationsretrieved in the earlier review of [Pons et al., 2016], we retrieved 164 publications. Inthis section we discuss and oﬀer some insight into the observations and trends of howNLP is being applied to radiology and make some recommendations that may beneﬁtthe ﬁeld going forward.

Clinical Applications and NLP Methods in Radiology

The clinical applications of the publications is similar to the earlier review of Pons et al.but whilst we observe an increase in research output we also highlight that there appearsto be even less focus on clinical application compared to their review. Like many otherﬁelds applying NLP the use of deep learning has increased, with RNN architecturesbeing the most popular. This is also observed in a review of NLP in clinical text[Wu16t al., 2020]. However, although deep learning use increases, rules and traditional ma-chine classiﬁers are still prevalent and often used as baselines to compare deep learningarchitectures against. One reason for traditional methods remaining popular is their in-terpretability compared to deep learning models. Understanding the features that drivea model prediction can support decision-making in the clinical domain but the complexlayers of non-linear data transformations deep learning is composed of does not easilysupport transparency [Shickel et al., 2018]. This may also help explain why in synthesisof the literature we observed less focus on discussing clinical application and more em-phasis on disease classiﬁcation or information task only. Advances in interpretability ofdeep learning models are critical to its adoption in clinical practice.Other challenges exist for deep learning such as only having access to small or im-balanced datasets. Chen et al. [Chen et al., 2019] review deep learning methods withinhealthcare and point to these challenges resulting in poor performance but that thesesame datasets can perform well with traditional machine learning methods. We foundseveral studies highlight this and when data is scarce or datasets imbalanced, they intro-duced hybrid approaches of rules and deep learning to improve performance, particularlyin the

Diagnostic Surveillance category. Yang et al. [Yang et al., 2018] observed rulesperforming better for some entity types, such as time and size, which are proportionallylower than some of the other entities in their train and test sets; hence they combine abidirectional-LSTM and CRF with rules for entity recognition. Peng et al. [Peng et al.,2019] comment that combining rules and the neural architecture complement each other,with deep learning being more balanced between precision and recall, but the rule-basedmethod having higher precision and lower recall. The authors reason that this providesbetter performance as rules can capture rare disease cases, particularly when multi-classlabelling is needed, whilst deep learning architectures perform worse in instances withfewer data points.In addition to its need for large-scale data, deep learning can be computationallycostly. The use of pre-trained models and embeddings may alleviate some of this burden.Pre-trained models often only require ﬁne-tuning, which can reduce computation cost.Language comprehension pre-learned from other tasks can then be inherited from theparent models, meaning fewer domain-speciﬁc labelled examples may be needed [Woodet al., 2020]. This use of pre-trained information also supports generalisability, e.g.,[Banerjee et al., 2019b] show that their model trained on one dataset can generalise toother institutional datasets.Embedding use has increased which is expected with the application of deep learn-ing approaches but many rule-based and machine classiﬁers continue to use traditionalcount-based features, e.g., bag-of-words and n-grams. Recent evidence [Ong et al.,2020] suggests that the trend to continue to use feature engineering with traditionalmachine learning methods does produce better performance in radiology reports thanusing domain-speciﬁc word embeddings.Banerjee et al. [Banerjee et al., 2017] found that there was not much diﬀerence be-tween a uni-gram approach and a Word2vec embedding, hypothesising this was due totheir narrow domain, intracranial haemorrhage. However, the NLP research ﬁeld has17een a move towards bi-directional encoder representations from transformers (BERT)based embedding models not reﬂected in our analysis, with only one study using BERTgenerated embeddings [Deshmukh et al., 2019]. Embeddings from BERT are thought tobe superior as they can deliver better contextual representations and result in improvedtask performance. Whilst more publications since our review period have used BERTbased embeddings with radiology reports e.g. [Wood et al., 2020, Smit et al., 2020a]not all outperform traditional methods [Grivas et al., 2020]. Recent evidence shows thatembeddings generated by BERT fail to show a generalisable understanding of negation[Ettinger, 2020], an essential factor in interpreting radiology reports eﬀectively. Spe-cialised BERT models have been introduced such as ClinicalBERT [Alsentzer et al.,2019] or BlueBERT [Smit et al., 2020a]. BlueBERT has been shown to outperform Clin-icalBERT when considering chest radiology [Smit et al., 2020b] but more explorationof the performance gains versus the beneﬁts of generalisability are needed for radiologytext.All NLP models have in common that they need large amounts of labelled data formodel training [Yasaka and Abe, 2018]. Several studies [Percha et al., 2018, Tahmasebiet al., 2019, Banerjee et al., 2018] explored combining word embeddings and ontologiesto create domain-speciﬁc mappings, and they suggest this can avoid a need for largeamounts of annotated data. Additionally, [Percha et al., 2018, Tahmasebi et al., 2019]highlight that such combinations could boost coverage and performance compared tomore conventional techniques for concept normalisation.The number of publications using medical lexical knowledge resources is still rel-atively low, even though a recent trend in the general NLP ﬁeld is to enhance deeplearning with external knowledge [Young et al., 2018]. This was also observed by [Wuet al., 2020], where only 18% of the deep learning studies in their review utilised knowl-edge resources. Although pre-training supports learning previously known facts it couldintroduce unwanted bias, hindering performance. The inclusion of domain expertisethrough resources such as medical lexical knowledge may help reduce this unwanted bias[Wu et al., 2020]. Exploration of how this domain expertise can be incorporated withdeep learning architectures in future could improve the performance when having accessto less labelled data.

Task Knowledge

Knowledge about the disease area of interest and how aspects of this disease are lin-guistically expressed is useful and could promote better performing solutions. Whilst[Donnelly et al., 2019] ﬁnd high variability between radiologists, with metric values(e.g. number of syntactic, clinical terms based on ontology mapping) being signiﬁcantlygreater on free-text than structured reports, [Xie et al., 2019] who look speciﬁcallyat anatomical areas ﬁnd less evidence for variability. Zech et al. [Zech et al., 2018]suggest that the highly specialised nature of each imaging modality creates diﬀerentsub-languages and the ability to discover these labels (i.e. disease mentions) reﬂects theconsistency with which labels are referred to. For example, edema is referred to very con-sistently whereas other labels are not, such as infarction/ischaemic. Understanding the18anguage and the context of entity mentions could help promote novel ideas on how tosolve problems more eﬀectively. For example, [Yim et al., 2017] discuss how the accuracyof predicting malignancy is aﬀected by cues being outside their window of considerationand [Yim et al., 2018] observe problems of co-reference resolution within a report dueto long-range dependencies. Both these studies use traditional NLP approaches, butwe observed novel neural architectures being proposed to improve performance in sim-ilar tasks speciﬁcally capturing long-range context and dependency learning, e.g., [Zhuet al., 2019, Short et al., 2019]. This understanding requires close cooperation of health-care professionals and data scientists, which is diﬀerent to some other ﬁelds where moredisconnection is present [Chen et al., 2019].

Study Heterogeneity, a Need for Reporting Standards

Most studies reviewed could be described as a proof-of-concept and not trialled in a clin-ical setting. Pons et al. [Pons et al., 2016] hypothesised that a lack of clinical applicationmay stem from uncertainty around minimal performance requirements hampering imple-mentations, evidence-based practice requiring justiﬁcation and transparency of decisions,and the inability to be able to compare to human performance as the human agreementis often an unknown. These hypotheses are still valid, and we see little evidence thatthese problems are solved.Human annotation is generally considered the gold standard at measuring humanperformance, and whilst many studies reported that they used annotated data, overall,reporting was inconsistent. Steps were undertaken to measure inter-annotator agree-ment (IAA), but in many studies, this was not directly comparable to the evaluationundertaken of the NLP methods. The size of the data being used to draw experimentalconclusions from is important and accurate reporting of these measures is essential toensure reproducibility and comparison in further studies. Reporting on the training,test and validation splits was varied with some studies not giving details and not usingheld-out validation sets.Most studies use retrospective data from single institutions but this can lead to amodel over-ﬁtting and, thus, not generalising well when applied in a new setting. Over-coming the problem of data availability is challenging due to privacy and ethics concerns,but essential to ensure that performance of models can be investigated across institu-tions, modalities, and methods. Availability of data would allow for agreed benchmarksto be developed within the ﬁeld that algorithm improvements can be measured upon.External validation of applied methods was extremely low, although, this is likely due tothe availability of external datasets. Making code available would enable researchers toreport how external systems perform on their data. However, only 15 studies reportedthat their code is available. To be able to compare systems there is a need for commondatasets to be available to benchmark and compare systems against.Whilst reported ﬁgures in precision and recall generally look high more evidence isneeded for accurate comparison to human performance. A wide variety of performancemeasures were used, with some studies only reporting one measure, e.g., accuracy orF1 scores, with these likely representing the best performance obtained. Individual19tudies are often not directly comparable for such measures, but none-the-less clarityand consistency in reporting is desirable. Many studies making model comparisons didnot carry out any signiﬁcance testing for these comparisons.The make the following recommendations to help move the ﬁeld forward, enable moreinter-study comparisons, and increase study reproducibility:1. Clarity in reporting study properties is required: (a) Data characteristics includingsize and the type of dataset should be detailed, e.g., the number of reports, sen-tences, patients, and if patients how many reports per patient. The training, testand validation data split should be evident, as should the source of the data. (b)Annotation characteristics including the methodology to develop the annotationshould be reported, e.g., annotation set size, annotator details, how many, exper-tise. (c) Performance metrics should include a range of metrics: precision, recall,F1, accuracy and not just one overall value.2. Signiﬁcance testing should be carried out when a comparison between methods ismade.3. Data and code availability are encouraged. While making data available will oftenbe challenging due to privacy concerns, researchers should make code available toenable inter-study comparisons and external validation of methods.4. Common datasets should be used to benchmark and compare systems.

Limitations of Study

Publication search is subject to bias in search methods and it is likely that our searchstrategy did inevitably miss some publications. Whilst trying to be precise and objec-tive during our review process some of the data collected and categorising publicationsinto categories was diﬃcult to agree on and was subjective. For example, many of thepublications could have belonged to more than one category. One of the reasons for thiswas how diverse in structure the content was which was in some ways reﬂected by thediﬀerent domains papers were published in. It is also possible that certain keywordswere missed in recording data elements due to the reviewers own biases and researchexperience.

Conclusions

This paper presents an systematic review of publications using NLP on radiology reportsduring the period 2015 to October 2019. We show there has been substantial growthin the ﬁeld particularly in researchers using deep learning methods. Whilst deep learn-ing use has increased, as seen in NLP research in general, it faces challenges of lowerperformance when data is scarce or when labelled data is unavailable, and is not widelyused in clinical practice perhaps due to the diﬃculties in interpretability of such models.Traditional machine learning and rule-based methods are, therefore, still widely in use.20xploration of domain expertise such as medial lexical knowledge must be explored fur-ther to enhance performance when data is scarce. The clinical domain faces challengesdue to privacy and ethics in sharing data but overcoming this would enable developmentof benchmarks to measure algorithm performance and test model robustness across in-stitutions. Common agreed datasets to compare performance of tools against would helpsupport the community in inter-study comparisons and validation of systems. The workwe present here has the potential to inform researchers about applications of NLP toradiology and to lead to more reliable and responsible research in the domain.

Acknowledgements

Not applicable

Funding

This research was supported by the Alan Turing Institute, MRC, HDR-UK and the ChiefScientist Oﬃce. B.A.,A.C,D.D.,A.G. and C.G. have been supported by the Alan TuringInstitute via Turing Fellowships (B.A,C.G.) and Turing project funding (ESPRC grantEP/N510129/1). A.G. was also funded by a MRC Mental Health Data Pathﬁnder Award(MRC-MCPC17209). H.W. is MRC/Rutherford Fellow HRD UK (MR/S004149/1).H.D. is supported by HDR UK National Phemomics Resource Project. V.S-P. is sup-ported by the HDR UK National Text Analytics Implementation Project. W.W. issupported by a Scottish Senior Clinical Fellowship (CAF/17/01).

Abbreviations

NLP - natural language processinge.g. - exampleICD - international classiﬁcation of diseasesBI-RADS - Breast Imaging-Reporting and Data SystemIAA - inter-annotator agreementNo. - numberUMLS - uniﬁed medical language systemELMo - embeddings from Language ModelsBERT - bidirectional encoder representations form transformersSVM - support vector machineCNN - convolutional neural networkLSTM - long short-term memoryBi-LSTM - bi-directional long short-term memoryBi-GRU - bi-directional gated recurrent unitCRF - conditional random ﬁeld 21LOVE - Global Vectors for Word Representation

Bibliography

Beatrice Alex, Claire Grover, Richard Tobin, Cathie Sudlow, Grant Mair, and WilliamWhiteley. Text mining brain imaging reports.

Journal of Biomedical Semantics , 10(1):23, November 2019. ISSN 2041-1480. doi: 10.1186/s13326-019-0211-7. URL https://doi.org/10.1186/s13326-019-0211-7 .Emily Alsentzer, John Murphy, William Boag, Wei-Hung Weng, Di Jindi, Tristan Nau-mann, and Matthew McDermott. Publicly Available Clinical BERT Embeddings. In

Proceedings of the 2nd Clinical Natural Language Processing Workshop , pages 72–78,Minneapolis, Minnesota, USA, June 2019. Association for Computational Linguistics.doi: 10.18653/v1/W19-1909. URL .Imon Banerjee, Sriraman Madhavan, Roger Eric Goldman, and Daniel L. Rubin. Intel-ligent Word Embeddings of Free-Text Radiology Reports.

AMIA Annual SymposiumProceedings , pages 411–420, 2017. ISSN 1942-597X. URL .Imon Banerjee, Matthew C. Chen, Matthew P. Lungren, and Daniel L. Rubin. Radiologyreport annotation using intelligent word embeddings: Applied to multi-institutionalchest CT cohort.

Journal of Biomedical Informatics , 77:11–20, January 2018. ISSN1532-0464. doi: 10.1016/j.jbi.2017.11.012. URL .Imon Banerjee, Selen Bozkurt, Emel Alkim, Hersh Sagreiya, Allison W. Kurian, andDaniel L. Rubin. Automatic inference of BI-RADS ﬁnal assessment categoriesfrom narrative mammography report ﬁndings.

Journal of Biomedical Informatics ,92:103137, April 2019a. ISSN 1532-0464. doi: 10.1016/j.jbi.2019.103137. URL .Imon Banerjee, Yuan Ling, Matthew C. Chen, Sadid A. Hasan, Curtis P. Langlotz,Nathaniel Moradzadeh, Brian Chapman, Timothy Amrhein, David Mong, Daniel L.Rubin, Oladimeji Farri, and Matthew P. Lungren. Comparative eﬀectiveness ofconvolutional neural network (CNN) and recurrent neural network (RNN) archi-tectures for radiology text report classiﬁcation.

Artiﬁcial Intelligence in Medicine ,97:79–88, June 2019b. ISSN 0933-3657. doi: 10.1016/j.artmed.2018.11.004. URL .Jonathan Bates, Samah J. Fodeh, Cynthia A. Brandt, and Julie A. Womack. Classiﬁ-cation of radiology reports for falls in an HIV study cohort.

Journal of the Ameri-can Medical Informatics Association , 23(e1):e113–e117, April 2016. ISSN 1067-5027.doi: 10.1093/jamia/ocv155. URL https://academic.oup.com/jamia/article/23/e1/e113/2379897 . 22ebastian E. Beyer, Brady J. McKee, Shawn M. Regis, Andrea B. McKee, SebastianFlacke, Gilan El Saadawi, and Christoph Wald. Automatic Lung-RADS ™ classiﬁcationwith a natural language processing system. Journal of Thoracic Disease , 9(9):3114–3122, September 2017. ISSN 2072-1439. doi: 10.21037/jtd.2017.08.13. URL .Mark D. Bobbin, Ivan K. Ip, V. Anik Sahni, Atul B. Shinagare, and Ramin Kho-rasani. Focal Cystic Pancreatic Lesion Follow-up Recommendations After Publica-tion of ACR White Paper on Managing Incidental Findings.

Journal of the Amer-ican College of Radiology , 14(6):757–764, June 2017. ISSN 1546-1440. doi: 10.1016/j.jacr.2017.01.044. URL .Selen Bozkurt, Francisco Gimenez, Elizabeth S. Burnside, Kemal H. Gulkesen, andDaniel L. Rubin. Using automatically extracted information from mammographyreports for decision-support.

Journal of Biomedical Informatics , 62:224–231, Au-gust 2016. ISSN 1532-0464. doi: 10.1016/j.jbi.2016.07.001. URL .Selen Bozkurt, Emel Alkim, Imon Banerjee, and Daniel L. Rubin. Automated Detectionof Measurements and Their Descriptors in Radiology Reports Using a Hybrid NaturalLanguage Processing Algorithm.

Journal of Digital Imaging , 32(4):544–553, August2019. ISSN 1618-727X. doi: 10.1007/s10278-019-00237-9. URL https://doi.org/10.1007/s10278-019-00237-9 .Simon Briscoe, Alison Bethel, and Morwenna Rogers. Conduct and reporting of citationsearching in Cochrane systematic reviews: A cross-sectional study.

Research SynthesisMethods , 11(2):169–180, 2020. ISSN 1759-2887. doi: 10.1002/jrsm.1355. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/jrsm.1355 .Kate Brizzi, Sophia N. Zupanc, Brooks V. Udelsman, James A. Tulsky, Alexi A.Wright, Hanneke Poort, and Charlotta Lindvall. Natural Language Processing toAssess Palliative Care and End-of-Life Process Measures in Patients With BreastCancer With Leptomeningeal Disease.

American Journal of Hospice and PalliativeMedicine , 37(5):371–376, 2019. doi: https://doi.org/10.1177/1049909119885585. URL https://journals.sagepub.com/doi/abs/10.1177/1049909119885585 .A. D. Brown and J. R. Kachura. Natural Language Processing of Radiology Reportsin Patients With Hepatocellular Carcinoma to Predict Radiology Resource Utiliza-tion.

Journal of the American College of Radiology , 16(6):840–844, June 2019. ISSN1546-1440. doi: 10.1016/j.jacr.2018.12.004. URL .Andrew D. Brown and Thomas R. Marotta. A Natural Language Processing-based Model to Automate MRI Brain Protocol Selection and Prioritization.

Aca-demic Radiology , 24(2):160–166, February 2017. ISSN 1076-6332. doi: 10.1016/23.acra.2016.09.013. URL .Andrew D. Brown and Thomas R. Marotta. Using machine learning for sequence-level automated MRI protocol selection in neuroradiology.

Journal of the Amer-ican Medical Informatics Association , 25(5):568–571, May 2018. ISSN 1067-5027.doi: 10.1093/jamia/ocx125. URL https://academic.oup.com/jamia/article/25/5/568/4569611 .Hakan Bulu, Dorothy A. Sippo, Janie M. Lee, Elizabeth S. Burnside, and Daniel L.Rubin. Proposing New RadLex Terms by Analyzing Free-Text Mammography Re-ports.

Journal of Digital Imaging , 31(5):596–603, October 2018. ISSN 1618-727X. doi:10.1007/s10278-018-0064-0. URL https://doi.org/10.1007/s10278-018-0064-0 .Tianrun Cai, Andreas A. Giannopoulos, Sheng Yu, Tatiana Kelil, Beth Ripley,Kanako K. Kumamaru, Frank J. Rybicki, and Dimitrios Mitsouras. Natural LanguageProcessing Technologies in Radiology Research and Clinical Applications.

Radio-Graphics , 36(1):176–191, January 2016. ISSN 0271-5333. doi: 10.1148/rg.2016150080.URL https://pubs.rsna.org/doi/full/10.1148/rg.2016150080 .Sergio M. Castro, Eugene Tseytlin, Olga Medvedeva, Kevin Mitchell, ShyamVisweswaran, Tanja Bekhuis, and Rebecca S. Jacobson. Automated annotationand classiﬁcation of BI-RADS assessment from radiology reports.

Journal ofBiomedical Informatics , 69:177–187, May 2017. ISSN 1532-0464. doi: 10.1016/j.jbi.2017.04.011. URL .David Chen, Sijia Liu, Paul Kingsbury, Sunghwan Sohn, Curtis B. Storlie, Elizabeth B.Habermann, James M. Naessens, David W. Larson, and Hongfang Liu. Deep learningand alternative learning strategies for retrospective real-world clinical data. npj DigitalMedicine , 2(1):1–5, May 2019. ISSN 2398-6352. doi: 10.1038/s41746-019-0122-0. URL .Matthew C. Chen, Robyn L. Ball, Lingyao Yang, Nathaniel Moradzadeh, Brian E. Chap-man, David B. Larson, Curtis P. Langlotz, Timothy J. Amrhein, and Matthew P.Lungren. Deep Learning to Classify Radiology Free-Text Reports.

Radiology , 286(3):845–852, November 2017. ISSN 0033-8419. doi: 10.1148/radiol.2017171115. URL https://pubs.rsna.org/doi/full/10.1148/radiol.2017171115 .Po-Hao Chen, Hanna Zafar, Maya Galperin-Aizenberg, and Tessa Cook. Integrat-ing Natural Language Processing and Machine Learning Algorithms to CategorizeOncologic Response in Radiology Reports.

Journal of Digital Imaging , 31(2):178–184, April 2018. ISSN 1618-727X. doi: 10.1007/s10278-017-0027-x. URL https://doi.org/10.1007/s10278-017-0027-x .24nne Cocos, Ting Qian, Chris Callison-Burch, and Aaron J. Masino. Crowd con-trol: Eﬀectively utilizing unscreened crowd workers for biomedical data annota-tion.

Journal of Biomedical Informatics , 69:86–92, May 2017. ISSN 1532-0464.doi: 10.1016/j.jbi.2017.04.003. URL .Jacob Cohen. A Coeﬃcient of Agreement for Nominal Scales.

Educational and Psy-chological Measurement , 20(1):37–46, April 1960. ISSN 0013-1644. doi: 10.1177/001316446002000104. URL https://doi.org/10.1177/001316446002000104 .A. Comelli, L. Agnello, and S. Vitabile. An ontology-based retrieval system for mammo-graphic reports. In ,pages 1001–1006, Larnaca, July 2015. IEEE. doi: 10.1109/ISCC.2015.7405644.Viviana Cotik, Dario Filippo, and Jose Castano. An Approach for Automatic Classiﬁca-tion of Radiology Reports in Spanish.

Studies in Health Technology and Informatics ,216:634–638, jan 2015. ISSN 0926-9630, 1879-8365. URL https://europepmc.org/article/med/26262128 .Viviana Cotik, Horacio Rodr´ıguez, and Jorge Vivaldi. Spanish Named Entity Recogni-tion in the Biomedical Domain. In Juan Antonio Lossio-Ventura, Denisse Mu˜nante,and Hugo Alatrista-Salas, editors,

Information Management and Big Data , vol-ume 898 of

Communications in Computer and Information Science , pages 233–248,Lima, Peru, 2018. Springer International Publishing. ISBN 978-3-030-11680-4. doi:10.1007/978-3-030-11680-4-23.Sandeep Dalal, Vadiraj Hombal, Wei-Hung Weng, Gabe Mankovich, Thusitha Mabo-tuwana, Christopher S. Hall, Joseph Fuller, Bruce E. Lehnert, and Martin L.Gunn. Determining Follow-Up Imaging Study Using Radiology Reports.

Journalof Digital Imaging , 33(1):121–130, February 2020. ISSN 1618-727X. doi: 10.1007/s10278-019-00260-w. URL https://doi.org/10.1007/s10278-019-00260-w .Neil Deshmukh, Selin Gumustop, Romane Gauriau, Varun Buch, Bradley Wright,Christopher Bridge, Ram Naidu, Katherine Andriole, and Bernardo Bizzo. Semi-Supervised Natural Language Approach for Fine-Grained Classiﬁcation of MedicalReports. arXiv:1910.13573 [cs.LG] , November 2019. URL http://arxiv.org/abs/1910.13573 .Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprintarXiv:1810.04805 , 2018.Lane F. Donnelly, Robert Grzeszczuk, Carolina V. Guimaraes, Wei Zhang, and George S.Bisset III. Using a Natural Language Processing and Machine Learning AlgorithmProgram to Analyze Inter-Radiologist Report Style Variation and Compare VariationBetween Radiologists When Using Highly Structured Versus More Free Text Report-ing.

Current Problems in Diagnostic Radiology , 48(6):524–530, November 2019. ISSN25363-0188. doi: 10.1067/j.cpradiol.2018.09.005. URL .Ruth M. Dunne, Ivan K. Ip, Sarah Abbett, Esteban F. Gershanik, Ali S. Raja, AndettaHunsaker, and Ramin Khorasani. Eﬀect of Evidence-based Clinical Decision Supporton the Use and Yield of CT Pulmonary Angiographic Imaging in Hospitalized Patients.

Radiology , 276(1):167–174, February 2015. ISSN 0033-8419. doi: 10.1148/radiol.15141208. URL https://pubs.rsna.org/doi/full/10.1148/radiol.15141208 .Allyson Ettinger. What BERT Is Not: Lessons from a New Suite of PsycholinguisticDiagnostics for Language Models.

Transactions of the Association for ComputationalLinguistics , 8:34–48, January 2020. doi: 10.1162/tacl \ a \ https://doi.org/10.1162/tacl_a_00298 .Farhood Farjah, Scott Halgrim, Diana S.M. Buist, Michael K. Gould, Steven B. Zeliadt,Elizabeth T. Loggers, and David S. Carrell. An Automated Method for IdentifyingIndividuals with a Lung Nodule Can Be Feasibly Implemented Across Health Systems. eGEMs , 4(1):1254, August 2016. ISSN 2327-9214. doi: 10.13063/2327-9214.1254. URL .Joseph L. Fleiss. Measuring nominal scale agreement among many raters. PsychologicalBulletin , 76(5):378–382, 1971. ISSN 1939-1455(Electronic),0033-2909(Print). doi: 10.1037/h0031619.Sunyang Fu, Lester Y. Leung, Yanshan Wang, Anne-Olivia Raulli, David F. Kallmes,Kristin A. Kinsman, Kristoﬀ B. Nelson, Michael S. Clark, Patrick H. Luetmer, Paul R.Kingsbury, David M. Kent, and Hongfang Liu. Natural Language Processing for theIdentiﬁcation of Silent Brain Infarcts From Neuroimaging Reports.

JMIR MedicalInformatics , 7(2):e12109, 2019. doi: 10.2196/12109. URL https://medinform.jmir.org/2019/2/e12109/ .Ravi Garg, Elissa Oh, Andrew Naidech, Konrad Kording, and Shyam Prab-hakaran. Automating Ischemic Stroke Subtype Classiﬁcation Using Machine Learn-ing and Natural Language Processing.

Journal of Stroke and Cerebrovascu-lar Diseases , 28(7):2045–2051, July 2019. ISSN 1052-3057. doi: 10.1016/j.jstrokecerebrovasdis.2019.02.004. URL .Jean-Fran¸cois Gehanno, Laetitia Rollin, and Stefan Darmoni. Is the coverage of googlescholar enough to be used alone for systematic reviews.

BMC Medical Informaticsand Decision Making , 13:7, 2013. doi: 10.1186/1472-6947-13-7. URL .Inbal Goldshtein, Gabriel Chodick, Ilan Kochba, Nitsan Gal, Muriel Webb, andOren Shibolet. Identiﬁcation and Characterization of Nonalcoholic Fatty Liver Dis-ease.

Clinical Gastroenterology and Hepatology , 18(8):1887–1889, July 2020. ISSN26542-3565. doi: 10.1016/j.cgh.2019.08.007. URL .Philip John Gorinski, Honghan Wu, Claire Grover, Richard Tobin, Conn Talbot,Heather Whalley, Cathie Sudlow, William Whiteley, and Beatrice Alex. NamedEntity Recognition for Electronic Health Records: A Comparison of Rule-basedand Machine Learning Approaches. arXiv:1903.03985 [cs.CL] , June 2019. URL http://arxiv.org/abs/1903.03985 .Michael K. Gould, Tania Tang, In-Lu Amy Liu, Janet Lee, Chengyi Zheng, Kim N.Danforth, Anne E. Kosco, Jamie L. Di Fiore, and David E. Suh. Recent Trends inthe Identiﬁcation of Incidental Pulmonary Nodules.

American Journal of Respiratoryand Critical Care Medicine , 192(10):1208–1214, July 2015. ISSN 1073-449X. doi:10.1164/rccm.201505-0990OC. URL .A. Grivas, B. Alex, C. Grover, Tobin, R., and Whiteley, W. Not a cute stroke: Analysis ofRule- and Neural Network-Based Information Extraction Systems for Brain RadiologyReports. In

Proceedings of the 11th International Workshop on Health Text Miningand Information Analysis , 2020.Robert W. Grundmeier, Aaron J. Masino, T. Charles Casper, Jonathan M. Dean,Jamie Bell, Rene Enriquez, Sara Deakyne, James M. Chamberlain, and Eliza-beth R. Alpern. Identiﬁcation of Long Bone Fractures in Radiology ReportsUsing Natural Language Processing to Support Healthcare Quality Improvement.

Applied Clinical Informatics , 7(4):1051–1068, November 2016. ISSN 1869-0327.doi: 10.4338/ACI-2016-08-RA-0129. URL .Anupama Gupta, Imon Banerjee, and Daniel L. Rubin. Automatic information ex-traction from unstructured mammography reports using distributed semantics.

Jour-nal of Biomedical Informatics , 78:78–86, February 2018. ISSN 1532-0464. doi: 10.1016/j.jbi.2017.12.016. URL .Harzing A. W.

Publish or Perish . 2007. URL

Availablefromhttps://harzing.com/resources/publish-or-perish .Saeed Hassanpour and Curtis P. Langlotz. Unsupervised Topic Modeling in a LargeFree Text Radiology Report Repository.

Journal of Digital Imaging , 29(1):59–62,February 2016. ISSN 1618-727X. doi: 10.1007/s10278-015-9823-3. URL https://doi.org/10.1007/s10278-015-9823-3 .Saeed Hassanpour, Graham Bay, and Curtis P. Langlotz. Characterization of Changeand Signiﬁcance for Clinical Findings in Radiology Reports Through Natural Lan-guage Processing.

Journal of Digital Imaging , 30(3):314–322, June 2017. ISSN27618-727X. doi: 10.1007/s10278-016-9931-8. URL https://doi.org/10.1007/s10278-016-9931-8 .Marta E. Heilbrun, Brian E. Chapman, Evan Narasimhan, Neel Patel, and DanielleMowery. Feasibility of Natural Language Processing–Assisted Auditing of CriticalFindings in Chest Radiology.

Journal of the American College of Radiology , 16(9, PartB):1299–1304, September 2019. ISSN 1546-1440. doi: 10.1016/j.jacr.2019.05.038. URL .Yi Hong and Jin Zhang. Investigation of Terminology Coverage in Radiology Report-ing Templates and Free-text Reports.

International Journal of Knowledge ContentDevelopment & Technology , 5:5–14, 2015. doi: 10.5865/IJKCT.2015.5.1.005.Hannu T. Huhdanpaa, W. Katherine Tan, Sean D. Rundell, Pradeep Suri, Falgun H.Chokshi, Bryan A. Comstock, Patrick J. Heagerty, Kathryn T. James, Andrew L.Avins, Srdjan S. Nedeljkovic, David R. Nerenz, David F. Kallmes, Patrick H. Luetmer,Karen J. Sherman, Nancy L. Organ, Brent Griﬃth, Curtis P. Langlotz, David Carrell,Saeed Hassanpour, and Jeﬀrey G. Jarvik. Using Natural Language Processing ofFree-Text Radiology Reports to Identify Type 1 Modic Endplate Changes.

Journalof Digital Imaging , 31(1):84–90, February 2018. ISSN 1618-727X. doi: 10.1007/s10278-017-0013-3. URL https://doi.org/10.1007/s10278-017-0013-3 .K. Jnawali, M. R. Arbabshirani, A. E. Ulloa, N. Rao, and A. A. Patel. AutomaticClassiﬁcation of Radiological Report for Intracranial Hemorrhage. In , pages 187–190, NewportBeach, CA, USA, January 2019. IEEE. doi: 10.1109/ICOSC.2019.8665578.E. Johnson, W. C. Baughman, and G. Ozsoyoglu. A method for imputation of semanticclass in diagnostic radiology text. In , pages 750–755, Washington, DC, November 2015.IEEE. doi: 10.1109/BIBM.2015.7359780.Stella K. Kang, Kira Garry, Ryan Chung, William H. Moore, Eduardo Iturrate, Jordan L.Swartz, Danny C. Kim, Leora I. Horwitz, and Saul Blecker. Natural Language Pro-cessing for Identiﬁcation of Incidental Pulmonary Nodules in Radiology Reports.

Jour-nal of the American College of Radiology , 16(11):1587–1594, November 2019. ISSN1546-1440. doi: 10.1016/j.jacr.2019.04.026. URL .B. Karunakaran, D. Misra, K. Marshall, D. Mathrawala, and S. Kethireddy. Closingthe loop — Finding lung cancer patients using NLP. In , pages 2452–2461, Boston, MA, December 2017.IEEE. doi: 10.1109/BigData.2017.8258203.Kenneth L. Kehl, Haitham Elmarakeby, Mizuki Nishino, Eliezer M. Van Allen, Eva M.Lepisto, Michael J. Hassett, Bruce E. Johnson, and Deborah Schrag. Assessment of28eep Natural Language Processing in Ascertaining Oncologic Outcomes From Radiol-ogy Reports.

JAMA Oncology , 5(10):1421–1429, October 2019. ISSN 2374-2437. doi:10.1001/jamaoncol.2019.1800. URL https://doi.org/10.1001/jamaoncol.2019.1800 .Chulho Kim, Vivienne Zhu, Jihad Obeid, and Leslie Lenert. Natural language pro-cessing and machine learning algorithm to identify brain MRI reports with acuteischemic stroke.

PLOS ONE , 14(2):e0212778, February 2019. ISSN 1932-6203.doi: 10.1371/journal.pone.0212778. URL https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0212778 .Kory Kreimeyer, Matthew Foster, Abhishek Pandey, Nina Arya, Gwendolyn Halford,Sandra F. Jones, Richard Forshee, Mark Walderhaug, and Taxiarchis Botsis. Naturallanguage processing systems for capturing and standardizing unstructured clinicalinformation: A systematic review.

Journal of Biomedical Informatics , 73:14–29, 2017.ISSN 1532-0480. doi: 10.1016/j.jbi.2017.07.012.Janice L. Kwan, Darya Yermak, Lezlie Markell, Narinder S. Paul, Kaveh J. Sho-jania, and Peter Cram. Follow Up of Incidental High-Risk Pulmonary Noduleson Computed Tomography Pulmonary Angiography at Care Transitions.

Journalof Hospital Medicine , 14(6):349–352, June 2019. doi: 10.12788/jhm.3128. URL https://europepmc.org/article/med/30794133 .Monika K(cid:32)los, Jaros(cid:32)law ˙Zy(cid:32)lkowski, and Dominik Spinczyk. Automatic Classiﬁcation ofText Documents Presenting Radiology Examinations. In Ewa Pietka, Pawel Badura,Jacek Kawa, and Wojciech Wieclawek, editors,

Proceedings 6th International Confer-ence Information Technology in Biomedicine(ITIB’2018) , Advances in Intelligent Sys-tems and Computing, pages 495–505. Springer International Publishing, 2018. ISBN978-3-319-91211-0. doi: 10.1007/978-3-319-91211-0-43.Ronilda Lacson, Kimberly Harris, Phyllis Brawarsky, Tor D. Tosteson, Tracy Onega,Anna N. A. Tosteson, Abby Kaye, Irina Gonzalez, Robyn Birdwell, and Jennifer S.Haas. Evaluation of an Automated Information Extraction Tool for Imaging DataElements to Populate a Breast Cancer Screening Registry.

Journal of Digital Imaging ,28(5):567–575, October 2015. ISSN 1618-727X. doi: 10.1007/s10278-014-9762-4. URL https://doi.org/10.1007/s10278-014-9762-4 .Ronilda Lacson, Martha E. Goodrich, Kimberly Harris, Phyllis Brawarsky, and Jen-nifer S. Haas. Assessing Inaccuracies in Automated Information Extraction of BreastImaging Findings.

Journal of Digital Imaging , 30(2):228–233, April 2017. ISSN1618-727X. doi: 10.1007/s10278-016-9927-4. URL https://doi.org/10.1007/s10278-016-9927-4 .M. Lafourcade and Lionel Ramadier. Radiological text simpliﬁcation using a generalknowledge base. In ntelligent Text Processing (CICLing 2017) , CICLing 2017. Budapest, Hungary, 2017.doi: https://doi.org/10.1007/978-3-319-77116-8 \ Proceedings of the Tenth Inter-national Conference on Language Resources and Evaluation (LREC 2016) , LREC2016 Proceedings, Portoroˇz, Slovenia, 2016. European Language Resources Associa-tion (ELRA). ISBN 978-2-9517408-9-1. URL https://hal.archives-ouvertes.fr/hal-01382320 .J. Richard Landis and Gary G. Koch. The Measurement of Observer Agreement forCategorical Data.

Biometrics , 33(1):159–174, 1977. ISSN 0006-341X. doi: 10.2307/2529310. URL .Andrew Yu Li and Nikki Elliot. Natural language processing to identify uretericstones in radiology reports.

Journal of Medical Imaging and Radiation Oncol-ogy , 63(3):307–310, 2019. ISSN 1754-9485. doi: 10.1111/1754-9485.12861. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/1754-9485.12861 .Yi Liu, Li-Na Zhu, Qing Liu, Chao Han, Xiao-Dong Zhang, and Xiao-Ying Wang.Automatic extraction of imaging observation and assessment categories from breastmagnetic resonance imaging reports with natural language processing.

ChineseMedical Journal , 132(14):1673–1680, July 2019. ISSN 0366-6999. doi: 10.1097/CM9.0000000000000301. URL .Thusitha Mabotuwana, Christopher S Hall, Joel Tieder, and Martin L. Gunn. Im-proving Quality of Follow-Up Imaging Recommendations in Radiology.

AMIA An-nual Symposium Proceedings , 2017:1196–1204, April 2018a. ISSN 1942-597X. URL .Thusitha Mabotuwana, Vadiraj Hombal, Sandeep Dalal, Christopher S. Hall, and MartinGunn. Determining Adherence to Follow-up Imaging Recommendations.

Journalof the American College of Radiology , 15(3, Part A):422–428, March 2018b. ISSN1546-1440. doi: 10.1016/j.jacr.2017.11.022. URL .Margaret Mahan, Daniel Rafter, Hannah Casey, Marta Engelking, Tessneem Abdal-lah, Charles Truwit, Mark Oswood, and Uzma Samadani. tbiExtractor: A frame-work for extracting traumatic brain injury common data elements from radiologyreports. bioRxiv 585331 , 2019. doi: 10.1101/585331. URL .M´at´e E. Maros, Ralf Wenz, Alex F¨orster, Matthias F. Froelich, Christoph Groden,Wieland H. Sommer, Stefan O. Sch¨onberg, Thomas Henzler, and Holger Wenz. Objec-tive Comparison Using Guideline-based Query of Conventional Radiological Reports30nd Structured Reports.

In Vivo , 32(4):843–849, January 2018. ISSN 0258-851X, 1791-7549. doi: 10.21873/invivo.11318. URL http://iv.iiarjournals.org/content/32/4/843 .Aaron J. Masino, Robert W. Grundmeier, Jeﬀrey W. Pennington, John A. Ger-miller, and E. Bryan Crenshaw. Temporal bone radiology report classiﬁcation usingopen source machine learning and natural langue processing libraries.

BMC Med-ical Informatics and Decision Making , 16(1):65, June 2016. ISSN 1472-6947. doi:10.1186/s12911-016-0306-3. URL https://doi.org/10.1186/s12911-016-0306-3 .Stephane Meystre, Ramkiran Gouripeddi, Joel Tieder, Jeﬀrey Simmons, Rajendu Sri-vastava, and Samir Shah. Enhancing Comparative Eﬀectiveness Research With Au-tomated Pediatric Pneumonia Detection in a Multi-Institutional Clinical Repository:A PHIS+ Pilot Study.

Journal of Medical Internet Research , 19(5):e162, 2017. doi:10.2196/jmir.6887. URL .Shumei Miao, Tingyu Xu, Yonghui Wu, Hui Xie, Jingqi Wang, Shenqi Jing, YaoyunZhang, Xiaoliang Zhang, Yinshuang Yang, Xin Zhang, Tao Shan, Li Wang, HuaXu, Shui Wang, and Yun Liu. Extraction of BI-RADS ﬁndings from breast ultra-sound reports in Chinese using deep learning approaches.

International Journal ofMedical Informatics , 119:17–21, November 2018. ISSN 1386-5056. doi: 10.1016/j.ijmedinf.2018.08.009. URL .Tomas Mikolov, Kai Chen, Greg Corrado, and Jeﬀrey Dean.

Eﬃcient Estimation ofWord Representations in Vector Space . 2013. URL http://arxiv.org/abs/1301.3781 .Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, and ArmandJoulin. Advances in Pre-Training Distributed Word Representations. In

Proceedingsof the International Conference on Language Resources and Evaluation (LREC 2018) ,2018.Matthew J. Minn, Arash R. Zandieh, and Ross W. Filice. Improving Radiology ReportQuality by Rapidly Notifying Radiologist of Report Errors.

Journal of Digital Imaging ,28(4):492–498, August 2015. ISSN 1618-727X. doi: 10.1007/s10278-015-9781-9. URL https://doi.org/10.1007/s10278-015-9781-9 .David Moher, Larissa Shamseer, Mike Clarke, Davina Ghersi, Alessandro Liberati,Mark Petticrew, Paul Shekelle, and Lesley A Stewart. Preferred reporting itemsfor systematic review and meta-analysis protocols (PRISMA-P) 2015 statement.

Systematic Reviews , 4(1):1, December 2015. ISSN 2046-4053. doi: 10.1186/2046-4053-4-1. URL https://systematicreviewsjournal.biomedcentral.com/articles/10.1186/2046-4053-4-1 .Srikanth Mujjiga, Vamsi Krishna, Kalyan Chakravarthi, and Vijayananda J. IdentifyingSemantics in Clinical Reports Using Neural Machine Translation.

Proceedings of the AAI Conference on Artiﬁcial Intelligence , 33(01):9552–9557, July 2019. ISSN 2374-3468. doi: 10.1609/aaai.v33i01.33019552. URL .National Library of Medicine.

SNOMED CT . 2021a. URL .National Library of Medicine.

Uniﬁed Medical Language System . 2021b. URL .Nariman Noorbakhsh-Sabet, Georgios Tsivgoulis, Shima Shahjouei, Yirui Hu, NitinGoyal, Andrei V. Alexandrov, and Ramin Zand. Racial Diﬀerence in CerebralMicrobleed Burden Among a Patient Population in the Mid-South United States.

Journal of Stroke and Cerebrovascular Diseases , 27(10):2657–2661, October 2018.ISSN 1052-3057. doi: 10.1016/j.jstrokecerebrovasdis.2018.05.031. URL .Heiner Oberkampf, Sonja Zillner, James A. Overton, Bernhard Bauer, Alexander Cav-allaro, Michael Uder, and Matthias Hammon. Semantic representation of reportedmeasurements in radiology.

BMC Medical Informatics and Decision Making , 16(1):5, January 2016. ISSN 1472-6947. doi: 10.1186/s12911-016-0248-9. URL https://doi.org/10.1186/s12911-016-0248-9 .Charlene Jennifer Ong, Agni Orfanoudaki, Rebecca Zhang, Francois Pierre M. Caprasse,Meghan Hutch, Liang Ma, Darian Fard, Oluwafemi Balogun, Matthew I. Miller, Mar-garet Minnig, Hanife Saglam, Brenton Prescott, David M. Greer, Stelios Smirnakis,and Dimitris Bertsimas. Machine learning and natural language processing methodsto identify ischemic stroke, acuity and location from radiology reports.

PLOS ONE ,15(6):e0234908, June 2020. ISSN 1932-6203. doi: 10.1371/journal.pone.0234908.URL https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0234908 .Tejal A. Patel, Mamta Puppala, Richard O. Ogunti, Joe E. Ensor, Tiancheng He,Jitesh B. Shewale, Donna P. Ankerst, Virginia G. Kaklamani, Angel A. Rodriguez,Stephen T. C. Wong, and Jenny C. Chang. Correlating mammographic and patho-logic ﬁndings in clinical decision support using natural language processing and datamining methods.

Cancer , 123(1):114–121, January 2017. ISSN 1097-0142. doi:10.1002/cncr.30245.Y. Peng, K. Yan, V. Sandfort, R. M. Summers, and Z. Lu. A self-attention baseddeep learning method for lesion attribute detection from CT reports. In , pages 1–5, Xi’an, China,June 2019. IEEE Computer Society. doi: 10.1109/ICHI.2019.8904668.32eﬀrey Pennington, Richard Socher, and Christopher D Manning. Glove: Global vectorsfor word representation. In

Proceedings of the 2014 conference on empirical methodsin natural language processing (EMNLP) , pages 1532–1543, 2014.Bethany Percha, Yuhao Zhang, Selen Bozkurt, Daniel Rubin, Russ B. Altman, and Cur-tis P. Langlotz. Expanding a radiology lexicon using contextual patterns in radiologyreports.

Journal of the American Medical Informatics Association , 25(6):679–685,June 2018. doi: 10.1093/jamia/ocx152. URL https://academic.oup.com/jamia/article/25/6/679/4797401 .Matthew E. Peters, Mark Neumann, Mohit Iyyer, Matt Gardner, Christopher Clark,Kenton Lee, and Luke Zettlemoyer. Deep contextualized word representations.

CoRR , abs/1802.05365, 2018. URL http://arxiv.org/abs/1802.05365 . eprint:1802.05365.Ewoud Pons, Loes M. M. Braun, M. G. Myriam Hunink, and Jan A. Kors. NaturalLanguage Processing in Radiology: A Systematic Review.

Radiology , 279(2):329–343,April 2016. ISSN 0033-8419. doi: 10.1148/radiol.16142770. URL https://pubs.rsna.org/doi/10.1148/radiol.16142770 .Peter Pruitt, Andrew Naidech, Jonathan Van Ornam, Pierre Borczuk, and WilliamThompson. A natural language processing algorithm to extract characteristics ofsubdural hematoma from head CT reports.

Emergency Radiology , 26(3):301–306,June 2019. ISSN 1438-1435. doi: 10.1007/s10140-019-01673-4. URL https://doi.org/10.1007/s10140-019-01673-4 .Basel Qenam, Tae Youn Kim, Mark J. Carroll, and Michael Hogarth. Text Simpliﬁ-cation Using Consumer Health Vocabulary to Generate Patient-Centered RadiologyReporting: Translation and Evaluation.

Journal of Medical Internet Research , 19(12):e417, 2017. doi: 10.2196/jmir.8536. URL .Alex Ratner, Braden Hancock, Jared Dunnmon, Roger Goldman, and Christopher R´e.Snorkel MeTaL: Weak Supervision for Multi-Task Learning. In

Proceedings of theSecond Workshop on Data Management for End-To-End Machine Learning , volume 3of

DEEM’18 , pages 1–4, Houston, TX, USA, 2018. ACM. ISBN 978-1-4503-5828-6.doi: 10.1145/3209889.3209898. URL https://doi.org/10.1145/3209889.3209898 .Joseph S. Redman, Yamini Natarajan, Jason K. Hou, Jingqi Wang, Muzammil Hanif,Hua Feng, Jennifer R. Kramer, Roxanne Desiderio, Hua Xu, Hashem B. El-Serag,and Fasiha Kanwal. Accurate Identiﬁcation of Fatty Liver Disease in Data WarehouseUtilizing Natural Language Processing.

Digestive Diseases and Sciences , 62(10):2713–2718, October 2017. ISSN 1573-2568. doi: 10.1007/s10620-017-4721-9. URL https://doi.org/10.1007/s10620-017-4721-9 .RSNA.

RadLex . 2021. URL http://radlex.org/ .33vonne Sada, Jason Hou, Peter Richardson, Hashem El-Serag, and Jessica Davila. Val-idation of Case Finding Algorithms for Hepatocellular Cancer from AdministrativeData and Electronic Health Records using Natural Language Processing.

Medical care ,54(2):e9–e14, February 2016. ISSN 0025-7079. doi: 10.1097/MLR.0b013e3182a30373.URL .M. Sevenster, J. Buurman, P. Liu, J. F. Peters, and P. J. Chang. Natural LanguageProcessing Techniques for Extracting and Categorizing Finding Measurements in Nar-rative Radiology Reports.

Applied Clinical Informatics , 06(3):600–610, 2015a. ISSN1869-0327. doi: 10.4338/ACI-2014-11-RA-0110. URL .Merlijn Sevenster, Jeﬀrey Bozeman, Andrea Cowhy, and William Trost. A natu-ral language processing pipeline for pairing measurements uniquely across free-textCT reports.

Journal of Biomedical Informatics , 53:36–48, February 2015b. ISSN1532-0464. doi: 10.1016/j.jbi.2014.08.015. URL .S. C. Shelmerdine, M. Singh, W. Norman, R. Jones, N. J. Sebire, and O. J. Arthurs.Automated data extraction and report analysis in computer-aided radiology au-dit: practice implications from post-mortem paediatric imaging.

Clinical Radi-ology , 74(9):733.e11–733.e18, September 2019. ISSN 0009-9260. doi: 10.1016/j.crad.2019.04.021. URL .B. Shickel, P. J. Tighe, A. Bihorac, and P. Rashidi. Deep EHR: A Survey of RecentAdvances in Deep Learning Techniques for Electronic Health Record (EHR) Analy-sis.

IEEE Journal of Biomedical and Health Informatics , 22(5):1589–1604, September2018. ISSN 2168-2208. doi: 10.1109/JBHI.2017.2767063.B. Shin, F. H. Chokshi, T. Lee, and J. D. Choi. Classiﬁcation of radiology reports usingneural attention models. In , pages 4363–4370, Anchorage, AK, May 2017. IEEE. doi: 10.1109/IJCNN.2017.7966408.Ryan G. Short, John Bralich, Dave Bogaty, and Nicholas T. Befera. Comprehen-sive Word-Level Classiﬁcation of Screening Mammography Reports Using a NeuralNetwork Sequence Labeling Approach.

Journal of Digital Imaging , 32(5):685–692,October 2019. ISSN 1618-727X. doi: 10.1007/s10278-018-0141-4. URL https://doi.org/10.1007/s10278-018-0141-4 .Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Ng, and MatthewLungren. Combining Automatic Labelers and Expert Annotations for AccurateRadiology Report Labeling Using BERT. In

Proceedings of the 2020 Conferenceon Empirical Methods in Natural Language Processing (EMNLP) , pages 1500–1519,Online, November 2020a. Association for Computational Linguistics. doi: 10.348653/v1/2020.emnlp-main.117. URL .Akshay Smit, Saahil Jain, Pranav Rajpurkar, Anuj Pareek, Andrew Y. Ng, andMatthew P. Lungren. CheXbert: Combining Automatic Labelers and Expert Annota-tions for Accurate Radiology Report Labeling Using BERT.

CoRR , abs/2004.09167,2020b. URL https://arxiv.org/abs/2004.09167 . eprint: 2004.09167.V. Sorin, Y. Barash, E. Konen, and E. Klang. Deep Learning for Natural Language Pro-cessing in Radiology-Fundamentals and a Systematic Review.

Journal of the AmericanCollege of Radiology : JACR , 17(5):639–648, 2020. doi: 10.1016/j.jacr.2019.12.026.Irena Spasic and Goran Nenadic. Clinical Text Data in Machine Learning: SystematicReview.

JMIR medical informatics , 8(3):e17984, March 2020. ISSN 2291-9694. doi:10.2196/17984.Jackson M. Steinkamp, Charles Chambers, Darco Lalevic, Hanna M. Zafar, and Tessa S.Cook. Toward Complete Structured Information Extraction from Radiology ReportsUsing Machine Learning.

Journal of Digital Imaging , 32(4):554–564, August 2019.ISSN 1618-727X. doi: 10.1007/s10278-019-00234-y. URL https://doi.org/10.1007/s10278-019-00234-y .Amir M. Tahmasebi, Henghui Zhu, Gabriel Mankovich, Peter Prinsen, Prescott Klassen,Sam Pilato, Rob van Ommering, Pritesh Patel, Martin L. Gunn, and Paul Chang.Automatic Normalization of Anatomical Phrases in Radiology Reports Using Unsu-pervised Learning.

Journal of Digital Imaging , 32(1):6–18, February 2019. ISSN1618-727X. doi: 10.1007/s10278-018-0116-5. URL https://doi.org/10.1007/s10278-018-0116-5 .W. Katherine Tan and Patrick J. Heagerty. Surrogate-guided sampling designs for clas-siﬁcation of rare outcomes from electronic medical records data. arXiv:1904.00412[stat.ME] , March 2019. URL http://arxiv.org/abs/1904.00412 .W. Katherine Tan, Saeed Hassanpour, Patrick J. Heagerty, Sean D. Rundell, PradeepSuri, Hannu T. Huhdanpaa, Kathryn James, David S. Carrell, Curtis P. Langlotz,Nancy L. Organ, Eric N. Meier, Karen J. Sherman, David F. Kallmes, Patrick H.Luetmer, Brent Griﬃth, David R. Nerenz, and Jeﬀrey G. Jarvik. Comparison of Nat-ural Language Processing Rules-based and Machine-learning Systems to Identify Lum-bar Spine Imaging Findings Related to Low Back Pain.

Academic Radiology , 25(11):1422–1432, November 2018. ISSN 1076-6332. doi: 10.1016/j.acra.2018.03.008. URL .Gaurav Trivedi, Charmgil Hong, Esmaeel R. Dadashzadeh, Robert M. Handzel, HarryHochheiser, and Shyam Visweswaran. Identifying incidental ﬁndings from radiologyreports of trauma patients: An evaluation of automated feature representation meth-ods.

International Journal of Medical Informatics , 129:81–87, September 2019. ISSN35386-5056. doi: 10.1016/j.ijmedinf.2019.05.021. URL .Hari Trivedi, Joseph Mesterhazy, Benjamin Laguna, Thienkhai Vu, and Jae Ho Sohn.Automatic Determination of the Need for Intravenous Contrast in MusculoskeletalMRI Examinations Using IBM Watson’s Natural Language Processing Algorithm.

Journal of Digital Imaging , 31(2):245–251, April 2018. ISSN 1618-727X. doi: 10.1007/s10278-017-0021-3. URL https://doi.org/10.1007/s10278-017-0021-3 .Robert M. Van Haren, Arlene M. Correa, Boris Sepesi, David C. Rice, Wayne L. Hofstet-ter, Reza J. Mehran, Ara A. Vaporciyan, Garrett L. Walsh, Jack A. Roth, Stephen G.Swisher, and Mara B. Antonoﬀ. Ground Glass Lesions on Chest Imaging: Evalua-tion of Reported Incidence in Cancer Patients Using Natural Language Processing.

The Annals of Thoracic Surgery , 107(3):936–940, March 2019. ISSN 0003-4975. doi:10.1016/j.athoracsur.2018.09.016. URL .Emily Wheater, Grant Mair, Cathie Sudlow, Beatrice Alex, Claire Grover, and WilliamWhiteley. A validated natural language processing algorithm for brain imaging phe-notypes from radiology reports in UK electronic health records.

BMC Medical In-formatics and Decision Making , 19(1):184, September 2019. ISSN 1472-6947. doi:10.1186/s12911-019-0908-7. URL https://doi.org/10.1186/s12911-019-0908-7 .Claes Wohlin. Guidelines for Snowballing in Systematic Literature Studies and a Repli-cation in Software Engineering. In

Proceedings of the 18th International Confer-ence on Evaluation and Assessment in Software Engineering , EASE ’14, New York,NY, USA, 2014. Association for Computing Machinery. ISBN 978-1-4503-2476-2.doi: 10.1145/2601248.2601268. URL https://doi.org/10.1145/2601248.2601268 .event-place: London, England, United Kingdom.David A. Wood, Jeremy Lynch, Sina Kaﬁabadi, Emily Guilhem, Aisha Al Busaidi,Antanas Montvila, Thomas Varsavsky, Juveria Siddiqui, Naveen Gadapa, MatthewTownend, Martin Kiik, Keena Patel, Gareth Barker, Sebastian Ourselin, James H.Cole, and Thomas C. Booth. Automated Labelling using an Attention model forRadiology reports of MRI scans (ALARM). arXiv:2002.06588 [cs.CV] , 2020. URL http://arxiv.org/abs/2002.06588 .Stephen Wu, Kirk Roberts, Surabhi Datta, Jingcheng Du, Zongcheng Ji, Yuqi Si, SarveshSoni, Qiong Wang, Qiang Wei, Yang Xiang, Bo Zhao, and Hua Xu. Deep learning inclinical natural language processing: a methodical review.

Journal of the AmericanMedical Informatics Association: JAMIA , 27(3):457–470, 2020. ISSN 1527-974X. doi:10.1093/jamia/ocz200.Zhe Xie, Yuanyuan Yang, Mingqing Wang, Ming Li, Haozhe Huang, Dezhong Zheng,Rong Shu, and Tonghui Ling. Introducing Information Extraction to Radiology Infor-mation Systems to Improve the Eﬃciency on Reading Reports.

Methods of Informationin Medicine , 58(2-03):94–106, 2019. ISSN 2511-705X. doi: 10.1055/s-0039-1694992.36abir Yadav, Efsun Sarioglu, Hyeong-Ah Choi, Walter B. Cartwright, Pamela S. Hinds,and James M. Chamberlain. Automated Outcome Classiﬁcation of Computed Tomog-raphy Imaging Reports for Pediatric Traumatic Brain Injury.

Academic EmergencyMedicine , 23(2):171–178, 2016. ISSN 1553-2712. doi: 10.1111/acem.12859. URL https://onlinelibrary.wiley.com/doi/abs/10.1111/acem.12859 .Zihao Yan, Ivan K. Ip, Ali S. Raja, Anurag Gupta, Joshua M. Kosowsky, and RaminKhorasani. Yield of CT Pulmonary Angiography in the Emergency Department WhenProviders Override Evidence-based Clinical Decision Support.

Radiology , 282(3):717–725, September 2016. ISSN 0033-8419. doi: 10.1148/radiol.2016151985. URL https://pubs.rsna.org/doi/full/10.1148/radiol.2016151985 .Hongmei Yang, Lin Li, Ridong Yang, and Yi Zhou. Towards Automated KnowledgeDiscovery of Hepatocellular Carcinoma: Extract Patient Information from ChineseClinical Reports. In

Proceedings of the 2nd International Conference on Medical andHealth Informatics , ICMHI ’18, pages 111–116, New York, NY, USA, June 2018. ACM.ISBN 978-1-4503-6389-1. doi: 10.1145/3239438.3239445. URL https://doi.org/10.1145/3239438.3239445 .Koichiro Yasaka and Osamu Abe. Deep learning and artiﬁcial intelligence in radiol-ogy: Current applications and future directions.

PLOS Medicine , 15(11):e1002707,November 2018. ISSN 1549-1676. doi: 10.1371/journal.pmed.1002707. URL https://journals.plos.org/plosmedicine/article?id=10.1371/journal.pmed.1002707 .Wen-wai Yim, Tyler Denman, Sharon W. Kwan, and Meliha Yetisgen. Tumor infor-mation extraction in radiology reports for hepatocellular carcinoma patients.

AMIASummits on Translational Science Proceedings , 2016:455–464, July 2016a. ISSN 2153-4063. URL .Wen-wai Yim, Sharon W. Kwan, and Meliha Yetisgen. Tumor reference resolutionand characteristic extraction in radiology reports for liver cancer stage prediction.

Journal of Biomedical Informatics , 64:179–191, December 2016b. ISSN 1532-0464.doi: 10.1016/j.jbi.2016.10.005. URL .Wen-wai Yim, Sharon W. Kwan, and Meliha Yetisgen. Classifying tumor event at-tributes in radiology reports.

Journal of the Association for Information Science andTechnology , 68(11):2662–2674, 2017. ISSN 2330-1643. doi: 10.1002/asi.23937. URL https://asistdl.onlinelibrary.wiley.com/doi/abs/10.1002/asi.23937 .Wen-wai Yim, Sharon W Kwan, Guy Johnson, and Meliha Yetisgen. Classiﬁcation ofhepatocellular carcinoma stages from free-text clinical and radiology reports.

AMIAAnnual Symposium Proceedings , 2017:1858–1867, April 2018. ISSN 1942-597X. URL .37. Young, D. Hazarika, S. Poria, and E. Cambria. Recent Trends in Deep Learning BasedNatural Language Processing [Review Article].

IEEE Computational Intelligence Mag-azine , 13(3):55–75, August 2018. ISSN 1556-6048. doi: 10.1109/MCI.2018.2840738.John Zech, Margaret Pain, Joseph Titano, Marcus Badgeley, Javin Scheﬄein, An-dres Su, Anthony Costa, Joshua Bederson, Joseph Lehar, and Eric Karl Oermann.Natural Language–based Machine Learning Models for the Annotation of ClinicalRadiology Reports.

Radiology , 287(2):570–580, January 2018. ISSN 0033-8419.doi: 10.1148/radiol.2018171093. URL https://pubs.rsna.org/doi/full/10.1148/radiol.2018171093 .John Zech, Jessica Forde, Joseph J. Titano, Deepak Kaji, Anthony Costa, and Eric KarlOermann. Detecting insertion, substitution, and deletion errors in radiology reportsusing neural sequence-to-sequence models.

Annals of Translational Medicine , 7(11),June 2019. ISSN 2305-5839. doi: 10.21037/atm.2018.08.11. URL .A. Y. Zhang, S. S. W. Lam, N. Liu, Y. Pang, L. L. Chan, and P. H. Tang. Developmentof a Radiology Decision Support System for the Classiﬁcation of MRI Brain Scans. In , pages 107–115, December 2018. doi: 10.1109/BDCAT.2018.00021.Yuhao Zhang, Derek Merck, Emily Bao Tsai, Christopher D. Manning, and Curtis P.Langlotz. Optimizing the Factual Correctness of a Summary: A Study of SummarizingRadiology Reports. arXiv:1911.02541 [cs.CL] , 2019. URL http://arxiv.org/abs/1911.02541 .Yiqing Zhao, Nooshin J. Fesharaki, Hongfang Liu, and Jake Luo. Using data-drivensublanguage pattern mining to induce knowledge models: application in medical imagereports knowledge representation.

BMC Medical Informatics and Decision Making ,18(1):61, July 2018. ISSN 1472-6947. doi: 10.1186/s12911-018-0645-3. URL https://doi.org/10.1186/s12911-018-0645-3 .Henghui Zhu, Ioannis Ch. Paschalidis, Christopher Hall, and Amir Tahmasebi. Context-Driven Concept Annotation in Radiology Reports: Anatomical Phrase Labeling.

AMIA Summits on Translational Science Proceedings , 2019:232–241, May 2019. ISSN2153-4063. URL