[PDF] Deep metric learning for multi-labelled radiographs

Abstract

Many radiological studies can reveal the presence of several co-existing abnormalities, each one represented by a distinct visual pattern. In this article we address the problem of learning a distance metric for plain radiographs that captures a notion of "radiological similarity": two chest radiographs are considered to be similar if they share similar abnormalities. Deep convolutional neural networks (DCNs) are used to learn a low-dimensional embedding for the radiographs that is equipped with the desired metric. Two loss functions are proposed to deal with multi-labelled images and potentially noisy labels. We report on a large-scale study involving over 745,000 chest radiographs whose labels were automatically extracted from free-text radiological reports through a natural language processing system. Using 4,500 validated exams, we demonstrate that the methodology performs satisfactorily on clustering and image retrieval tasks. Remarkably, the learned metric separates normal exams from those having radiological abnormalities.

Full PDF

DDeep metric learning for multi-labelled radiographs

Mauro Annarumma

Department of BiomedicalEngineeringKing’s College London [email protected] Giovanni Montana

Department of BiomedicalEngineeringKing’s College London [email protected]

ABSTRACT

Many radiological studies can reveal the presence of sev-eral co-existing abnormalities, each one represented by adistinct visual pattern. In this article we address the prob-lem of learning a distance metric for plain radiographs thatcaptures a notion of “radiological similarity”: two chest ra-diographs are considered to be similar if they share similarabnormalities. Deep convolutional neural networks (DCNs)are used to learn a low-dimensional embedding for the ra-diographs that is equipped with the desired metric. Twoloss functions are proposed to deal with multi-labelled im-ages and potentially noisy labels. We report on a large-scalestudy involving over 745,000 chest radiographs whose labelswere automatically extracted from free-text radiological re-ports through a natural language processing system. Using4,500 validated exams, we demonstrate that the methodol-ogy performs satisfactorily on clustering and image retrievaltasks. Remarkably, the learned metric separates normal ex-ams from those having radiological abnormalities.

CCS Concepts • Computing methodologies → Dimensionality reduc-tion and manifold learning; Neural networks;

Visualcontent-based indexing and retrieval; • Applied comput-ing → Imaging;

Keywords deep metric learning, convolutional networks, x-rays

1. INTRODUCTION

Chest radiographs are performed to diagnose and monitor awide range of conditions aﬀecting lungs, heart, bones, and

Permission to make digital or hard copies of part or all of this work forpersonal or classroom use is granted without fee provided that copiesare not made or distributed for proﬁt or commercial advantage and thatcopies bear this notice and the full citation on the ﬁrst page. Copyrightsfor third-party components of this work must be honored. For all otheruses, contact the owner/author(s).

SAC 2018April 9–13, 2018, Pau, France

Copyright held by the owner/author(s).ACM 978-1-4503-5191-1/18/04.DOI: https://doi.org/10.1145/3167132.3167379 soft tissues. Despite being commonly performed, their read-ing is challenging and interpretation discrepancies can occur.There is a need to develop machine learning algorithms thatcan assist the reporting radiologist. In this work we addressthe problem of learning a distance metric for chest radio-graphs using a very large repository of historical exams thathave already been reported. An ideal metric should be ableto cluster together radiographs presenting similar radiologi-cal abnormalities and place them far away from exams withnormal radiological appearance. Learning a suitable met-ric would enable a variety of applications, from automatedretrieval of radiologically similar exams, for teaching andtraining, to their automated prioritization based on visualpatterns.The problem we discuss here is challenging for several rea-sons. First, the number of potential abnormalities that canbe observed in a chest radiograph can be quite large. Visualpatterns detected in radiographs are important cues used bythe clinicians when making a diagnosis. Often, during thereporting time, the clinician will describe the visual patternusing descriptors (e.g. “enlarged heart”) or stating the exactmedical pathology associated with the visual pattern (e.g.“consolidation in the right lower lobe”). A metric learningalgorithm should be able to deal with any such labels andtheir potential overlaps. Second, the labels may not alwaysbe accurate or comprehensive due to the fact that not all theabnormalities are always reported in an image, e.g. due toomissions or when deemed unimportant by the radiologist.When these labels are automatically obtained from free-textreports, as we do in this work, mislabelling errors may alsooccur. Third, certain abnormalities are less frequently ob-served than others, and may not even exist in the trainingdataset.To support this study, we have prepared a large repositoryconsisting of over 745 ,

000 chest radiograph examinationsextracted from the PACS (Picture Archiving and Commu-nication System) of a large teaching hospital in London. Toour knowledge, this is the largest chest radiograph reposi-tory to ever be deployed in a machine learning study. Dueto the large sample size, manual annotation of all the examsis unfeasible. All the historical free-text reports have beenparsed using a Natural Language Processing (NLP) system,which has identiﬁed and classiﬁed any mention of radiologi-cal abnormalities. As a result of this process, each ﬁlm hasbeen automatically assigned to one or multiple labels. Ourcontributions are the following. First, we discuss the prob- a r X i v : . [ s t a t . M L ] D ec The lungs and pleural spaces are clear. No pneumothorax. The heart is not enlarged.Wrong report refers to another x-ray! A2 Large left-sided pleural effusion with almost complete collapse of left lower lobe. Right-sided thoracostomy tube. B1 The heart size is at the upper limits of normal, the lungs are clear. B2 The heart is enlarged. No active lung lesion.

Figure 1: Examples of pairs of images that are placed closeto each other in the learned embedding space shown in Fig.3. A1 was incorrectly reported, but a second reading showsthe presence of pleural eﬀusion and a medical device, whichjustiﬁes its proximity to A2. B1 was labelled as “normal”,but a second reading reveals some degree of cardiomegalyand, as such, the scan is placed close to B2. An extractfrom the original reports can be found under each image.Fig. 3 contains the legend for the labels.lem of deep metric learning with multi-labelled images andpropose two versions of a loss function speciﬁcally designedto deal with overlapping and potentially noisy labels. At thecore of the architecture, a DCN is used to learn compact im-age representations capturing the visual patterns describedby the labels. Second, we report on a large-scale evaluationof the proposed methodology using a manually curated sub-set of over 4 ,

500 exams. Each historical radiological reportwas reviewed by two independent clinicians who extractedall the labels associated to the ﬁlms. We report on compar-ative results for two tasks, clustering and image retrieval,and provide evidence that the learned metric can be usedto cluster radiographs with a normal appearance as well asclusters of abnormal exams with co-occurring abnormalities.

2. RELATED WORK2.1 Deep metric learning

The ﬁrst attempt of using neural networks to learn an em- bedding space was the

Siamese Network [1][2], which used acontrastive loss to train the network to distinguish betweenpairs of examples. Schroﬀ et al. [10] combined a Siamesearchitecture with a triplet loss[19] and applied the resultingmodel to the face veriﬁcation problem obtaining a nearlyhuman performance. Other approaches have been proposedmore recently in order to better exploit the information ineach mini-batch; e.g. Song et al. [14] proposed a loss witha lifted structure, while Sohn et al. [12] proposed a tupletloss. They both use all the possible example pairs withineach mini-batch. All these methods use a query or anchorimage x a , which is compared with positive elements (im-ages sharing the same label) and negative elements (imageswith a diﬀerent label). Several of these methods also imple-ment a hard data mining approach whereby samples withina given pair or triplet are selected in such a way to repre-sent the hardest positive or negative example with respectto the given anchor. This strategy improves both the con-vergence speed and the ﬁnal discriminative performance. InFaceNet [10], pairs of anchor and positive samples are ran-domly selected while negative samples are selected from asubset of the training set using a semi-hard negative algo-rithm. Recently, Wu et al. [20] proposed a novel oﬀ-linemining strategy that, on the entire training set, selects theoptimal positive and negative elements for each anchor. Adiﬀerent learning framework that does not require the train-ing data to be processed in paired format has been recentlyproposed [13]. The use of computer-aided diagnosis (CAD) systems in med-ical imaging goes back more than a half century [17]. Overthe years the methodologies powering the CAD systems haveevolved substantially from rule-based engines to artiﬁcialneural nertworks. In recent years, CAD developers havestarted to adopt deep learning stategies in a number of med-ical application domains. For instance, Geras et al. [4] havedeveloped a DCN model able to handle multiple views ofhigh-resolution screening mammographies, which are com-monly used to screen for breast cancer. For applications toplain chest radioghraphs, standard DCNs have been usedto predict pulmonary tuberculosis [6] and an architectureinvolving DCNs and recurrent neural networks has beentrained to perform automatic image annotation [11]. Wanget al. [18] have used a database of chest x-rays with morethan 100 ,

000 frontal-view images and associated radiolog-ical reports in an attempt to detect commonly occurringthoracic diseases.

3. DEEP METRIC LEARNING WITH MULTI-LABELLED IMAGES3.1 Problem formulation

In the remainder of this article we assume that each chestradiograph x ∈ R w is associated with any of l possible labelscontained in a set L . We collect all the labels describing x in a set L ( x ) whilst all the remaining labels are identiﬁed by L ( x ) = L − L ( x ). Our aim is to learn a non-linear embed-ding f ( x ) that maps each x onto a feature space R d where d (cid:28) w . In this subspace, the Euclidean distance among) Triplet loss b) ML2 loss c) ML2+ loss d) Ideal metricFigure 2: An illustration of metric learning using the triplet (a), ML2 (b) and ML2+ (c) losses compared to an ideal metric.Each shape represents a label and overlapping shapes indicate co-occurring labels. The dotted arcs indicate the margin boundsdepending on α . See the text for further details.groups of similar images should be small and, conversely,the distance between dissimilar images should be large. Thedistance should be robust to anatomical variability withinthe normal range as well as geometric distortions and noise.Most importantly, it should be able to capture a notion of radiological similarity , i.e. two images are expected to bemore similar to each other if they share similar radiologicalabnormalities. We require the embedding function, f θ ( · ),to depend only upon a learnable parameter vector θ . Noassumptions about this function can be made besides dif-ferentiability with respect to θ . Consequently, the learneddistance, d θ ( f θ ( x i ) , f θ ( x j )), also depends on θ .While the deﬁnition of positive and negative elements isstraightforward for applications involving mutually exclu-sive labels, it becomes more ambiguous when each image isallowed to have non-mutually exclusive labels. Restrictiveassumptions would need to be made in order to use exist-ing approaches based on contrastive loss [2], triplet loss [10]and others [14, 12]. The simplest approach would be to as-sume that x i and x j are positive with respect to each otheronly when they share exactly the same labels, i.e. when L ( x i ) = L ( x j ); conversely, they would be interpreted as neg-ative elements when the equality is not satisﬁed. However,assuming that two ﬁlms are radiologically similar when theyshare exactly the same abnormalities is too strong. Adopt-ing this strategy would also result in much larger samplesizes for elements with frequently co-occuring labels com-pared to elements characterised by less frequent labels thushindering the learning process. Furthermore, since each in-dividual label in both L ( x i ) and L ( x j ) is expected to benoisy, requiring the co-occurrence of exactly all the labelsmay be too restrictive.A much less restrictive approach would be to assume that x i and x j are positive when they have at least one commonlabel, i.e. when L ( x ) ∩ L ( x ) (cid:54) = ∅ . Under this deﬁnition,both the contrastive or triplet loss could still be used. Thisapproach is still far from ideal, though, because this deﬁni-tion is invariant to the degree of overlap between L ( x i ) and L ( x j ). Ideally, the learned distance between any two imagesshould be proportional to the number of abnormalities theydo not share. Fig. 2d illustrates this ideal situation. Thetriplet loss would struggle to satisfy this requirement as itdoes not take the global structure of the embedding space into consideration [12] and does not explicitly account foroverlapping labels; see Fig. 2a. In the next section, we pro-pose two loss functions that are designed to overcome theabove limitations. We begin by assuming that x i and x j are positive when L ( x i ) ∩ L ( x j ) (cid:54) = ∅ . Given an anchor x a , our approach startsby retrieving l randomly selected images, one for each labelin L . The images are then grouped into two non-overlappingsets: one containing p positive elements P ( x a ) = { x +1 , ..., x + p } and one containing the n remaining negative elements N ( x a ) = { x − , ..., x − n } where p + n = l . An ideal metric should ensure that x a iskept as close as possible to all the elements in P whilst beingkept away from all the elements in N . Accordingly, the lossfunction to be minimised can be deﬁned as L ( x a , P , N ) = 1 np p (cid:88) i =1 n (cid:88) j =1 max (cid:18) , L tpl ( x a , x + i , x − j ) (cid:19) L tpl ( x a , x + , x − ) = d (cid:0) f θ ( x a ) , f θ ( x + ) (cid:1) − d (cid:0) f θ ( x a ) , f θ ( x − ) (cid:1) + α where the positive scalar α represents a margin to be en-forced between positive and negative pairs. This formulationcan be seen as the triplet loss average derived from all thepossible triplets { x a , x + i , x − j } where x + i ∈ P and x − j ∈ N .The expression above can be simpliﬁed by pre-selecting thenegative element x − j having the largest contribution (e.g.see also Song et al.[14]), i.e. yielding L − ( x a , N ) = max j (cid:104) α − d (cid:0) f θ ( x a ) , f θ ( x − j ) (cid:1)(cid:105) In this way, we obtain a more tractable optimisation problem L ( x a , P , N ) = 1 np p (cid:88) i =1 max (cid:18) , d (cid:0) f θ ( x a ) , f θ ( x + i ) (cid:1) + L − ( x a , N ) (cid:19) which can be further simpliﬁed by using a smooth upperable 1: Dataset sample sizesClass Train Validation Test GL SetNormal 86863 10857 10865 558Cardiomegaly 40312 5084 5315 374Medical device 105880 13287 13616 850Pleural eﬀusion 66980 8398 8676 642Pneumothorax 20003 2519 2613 212Total 261678 32802 33494 2051bound for L − ( x a , N ),ˆ L − ( x a , N ) = log (cid:16) n (cid:88) j =1 e α − d (cid:0) f θ ( x a ) ,f θ ( x − j ) (cid:1)(cid:17) ≥ L − ( x a , N )The above loss does not directly address the issue arisingwhen some elements in P ( x a ) have labels that are not in L ( x a ). Without imposing further constraints on how the el-ements in P are selected, the loss will force d (cid:0) f θ ( x a ) , f θ ( x + ) (cid:1) to become as small as possible regardless of the number oflabels that x a and x + actually have in common. This prob-lem is addressed by introducing a quantity, τ , that repre-sents the degree of overlap between the labels associated to x a and those associated to its positive elements, i.e. τ = (cid:0) |L ( x a ) ∪ L ( x + i ) | − |L ( x a ) ∩ L ( x + i ) | (cid:1) |L ( x a ) ∪ L ( x + i ) | . Clearly, τ is equal to 0 when |L ( x a ) ∩ L ( x + i ) | = |L ( x a ) ∪L ( x + i ) | and to 1 when L ( x a ) ∩ L ( x + i ) = ∅ . By allowing d (cid:0) f ( x a ) , f ( x + i ) (cid:1) to be a fraction τ of α , we obtain the pro-posed ML2 (Metric Learning for Multi-Label) loss, i.e. ML loss = 1 p p (cid:88) i =1 max (cid:18) , d (cid:0) f θ ( x a ) , f θ ( x + i ) (cid:1) − ατ + ˆ L − ( x a , N ) (cid:19) An illustrative example of its inner working is provided inFig. 2b. We also propose a diﬀerent version of the loss,which relies on a diﬀerent deﬁnition of positive elements. Inthis case, for each label in L ( x a ), a positive element is strictlyrequired to have only that particular label. The quantify τ then simpliﬁes to τ = ( p − /p since |L ( x a ) ∩ L ( x + i ) | = 1and |L ( x a ) ∪ L ( x + i ) | = p . An illustration is provided in Fig.2c, and we call this version ML2+.For applications involving a large number of classes, a mem-ory eﬃcient implementation of the two methods above canbe obtained by reducing the elements in P and N using ahard class mining approach. In this case, P and N dependonly on a subset of all l labels, which is chosen by determin-ing which labels contribute the most to the overall loss (e.g.see Sohn et al.[12]).

4. LARGE-SCALE METRIC LEARNING FORCHEST RADIOGRAPHS

For this study, we obtained a large dataset consisting of745 ,

480 historical chest radiographs extracted from the PACSsystem of Guy’s & St Thomas’ NHS Foundation Trust, serv-ing a large, diverse population in South London. Our dataset covers the period between January 2005 and March 2016.The radiographs were taken using 40 diﬀerent scanners acrossmore than 100 departments. For a large portion of theseexams, we had both the radiological report as well as theassociated plain ﬁlm. The reports were written by 276 dif-ferent readers, including consultant and trainee radiologistsand accredited reporting radiographers. All the examina-tions were anonymised with no patient-identiﬁable data orreferral information. The size of the images ranges from734 ×

734 to 4400 × Golden Set - was used to assess and compare the per-formance of the metric algorithms.

Given the large number of reports available for the study,obtaining manual labels for each exam was unfeasible. In-stead, all the written reports were processed using a NLPsystem speciﬁcally developed to model radiological language[3]. The system was trained to detect any mention of ra-diological abnormalities and their negations. Labels werechosen to allow all common radiological ﬁndings to be allo-cated to a group along with other ﬁlms sharing similar ap-pearances. The labels were adapted from Hansell et al. [5]and were meant to capture discrete radiological ﬁndings (e.g.cardiomegaly, medical device, pleural eﬀusion) rather thangiving a ﬁnal diagnosis (e.g. pulmonary oedema), which re-quires clinical judgement to combine the current ﬁndingswith previous imaging, clinical history, and laboratory re-sults. For this study, we used l = 4 diﬀerent labels, i.e. cardiomegaly , medical devices (e.g. pacemakers, lines, andtubes), pleural eﬀusion and pneumothorax . The NLP sys-tem also identiﬁed all “normal” exams, i.e. those where noabnormalities were mentioned in the report. Cumulatively,the normal and abnormal labels used here represent 68% ofall the reported visual patterns in our database.A validation study was carried out to assess how accuratelythe NLP system extracted the 4 clinical labels, plus the nor-mal class, from the written reports. Two independent clin-icians were presented with the original radiological reportsand manually generated the labels from the reports. Thisstudy generated the Golden Set , which is used here purelyfor performance evaluation purposes.In Table 2 we report the precision, sensitivity, speciﬁcity and F Golden Set

Class Prec. Sens. Spec. F Normal 98.98 97.33 99.85 98.15Cardiomegaly 99.59 99.39 99.95 99.49Medical device 98.52 94.34 99.27 96.39Pleural eﬀusion 96.80 91.42 99.36 94.03Pneumothorax 77.07 96.88 98.05 85.85Table 3: Proposed architecture based on the

Inception v3 network for 1211 × × × × × × × × × × × × × × × × × × × × × × × × Standard DCN architectures, such as

Inception v3 , wereoriginally designed to model natural images, such as those inthe Imagenet dataset [9]. These images are typically scaleddown to 299 ×

299 pixels, even though higher resolution im-ages are available. In many studies, down-scaling naturalimages has been shown to be a good compromise betweenthe amount of information that is lost and computational ef-ﬁciency. However, in a medical imaging setting, every detailin an image matters, at least in principle. Thus, arbitrarilyreducing the resolution of the images is generally consideredsuboptimal [4]. For this reason, in our study we have im-plemented a slightly modiﬁed version of

Inception v3 thatis able to handle 1211 ×

5. EXPERIMENTAL RESULTS5.1 Training strategy

The f θ ( x ) representation was learned using an Inception v3 architecture [15] resulting in an m − dimensional mappingunder the constraint that (cid:107) f θ ( x ) (cid:107) = 1. We call g ψ ( x ) theoutput of the last convolutional layer and we deﬁne our ﬁnallayer as: f θ ( x ) = g ψ ( x ) β + b (cid:107) g ψ ( x ) β + b (cid:107) where β ∈ R × m and b ∈ R are respectively weightsand bias of the last layer. All the results presented hereuse m = 64, because the use of larger dimensions did not introduce any signiﬁcant improvements. All images wererescaled to have a standard size of 299 ×

299 (1211 × f θ ( x ) was learned end-to-end from the rawimages, and one where pre-training was used instead, as iscommonly done in other works[12][14]. The proposed ML2and ML2+ losses were compared to more traditional metriclearning approaches based on contrastive and triplet lossessharing the same architecture.Stochastic Gradient Descent (SGD) was used for the opti-misation process, with an initial learning rate equal to 0 . . − .When we started from randomly initialised weights, the to-tal number of iterations was 90 ,

000 and, every 25 ,

000 it-erations, the learning rate was decreased by a factor of 10.Instead, when the weights were pre-trained on the classiﬁca-tion task, the number of total iterations was 27 ,

000 and thelearning rate was decreased every 8 ,

000 iterations. In bothexperimental setups the size of the mini-batches was equalto 36 when contrastive and triplet losses were used, and itwas equal to 10 for our proposed losses. We tested diﬀerentvalues for α , which, for the results shown in this work, hasbeen set to 0 .

2. During the training the model with the bestvalue of NMI on the validation set is kept as the best modeland used during the testing phase.Positive and negative elements were randomly sampled. Thenoisiness of our labels prevented us from exploiting any sam-pling techniques (e.g. hardest negative mining, etc.), sinceall those methods take the reliability of the labels for granted.

For the pre-training of our DCN, we used a multi-label bi-nary cross entropy loss. Given our 4 possible labels, we de-ﬁned an equal number of binary classiﬁers with the aim ofpredicting the presence or absence of each label. The outputof each binary classiﬁer l iφ ( x ) is l iφ ( x ) = LogSoftMax (cid:0) g ψ ( x ) β i + b i (cid:1) where β i ∈ R × and b i ∈ R are diﬀerent weights andbias whith respect to the one deﬁned above. The loss func-tion is equal to the average of the negative log likelihoods of l iφ ( x ) for each i = 1 , ..., L ( l φ ( x ) , y ) = 14 (cid:88) i =1 y i · l iφ ( x )where y is the labels vector; y i will be equal to (1 ,

0) whenthe i-th abnormality is present in the image x , otherwise, itwill be equal to (0 ,

1) .

We assessed the performance of the proposed losses on twodiﬀerent tasks: (i) clustering, evaluated with the normalizedmutual information (NMI) metric and (ii) image retrieval,evaluated with the Recall@K metric; see Manning et al. [7]for a complete account of these metrics.able 4 shows the empirical results obtained after learningthe metrics on the 263 ,

513 training images and testing themon the Golden Set. When learning without pre-training (i.e.initially using random weights), ML2+ outperforms ML2on both tasks and largely improves upon the other alter-native losses. When using a pre-trained architecture, im-provements can be observed across all methods, and ML2+obtains a slightly better performance than ML2. Based onthese results, we demonstrate the superior performance ofour proposed losses with respect to the baseline; moreover,we suspect that ML2+ is able to converge to a better opti-mum more easily than ML2.In the same Table we also reported the results obtained witha DCN using 1211 × ×

299 pixels may be suﬃciently informative.Figure 3 shows a 2-dimensional representation of the 2 , Golden Set . This representa-tion was obtained by means of dimensionality reduction us-ing a t -distributed Stochastic Neighbor Embedding (t-SNE)[16], which eﬀectively projects the 64-dimensional embed-dings extracted from the best model onto 2 dimensions forvisualisation purposes. Remarkably, this projection showsthat the normal exams are mostly concentrated in a well-separated cluster; moreover, other clusters of exams sharingsimilar abnormalities have also been identiﬁed.The chest radiographs marked with a circle can be seen inFigure 1. These are two examples of radiographs that wereoriginally labelled as normal but ended up being placed awayfrom the cloud of normal exams. A second reading of theseexams has revealed unreported abnormalities thus conﬁrm-ing that their position within the embedding was justiﬁed. In a separate task, we tried to predict whether a given chestradiograph contains a radiological abnormality. For thistask, we compared the performance of the DCN architec-ture trained as a multi-label classiﬁer using a cross entropyloss (the same described above and used for pre-training)and the feature embeddings extracted from one of our DCNtrained with a metric loss. Logistic regression was used onthe extracted embedding space in order to obtain a classi-ﬁcation prediction. In Table 5 we present the results weobtained. Performances are evaluated in terms of Precision,Sensitivity, Speciﬁcity and F Score. We used F Score in-stead of Accuracy bacause in our data normal and abnormalexams are not balanced, and in the latter case comparingperformances using Accuracy can be misleading. In com-parison to the baseline model, it is possible to see that themodels based on the learned embedding obtain better per-formances, showing a higher proﬁciency when discriminatingbetween normal and abnormal exams.

6. CONCLUSIONS

In this article we have proposed two loss functions for met-ric learning with multi-labelled medical images. Their per-formance has been tested on a very large dataset of chestradiographs. Our initial results demonstrate that learninga metric that captures a notion of radiological similarity isindeed possible; most importantly, the learned metric placesnormal radiographs far away from the exams that have beenreported to contain one or multiple abnormalities. This isa striking result, given the complexity of the visual pat-terns to be discovered, the degree of noise characterisingthe radiological labels, and the large variety of scanners andreaders included in our study. It is also an important steptowards the fully-automated reading of chest radiographsas being able to recognize normal radiological structures onplain ﬁlm, which is key to interpreting any abnormal ﬁnd-ings.

Acknowledgments

The authors thank NVIDIA for providing access to a DGX-1server, which speeded up the training and evaluation of allthe deep learning algorithms used in this work.

7. REFERENCES [1] J. Bromley, I. Guyon, Y. LeCun, E. S¨ackinger, andR. Shah. Signature veriﬁcation using a “siamese” timedelay neural network. In

NIPS , pages 737–744. 1994.[2] S. Chopra, R. Hadsell, and Y. LeCun. Learning asimilarity metric discriminatively, with application toface veriﬁcation. In

CVPR , volume 1, pages 539–546.IEEE, 2005.[3] S. Cornegruta, R. Bakewell, S. Withey, andG. Montana. Modelling radiological language withbidirectional long short-term memory networks. , 2016.[4] K. J. Geras, S. Wolfson, S. G. Kim, L. Moy, andK. Cho. High-resolution breast cancer screening withmulti-view deep convolutional neural networks.

CoRR ,abs/1703.07047, 2017.[5] D. M. Hansell, A. A. Bankier, H. MacMahon, T. C.McLoud, N. L. Muller, and J. Remy. Fleischnersociety: glossary of terms for thoracic imaging 1.

Radiology , 246(3):697–722, 2008.[6] P. Lakhani and B. Sundaram. Deep learning at chestradiography: Automated classiﬁcation of pulmonarytuberculosis by using convolutional neural networks.

Radiology , 284(2):574–582, 2017. PMID: 28436741.[7] C. D. Manning, P. Raghavan, and H. Sch¨utze.

Introduction to Information Retrieval . CambridgeUniversity Press, Cambridge, UK, 2008.[8] E. Pesce, P.-P. Ypsilantis, S. Withey, R. Bakewell,V. Goh, and G. Montana. Learning to detect chestradiographs containing lung nodules using visualattention networks.

ArXiv e-prints , Dec. 2017.[9] O. Russakovsky, J. Deng, H. Su, J. Krause,S. Satheesh, S. Ma, Z. Huang, A. Karpathy,A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei.ithout pre-training With pre-trainingNMI R@1 R@2 R@4 R@8 NMI R@1 R@2 R@4 R@8Contrastive 17.57 32.86 46.76 60.90 74.65 37.76 52.71 64.16 74.99 84.11Triplet 27.24 41.30 55.58 69.58 81.08 39.46 52.22 66.02 78.25 86.49ML2 35.79 47.05 62.21 76.26 84.84

ML2+ (high res.) – – – – – 40.80 54.90 68.11 79.62 86.64Table 4: Clustering and retrieval results of diﬀerent metric learning losses in terms of NMI and R@1,2,4,8.Table 5: The classiﬁcation performance for abnormal ex-ams obtained (i) when the network is trained directly onthe classiﬁcation task (Cross-entropy), (ii) using the embed-dings extracted from a network trained with a triplet lossin order to train a linear regression classiﬁer (LR on tripletembedding) and (iii) using the embeddings extracted from anetwork trained with our proposed loss, ML2+, in order totrain a linear regression classiﬁer (LR on ML2+ embedding).Method Prec. Sens. Spec. F Cross-entropy 94.86

ImageNet Large Scale Visual Recognition Challenge.

International Journal of Computer Vision (IJCV) ,115(3):211–252, 2015.[10] F. Schroﬀ, D. Kalenichenko, and J. Philbin. Facenet:A uniﬁed embedding for face recognition andclustering. In

CVPR , pages 815–823, 2015.[11] H. Shin, K. Roberts, L. Lu, D. Demner-Fushman,J. Yao, and R. M. Summers. Learning to read chestx-rays: Recurrent neural cascade model for automatedimage annotation. In

CVPR , pages 2497–2506, 2016.[12] K. Sohn. Improved deep metric learning withmulti-class n-pair loss objective. In

NIPS , pages1849–1857, 2016.[13] H. O. Song, S. Jegelka, V. Rathod, and K. Murphy.Deep metric learning via facility location. In

CVPR ,2017.[14] H. O. Song, Y. Xiang, S. Jegelka, and S. Savarese.Deep metric learning via lifted structured featureembedding. In

CVPR , 2016.[15] C. Szegedy, V. Vanhoucke, S. Ioﬀe, J. Shlens, andZ. Wojna. Rethinking the inception architecture forcomputer vision. In

CVPR , pages 2818–2826.[16] L. van der Maaten. Accelerating t-sne using tree-basedalgorithms.

Journal of Machine Learning Research ,15:3221–3245, 2014.[17] B. van Ginneken. Fifty years of computer analysis inchest imaging: rule-based, machine learning, deeplearning.

Radiological Physics and Technology ,10(1):23–32, 2017.[18] X. Wang, Y. Peng, L. Lu, Z. Lu, M. Bagheri, andR. M. Summers. Chestx-ray8: Hospital-scale chestx-ray database and benchmarks on weakly-supervised classiﬁcation and localization of common thoraxdiseases.

CoRR , abs/1705.02315, 2017.[19] K. Q. Weinberger, J. Blitzer, and L. K. Saul. Distancemetric learning for large margin nearest neighborclassiﬁcation. In Y. Weiss, P. B. Sch¨olkopf, and J. C.Platt, editors,

Advances in Neural InformationProcessing Systems 18 , pages 1473–1480. MIT Press,2006.[20] C. Wu, R. Manmatha, A. J. Smola, andP. Kr¨ahenb¨uhl. Sampling matters in deep embeddinglearning.

CoRR , abs/1706.07567, 2017. ed i c a l de v i c e N o r m a l P neu m o t ho r a x P l eu r a l E ff u s i on C a r d i o m ega l y F i g . ( B , B ) F i g . ( A , A ) F i g u r e : - d i m e n s i o n a l e m b e dd i n go f a ll c h e s t r a d i og r a ph s c o n t a i n e d i n t h e go l d e nd a t a s e t l e a r n e d t h r o u g h t h e M L + l o ss a nd v i s u a li s e d v i a m u l t i - d i m e n s i o n a l s c a li n g . E a c h e x a m i s r e p r e s e n t e d a s a p o i n t w i t hd i ﬀ e r e n t s h a p e s a nd c o l o r s t o i d e n t i f y m u l t i p l e l a b e l s . W e ll - s e p a r a t e d c l u s t e r o f “ n o r m a l ” r a d i og r a ph s ( g r ee n t r i a n g l e s ) a nd e x a m s f e a t u r i n ga n e n l a r g e dh e a r t a r e c l e a r l yv i s i b l e . S ee F i g . f o r t h e c i r c l e d i m ag e ss