[PDF] Comparing Different Deep Learning Architectures for Classification of Chest Radiographs

Abstract

Full PDF

CComparing Diﬀerent Deep Learning Architectures forClassiﬁcation of Chest Radiographs

Keno K. Bressem, Lisa Adams, Christoph Erxleben, Bernd Hamm, Stefan Niehues, Janis VahldiekDepartment of Radiology, Charité Universitätsmedizin Berlin keno-kyrill.bressem(at)charite.de

Abstract

Chest radiographs are among the most frequently acquired images in radiology and are often thesubject of computer vision research. However, most of the models used to classify chest radiographs arederived from openly available deep neural networks, trained on large image-datasets. These datasetsroutinely diﬀer from chest radiographs in that they are mostly color images and contain several possibleimage classes, while radiographs are greyscale images and often only contain fewer image classes.Therefore, very deep neural networks, which can represent more complex relationships in image-features,might not be required for the comparatively simpler task of classifying grayscale chest radiographs.We compared ﬁfteen diﬀerent architectures of artiﬁcial neural networks regarding training-time andperformance on the openly available CheXpert dataset to identify the most suitable models for deeplearning tasks on chest radiographs. We could show, that smaller networks such as ResNet-34, AlexNetor VGG-16 have the potential to classify chest radiographs as precisely as deeper neural networks such asDenseNet-201 or ResNet-151, while being less computationally demanding. a r X i v : . [ c s . L G ] F e b ntroduction Chest radiographs are among the most frequentlyused imaging procedures in radiology. They havebeen widely employed in the ﬁeld of computer vision,as chest radiographs are a standardized techniqueand, if compared to other radiological examinationssuch as computed tomography or magnetic resonanceimaging, contain a smaller group of relevant patholo-gies. Although many artiﬁcial neural networks forthe classiﬁcation of chest radiographs have been de-veloped, it is still the subject of intensive research.Only a few groups design their own networks fromscratch, but rather use already established architec-tures, such as ResNet-50 or DenseNet-121 (with 50and 121 representing the number of layers within therespective neural network) [3][5][7][2][14][11]. Theseneural networks have often been trained on large,openly available datasets, such as ImageNet, and aretherefore already able to recognize numerous imagefeatures. When training a model for a new task, suchas the classiﬁcation of chest radiographs, the use ofpre-trained networks may improve the training speedand accuracy of the new model, since important im-age features that have already been learned can betransferred to the new task and do not have to belearned again. However, the feature space of freelyavailable data sets such as ImageNet diﬀers from chestradiographs as they contain color images and morecategories. The ImageNet Challenge includes 1000possible categories per image, while CheXpert, a largefreely available data set of chest radiographs, onlydistinguishes between 14 categories (or classes)[13].Although the ImageNet challenge showed a trend to-wards higher accuracies for deeper networks, this maynot be fully transferrable to radiology. In radiology,sometimes only limited features of an image can bedecisive for the diagnosis. Therefore, images cannotbe scaled down as much as desired, as the requiredinformation would otherwise be lost. But, the morecomplex a neural network architecture is, the moreresources are required for training and deployment ofsuch an algorithm. As up-scaling the input-imagesresolution exponentially increases memory usage dur-ing training for large neural networks, that evalu-ate many parameters, the size of a mini batch needsto be reduced earlier and more strongly, potentiallyaﬀecting optimizers such as stochastic gradient de-scent. Therefore, it is currently not clear, which ofthe available artiﬁcial neural networks designed forand trained on the ImageNet dataset will perform thebest for the classiﬁcation of chest radiographs. Thehypothesis of this work is, that shallow networks arealready suﬃcient for the classiﬁcation of radiographs and might even outperform deeper networks while re-quiring lesser resources. Therefore, we systematicallyexamine the performance of ﬁfteen openly availableartiﬁcial neural network architectures in order to iden-tify the most suitable ones for the basic classiﬁcationof chest radiographs.

Methods

Data preparation

The free available CheXpert dataset consists of224,316 chest radiographs from 65,240 patients.Fourteen ﬁndings have been annotated for each image:enlarged cardiomediastinum, cardiomegaly, lungopacity, lung lesion, edema, consolidation, pneumonia,atelectasis, pneumothorax, pleural eﬀusion, pleuralother, fracture and support devices. Hereby theﬁndings can be annotated as present (1), absent(NA) or uncertain (-1). Similar to previous work onthe classiﬁcation of the CheXpert dataset [7][16],we trained these networks on a subset of labels:cardiomegaly, edema, consolidation, atelectasisand pleural eﬀusion. As we only aim at networkcomparison and not on maximal precision of aneural network, for this analysis, each image withan uncertainty label was excluded, other approachessuch as zero imputation or self-training were also notadopted. Furthermore, only frontal radiographs wereused, leaving 135,494 images from 53,388 patients fortraining. CheXpert oﬀers additional dataset with235 images (201 images after excluding uncertaintylabels and lateral radiographs), annotated by twoindependent radiologists, which is intended as anevaluations dataset and was therefore used for thispurpose.

Data augmentation

For the ﬁrst and second training session, the imageswere scaled to 320 x 320 pixels, using bilinear inter-polation, and pixel values were normalized. Duringtraining, multiple image-transformations were applied:ﬂipping of the images alongside the horizontal andvertical axis, rotation of up to 10°, zooming of upto 110%, adding of random lightning or symmetricwrapping.2 odel training

Evaluation

Evaluation was performed using the “R” statisticalenvironment including the “tidyverse” and “ROCR”libraries [12][20][18].Predictions on the validationdataset of the ﬁve models for each network archi-tecture were pooled so that the models could beevaluated as a consortium. For each individualprediction as well as the pooled predictions, receiveroperation characteristic (ROC) curves and precisionrecall curves (PRC) were plotted and the areas undereach curve were calculated (AUROC and AUPRC).AUROC and AUPRC were chosen as they enablea comparison of diﬀerent models, independent of achosen threshold for the classiﬁcation.

Results

The CheXpert validation dataset consists out of 234studies of 200 patients, not used for training withno uncertainty-labels. After excluding lateral radio-graphs (n = 32), 202 images of 200 patients remained. The dataset presents class imbalances (% positives foreach ﬁnding: cardiomegaly 33%, edema 21%, consol-idation 16%, atelectasis 37%, pleural eﬀusion 32%),so that the AUPRC as well as the AUROC can beconsidered equally important measurements for theperformance of the network. The performance of thetested networks is compared to the AUROC reportedby Irvin et al.[7].However, only values for AUROC,but not for AUPRC, are provided there. In most cases,the best results were achieved with a batch size of 32,so all the information provided below refers to modelstrained with this batch size. Results achieved withsmaller batch sizes of 16 will be explicitly mentioned.

Area under the Receiver OperatingCharacteristic Curve

Deeper artiﬁcial neural networks generally achievedhigher AUROC values than shallow networks (Table1 and Figures 1-3). Regarding the pooled AUROCfor the detection of the ﬁve pathologies, ResNet-152(0.882), DenseNet-161 (0.881) and ResNet-50 (0.881)performed best (Irvin et al. CheXpert baseline0.889)[7]. Broken down for individual ﬁndings, themost accurate detection of atelectasis was achieved byResNet-18 (0.816, batch size 16), ResNet-101 (0.813,batch size 16), VGG-19 (0.813, batch size 16) andResNet-50 (0.811). For detection of cardiomegaly, thebest four models surpassed the CheXpert baseline of0.828 (ResNet-34 0.840, ResNet-152 0.836, DenseNet-161 0.834, ResNet-50 0.832). For congestion, thehighest AUROC was achieved using ResNet-152(0.917), ResNet-50 (0.916) and DenseNet-161 (0.913).Pulmonary edema was most accurately detected usingDenseNet-161 (0.923), DenseNet-169 (0.922) andDenseNet-201 (0.922). For pleural eﬀusion, the fourbest models were ResNet-152 (0.937), ResNet-101(0.936), ResNet-50 (0.934) and DenseNet-169 (0.934),all of which performed superior to the CheXpertbaseline of 0.928.

Area under the Precision Recall Curve

For AUPRC, shallower artiﬁcial neural networkscould achieve higher values than deeper network-architectures (Table 2 and Figures 4-6). The highestpooled values for the AUPRC were achieved bytraining VGG-16 (0.709), AlexNet (0.701) andResNet-34 (0.688). For atelectasis, CGG-16 andAlexNet both achieved the highest AUPRC of 0.732,followed by Resnet-35 with 0.652. Cardiomegaly wasmost accurately detected by SqueezeNet 1.0 (0.565),3lexnet-152 (0.565) and Vgg-13 (0.563). SqueezNet1.0 also achieved the highest AUPRC values forconsolidation (0.815) followed by ResNet-152 (0.810)and ResNet-50 (0.809). The best classiﬁcations ofpulmonary edema were achieved by DenseNet-169,DenseNet-161 (both 0.743) and DenseNet-201(0.742). Finally, for pleural eﬀusion ResNet-101 andResNet-152 achieved the highest AUPRC of 0.591,followed by ResNet-50 (0.590).

Overall best Performance

Considering both AUROC and AUPRC, the best per-formance was achieved by VGG-16 (AUROC: 0.856,AUPRC: 0.709), ResNet-34 (AUROC: 0.872, AUPRC:0.688) and AlexNet (AUROC: 0.839, AUPRC: 0.701),all with a batch size of 32.

Training time

Fourteen diﬀerent network-architectures were trained10 times each with a multilabel-classiﬁcation head (ﬁvetimes each for batch size of 16 or 32 and an input-image resolution of 320 x 320 pixels) and once with abinary classiﬁcation head for each ﬁnding, resultingin 210 individual training runs. Overall, training took340 hours. As to be expected, the training of deepernetworks required more time than the training ofshallower networks. For an image resolution of 320 x320 pixels, the training of AlexNet required the leastamount of time with a time per epoch of 2:29 to 2:50minutes and a total duration of 20 minutes for a batchsize of 32. Using a smaller batch size of 16, the timeper epoch raised to 2:59 - 3:06 minutes and a totalduration of 24 minutes. In contrast, using a batchsize of 16, training of a DenseNet-201 took the longestwith 5:11 hours and epochs requiring 41 minutes. Fora batch size of 32, training a DenseNet-169 requiredthe largest amount of time with 3:06 hours (epochsbetween 21 and 27 minutes). Increasing the batchsize from 16 to 32 lead to an average acceleration oftraining by 29.9% ± Discussion

In the present work, diﬀerent architectures of artiﬁcialneural networks are analyzed with respect to theirperformance for the classiﬁcation of chest radiographs.We could show that more complex neural networks donot necessarily perform better than shallow networks. Instead, an accurate classiﬁcation of chest radiographsmay be achieved with comparably shallow networks,such as AlexNet (8 layers), ResNet-34 or VGG-16,which surpass even complex deep networks such asResNet-150 or DenseNet-201.The use of smaller neural networks has the advantage,that hardware requirements and training time arelower compared to deeper networks. Shorter trainingtimes allow to test more hyperparameters, simpli-fying the overall training process. Lower hardwarerequirements also enable the use of increased imageresolutions. This could be of relevance for the evalu-ation of chest radiographs with a generic resolutionof 2048 x 2048 px to 4280 x 4280 px, where speciﬁcﬁndings, such as small pneumothorax, require largerresolutions of input-images, because otherwise the cru-cial information regarding their presence could be lostdue to a downscaling. Furthermore, shorter trainingtimes might simplify the integration of improvementmethods into the training data, such as the implemen-tation of ‘human in the loop’ annotations. ‘Humanin the loop’ implies that the training of a network issupervised by a human expert, who may intervene andcorrect the network at critical steps. For example, thehuman expert can check the misclassiﬁcations withthe highest loss for incorrect labels, thus eﬀectively re-ducing label noise. With shorter training times, suchfeedback loops can be executed faster. In the CheX-pert dataset, which was used as a groundwork for thepresent analysis, labels for the images were generatedusing a speciﬁcally developed natural language pro-cessing tool, which did not produce perfect labels. Forexample, the F1 scores for the mention and subsequentnegation of cardiomegaly were 0.973 and 0.909, andthe F1 score for an uncertainty label was 0.727. There-fore, it can be assumed, that there is a certain amountof noise in the training data, which might aﬀect theaccuracy of the models trained on it. Implementing ahuman-in-the loop approach for partially correctingthe label noise could further improve performance ofnetworks trained on the CheXpert dataset [8]. Ourﬁndings diﬀer from applied techniques used in pre-vious literature, where deeper network architectures,mainly a DenseNet-121, were used instead of smallnetworks to classify the CheXpert data set [11][1][15].The authors of the CheXpert dataset achieved an aver-age overall AUROC of 0.889 [7], using a DenseNet-121which was not surpassed by any of the models usedin our analysis, although diﬀerences between the bestperforming networks and the CheXpert baseline weresmaller than 0.01.. It should be noted, however, thatin our analysis the hyperparameters for the modelswere probably not selected as precise as in the originalCheXpert paper by Irvin et al., since the focus of this4ork was more on comparing the architectures and noton the complete optimization of one speciﬁc network.Still, we identiﬁed model, which achieved higher AU-ROC values in two of the ﬁve ﬁndings (cardiomegalyand eﬀusion). Pham et al. also used a DenseNet-121as the basis for their model and proposed the mostaccurate model of the CheXpert dataset with a meanAUROC of 0.940 for the ﬁve selected ﬁndings [11].The good results are probably due to the hierarchicalstructure of the classiﬁcation framework, which takesinto account correlations between diﬀerent labels, andthe application of a label-smoothing technique, whichalso allows the use of uncertainty labels (which wereexcluded in our present work). Allaouzi et al. simi-larly used a DenseNet-121 and created three diﬀerentmodels for the classiﬁcation of the CheXpert andChestX-ray14, yielding an AUC of 0.72 for atelectasis,0.87-0.88 for cardiomegaly, 0.74-0.77 for consolidation,0.86-0.87 for edema and 0.90 for eﬀusion [1]. Ex-cept for cardiomegaly, we achieved better values withseveral models (e.g. ResNet-34, ResNet-50, AlexNet,VGG-16). We would interpret this as proof that com-plex deep networks are not necessarily superior tomore shallow networks for chest x-ray classiﬁcation.At least for the CheXpert dataset it seems that meth-ods optimizing the handling of uncertainty labels andhierarchical structures of the data are important toimprove model performance. Sabottke et al. trained aResNet-32 for classiﬁcation of chest radiographs andtherefore are one of the few groups using a smallernetwork [15]. With an AUROC of 0.809 for atelecta-sis, 0.925 for cardiomegaly, 0.888 for edema and 0.859for eﬀusion, their network performed not as good assome of our tested networks. Raghu et al. employeda ResNet-50, an Inception-v3 as well as a custom-designed small network. Similar to our ﬁndings, theyobserved, that smaller networks showed a comparableperformance to deeper networks [13].

Conclusion

In the present work, we could show that smaller arti-ﬁcial neural networks for the classiﬁcation of chest ra-diographs can perform similar, or even surpass deeperand very deep neural networks. In contrast to manyprevious studies, which mostly used a DenseNet-121,we achieved the best results with up to 95% smallernetworks. Using smaller networks therefore has theadvantage that that they have lower hardware require-ments, as they require less GPU RAM and can betrained faster without loss of performance. 5 ables

Table 1 Area under the Receiver Operating Characteristic Curve

Network Batchsize Atelectasis Cardiomegaly Consolidation Edema Eﬀusion Pooled

CheXpert baseline 16 0.818 0.828 0.938 0.934 0.928 0.889

ResNet-18 16 0.816 0.797 0.905 0.868 0.899 0.857ResNet-34 16 0.799 0.798 0.902 0.891 0.905 0.859ResNet-50 16 0.798 0.799 0.890 0.880 0.913 0.856ResNet-101 16 0.813 0.810 0.905 0.889 0.907 0.865ResNet-152 16 0.801 0.809 0.908 0.896 0.916 0.866DenseNet-121 16 0.809 0.794 0.895 0.883 0.906 0.857DenseNet-161 16 0.800 0.817 0.885 0.900 0.923 0.865DenseNet-169 16 0.805 0.795 0.898 0.891 0.909 0.860DenseNet-201 16 0.805 0.812 0.891 0.886 0.916 0.862AlexNet 16 0.790 0.755 0.857 0.894 0.881 0.835SqueezeNet-1.0 16 0.761 0.755 0.833 0.907 0.885 0.828SqueezeNet-1.1 16 0.767 0.764 0.880 0.903 0.879 0.839VGG-13 16 0.798 0.752 0.886 0.867 0.872 0.835VGG-16 16 0.809 0.766 0.892 0.879 0.883 0.846VGG-19 16 0.811 0.786 0.901 0.890 0.884 0.854ResNet-18 32 0.796 0.822 0.908 0.903 0.911 0.868ResNet-34 32 0.797

Table 1 shows the diﬀerent areas under the receiver operating characteristic curve (AUROC) for each ofthe network architectures and individual ﬁnding as well as the pooled AUROC per model. According tothe pooled AUROC, ResNet-152, ResNet-50 und DenseNet-161 were the best models, while SqueezeNetand AlexNet showed the poorest performance. For cardiomegaly, ResNet-34, ResNet-50, ResNet-152 andDenseNet-161 could surpass the CheXpert baseline provided by Irvin et al. ResnEt-50, ResNet-101, ResNet-152and DenseNet-169 could also surpass the CheXpert baseline for pleural eﬀusion. A batch size of 32 often leadto better results compared to a batch size of 16. 6 able 2 Area under the Precision Recall Curve

Network Batchsize Atelectasis Cardiomegaly Consolidation Edema Eﬀusion PooledResNet-18 16 0.500 0.559 0.806 0.727 0.580 0.634ResNet-34 16 0.506 0.560 0.804 0.735 0.580 0.637ResNet-50 16 0.501 0.557 0.802 0.733 0.585 0.636ResNet-101 16 0.499 0.558 0.765 0.735 0.582 0.628ResNet-152 16 0.503 0.559 0.808 0.737 0.584 0.638DenseNet-121 16 0.503 0.554 0.802 0.733 0.580 0.634DenseNet-161 16 0.501 0.557 0.799 0.736 0.587 0.636DenseNet-169 16 0.500 0.560 0.805 0.733 0.582 0.636DenseNet-201 16 0.320 0.555 0.445 0.734 0.582 0.527AlexNet 16 0.543 0.565 0.490 0.733 0.577 0.582SqueezeNet-1.0 16 0.509 0.565 0.425 0.736 0.576 0.562SqueezeNet-1.1 16 0.505 0.563 0.400 0.733 0.575 0.555VGG-13 16 0.502 0.563 0.761 0.726 0.574 0.625VGG-16 16 0.501 0.559 0.797 0.733 0.577 0.633VGG-19 16 0.500 0.558 0.808 0.731 0.577 0.635ResNet-18 32 0.502 0.557 0.805 0.736 0.582 0.636ResNet-34 32 0.652 0.556 0.806 0.737 0.585 0.667ResNet-50 32 0.497 0.555 0.809 0.740 0.590 0.638ResNet-101 32 0.500 0.558 0.808 0.740 0.591 0.639ResNet-152 32 0.502 0.559 0.810 0.741 0.591 0.641DenseNet-121 32 0.500 0.558 0.793 0.736 0.587 0.635DenseNet-161 32 0.499 0.556 0.808 0.743 0.589 0.639DenseNet-169 32 0.499 0.556 0.805 0.743 0.588 0.638DenseNet-201 32 0.502 0.555 0.808 0.742 0.589 0.639AlexNet 32 0.720 0.562 0.789 0.731 0.578 0.676SqueezeNet-1.0 32 0.354 0.562 0.815 0.738 0.580 0.610SqueezeNet-1.1 32 0.506 0.563 0.804 0.731 0.577 0.636VGG-13 32 0.501 0.560 0.799 0.735 0.578 0.635VGG-16 32 0.732 0.561 0.804 0.739 0.582 0.684VGG-19 32 0.501 0.562 0.800 0.740 0.585 0.638

Table 2 shows the area under the precision recall curve (AUPRC) for all networks and ﬁndings. In contrastto the AUROC, where deeper models achieved higher values, shallower networks yielded the best results forAUPRC (ResNet-24, AlexNet, VGG-16). DenseNet-201 and Squeezenet showed the lowest AUPRC values.Again, a batch size of 32 appeared to deliver better results compared to a batch size of 16.7 able 3 Duration of Training

Network Batchsize Duration/Epoch Duration/TrainingResNet-18 16 6 min 50 minResNet-34 16 10 min 1 h 13 minResNet-50 16 11 min - 13 min 1 h 40 minResNet-101 16 19 min - 25 min 2 h 47 minResNet-152 16 27 min - 28 min 4 h 7 minSqueezeNet-1.0 16 4 min - 6 min 39 minSqueezeNet-1.1 16 4 min 37 minAlexNet 16 3 min 24 minVGG-13 16 12 min 1 h 49 minVGG-16 16 20 min - 21 min 2 h 14 minVGG-19 16 24 min 2 h 40 minDenseNet-121 16 23 min - 25 min 3 h 7 minDenseNet-169 16 31 min - 34 min 4 h 21 minDenseNet-161 16 29 min - 36 min 4 h 17 minDenseNet-201 16 41 min 5 h 11 minResNet-18 32 4 min 31 minResNet-34 32 5 min - 7 min 45 minResNet-50 32 8 min 1 h 16 minResNet-101 32 13 min 2 h 8 minResNet-152 32 21 min - 26 min 2 h 58 minSqueezeNet-1.0 32 3 min - 4 min 28 minSqueezeNet-1.1 32 3 min 25 minAlexNet 32 2 min - 3 min 20 minVGG-13 32 10 min - 14 min 1 h 31 minVGG-16 32 17 min 1 h 47 minVGG-19 32 13 min 2 h 2 minDenseNet-121 32 12 min - 16 min 1 h 49 minDenseNet-169 32 17 min 2 h 25 minDenseNet-161 32 21 min - 27 min 3 h 6 minDenseNet-201 32 20 min 2 h 52 min

Table 3 provides an overview of training time per epoch (duration/epoch) and an overall training-time(duration/training) for each neural network. The times given are the average of ﬁve training runs rounded tothe nearest minute. 8 igures

Receiver Operating Characteristic Curves

Figures 1, 2 and 3 display the ROC-curves for all models. The colored lines represent a single training, blacklines represent the pooled performance over ﬁve trainings.

Figure 1 R e s N e t − R e s N e t − R e s N e t − R e s N e t − R e s N e t − atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 false positive rate t r ue po s i t i v e r a t e atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 false positive rate t r ue po s i t i v e r a t e atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 false positive rate t r ue po s i t i v e r a t e atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 false positive rate t r ue po s i t i v e r a t e atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 false positive rate t r ue po s i t i v e r a t e igure 2 D en s e N e t − D en s e N e t − D en s e N e t − D en s e N e t − A l e x N e t − atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 false positive rate t r ue po s i t i v e r a t e atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 false positive rate t r ue po s i t i v e r a t e atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 false positive rate t r ue po s i t i v e r a t e atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 false positive rate t r ue po s i t i v e r a t e atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 false positive rate t r ue po s i t i v e r a t e igure 3 S quee z e N e t − . S quee z e N e t − . V GG − V GG − V GG − atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 false positive rate t r ue po s i t i v e r a t e atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 false positive rate t r ue po s i t i v e r a t e atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 false positive rate t r ue po s i t i v e r a t e atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 false positive rate t r ue po s i t i v e r a t e atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 false positive rate t r ue po s i t i v e r a t e recision Recall Curves Figures 1, 2 and 3 display the precision recall curves for all models. The colored lines represent a singletraining, black lines represent the pooled performance over ﬁve trainings.

Figure 4 R e s N e t − R e s N e t − R e s N e t − R e s N e t − R e s N e t − atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 precision r e c a ll atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 precision r e c a ll atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 precision r e c a ll atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 precision r e c a ll atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 precision r e c a ll igure 5 D en s e N e t − D en s e N e t − D en s e N e t − D en s e N e t − A l e x N e t − atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 precision r e c a ll atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 precision r e c a ll atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 precision r e c a ll atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 precision r e c a ll atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 precision r e c a ll igure 6 S quee z e N e t − . S quee z e N e t − . V GG − V GG − V GG − atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 precision r e c a ll atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 precision r e c a ll atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 precision r e c a ll atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 precision r e c a ll atelectasis cardiomegaly consolidation edema pleural effusion0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.0 0.0 0.5 1.00.00.51.0 precision r e c a ll eferences [1] Imane Allaouzi and Mohamed Ben Ahmed. “Anovel approach for multi-label chest X-ray clas-siﬁcation of common thorax diseases”. In: IEEEAccess arXiv preprint arXiv:1901.07441 (2019).[3] Kaiming He et al. “Deep residual learning forimage recognition”. In:

Proceedings of the IEEEconference on computer vision and pattern recog-nition . 2016, pp. 770–778.[4] Jeremy Howard et al. fastai . https://github.com/fastai/fastai. 2018.[5] Gao Huang et al. “Densely connected convolu-tional networks”. In:

Proceedings of the IEEEconference on computer vision and pattern recog-nition . 2017, pp. 4700–4708.[6] Forrest N Iandola et al. “SqueezeNet: AlexNet-level accuracy with 50x fewer parametersand< 0.5 MB model size”. In: arXiv preprintarXiv:1602.07360 (2016).[7] Jeremy Irvin et al. “Chexpert: A large chest ra-diograph dataset with uncertainty labels and ex-pert comparison”. In:

Proceedings of the AAAIConference on Artiﬁcial Intelligence . Vol. 33.2019, pp. 590–597.[8] Davood Karimi et al. “Deep learning with noisylabels: exploring techniques and remedies inmedical image analysis”. In: arXiv preprintarXiv:1912.02911 (2019).[9] Alex Krizhevsky, Ilya Sutskever, and GeoﬀreyE Hinton. “ImageNet Classiﬁcation with DeepConvolutional Neural Networks”. In:

Advancesin Neural Information Processing Systems 25 .Ed. by F. Pereira et al. Curran Associates, Inc.,2012, pp. 1097–1105. url : http://papers.nips.cc/paper/4824- imagenet- classiﬁcation- with-deep-convolutional-neural-networks.pdf.[10] Adam Paszke et al. “PyTorch: An ImperativeStyle, High-Performance Deep Learning Li-brary”. In:

Advances in Neural InformationProcessing Systems 32 . Ed. by H. Wallach et al.Curran Associates, Inc., 2019, pp. 8024–8035. url : http :// papers.neurips. cc/paper /9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf. [11] Hieu H Pham et al. “Interpreting chest X-raysvia CNNs that exploit disease dependenciesand uncertainty labels”. In: arXiv preprintarXiv:1911.06475 (2019).[12] R Core Team.

R: A Language and Environmentfor Statistical Computing . R Foundation for Sta-tistical Computing. Vienna, Austria, 2019. url

Advances in Neural Information Processing Sys-tems . 2019, pp. 3342–3352.[14] Pranav Rajpurkar et al. “Chexnet: Radiologist-level pneumonia detection on chest x-rayswith deep learning”. In: arXiv preprintarXiv:1711.05225 (2017).[15] Carl F. Sabottke and Bradley M. Spieler. “TheEﬀect of Image Resolution on Deep Learningin Radiography”. In:

Radiology: Artiﬁcial Intel-ligence doi : 10.1148/ryai.2019190015.[16] Carl F Sabottke and Bradley M Spieler. “Theeﬀect of image resolution on deep learning in ra-diography”. In:

Radiology: Artiﬁcial Intelligence arXiv preprintarXiv:1409.1556 (2014).[18] T. Sing et al. “ROCR: visualizing classiﬁer per-formance in R”. In:

Bioinformatics url : http://rocr.bioinf.mpi-sb.mpg.de.[19] Leslie N Smith. “Cyclical learning rates for train-ing neural networks”. In: . IEEE. 2017, pp. 464–472.[20] Hadley Wickham. tidyverse: Easily Install andLoad the ’Tidyverse’ . R package version 1.2.1.2017. urlurl