Flow-Mixup: Classifying Multi-labeled Medical Images with Corrupted Labels
Jintai Chen, Hongyun Yu, Ruiwei Feng, Danny Z. Chen, Jian Wu
FFlow-Mixup: Classifying Multi-labeled MedicalImages with Corrupted Labels
Jintai Chen
College of Computer Science and TechnologyZhejiang University
Hangzhou, [email protected]
Hongyun Yu
College of Computer Science and TechnologyZhejiang University
Hangzhou, [email protected]
Ruiwei Feng
College of Computer Science and TechnologyZhejiang University
Hangzhou, Chinaruiwei [email protected]
Danny Z. Chen
Department of Computer Science and EngineeringUniversity of Notre Dame
Notre Dame, IN 46556, [email protected]
Jian Wu (cid:0)
Zhejiang University School of MedicineZhejiang University
Hangzhou, [email protected]
Abstract —In clinical practice, medical image interpretationoften involves multi-labeled classification, since the affected partsof a patient tend to present multiple symptoms or comorbidities.Recently, deep learning based frameworks have attained expert-level performance on medical image interpretation, which canbe attributed partially to large amounts of accurate annotations.However, manually annotating massive amounts of medicalimages is impractical, while automatic annotation is fast butimprecise (possibly introducing corrupted labels). In this work,we propose a new regularization approach, called Flow-Mixup,for multi-labeled medical image classification with corruptedlabels. Flow-Mixup guides the models to capture robust featuresfor each abnormality, thus helping handle corrupted labelseffectively and making it possible to apply automatic annotation.Specifically, Flow-Mixup decouples the extracted features byadding constraints to the hidden states of the models. Also, Flow-Mixup is more stable and effective comparing to other knownregularization methods, as shown by theoretical and empiricalanalyses. Experiments on two electrocardiogram datasets and achest X-ray dataset containing corrupted labels verify that Flow-Mixup is effective and insensitive to corrupted labels.
Index Terms —deep learning, regularization, multi-labeled im-age classification, mixup
I. I
NTRODUCTION
Medical image classification is critical in clinical practice(e.g., for early detection of diseases). However, medical imageclassification is still a challenging task due to large intra-class variations, blurred boundaries between abnormalities,inconclusive abnormality patterns, etc. Furthermore, it is com-mon for a patient to suffer multiple symptoms simultaneously, which present several kinds of abnormalities and some com-plex comorbidities in the medical images. Therefore, medicalimage interpretation is commonly a multi-labeled image clas-sification process. Recently, supervised deep learning modelshave attained high performance thanks to large amounts ofwell-annotated data for model training. However, it is time-consuming to annotate medical images manually by medicalexperts, while automatic annotation (e.g., automatically ex-tracting labels from reports [22]) is fast but possibly introducesconsiderable corrupted (incorrect) labels.Many methods, such as model ensemble [17], weighted lossfunction [8], and label hierarchy [4], were widely utilized inmulti-labeled medical image classification. However, dealingwith corrupted labels of multi-labeled medical images wasrarely studied, which is a basic issue for using automaticallyannotated labels. It was proved that regularization methodscould hinder the memorization of models with the generaliza-tion ability preserved [2], which is advantageous to tacklinglabel corruption. Many known regularization methods (e.g.,Mixup [25], Manifold Mixup [21]) were proposed for single-output tasks, but could not meet the needs of multi-outputtasks [24]. Besides, the existence of complex correlationsamong abnormalities was confirmed in literature [3], [22],which required additional considerations in model training.Thus, multi-labeled medical image classification with cor-rupted labels is a challenging problem and requires furtherresearch effort.To this end, in this paper, we propose a new regularization a r X i v : . [ c s . C V ] F e b pproach called Flow-Mixup for multi-labeled medical imageclassification with label corruption. Specifically, we introducea new dimension called “flow dimension” for the featuretensors in hidden states and apply a novel Mixing moduleto a selected hidden state . Thus, the model layers ahead ofthe selected hidden state are restricted to learning a nonlinearfunction while the subsequent layers are restricted to learninga linear function. Flow-Mixup guides the nonlinear part todecouple the complex features (where the features of ab-normalities are correlative) into abnormality-specific featuresbefore feeding the features to the linear part. The decouplingis guaranteed as the linear function requires its input featuresto lie in a linearly separable space. We compare Flow-Mixupwith Mixup [25] and Manifold Mixup [21] to highlight thecharacteristics of Flow-Mixup.This work makes three main contributions:1) We propose a new regularization method called Flow-Mixup for multi-labeled medical image classification,and show that Flow-Mixup is insensitive to corruptedlabels.2) We compare Flow-Mixup with Mixup [25] and ManifoldMixup [21], and show that the “correlation conflicts”phenomenon and the “distribution shift” phenomenonoccur with using Mixup or Manifold Mixup.3) Experiments on several multi-labeled medical im-age classification datasets with corrupted labels verifythat our Flow-Mixup outperforms known regularizationmethods. II. R ELATED W ORK
A. Multi-labeled Medical Image Classification
Various automatic medical image interpretation applicationsinvolve multi-labeled image classification tasks, such as chestX-ray (CXR) interpretation [3], [5], [14], [17], [22], elec-trocardiogram (ECG) monitoring [7], [12], [18], comorbidityidentification of Alcohol Use Disorder and human immunode-ficiency virus infection [1], bone fracture type diagnosis [16],etc. To better handle multi-labeled classification tasks, a newloss function was proposed to guide deep learning models tosearch the subspace of abnormality features [14], and labelhierarchy [4] and matrix completion [1] methods were alsoused in correlative feature refinement. In [6], an approach wasespecially designed to calculate the uncertainty of automateddiagnosis. Abnormality location perception was consideredin [8], [19] for CXR image classification. Also, adversariallearning approaches were designed for data augmentation anddisease severity assessment in CXR [15], [23] and ECG [7]classification. A large dataset [22] catalyzed multi-labeledclassification methods on CXR images. However, the labelsin this dataset were mined from radiology reports by naturallanguage processing (NLP) and the text-mined labels weresomewhat corrupted. Most of the known supervised multi-labeled classification methods focused on tackling feature In this paper, we denote a “hidden state” as the general output of a modellayer, and a “feature” refers to a particular representation of some data. correlations among abnormalities but few of them consideredlabel corruption.
B. Regularization Methods
Regularization methods are useful for dealing with labelcorruption [2]. Kurmann et al. [14] managed to drive class-specific features into different affine subspaces and enlarge thedistances between the subspaces. This method outperformedthe vanilla methods in multi-labeled CXR image classification.Many data augmentation methods were used to deal withmulti-labeled medical image classification [3], [8], [17], whichhad similar effect as regularization methods. The state-of-the-art regularization methods for single-labeled classification areMixup [9], [25] and Manifold Mixup [21], by introducinglinear constraints into the models. However, both of themare not very suitable to multi-labeled classification because ofthe “correlation conflicts” and “distribution shift” phenomena(discussed in Sec. IV). Mixup ignored the feature correlationsamong abnormalities while Manifold Mixup was often un-stable in training. In this paper, we propose Flow-Mixup formulti-labeled medical image classification, which avoids thedrawbacks of Mixup and Manifold Mixup.III. A
PPROACH
A. Preliminaries
Mixup [25] introduced a linear constraint to single-labeledclassification and achieved good performance. Considering adeep learning classifier as a function h ( · ) , the standard Mixupis defined as: h ( px p + qx q ) = py p + qy q (1)where x p and x q are two input images while y p and y q arethe corresponding labels, with q = 1 − p . Mixup regularizationrestricts the whole model (the function h ) to be a linearfunction, as h ( px p + qx q ) = py p + qy q = ph ( x p ) + qh ( x q ) .Similarly, Manifold Mixup [21] applies the mixing operationas in Eq. (1) to a hidden state, and restricts the subsequentparts of the model to learn a linear function. Note that the“linear function” and “nonlinear function” are different fromthe “linear layer” and “nonlinear layer” of the neural networks,as the former concepts are related to the learning objectivesbut the later concepts are about the model entities. B. An Overview of Flow-Mixup
In this paper, we propose a new regularization approach,Flow-Mixup, for multi-labeled medical image classification.Consider a deep learning classifier h ( x ) = f ( g ( x )) , where g is a nonlinear function and f is a linear function. A trainingforward process with Flow-Mixup takes several steps: First,we select a hidden state s to split the model into a nonlinearpart and a linear part before training, as s = g ( x ) , y = f ( s ) ,and y is the model output. Second, we process the data(e.g., the images) forward to the selected hidden state, andapply a new Mixing module to the features in the hiddenstate (our Mixing module is depicted below). After beingprocessed by the Mixing module, the features continue the + Feature Copy Flow Concatenation Mixing B Beta DistributionFeatures /2 grad m grad’grad z z copy z [z, z mixed ]z mixed z copy , (y)z copy , (y) p 1-pz copy ’, (y’) z mixed , (y mixed ) (a) (b) /2 Gradients Gradient halving
Fig. 1. The left part shows the structure of the Mixing module. The right part gives the details of the mixing operation of Flow-Mixup. A copy of the featuremaps is made and processed by the mixing operation, and then is batch-wise concatenated to the original ones. forward propagation until the output. With the Mixing module,Flow-Mixup restricts the front part of the model to learning anonlinear function, and the rest of the model serves as a linearfunction. In dealing with multi-labeled medical images, thenonlinear function extracts abnormality-specific features, andthe linear function (subsequent part) of the model projects theabnormality-specific features into the label spaces. The con-straint to the nonlinear part is guaranteed, as the output of thenonlinear part is fed to the linear part which requires its inputto lie in a linearly separable space. Different from ManifoldMixup, the special Mixing module introduces an extra flowdimension, thus simultaneously using several mixing modulesin a model is allowed.
C. Mixing Module
Generally, the tensors of an image in deep learning modelshave 4 dimensions: batch dimension, channel dimension, widthand height dimensions. Our proposed Flow-Mixup introducesa new dimension, called flow dimension. As shown in theleft part of Fig. 1, assume that the original feature z hasa flow dimension of size 1 before being processed by theMixing module, and then the output of the Mixing module [ z, z mixed ] has a flow dimension of size 2. The flow size isincreased by the feature concatenation operation. After thefeatures are fed to the Mixing module, the first step is to makea copy of these features. Then, the feature copy is processedby a mixing operation and then concatenated into the originalfeatures along the flow dimension. The forward process in theMixing module is defined as: z (cid:48) = M ( p, z ) = [ z, Mixing ( p, z )] = [ z, z mixed ] (2)where the mixing operation Mixing ( · , · ) transforms a fea-ture copy ( z copy , y ) into two mini-copies ( ( z, y ) , ( z (cid:48) , y (cid:48) ) ) andapplies the standard Mixup to them by ( z mixed , y mixed ) =( pz + (1 − p ) z (cid:48) , py + (1 − p ) y (cid:48) ) , as illustrated in the right partof Fig. 1. ( z, y ) and ( z (cid:48) , y (cid:48) ) are obtained by applying randomindex-shuffle to ( z copy , y ) . p is randomly sampled from a beta distribution p B ( α, α ) and α is a hyper-parameter controllingthe mixing degree [25]. [ · , · ] indicates flow-wise concatenation,which results in the flow dimension size increase. FollowingEq. (2), the feature z is transformed into z (cid:48) with a double flowsize. Since the flow size is doubled in the forward propagation,the Mixing module shall halve the gradients in the back-propagation in order to keep the magnitudes of the gradients.The back-propagation of the Mixing module is defined as:grad = ( grad (cid:48) + grad m ) / (3)where grad (cid:48) indicates the gradients of the original features,and grad m represents the gradients of the mixed features (seeFig. 1). In this way, the Mixing module can be applied toseveral hidden states simultaneously with the original featuresbeing preserved, as shown in Fig. 2(b). Note that the reg-ularization approach cannot entirely restrict the subsequentlayers to be a linear function, and thus applying several Mixingmodules is helpful in strengthening the linear constraints. Inimplementation, if a hidden state is the last one (see Fig. 2(b))or there is only one state (see Fig. 2(a)) to apply the Mixingmodule, it is optional to compute the forward propagationof the original features to the output layer. If the originalfeatures do not go forward, the Mixing module degradesinto the common Mixup operation, calculating z (cid:48) = z mixed in the forward propagation and grad = grad m in the back-propagation. IV. A NALYSIS AND C OMPARISONS
This section discusses the feasibility and the characteristicsof Flow-Mixup and its differences with the known regulariza-tion methods, Mixup [25] and Manifold Mixup [21].
A. Feasibility of Flow-Mixup
Hypothesis 1:
A learned sequential deep learning classifierfor multi-labeled images can be reformulated as a compositionof some linear functions and some nonlinear functions.
Since the feature correlations among abnormalities exist a) g f (b) forward path of original features backward path forward path of mixed features layers/blocks Mixing module optional forward path of original features Fig. 2. Illustrating two strategies to use Flow-Mixup. (a) The Mixing module is utilized only on a hidden state. (b) Mixing modules are applied to two hiddenstates. The numbers along the models indicate the flow dimension sizes, with and without the optional forward path of original features. and the labels lie independently in the label space, a deeplearning classifier needs to learn nonlinear functions in orderto decouple the correlative features. Thus, it is reasonableto regard a learned sequential classifier as a composition ofmultiple nonlinear functions and linear functions.Based on Hypothesis 1, a learned sequential deep learningclassifier h ( · ) can be mathematically decoupled by: h ( x ) = f ◦ g ◦ f ◦ g ◦ · · · ◦ f a ◦ g b ( x ) (4)where the functions f i ( i ∈ { , , . . . , a } ) and g j ( j ∈ { , , . . . , b } ) belong to a linear function family F and a nonlinear function family G , respectively. f i ◦ g j ( x ) = g j ( f i ( x )) . In practice, in Eq. (4), the order andthe concrete expressions of f i and g j are obtained by learningfrom the data. Theorem 1.
A learned sequential deep learning classifier h ( · ) for multi-labeled images can be reformulated as acomposition of some linear functions and some nonlinearfunctions, in a sequence where the nonlinear functions ˜ g i appear first and then the linear functions ˜ f j follow, as: h ( x ) = ˜ g ◦ ˜ g ◦ · · · ◦ ˜ g c ◦ ˜ f ◦ ˜ f ◦ · · · ◦ ˜ f d ( x ) (5) Proof:
A commutative law for linear and nonlinear functionscan be proved, as follows. Assume f ◦ g ( x ) = y for a linearfunction f and a nonlinear function g . Then ∃ ˜ f ∈ F and ∃ ˜ g ∈ G such that f ◦ g ( x ) = ( f ◦ g ◦ ˜ f ( − ) ◦ ˜ f ( x ) = ˜ g ◦ ˜ f ( x ) = y (because a linear function is invertible), with ˜ g = f ◦ g ◦ ˜ f ( − .Thus, f ◦ g ( x ) = ˜ g ◦ ˜ f ( x ) . By applying this commutativelaw repeatedly, it is easy to prove that a model h ( x ) underHypothesis 1 can be specified as: h ( x ) = ˜ g ◦ ˜ g ◦ · · · ◦ ˜ g b ◦ ˜ f ◦ ˜ f ◦ · · · ◦ ˜ f a ( x ) (6)where d = a and c = b . The solution thus constructed (i.e.,Eq. (6)) verifies the theorem, which suggests that any multi-labeled image classifier can find a solution under the constraint of Flow-Mixup if the equation of the original classifier has asolution. (cid:3) B. Comparisons with Mixup
As discussed in the previous work [3], the features ofabnormalities can be correlative, which may not be linearlyseparable. In other words, the inherent correlation of abnor-malities might be in conflict with the linear constraint ofMixup. Thus, training a multi-labeled image classifier with theMixup regularization may result in performance decrease. Asthe situation illustrated in Fig. 3, such “correlation conflicts”happen as the boundary line of two classes cannot deal withthe data belonging to both of these two classes, after mappingthe data manifold to a low dimensional space satisfying theMixup linear constraint. In contrast, with our Flow-Mixup,the correlative features of abnormalities can be decoupled intoabnormality-specific features by the nonlinear functions first,and such features lie in a linearly separable space.
C. Comparisons with Manifold Mixup
Manifold Mixup [21] allows applying a mixing operationto several hidden states in the training process. However,this mixing operation cannot be performed simultaneously.Manifold Mixup randomly selects one of these hidden statesto apply the mixing operation in every training iteration, andconsequently suffers two drawbacks. (1) Updating parametersin every iteration affects the final parameters. Therefore, it ishard to know exactly what degree of data mixing is appliedto a hidden state, as the mixing operation is used witha probability. Thus, it is difficult to determine the hyper-parameters for the mixing operation. (2) Since the trainingcondition to a hidden state (whether to use a mixing operation)is changeable, the training process is unstable and suffers a“distribution shift” phenomenon. “Distribution shift” meansthat the objective feature distribution is changed. Ideally, using ig. 3. Illustrating “correlation conflicts”. Assume that the yellow balls andblue balls indicate samples with two different labels, while the orange ballsare for samples with both of the two labels. One can see the Mixup linearconstraint conflicting with the original correlation. z (1) z (2) z (3) z (4) y Features0.00.20.40.60.8 R Itemsw\ Mixupw\o Mixup R Ratio
Fig. 4. A visualization of the R values of different hidden states with orwithout Mixup on the training set of CIFAR-10. The R Ratio = R / R ,where R is R values without Mixup and R is R values with Mixup.One can see that the feature distributions with and without Mixup are quitedifferent. a mixing operation on a hidden state restricts the features to liein a linearly separable space. However, Manifold Mixup keepschanging the constraint to the hidden states, which leads to anunstable training process and decreases the performance.To observe the occurrence of the “distribution shift” phe-nomenon in model training, we compare the feature distribu-tions on the training set of CIFAR-10, as shown in Fig. 4. Wetrain the PreAct-ResNet-32 model [10] on the training set ofCIFAR-10 with Mixup (applied to the data input and the outputof every residual block with p B (1 . , . ) and without Mixup.Then we collect the output of every residual block and themodel output. To avoid the influence of the classification re-sults, we utilize the k-means clustering algorithm (partitioninginto k = 10 classes) on the collected features of every blockoutput and model output. Then we calculate the average valueof R (similar to R in the analysis of variance) to observethe feature distributions. R = 1 − SSI / SST, where SSI isthe sum of squares for intra-cluster and SST is the total sumof squares. R presents the percentage of the total variancecoming from the inter-cluster variance. The higher R is, themore clear the boundaries of the clusters are. SSI and SST are defined by: (cid:40) SST ( i ) = V ( i ) (cid:80) Nj =0 (cid:80) V ( i ) v =1 ( z ( i ) j,v − ¯¯ z ( i ) v ) SSI ( i ) = V ( i ) (cid:80) Cc =0 (cid:80) N c j =1 (cid:80) V ( i ) v =1 ( z ( i ) c,j,v − ¯ z ( i ) c,v ) (7)where C indicates the number of clusters, N is the number ofimages, and N c is the number of the images belonging to the c -th cluster. z ( i ) j is the features of the j -th images in the i -thhidden state. V ( i ) denotes the feature size of one data in the i -th hidden state, i.e., V = D × H × W , where D, H , and W arethe channel, height, and width dimension sizes. ¯¯ z ( i ) and ¯ z ( i ) c denote the data-wise average features in the i -th hidden stateand the data-wise average features of the c -th cluster in the i -th hidden state, respectively. As shown in Fig. 4, one can seethat R of the features learned with Mixup is evidently higherthan without any mixing operations. Thus, the “distributionshift” phenomenon happens when using Manifold Mixup, asthe objective feature distributions are very different with andwithout mixing operations.V. E XPERIMENTS
A. Datasets
To evaluate our Flow-Mixup approach for multi-labeledmedical image classification tasks, we conduct experiments onthe ChestX-ray14 dataset [22] and two ECG record datasets ofthe Alibaba Tianchi Cloud Competition . These datasets arefor multi-labeled medical image classification. The ChestX-ray14 dataset [22] consists of 112,120 CXR images of size , × , each. The corresponding labels cover 14abnormalities extracted from radiology reports by naturallanguage processing (NLP), and some of the CXR imagesare assigned with more than one label. As estimated by datacollectors [22], there is ∼
10% label corruption. For the ECGclassification, we use the preliminary competition ECG dataset(the ECG-55 dataset) containing 55 arrhythmia categories,and a selected ECG dataset (the ECG-12 dataset) containing12 most common arrhythmia categories in which the ECGrecords are selected from the preliminary dataset and thefinal competition dataset. The ECG-55 dataset consists of31,779 8-lead ECG records and ECG-12 has 34,664 8-leadECG records. The ECGs are 10 second records and wererecorded at a frequency of 500 Hertz. An ECG record canbe treated as a special one-dimensional image and with 1-Dconvolutions [12], [18]. Example samples of the ChestX-ray14and ECG datasets are shown in Fig. 5. In experiments, forthe ChestX-ray14 dataset, we follow the official split, and forthe ECG datasets, we randomly split a dataset into training,validation, and test parts by 7:1:2 since the official test set isnot available.
B. Experimental Setups
We use DenseNet-121 [11] as the CXR classifier baselineand ResNet-34 [10] as the ECG classifier baseline. Twoconvolutional layers are added ahead of the DenseNet-121 https://tianchi.aliyun.com/competition/entrance/231754/introductionABLE IC OMPARISON OF REGULARIZATION METHODS USING D ENSE N ET -121 AND IN
AUC ( % ) ON THE C HEST X- RAY TEST SET WITH LABEL CORRUPTION .H ERE , α IS THE HYPER - PARAMETER OF THE BETA DISTRIBUTIONS IN THE M IXING OPERATION , AND “O P ” INDICATES WHETHER THE ORIGINALFEATURES GOING FORWARD ( SEE F IG . 2). T HE BEST AND THE SECOND BEST RESULTS ARE MARKED AS
BOLD
AND UNDERLINED , RESPECTIVELY . Regularization × × × × Best Last Diff. Best Last Diff. Best Last Diff. Best Last Diff.ERM 74.2 69.4 4.8 73.1 69.9 3.2 72.8 69.3 3.5 72.2 67.3 4.9Mixup ( α = 1.0) 77.0 76.2 0.8 76.8 76.4 0.4 76.4 α = 3.0) 76.7 75.8 0.9 76.3 75.6 0.7 76.3 75.6 0.7 76.1 α = 3.0) 77.3 75.5 1.8 76.6 75.3 1.3 76.6 74.8 1.8 75.8 74.3 1.5Flow-Mixup ( α =3.0, Op=False) α =3.0, Op=True) 76.9 76.5 0.4 76.9 TABLE II
COMPARISON OF REGULARIZATION METHODS USING ES N ET -34 AND IN M ACRO -F1 ( % ) ON THE
ECG
TEST SETS WITH LABEL CORRUPTION .H ERE , α IS THE HYPER - PARAMETER OF THE BETA DISTRIBUTIONS IN THE M IXING OPERATION , AND “O P ” INDICATES WHETHER THE ORIGINALFEATURES GOING FORWARD ( SEE F IG . 2). T HE BEST AND THE SECOND BEST RESULTS ARE MARKED AS
BOLD
AND UNDERLINED , RESPECTIVELY . Dataset Regularization × × × × ECG-12 ERM 0.6617 0.6531 0.6238 0.5590Mixup ( α = 1.0) 0.6773 0.6581 0.6337 0.5774Mixup ( α = 3.0) 0.6575 0.6225 0.6195 0.5894Manifold Mixup ( α = 3.0) 0.6436 0.6389 0.6378 0.5822Flow-Mixup ( α =3.0, Op=False) Flow-Mixup ( α =3.0, Op=True) 0.6846 0.6784 α = 1.0) 0.5543 0.5119 0.4992 0.3474Mixup ( α = 3.0) 0.5509 0.5245 0.4953 0.4563Manifold Mixup ( α = 3.0) 0.5512 0.5390 0.4966 Flow-Mixup ( α =3.0, Op=False) 0.5551 0.5416 α =3.0, Op=True) (a) (b) Fig. 5. (a) A chest X-ray image from the ChestX-ray14 dataset. (b) An 8-leadelectrocardiogram record from the ECG datasets. network as was done similarly in [8], both with a kernel sizeof 3 and a stride of 2. For CXR image classification, wefollow the weighted binary cross-entropy loss function [8],[22], weighting the loss term for an abnormality with itsinverse proportion. In ResNet-34, 1-D convolution kernels ofsize 3 are used to replace the × convolution kernels. Duringtraining, we set the batch size as 32, and employ the Adamoptimizer [13] with β = 0 . and β = 0 . . The learningrate is initialized as − and is reduced by × when thevalid loss reaches a plateaus. We run 50 epochs for CXR image classification and 200 epochs for ECG classification. Tovalidate Flow-Mixup and compare to the known regularizationmethods on multi-labeled classification with label corruption,we replace the labels with corrupted labels in probability. Thelabel corruption rates for the ECG reports are × , × , × , and × , while for CXR images the label corruptionrates are × , × , × , and × as the original labelsare already corrupted. The mixing operation is applied to theinput of the third and fifth ResBlock and Denseblock, for bothManifold Mixup and Flow-Mixup. We report the average AUC(Area under the ROC curve) over the 14 kinds of abnormalitieson the CXR test set, and report Macro-F1 (macro-averagingon F1 scores) on the two ECG test sets. C. Experimental Results1) Performance Comparison:
The experimental results onthe ChestX-ray14 dataset are reported in Table I and theresults on the ECG-55 and ECG-12 datasets are in Table II.We compare the indicators of our proposed Flow-Mixup withthose of several known state-of-the-art regularization methods,including the Empirical Risk Minimization (ERM) princi-ple [20], Mixup [25], and Manifold Mixup [21]. We report the a) ChestX-ray14 (b) ECG-12
Fig. 6. Illustrating the AUC and F1 score for every abnormality with Mixup ( α = 3.0) and Flow-Mixup ( α = 3.0, Op = False). In Subfig. (b), we show just11 abnormalitiy categories except for a normal category. Note that Mixup achieves F1 scores = 0.0 in the left bundle branch block abnormality. best and the last performances in CXR classification; we reportonly the best performances in ECG classification, since the lastperformances are very close to the best performances on theECG datasets. One can see that Flow-Mixup outperforms theother regularization methods in dealing with various degreesof label corruption. Flow-Mixup’s performance over the otherregularization methods validates the capability of Flow-Mixup.Flow-Mixup attains better performances than Mixup, whichmight result from the abnormality-specific features extractedby the nonlinear part (see Sec. IV-B). In ECG classification,Flow-Mixup outperforms Mixup and Manifold Mixup, and asimilar conclusion can be derived. Further, one can see fromFig. 6 that Flow-Mixup outperforms Mixup in most classes inF1 scores and AUCs.
2) Correlation Conflict Reduction:
To further evaluateFlow-Mixup’s ability to reduce correlation conflicts, we com-pare the F1 score and AUC of every abnormality betweenMixup and Flow-Mixup on the ChestX-ray14 test set andECG-12 test set, respectively. The histograms of these F1scores and AUCs are shown in Fig. 6, both of themwith ∼
10% label corruption. For easy comparison, we settwo new indicators for every class: “Performance Ratio” = ( Performance
Mixup / Performance
Flow-Mixup ) r ( r is an exponent; r = 10 . for CXR images and r = 1 . for ECG records), and“Independent Ratio” = n c /m c , where n c is the number ofimages with only the class c , and m c is the number of all theimages with the class c including multi-labeled images. The“Performance Ratio” (which is the “AUC Ratio” for CXRsand “F1 Ratio” for ECGs in Fig. 6) indicates the relativeperformance in every abnormality of Mixup and Flow-Mixup,while the “Independent Ratio” suggests in what degree a classis independent in a dataset. The performances are normalizedbefore computing the “Performance Ratio”. Thus, one cansee whether the relative performances are related to the classindependence by comparing the coincides of the “Perfor-mance Ratio” and the “Independent Ratio”. In Fig. 6, the“Performance Ratio” curves coincide with the “IndependentRatio” curves (with Spearman correlation coefficients ≈ . ), indicating that Flow-Mixup can obtain better performance ina relatively dependent class. Hence, we believe Flow-Mixupis able to reduce the correlation conflicts.
3) Distribution Shift Reduction:
To evaluate Flow-Mixup’sability to reduce the “Distribution Shift” phenomenon, wecompute the differences (Diff.s) between the Best AUCsand the Last AUCs on the ChestX-ray14 test set, shown inTable III. The Diff.s on the ECG test sets are not reported sincethe best and the last Marco F1 scores are very close. Further,we compute the variances of the normalized performanceindicators (AUCs for CXR images and Macro F1 scores forECG records) of some epochs on the test sets, as:Var ( I ) = (cid:80) ne =1 ( I e − ¯ I ) n (8)where I is the normalized performance on the test set, ¯ I is the average performance, and e is the index of epochs.The normalization method is the min-max normalization. Thevariances are computed for the early 20 epochs on the ChestX-ray14 dataset ( n = 20 ) and for the early 100 epochs on thetwo ECG datasets ( n = 100 ), as the indicators just fluctuateslightly in the rest epochs. As shown in Table III, Flow-Mixuphas lower variances than Manifold Mixup, which suggeststhat Flow-Mixup is more stable. Comparing the Diff.s andvariances, it is obvious that training models with ManifoldMixup is not as stable as with Flow-Mixup, which mightbe due to the instability caused by “distribution shift”, asdiscussed in Sec. IV-C.
4) Hyperparameter α : As the suggestions in [21], [25],setting the mixing degree α > . is suggested in dealingwith the corrupted labels. In our tasks, we find that themodels perform well with α > . . Flow-Mixup seems tobe insensitive to α , and the results fluctuate within ∼ . onthe ECG datasets and within ∼ . on the ChestX-ray datasetwith different α ∈ [1 . , . .VI. C ONCLUSIONS
In this paper, we proposed a new regularization approach,Flow-Mixup, for multi-labeled medical image classification
ABLE IIIV
ARIANCES OF THE NORMALIZED PERFORMANCE INDICATORS FOR M ANIFOLD M IXUP ( α = 3.0) AND F LOW -M IXUP ( α = 3.0, O P = F ALSE ) OVER n EPOCHS ON THE TEST SETS : AUC
VARIANCES FOR THE C HEST X- RAY TEST SET AND M ACRO -F1
VARIANCES FOR THE
ECG-55
AND
ECG-12
TESTSETS . T
HE LOWER VARIANCES ARE MARKED AS
BOLD . ChestX-ray14 ( n = 20 ) ECG-12 ( n = 100 ) ECG-55 ( n = 100 )Regularization × × × × × × × × × × × × Manifold Mixup with corrupted labels. Guided by Flow-Mixup, a deep learningclassifier extracts abnormality-specific features and then mapssuch features into the label space. Experiments verified thatFlow-Mixup can handle datasets containing corrupted labels,and thus makes it possible to apply automatic annotation.Besides, we compared Flow-Mixup with the common Mixupand Manifold Mixup methods, highlighted the characteristicsof Flow-Mixup, and discussed the “correlation conflicts” phe-nomenon and “distribution shift” phenomenon occurred withusing Mixup or Manifold Mixup.VII. A
CKNOWLEDGEMENTS
This research was partially supported by the NationalResearch and Development Program of China under grantNo. 2019YFB1404802, No. 2019YFC0118802, and No.2018AAA0102102, the National Natural Science Foundationof China under grant No. 61672453, the Zhejiang Univer-sity Education Foundation under grants No. K18-511120-004,No. K17-511120-017, and No. K17-518051-02, the Zhejiangpublic welfare technology research project under grant No.LGF20F020013, and the Key Laboratory of Medical Neu-robiology of Zhejiang Province. D. Z. Chen’s research wassupported in part by NSF Grant CCF-1617735.R
EFERENCES[1] Ehsan Adeli, Dongjin Kwon, and Kilian M Pohl. Multi-label transduc-tion for identifying disease comorbidity patterns. In
MICCAI , 2018.[2] Devansh Arpit, Stanisław Jastrze¸bski, Nicolas Ballas, David Krueger,Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer,Aaron Courville, Yoshua Bengio, et al. A closer look at memorizationin deep networks. In
ICML , 2017.[3] Angelica I Aviles-Rivero, Nicolas Papadakis, Ruoteng Li, Philip Sel-lars, Qingnan Fan, Robby T Tan, and Carola-Bibiane Sch¨onlieb.GraphX
NET − Chest X-ray classification under extreme minimal su-pervision. In
MICCAI , 2019.[4] Haomin Chen, Shun Miao, Daguang Xu, Gregory D Hager, and Adam PHarrison. Deep hierarchical multi-label classification of chest X-rayimages. In
ICLR , 2019.[5] Mark Cicero, Alexander Bilbily, Errol Colak, Tim Dowdell, Bruce Gray,Kuhan Perampaladas, and Joseph Barfett. Training and validating adeep convolutional neural network for computer-aided detection andclassification of abnormalities on frontal chest radiographs.
InvestigativeRadiology , 2017.[6] Florin C Ghesu, Bogdan Georgescu, Eli Gibson, Sebastian Guendel,Mannudeep K Kalra, Ramandeep Singh, Subba R Digumarthy, SasaGrbic, and Dorin Comaniciu. Quantifying and leveraging classificationuncertainty for chest radiograph assessment. In
MICCAI , 2019.[7] Tomer Golany and Kira Radinsky. PGANs: Personalized generativeadversarial networks for ECG synthesis to improve patient-specific deepECG classification. In
AAAI , 2019. [8] Sebastian Guendel, Sasa Grbic, Bogdan Georgescu, Siqi Liu, AndreasMaier, and Dorin Comaniciu. Learning to recognize abnormalities inchest X-Rays with location-aware dense networks. In
IberoamericanCongress on Pattern Recognition , 2018.[9] Hongyu Guo, Yongyi Mao, and Richong Zhang. Mixup as locally linearout-of-manifold regularization. In
AAAI , 2019.[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identitymappings in deep residual networks. In
ECCV , 2016.[11] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian QWeinberger. Densely connected convolutional networks. In
CVPR , 2017.[12] Mohammad Kachuee, Shayan Fazeli, and Majid Sarrafzadeh. ECGheartbeat classification: A deep transferable representation. In
IEEEInternational Conference on Healthcare Informatics , 2018.[13] Diederik P Kingma and Jimmy Ba. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.[14] Thomas Kurmann, Pablo M´arquez-Neila, Sebastian Wolf, and RaphaelSznitman. Deep multi-label classification in affine subspaces. In
MICCAI , 2019.[15] Ricardo Bigolin Lanfredi, Joyce D Schroeder, Clement Vachet, andTolga Tasdizen. Adversarial regression training for visualizing theprogression of chronic obstructive pulmonary disease with chest X-rays.In
MICCAI , 2019.[16] Keon Myung Lee, Sang Yeon Lee, Chan Sik Han, and Seung MyungChoi. Long bone fracture type classification for limited number of CTdata with deep learning. In
ACM Symposium on Applied Computing ,2020.[17] Pranav Rajpurkar, Jeremy Irvin, Robyn L Ball, Kaylie Zhu, BrandonYang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis PLanglotz, et al. Deep learning for chest radiograph diagnosis: Aretrospective comparison of the CheXNeXt algorithm to practicingradiologists.
PLoS Medicine , 2018.[18] Yichen Shen, Maxime Voisin, Alireza Aliamiri, Anand Avati, AwniHannun, and Andrew Ng. Ambulatory atrial fibrillation monitoring usingwearable photoplethysmography with deep learning. In
KDD , 2019.[19] Saeid Asgari Taghanaki, Mohammad Havaei, Tess Berthier, FrancisDutil, Lisa Di Jorio, Ghassan Hamarneh, and Yoshua Bengio. InfoMask:Masked variational latent representation to localize chest disease. In
MICCAI , 2019.[20] Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergenceof relative frequencies of events to their probabilities.
Theory ofProbability and Its Applications , 1971.[21] Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najafi, IoannisMitliagkas, David Lopez-Paz, and Yoshua Bengio. Manifold Mixup:Better representations by interpolating hidden states. In
ICML , 2019.[22] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, MohammadhadiBagheri, and Ronald M Summers. ChestX-ray8: Hospital-scale chest X-Ray database and benchmarks on weakly-supervised classification andlocalization of common thorax diseases. In
CVPR , 2017.[23] Yunyan Xing, Zongyuan Ge, Rui Zeng, Dwarikanath Mahapatra, JarrelSeah, Meng Law, and Tom Drummond. Adversarial pulmonary pathol-ogy translation for pairwise chest X-Ray data augmentation. In
MICCAI ,2019.[24] Donna Xu, Yaxin Shi, Ivor W Tsang, Yew-Soon Ong, Chen Gong, andXiaobo Shen. Survey on multi-output learning.
IEEE Transactions onNeural Networks and Learning Systems , 2019.[25] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. Mixup: Beyond empirical risk minimization. In