[PDF] Flow-Mixup: Classifying Multi-labeled Medical Images with Corrupted Labels

Abstract

In clinical practice, medical image interpretation often involves multi-labeled classification, since the affected parts of a patient tend to present multiple symptoms or comorbidities. Recently, deep learning based frameworks have attained expert-level performance on medical image interpretation, which can be attributed partially to large amounts of accurate annotations. However, manually annotating massive amounts of medical images is impractical, while automatic annotation is fast but imprecise (possibly introducing corrupted labels). In this work, we propose a new regularization approach, called Flow-Mixup, for multi-labeled medical image classification with corrupted labels. Flow-Mixup guides the models to capture robust features for each abnormality, thus helping handle corrupted labels effectively and making it possible to apply automatic annotation. Specifically, Flow-Mixup decouples the extracted features by adding constraints to the hidden states of the models. Also, Flow-Mixup is more stable and effective comparing to other known regularization methods, as shown by theoretical and empirical analyses. Experiments on two electrocardiogram datasets and a chest X-ray dataset containing corrupted labels verify that Flow-Mixup is effective and insensitive to corrupted labels.

Full PDF

FFlow-Mixup: Classifying Multi-labeled MedicalImages with Corrupted Labels

Jintai Chen

College of Computer Science and TechnologyZhejiang University

Hangzhou, [email protected]

Hongyun Yu

College of Computer Science and TechnologyZhejiang University

Hangzhou, [email protected]

Ruiwei Feng

College of Computer Science and TechnologyZhejiang University

Hangzhou, Chinaruiwei [email protected]

Danny Z. Chen

Department of Computer Science and EngineeringUniversity of Notre Dame

Notre Dame, IN 46556, [email protected]

Jian Wu (cid:0)

Zhejiang University School of MedicineZhejiang University

Hangzhou, [email protected]

Abstract —In clinical practice, medical image interpretationoften involves multi-labeled classiﬁcation, since the affected partsof a patient tend to present multiple symptoms or comorbidities.Recently, deep learning based frameworks have attained expert-level performance on medical image interpretation, which canbe attributed partially to large amounts of accurate annotations.However, manually annotating massive amounts of medicalimages is impractical, while automatic annotation is fast butimprecise (possibly introducing corrupted labels). In this work,we propose a new regularization approach, called Flow-Mixup,for multi-labeled medical image classiﬁcation with corruptedlabels. Flow-Mixup guides the models to capture robust featuresfor each abnormality, thus helping handle corrupted labelseffectively and making it possible to apply automatic annotation.Speciﬁcally, Flow-Mixup decouples the extracted features byadding constraints to the hidden states of the models. Also, Flow-Mixup is more stable and effective comparing to other knownregularization methods, as shown by theoretical and empiricalanalyses. Experiments on two electrocardiogram datasets and achest X-ray dataset containing corrupted labels verify that Flow-Mixup is effective and insensitive to corrupted labels.

Index Terms —deep learning, regularization, multi-labeled im-age classiﬁcation, mixup

I. I

NTRODUCTION

Medical image classiﬁcation is critical in clinical practice(e.g., for early detection of diseases). However, medical imageclassiﬁcation is still a challenging task due to large intra-class variations, blurred boundaries between abnormalities,inconclusive abnormality patterns, etc. Furthermore, it is com-mon for a patient to suffer multiple symptoms simultaneously, which present several kinds of abnormalities and some com-plex comorbidities in the medical images. Therefore, medicalimage interpretation is commonly a multi-labeled image clas-siﬁcation process. Recently, supervised deep learning modelshave attained high performance thanks to large amounts ofwell-annotated data for model training. However, it is time-consuming to annotate medical images manually by medicalexperts, while automatic annotation (e.g., automatically ex-tracting labels from reports [22]) is fast but possibly introducesconsiderable corrupted (incorrect) labels.Many methods, such as model ensemble [17], weighted lossfunction [8], and label hierarchy [4], were widely utilized inmulti-labeled medical image classiﬁcation. However, dealingwith corrupted labels of multi-labeled medical images wasrarely studied, which is a basic issue for using automaticallyannotated labels. It was proved that regularization methodscould hinder the memorization of models with the generaliza-tion ability preserved [2], which is advantageous to tacklinglabel corruption. Many known regularization methods (e.g.,Mixup [25], Manifold Mixup [21]) were proposed for single-output tasks, but could not meet the needs of multi-outputtasks [24]. Besides, the existence of complex correlationsamong abnormalities was conﬁrmed in literature [3], [22],which required additional considerations in model training.Thus, multi-labeled medical image classiﬁcation with cor-rupted labels is a challenging problem and requires furtherresearch effort.To this end, in this paper, we propose a new regularization a r X i v : . [ c s . C V ] F e b pproach called Flow-Mixup for multi-labeled medical imageclassiﬁcation with label corruption. Speciﬁcally, we introducea new dimension called “ﬂow dimension” for the featuretensors in hidden states and apply a novel Mixing moduleto a selected hidden state . Thus, the model layers ahead ofthe selected hidden state are restricted to learning a nonlinearfunction while the subsequent layers are restricted to learninga linear function. Flow-Mixup guides the nonlinear part todecouple the complex features (where the features of ab-normalities are correlative) into abnormality-speciﬁc featuresbefore feeding the features to the linear part. The decouplingis guaranteed as the linear function requires its input featuresto lie in a linearly separable space. We compare Flow-Mixupwith Mixup [25] and Manifold Mixup [21] to highlight thecharacteristics of Flow-Mixup.This work makes three main contributions:1) We propose a new regularization method called Flow-Mixup for multi-labeled medical image classiﬁcation,and show that Flow-Mixup is insensitive to corruptedlabels.2) We compare Flow-Mixup with Mixup [25] and ManifoldMixup [21], and show that the “correlation conﬂicts”phenomenon and the “distribution shift” phenomenonoccur with using Mixup or Manifold Mixup.3) Experiments on several multi-labeled medical im-age classiﬁcation datasets with corrupted labels verifythat our Flow-Mixup outperforms known regularizationmethods. II. R ELATED W ORK

A. Multi-labeled Medical Image Classiﬁcation

Various automatic medical image interpretation applicationsinvolve multi-labeled image classiﬁcation tasks, such as chestX-ray (CXR) interpretation [3], [5], [14], [17], [22], elec-trocardiogram (ECG) monitoring [7], [12], [18], comorbidityidentiﬁcation of Alcohol Use Disorder and human immunode-ﬁciency virus infection [1], bone fracture type diagnosis [16],etc. To better handle multi-labeled classiﬁcation tasks, a newloss function was proposed to guide deep learning models tosearch the subspace of abnormality features [14], and labelhierarchy [4] and matrix completion [1] methods were alsoused in correlative feature reﬁnement. In [6], an approach wasespecially designed to calculate the uncertainty of automateddiagnosis. Abnormality location perception was consideredin [8], [19] for CXR image classiﬁcation. Also, adversariallearning approaches were designed for data augmentation anddisease severity assessment in CXR [15], [23] and ECG [7]classiﬁcation. A large dataset [22] catalyzed multi-labeledclassiﬁcation methods on CXR images. However, the labelsin this dataset were mined from radiology reports by naturallanguage processing (NLP) and the text-mined labels weresomewhat corrupted. Most of the known supervised multi-labeled classiﬁcation methods focused on tackling feature In this paper, we denote a “hidden state” as the general output of a modellayer, and a “feature” refers to a particular representation of some data. correlations among abnormalities but few of them consideredlabel corruption.

B. Regularization Methods

Regularization methods are useful for dealing with labelcorruption [2]. Kurmann et al. [14] managed to drive class-speciﬁc features into different afﬁne subspaces and enlarge thedistances between the subspaces. This method outperformedthe vanilla methods in multi-labeled CXR image classiﬁcation.Many data augmentation methods were used to deal withmulti-labeled medical image classiﬁcation [3], [8], [17], whichhad similar effect as regularization methods. The state-of-the-art regularization methods for single-labeled classiﬁcation areMixup [9], [25] and Manifold Mixup [21], by introducinglinear constraints into the models. However, both of themare not very suitable to multi-labeled classiﬁcation because ofthe “correlation conﬂicts” and “distribution shift” phenomena(discussed in Sec. IV). Mixup ignored the feature correlationsamong abnormalities while Manifold Mixup was often un-stable in training. In this paper, we propose Flow-Mixup formulti-labeled medical image classiﬁcation, which avoids thedrawbacks of Mixup and Manifold Mixup.III. A

PPROACH

A. Preliminaries

Mixup [25] introduced a linear constraint to single-labeledclassiﬁcation and achieved good performance. Considering adeep learning classiﬁer as a function h ( · ) , the standard Mixupis deﬁned as: h ( px p + qx q ) = py p + qy q (1)where x p and x q are two input images while y p and y q arethe corresponding labels, with q = 1 − p . Mixup regularizationrestricts the whole model (the function h ) to be a linearfunction, as h ( px p + qx q ) = py p + qy q = ph ( x p ) + qh ( x q ) .Similarly, Manifold Mixup [21] applies the mixing operationas in Eq. (1) to a hidden state, and restricts the subsequentparts of the model to learn a linear function. Note that the“linear function” and “nonlinear function” are different fromthe “linear layer” and “nonlinear layer” of the neural networks,as the former concepts are related to the learning objectivesbut the later concepts are about the model entities. B. An Overview of Flow-Mixup

In this paper, we propose a new regularization approach,Flow-Mixup, for multi-labeled medical image classiﬁcation.Consider a deep learning classiﬁer h ( x ) = f ( g ( x )) , where g is a nonlinear function and f is a linear function. A trainingforward process with Flow-Mixup takes several steps: First,we select a hidden state s to split the model into a nonlinearpart and a linear part before training, as s = g ( x ) , y = f ( s ) ,and y is the model output. Second, we process the data(e.g., the images) forward to the selected hidden state, andapply a new Mixing module to the features in the hiddenstate (our Mixing module is depicted below). After beingprocessed by the Mixing module, the features continue the + Feature Copy Flow Concatenation Mixing B Beta DistributionFeatures /2 grad m grad’grad z z copy z [z, z mixed ]z mixed z copy , (y)z copy , (y) p 1-pz copy ’, (y’) z mixed , (y mixed ) (a) (b) /2 Gradients Gradient halving

Fig. 1. The left part shows the structure of the Mixing module. The right part gives the details of the mixing operation of Flow-Mixup. A copy of the featuremaps is made and processed by the mixing operation, and then is batch-wise concatenated to the original ones. forward propagation until the output. With the Mixing module,Flow-Mixup restricts the front part of the model to learning anonlinear function, and the rest of the model serves as a linearfunction. In dealing with multi-labeled medical images, thenonlinear function extracts abnormality-speciﬁc features, andthe linear function (subsequent part) of the model projects theabnormality-speciﬁc features into the label spaces. The con-straint to the nonlinear part is guaranteed, as the output of thenonlinear part is fed to the linear part which requires its inputto lie in a linearly separable space. Different from ManifoldMixup, the special Mixing module introduces an extra ﬂowdimension, thus simultaneously using several mixing modulesin a model is allowed.

C. Mixing Module

Generally, the tensors of an image in deep learning modelshave 4 dimensions: batch dimension, channel dimension, widthand height dimensions. Our proposed Flow-Mixup introducesa new dimension, called ﬂow dimension. As shown in theleft part of Fig. 1, assume that the original feature z hasa ﬂow dimension of size 1 before being processed by theMixing module, and then the output of the Mixing module [ z, z mixed ] has a ﬂow dimension of size 2. The ﬂow size isincreased by the feature concatenation operation. After thefeatures are fed to the Mixing module, the ﬁrst step is to makea copy of these features. Then, the feature copy is processedby a mixing operation and then concatenated into the originalfeatures along the ﬂow dimension. The forward process in theMixing module is deﬁned as: z (cid:48) = M ( p, z ) = [ z, Mixing ( p, z )] = [ z, z mixed ] (2)where the mixing operation Mixing ( · , · ) transforms a fea-ture copy ( z copy , y ) into two mini-copies ( ( z, y ) , ( z (cid:48) , y (cid:48) ) ) andapplies the standard Mixup to them by ( z mixed , y mixed ) =( pz + (1 − p ) z (cid:48) , py + (1 − p ) y (cid:48) ) , as illustrated in the right partof Fig. 1. ( z, y ) and ( z (cid:48) , y (cid:48) ) are obtained by applying randomindex-shufﬂe to ( z copy , y ) . p is randomly sampled from a beta distribution p B ( α, α ) and α is a hyper-parameter controllingthe mixing degree [25]. [ · , · ] indicates ﬂow-wise concatenation,which results in the ﬂow dimension size increase. FollowingEq. (2), the feature z is transformed into z (cid:48) with a double ﬂowsize. Since the ﬂow size is doubled in the forward propagation,the Mixing module shall halve the gradients in the back-propagation in order to keep the magnitudes of the gradients.The back-propagation of the Mixing module is deﬁned as:grad = ( grad (cid:48) + grad m ) / (3)where grad (cid:48) indicates the gradients of the original features,and grad m represents the gradients of the mixed features (seeFig. 1). In this way, the Mixing module can be applied toseveral hidden states simultaneously with the original featuresbeing preserved, as shown in Fig. 2(b). Note that the reg-ularization approach cannot entirely restrict the subsequentlayers to be a linear function, and thus applying several Mixingmodules is helpful in strengthening the linear constraints. Inimplementation, if a hidden state is the last one (see Fig. 2(b))or there is only one state (see Fig. 2(a)) to apply the Mixingmodule, it is optional to compute the forward propagationof the original features to the output layer. If the originalfeatures do not go forward, the Mixing module degradesinto the common Mixup operation, calculating z (cid:48) = z mixed in the forward propagation and grad = grad m in the back-propagation. IV. A NALYSIS AND C OMPARISONS

This section discusses the feasibility and the characteristicsof Flow-Mixup and its differences with the known regulariza-tion methods, Mixup [25] and Manifold Mixup [21].

A. Feasibility of Flow-Mixup

Hypothesis 1:

A learned sequential deep learning classiﬁerfor multi-labeled images can be reformulated as a compositionof some linear functions and some nonlinear functions.

Since the feature correlations among abnormalities exist a) g f (b) forward path of original features backward path forward path of mixed features layers/blocks Mixing module optional forward path of original features Fig. 2. Illustrating two strategies to use Flow-Mixup. (a) The Mixing module is utilized only on a hidden state. (b) Mixing modules are applied to two hiddenstates. The numbers along the models indicate the ﬂow dimension sizes, with and without the optional forward path of original features. and the labels lie independently in the label space, a deeplearning classiﬁer needs to learn nonlinear functions in orderto decouple the correlative features. Thus, it is reasonableto regard a learned sequential classiﬁer as a composition ofmultiple nonlinear functions and linear functions.Based on Hypothesis 1, a learned sequential deep learningclassiﬁer h ( · ) can be mathematically decoupled by: h ( x ) = f ◦ g ◦ f ◦ g ◦ · · · ◦ f a ◦ g b ( x ) (4)where the functions f i ( i ∈ { , , . . . , a } ) and g j ( j ∈ { , , . . . , b } ) belong to a linear function family F and a nonlinear function family G , respectively. f i ◦ g j ( x ) = g j ( f i ( x )) . In practice, in Eq. (4), the order andthe concrete expressions of f i and g j are obtained by learningfrom the data. Theorem 1.

A learned sequential deep learning classiﬁer h ( · ) for multi-labeled images can be reformulated as acomposition of some linear functions and some nonlinearfunctions, in a sequence where the nonlinear functions ˜ g i appear ﬁrst and then the linear functions ˜ f j follow, as: h ( x ) = ˜ g ◦ ˜ g ◦ · · · ◦ ˜ g c ◦ ˜ f ◦ ˜ f ◦ · · · ◦ ˜ f d ( x ) (5) Proof:

A commutative law for linear and nonlinear functionscan be proved, as follows. Assume f ◦ g ( x ) = y for a linearfunction f and a nonlinear function g . Then ∃ ˜ f ∈ F and ∃ ˜ g ∈ G such that f ◦ g ( x ) = ( f ◦ g ◦ ˜ f ( − ) ◦ ˜ f ( x ) = ˜ g ◦ ˜ f ( x ) = y (because a linear function is invertible), with ˜ g = f ◦ g ◦ ˜ f ( − .Thus, f ◦ g ( x ) = ˜ g ◦ ˜ f ( x ) . By applying this commutativelaw repeatedly, it is easy to prove that a model h ( x ) underHypothesis 1 can be speciﬁed as: h ( x ) = ˜ g ◦ ˜ g ◦ · · · ◦ ˜ g b ◦ ˜ f ◦ ˜ f ◦ · · · ◦ ˜ f a ( x ) (6)where d = a and c = b . The solution thus constructed (i.e.,Eq. (6)) veriﬁes the theorem, which suggests that any multi-labeled image classiﬁer can ﬁnd a solution under the constraint of Flow-Mixup if the equation of the original classiﬁer has asolution. (cid:3) B. Comparisons with Mixup

As discussed in the previous work [3], the features ofabnormalities can be correlative, which may not be linearlyseparable. In other words, the inherent correlation of abnor-malities might be in conﬂict with the linear constraint ofMixup. Thus, training a multi-labeled image classiﬁer with theMixup regularization may result in performance decrease. Asthe situation illustrated in Fig. 3, such “correlation conﬂicts”happen as the boundary line of two classes cannot deal withthe data belonging to both of these two classes, after mappingthe data manifold to a low dimensional space satisfying theMixup linear constraint. In contrast, with our Flow-Mixup,the correlative features of abnormalities can be decoupled intoabnormality-speciﬁc features by the nonlinear functions ﬁrst,and such features lie in a linearly separable space.

C. Comparisons with Manifold Mixup

Manifold Mixup [21] allows applying a mixing operationto several hidden states in the training process. However,this mixing operation cannot be performed simultaneously.Manifold Mixup randomly selects one of these hidden statesto apply the mixing operation in every training iteration, andconsequently suffers two drawbacks. (1) Updating parametersin every iteration affects the ﬁnal parameters. Therefore, it ishard to know exactly what degree of data mixing is appliedto a hidden state, as the mixing operation is used witha probability. Thus, it is difﬁcult to determine the hyper-parameters for the mixing operation. (2) Since the trainingcondition to a hidden state (whether to use a mixing operation)is changeable, the training process is unstable and suffers a“distribution shift” phenomenon. “Distribution shift” meansthat the objective feature distribution is changed. Ideally, using ig. 3. Illustrating “correlation conﬂicts”. Assume that the yellow balls andblue balls indicate samples with two different labels, while the orange ballsare for samples with both of the two labels. One can see the Mixup linearconstraint conﬂicting with the original correlation. z (1) z (2) z (3) z (4) y Features0.00.20.40.60.8 R Itemsw\ Mixupw\o Mixup R Ratio

Fig. 4. A visualization of the R values of different hidden states with orwithout Mixup on the training set of CIFAR-10. The R Ratio = R / R ,where R is R values without Mixup and R is R values with Mixup.One can see that the feature distributions with and without Mixup are quitedifferent. a mixing operation on a hidden state restricts the features to liein a linearly separable space. However, Manifold Mixup keepschanging the constraint to the hidden states, which leads to anunstable training process and decreases the performance.To observe the occurrence of the “distribution shift” phe-nomenon in model training, we compare the feature distribu-tions on the training set of CIFAR-10, as shown in Fig. 4. Wetrain the PreAct-ResNet-32 model [10] on the training set ofCIFAR-10 with Mixup (applied to the data input and the outputof every residual block with p B (1 . , . ) and without Mixup.Then we collect the output of every residual block and themodel output. To avoid the inﬂuence of the classiﬁcation re-sults, we utilize the k-means clustering algorithm (partitioninginto k = 10 classes) on the collected features of every blockoutput and model output. Then we calculate the average valueof R (similar to R in the analysis of variance) to observethe feature distributions. R = 1 − SSI / SST, where SSI isthe sum of squares for intra-cluster and SST is the total sumof squares. R presents the percentage of the total variancecoming from the inter-cluster variance. The higher R is, themore clear the boundaries of the clusters are. SSI and SST are deﬁned by: (cid:40) SST ( i ) = V ( i ) (cid:80) Nj =0 (cid:80) V ( i ) v =1 ( z ( i ) j,v − ¯¯ z ( i ) v ) SSI ( i ) = V ( i ) (cid:80) Cc =0 (cid:80) N c j =1 (cid:80) V ( i ) v =1 ( z ( i ) c,j,v − ¯ z ( i ) c,v ) (7)where C indicates the number of clusters, N is the number ofimages, and N c is the number of the images belonging to the c -th cluster. z ( i ) j is the features of the j -th images in the i -thhidden state. V ( i ) denotes the feature size of one data in the i -th hidden state, i.e., V = D × H × W , where D, H , and W arethe channel, height, and width dimension sizes. ¯¯ z ( i ) and ¯ z ( i ) c denote the data-wise average features in the i -th hidden stateand the data-wise average features of the c -th cluster in the i -th hidden state, respectively. As shown in Fig. 4, one can seethat R of the features learned with Mixup is evidently higherthan without any mixing operations. Thus, the “distributionshift” phenomenon happens when using Manifold Mixup, asthe objective feature distributions are very different with andwithout mixing operations.V. E XPERIMENTS

A. Datasets

To evaluate our Flow-Mixup approach for multi-labeledmedical image classiﬁcation tasks, we conduct experiments onthe ChestX-ray14 dataset [22] and two ECG record datasets ofthe Alibaba Tianchi Cloud Competition . These datasets arefor multi-labeled medical image classiﬁcation. The ChestX-ray14 dataset [22] consists of 112,120 CXR images of size , × , each. The corresponding labels cover 14abnormalities extracted from radiology reports by naturallanguage processing (NLP), and some of the CXR imagesare assigned with more than one label. As estimated by datacollectors [22], there is ∼

10% label corruption. For the ECGclassiﬁcation, we use the preliminary competition ECG dataset(the ECG-55 dataset) containing 55 arrhythmia categories,and a selected ECG dataset (the ECG-12 dataset) containing12 most common arrhythmia categories in which the ECGrecords are selected from the preliminary dataset and theﬁnal competition dataset. The ECG-55 dataset consists of31,779 8-lead ECG records and ECG-12 has 34,664 8-leadECG records. The ECGs are 10 second records and wererecorded at a frequency of 500 Hertz. An ECG record canbe treated as a special one-dimensional image and with 1-Dconvolutions [12], [18]. Example samples of the ChestX-ray14and ECG datasets are shown in Fig. 5. In experiments, forthe ChestX-ray14 dataset, we follow the ofﬁcial split, and forthe ECG datasets, we randomly split a dataset into training,validation, and test parts by 7:1:2 since the ofﬁcial test set isnot available.

B. Experimental Setups

We use DenseNet-121 [11] as the CXR classiﬁer baselineand ResNet-34 [10] as the ECG classiﬁer baseline. Twoconvolutional layers are added ahead of the DenseNet-121 https://tianchi.aliyun.com/competition/entrance/231754/introductionABLE IC OMPARISON OF REGULARIZATION METHODS USING D ENSE N ET -121 AND IN

AUC ( % ) ON THE C HEST X- RAY TEST SET WITH LABEL CORRUPTION .H ERE , α IS THE HYPER - PARAMETER OF THE BETA DISTRIBUTIONS IN THE M IXING OPERATION , AND “O P ” INDICATES WHETHER THE ORIGINALFEATURES GOING FORWARD ( SEE F IG . 2). T HE BEST AND THE SECOND BEST RESULTS ARE MARKED AS

BOLD

AND UNDERLINED , RESPECTIVELY . Regularization × × × × Best Last Diff. Best Last Diff. Best Last Diff. Best Last Diff.ERM 74.2 69.4 4.8 73.1 69.9 3.2 72.8 69.3 3.5 72.2 67.3 4.9Mixup ( α = 1.0) 77.0 76.2 0.8 76.8 76.4 0.4 76.4 α = 3.0) 76.7 75.8 0.9 76.3 75.6 0.7 76.3 75.6 0.7 76.1 α = 3.0) 77.3 75.5 1.8 76.6 75.3 1.3 76.6 74.8 1.8 75.8 74.3 1.5Flow-Mixup ( α =3.0, Op=False) α =3.0, Op=True) 76.9 76.5 0.4 76.9 TABLE II

COMPARISON OF REGULARIZATION METHODS USING ES N ET -34 AND IN M ACRO -F1 ( % ) ON THE

ECG

TEST SETS WITH LABEL CORRUPTION .H ERE , α IS THE HYPER - PARAMETER OF THE BETA DISTRIBUTIONS IN THE M IXING OPERATION , AND “O P ” INDICATES WHETHER THE ORIGINALFEATURES GOING FORWARD ( SEE F IG . 2). T HE BEST AND THE SECOND BEST RESULTS ARE MARKED AS

BOLD

AND UNDERLINED , RESPECTIVELY . Dataset Regularization × × × × ECG-12 ERM 0.6617 0.6531 0.6238 0.5590Mixup ( α = 1.0) 0.6773 0.6581 0.6337 0.5774Mixup ( α = 3.0) 0.6575 0.6225 0.6195 0.5894Manifold Mixup ( α = 3.0) 0.6436 0.6389 0.6378 0.5822Flow-Mixup ( α =3.0, Op=False) Flow-Mixup ( α =3.0, Op=True) 0.6846 0.6784 α = 1.0) 0.5543 0.5119 0.4992 0.3474Mixup ( α = 3.0) 0.5509 0.5245 0.4953 0.4563Manifold Mixup ( α = 3.0) 0.5512 0.5390 0.4966 Flow-Mixup ( α =3.0, Op=False) 0.5551 0.5416 α =3.0, Op=True) (a) (b) Fig. 5. (a) A chest X-ray image from the ChestX-ray14 dataset. (b) An 8-leadelectrocardiogram record from the ECG datasets. network as was done similarly in [8], both with a kernel sizeof 3 and a stride of 2. For CXR image classiﬁcation, wefollow the weighted binary cross-entropy loss function [8],[22], weighting the loss term for an abnormality with itsinverse proportion. In ResNet-34, 1-D convolution kernels ofsize 3 are used to replace the × convolution kernels. Duringtraining, we set the batch size as 32, and employ the Adamoptimizer [13] with β = 0 . and β = 0 . . The learningrate is initialized as − and is reduced by × when thevalid loss reaches a plateaus. We run 50 epochs for CXR image classiﬁcation and 200 epochs for ECG classiﬁcation. Tovalidate Flow-Mixup and compare to the known regularizationmethods on multi-labeled classiﬁcation with label corruption,we replace the labels with corrupted labels in probability. Thelabel corruption rates for the ECG reports are × , × , × , and × , while for CXR images the label corruptionrates are × , × , × , and × as the original labelsare already corrupted. The mixing operation is applied to theinput of the third and ﬁfth ResBlock and Denseblock, for bothManifold Mixup and Flow-Mixup. We report the average AUC(Area under the ROC curve) over the 14 kinds of abnormalitieson the CXR test set, and report Macro-F1 (macro-averagingon F1 scores) on the two ECG test sets. C. Experimental Results1) Performance Comparison:

The experimental results onthe ChestX-ray14 dataset are reported in Table I and theresults on the ECG-55 and ECG-12 datasets are in Table II.We compare the indicators of our proposed Flow-Mixup withthose of several known state-of-the-art regularization methods,including the Empirical Risk Minimization (ERM) princi-ple [20], Mixup [25], and Manifold Mixup [21]. We report the a) ChestX-ray14 (b) ECG-12

Fig. 6. Illustrating the AUC and F1 score for every abnormality with Mixup ( α = 3.0) and Flow-Mixup ( α = 3.0, Op = False). In Subﬁg. (b), we show just11 abnormalitiy categories except for a normal category. Note that Mixup achieves F1 scores = 0.0 in the left bundle branch block abnormality. best and the last performances in CXR classiﬁcation; we reportonly the best performances in ECG classiﬁcation, since the lastperformances are very close to the best performances on theECG datasets. One can see that Flow-Mixup outperforms theother regularization methods in dealing with various degreesof label corruption. Flow-Mixup’s performance over the otherregularization methods validates the capability of Flow-Mixup.Flow-Mixup attains better performances than Mixup, whichmight result from the abnormality-speciﬁc features extractedby the nonlinear part (see Sec. IV-B). In ECG classiﬁcation,Flow-Mixup outperforms Mixup and Manifold Mixup, and asimilar conclusion can be derived. Further, one can see fromFig. 6 that Flow-Mixup outperforms Mixup in most classes inF1 scores and AUCs.

2) Correlation Conﬂict Reduction:

To further evaluateFlow-Mixup’s ability to reduce correlation conﬂicts, we com-pare the F1 score and AUC of every abnormality betweenMixup and Flow-Mixup on the ChestX-ray14 test set andECG-12 test set, respectively. The histograms of these F1scores and AUCs are shown in Fig. 6, both of themwith ∼

10% label corruption. For easy comparison, we settwo new indicators for every class: “Performance Ratio” = ( Performance

Mixup / Performance

Flow-Mixup ) r ( r is an exponent; r = 10 . for CXR images and r = 1 . for ECG records), and“Independent Ratio” = n c /m c , where n c is the number ofimages with only the class c , and m c is the number of all theimages with the class c including multi-labeled images. The“Performance Ratio” (which is the “AUC Ratio” for CXRsand “F1 Ratio” for ECGs in Fig. 6) indicates the relativeperformance in every abnormality of Mixup and Flow-Mixup,while the “Independent Ratio” suggests in what degree a classis independent in a dataset. The performances are normalizedbefore computing the “Performance Ratio”. Thus, one cansee whether the relative performances are related to the classindependence by comparing the coincides of the “Perfor-mance Ratio” and the “Independent Ratio”. In Fig. 6, the“Performance Ratio” curves coincide with the “IndependentRatio” curves (with Spearman correlation coefﬁcients ≈ . ), indicating that Flow-Mixup can obtain better performance ina relatively dependent class. Hence, we believe Flow-Mixupis able to reduce the correlation conﬂicts.

3) Distribution Shift Reduction:

To evaluate Flow-Mixup’sability to reduce the “Distribution Shift” phenomenon, wecompute the differences (Diff.s) between the Best AUCsand the Last AUCs on the ChestX-ray14 test set, shown inTable III. The Diff.s on the ECG test sets are not reported sincethe best and the last Marco F1 scores are very close. Further,we compute the variances of the normalized performanceindicators (AUCs for CXR images and Macro F1 scores forECG records) of some epochs on the test sets, as:Var ( I ) = (cid:80) ne =1 ( I e − ¯ I ) n (8)where I is the normalized performance on the test set, ¯ I is the average performance, and e is the index of epochs.The normalization method is the min-max normalization. Thevariances are computed for the early 20 epochs on the ChestX-ray14 dataset ( n = 20 ) and for the early 100 epochs on thetwo ECG datasets ( n = 100 ), as the indicators just ﬂuctuateslightly in the rest epochs. As shown in Table III, Flow-Mixuphas lower variances than Manifold Mixup, which suggeststhat Flow-Mixup is more stable. Comparing the Diff.s andvariances, it is obvious that training models with ManifoldMixup is not as stable as with Flow-Mixup, which mightbe due to the instability caused by “distribution shift”, asdiscussed in Sec. IV-C.

4) Hyperparameter α : As the suggestions in [21], [25],setting the mixing degree α > . is suggested in dealingwith the corrupted labels. In our tasks, we ﬁnd that themodels perform well with α > . . Flow-Mixup seems tobe insensitive to α , and the results ﬂuctuate within ∼ . onthe ECG datasets and within ∼ . on the ChestX-ray datasetwith different α ∈ [1 . , . .VI. C ONCLUSIONS

In this paper, we proposed a new regularization approach,Flow-Mixup, for multi-labeled medical image classiﬁcation

ABLE IIIV

ARIANCES OF THE NORMALIZED PERFORMANCE INDICATORS FOR M ANIFOLD M IXUP ( α = 3.0) AND F LOW -M IXUP ( α = 3.0, O P = F ALSE ) OVER n EPOCHS ON THE TEST SETS : AUC

VARIANCES FOR THE C HEST X- RAY TEST SET AND M ACRO -F1

VARIANCES FOR THE

ECG-55

AND

ECG-12

TESTSETS . T

HE LOWER VARIANCES ARE MARKED AS

BOLD . ChestX-ray14 ( n = 20 ) ECG-12 ( n = 100 ) ECG-55 ( n = 100 )Regularization × × × × × × × × × × × × Manifold Mixup with corrupted labels. Guided by Flow-Mixup, a deep learningclassiﬁer extracts abnormality-speciﬁc features and then mapssuch features into the label space. Experiments veriﬁed thatFlow-Mixup can handle datasets containing corrupted labels,and thus makes it possible to apply automatic annotation.Besides, we compared Flow-Mixup with the common Mixupand Manifold Mixup methods, highlighted the characteristicsof Flow-Mixup, and discussed the “correlation conﬂicts” phe-nomenon and “distribution shift” phenomenon occurred withusing Mixup or Manifold Mixup.VII. A

CKNOWLEDGEMENTS

This research was partially supported by the NationalResearch and Development Program of China under grantNo. 2019YFB1404802, No. 2019YFC0118802, and No.2018AAA0102102, the National Natural Science Foundationof China under grant No. 61672453, the Zhejiang Univer-sity Education Foundation under grants No. K18-511120-004,No. K17-511120-017, and No. K17-518051-02, the Zhejiangpublic welfare technology research project under grant No.LGF20F020013, and the Key Laboratory of Medical Neu-robiology of Zhejiang Province. D. Z. Chen’s research wassupported in part by NSF Grant CCF-1617735.R

EFERENCES[1] Ehsan Adeli, Dongjin Kwon, and Kilian M Pohl. Multi-label transduc-tion for identifying disease comorbidity patterns. In

MICCAI , 2018.[2] Devansh Arpit, Stanisław Jastrze¸bski, Nicolas Ballas, David Krueger,Emmanuel Bengio, Maxinder S Kanwal, Tegan Maharaj, Asja Fischer,Aaron Courville, Yoshua Bengio, et al. A closer look at memorizationin deep networks. In

ICML , 2017.[3] Angelica I Aviles-Rivero, Nicolas Papadakis, Ruoteng Li, Philip Sel-lars, Qingnan Fan, Robby T Tan, and Carola-Bibiane Sch¨onlieb.GraphX

NET − Chest X-ray classiﬁcation under extreme minimal su-pervision. In

MICCAI , 2019.[4] Haomin Chen, Shun Miao, Daguang Xu, Gregory D Hager, and Adam PHarrison. Deep hierarchical multi-label classiﬁcation of chest X-rayimages. In

ICLR , 2019.[5] Mark Cicero, Alexander Bilbily, Errol Colak, Tim Dowdell, Bruce Gray,Kuhan Perampaladas, and Joseph Barfett. Training and validating adeep convolutional neural network for computer-aided detection andclassiﬁcation of abnormalities on frontal chest radiographs.

InvestigativeRadiology , 2017.[6] Florin C Ghesu, Bogdan Georgescu, Eli Gibson, Sebastian Guendel,Mannudeep K Kalra, Ramandeep Singh, Subba R Digumarthy, SasaGrbic, and Dorin Comaniciu. Quantifying and leveraging classiﬁcationuncertainty for chest radiograph assessment. In

MICCAI , 2019.[7] Tomer Golany and Kira Radinsky. PGANs: Personalized generativeadversarial networks for ECG synthesis to improve patient-speciﬁc deepECG classiﬁcation. In

AAAI , 2019. [8] Sebastian Guendel, Sasa Grbic, Bogdan Georgescu, Siqi Liu, AndreasMaier, and Dorin Comaniciu. Learning to recognize abnormalities inchest X-Rays with location-aware dense networks. In

IberoamericanCongress on Pattern Recognition , 2018.[9] Hongyu Guo, Yongyi Mao, and Richong Zhang. Mixup as locally linearout-of-manifold regularization. In

AAAI , 2019.[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identitymappings in deep residual networks. In

ECCV , 2016.[11] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian QWeinberger. Densely connected convolutional networks. In

CVPR , 2017.[12] Mohammad Kachuee, Shayan Fazeli, and Majid Sarrafzadeh. ECGheartbeat classiﬁcation: A deep transferable representation. In

IEEEInternational Conference on Healthcare Informatics , 2018.[13] Diederik P Kingma and Jimmy Ba. Adam: A method for stochasticoptimization. arXiv preprint arXiv:1412.6980 , 2014.[14] Thomas Kurmann, Pablo M´arquez-Neila, Sebastian Wolf, and RaphaelSznitman. Deep multi-label classiﬁcation in afﬁne subspaces. In

MICCAI , 2019.[15] Ricardo Bigolin Lanfredi, Joyce D Schroeder, Clement Vachet, andTolga Tasdizen. Adversarial regression training for visualizing theprogression of chronic obstructive pulmonary disease with chest X-rays.In

MICCAI , 2019.[16] Keon Myung Lee, Sang Yeon Lee, Chan Sik Han, and Seung MyungChoi. Long bone fracture type classiﬁcation for limited number of CTdata with deep learning. In

ACM Symposium on Applied Computing ,2020.[17] Pranav Rajpurkar, Jeremy Irvin, Robyn L Ball, Kaylie Zhu, BrandonYang, Hershel Mehta, Tony Duan, Daisy Ding, Aarti Bagul, Curtis PLanglotz, et al. Deep learning for chest radiograph diagnosis: Aretrospective comparison of the CheXNeXt algorithm to practicingradiologists.

PLoS Medicine , 2018.[18] Yichen Shen, Maxime Voisin, Alireza Aliamiri, Anand Avati, AwniHannun, and Andrew Ng. Ambulatory atrial ﬁbrillation monitoring usingwearable photoplethysmography with deep learning. In

KDD , 2019.[19] Saeid Asgari Taghanaki, Mohammad Havaei, Tess Berthier, FrancisDutil, Lisa Di Jorio, Ghassan Hamarneh, and Yoshua Bengio. InfoMask:Masked variational latent representation to localize chest disease. In

MICCAI , 2019.[20] Vladimir N Vapnik and A Ya Chervonenkis. On the uniform convergenceof relative frequencies of events to their probabilities.

Theory ofProbability and Its Applications , 1971.[21] Vikas Verma, Alex Lamb, Christopher Beckham, Amir Najaﬁ, IoannisMitliagkas, David Lopez-Paz, and Yoshua Bengio. Manifold Mixup:Better representations by interpolating hidden states. In

ICML , 2019.[22] Xiaosong Wang, Yifan Peng, Le Lu, Zhiyong Lu, MohammadhadiBagheri, and Ronald M Summers. ChestX-ray8: Hospital-scale chest X-Ray database and benchmarks on weakly-supervised classiﬁcation andlocalization of common thorax diseases. In

CVPR , 2017.[23] Yunyan Xing, Zongyuan Ge, Rui Zeng, Dwarikanath Mahapatra, JarrelSeah, Meng Law, and Tom Drummond. Adversarial pulmonary pathol-ogy translation for pairwise chest X-Ray data augmentation. In

MICCAI ,2019.[24] Donna Xu, Yaxin Shi, Ivor W Tsang, Yew-Soon Ong, Chen Gong, andXiaobo Shen. Survey on multi-output learning.

IEEE Transactions onNeural Networks and Learning Systems , 2019.[25] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. Mixup: Beyond empirical risk minimization. In