COIN: Contrastive Identifier Network for Breast Mass Diagnosis in Mammography
Heyi Li, Dongdong Chen, William H. Nailon, Mike E. Davies, David Laurenson
11 COIN: Contrastive Identifier Network for BreastMass Diagnosis in Mammography
Heyi Li, Dongdong Chen, William H. Nailon, Mike E. Davies
Fellow, IEEE , and David Laurenson
Abstract —Computer-aided breast cancer diagnosis in mam-mography is a challenging problem, stemming from mammo-graphical data scarcity and data entanglement. In particular, datascarcity is attributed to the privacy and expensive annotation.And data entanglement is due to the high similarity betweenbenign and malignant masses, of which manifolds reside in lowerdimensional space with very small margin. To address thesetwo challenges, we propose a deep learning framework, namedContrastive Identifier Network (COIN), which integrates ad-versarial augmentation and manifold-based contrastive learning.Firstly, we employ adversarial learning to create both on- andoff-distribution mass contained ROIs. After that, we propose anovel contrastive loss with a built Signed graph. Finally, theneural network is optimized in a contrastive learning manner,with the purpose of improving the deep model’s discriminativityon the extended dataset. In particular, by employing COIN,data samples from the same category are pulled close whereasthose with different labels are pushed further in the deep latentspace. Moreover, COIN outperforms the state-of-the-art relatedalgorithms for solving breast cancer diagnosis problem by aconsiderable margin, achieving 93.4% accuracy and 95.0% AUCscore. The code will release on ***.
Index Terms —Deep learning, Breast Cancer Diagnosis, Con-trastive Learning, Adversarial learning, Manifold learning
I. I
NTRODUCTION
Breast cancer is widely acknowledged as the most fre-quently diagnosed cancer and the second fatal disease forwomen around the world [1]. Although no effective methodhas been discovered for prevention, mammography screeningis advantageous to early breast mass diagnosis (BMD), whichhas practically increased the associated survival rates alongwith early treatments [2]. Screening mammography is partic-ularly useful when tumours are invasive (measuring < cm)and too small to be palpable or cause symptoms [3]. However,manual interpretations have been limited by wide variationsin pathology and the potential fatigue of human experts [2].Double reading is thereby employed in many western countries[4, 5], which has been proven to increase both sensitivity andspecificity for the interpretations. In recent years, computer-assisted interventions have been designed and employed tobenefit researchers and doctors as an alternative to a humandouble reader for an optimal healthcare [6, 7]. H. Li, D. Chen, William H. Nailon, Mike E. Davies, and David Lau-renson are with the School of Engineering, the University of EdinburghUniversity, Edinburgh, EH9 3JL, U.K. (e-mail: { Heyi. Li, D.Chen, W.Nailon,Mike.Davies, Dave.Laurenson } @ed.ac.uk).This work was supported in part by the ERC C-SENSE project ERCADG-2015-694888. Fig. 1: The illustration of BMD challenges with an INbreastdataset example: Q - data scarcity and Q - data entangle-ment . Red stands for the 2D t-SNE [8] embedding of malignantmasses and blue for that of benign lesions. The four imagesare corresponding mass examples. A. Classical Methods for BMD
Breast mass classification between benign and malignantlesions is one of the most important and challenging tasks forcommercial computer aided diagnosis systems (CADs). Thisis not only because of the small proportion of cancerous casesamong all screenings, but also due to their high similarities.This characteristic can be illustrated as Fig. 1, where benignand malignant masses are visually very similar as well as theyembed in an intersected manner with t-SNE visualization [8].Although the speed of development in CADs has not beenas rapid as that of medical imaging techniques, the situationhas improved as machine learning approaches advancing [9].When dealing with the classification or diagnosis task, findingor learning distinctive features of cancerous masses and theirsurrounding tissues is the most important task, so that inherentregularities or patterns can be well described [2]. Traditionally,meaningful features were hand engineered by domain experts[10], which instill task-specific knowledge [11]. However,the major cons of this process is clear that engineers ofmachine learning have to exploit essential algorithms withthe help from medical domain experts. Additionally, manualdesigned features may lead to strong bias for the training ofthe algorithm, resulting in limited performance [12], e.g. highfalse positive rate and low specificity [13]. a r X i v : . [ c s . C V ] D ec B. Deep Learning Methods for BMD
In recent years, owing to the success of deep neural net-works (deep learning) [14] applied in various computer per-ception tasks [15], a noticeable shift from rule-based, problemspecific solutions to increasingly generic, problem agnostic-based algorithms has been seen in mammographical CADs[16]–[23]. Specifically, [19] and [20] claimed that featuresextracted by a CNN can achieve better performance for breastmass discrimination, when compared to various hand-craftedfeatures. However, passing through the bottleneck in lowerdimension of classifying a mammographical mass is verydifficult in CNN models, yielding imprecise predictions. Thisis not only because of the low signal-to-noise ratio of thescreening images like other medical imaging modalities [2],but breast masses in mammography are also suffered from twoother major problems: • Q - Data Scarcity [24, 25], which is difficult to solvedue to the issue of patients’ privacy and the tremendousworkloads of annotation by human experts; • Q - Data entanglement . It is very challenging whencompares to natural image recolonization problems,which is attributed to the small margin between benignand malignant data manifolds (Fig. 1).The detailed recent efforts that have been made on these twomajor problems will be discussed in Sec. II. C. Our contribution
Based on all of the above observations, in this paper, wepropose a new deep convolutional neural network, called
Con-trastive Identifier Network (COIN), in which the contrastivelearning and manifold learning are integrated for breast massclassification (benign vs. malignant). In particular, we proposeto employ the adversarial learning for data augmentation,so that both on- and off-manifold new samples with moredistinctive features are created in an unsupervised fashion;We propose a novel triplet contrastive loss, which exploitsthe merit of the Signed similarity graph. In such a way, thelocality of the manifold is approximated as the built deepnetwork being trained. By incorporating these two methodsinto the deep neural network, we solve the manifold embed-ding problem by a learning process, instead of computingthe expensive eigenvalue decomposition for standard graphspectral learning [26]. By integrating these two methods,features discriminativity is improved in deep latent space(Fig. 3). In particular, data samples from the same classare pulled close, meanwhile those with different labels arepushed away in the deep latent space. Consequently, the intra-class difference is minimized, and more importantly, the inter-class manifold margin is maximized in the deep representationspace. A preliminary version of this work appeared in [27].This paper extends [27] by discussion and experiments so asto prove the effectiveness of our motivation for solving datascarcity (Q1) and data entanglement (Q2).II. R
ELATED W ORK
In this section, we will introduce the existing solutions andtheir limitations for the purpose of solving Q -Data Scarcity and Q -Data Entanglement. A. Approaches to Q In order to alleviate the data scarcity problem, worksin [18, 24, 25, 28] have applied classical affine or elas-tic transformations for data augmentation in mammography(e.g. flips, rotations, random crops, etc.). These methods arestraightforward and effective for increasing the total amountof training data. However, the distributions of the generatedsamples are not clear. Generated samples from unknowndistributions are likely to cause an even worse generaliza-tion [29]. Accordingly, adversarial learning [30] has beenemployed to generate synthetic images on the manifold ofreal mammograms, benefiting from the powerful ability tolearn the underlying distribution implicitly without modelingthe original data prior. So far, there is only one applicationon mammography has been noticed to automatically solvethe breast mass classification problem [31], in which bothbenign and malignant mass-contained ROIs are created bya conditional generative adversarial net (GAN). However,the performance is less encouraging. Their experiments haveshown a limited AUC score improvement, when compared toconventional augmentation methods [31]. This is potentiallybecause GAN-based augmentations disregard the importanceof off-distribution samples, that locate closely to the real datamanifold [32]. We believe these off-distribution samples mayalso play a very important role in increasing discriminativitywhile training the model.
B. Approaches to Q In order to mitigate the challenge of data entanglement,many efforts have been tried with CNNs for increasing thediscriminativity of latent features in BCD prlblem. For ex-ample, some researchers have proposed the use of extractingsegmentation-related features by CNNs, either with radiol-ogists’ pixel-level annotations [25] or with the generatedsemantic masks from automatic segmentation algorithms [28].This type of algorithms was originally inspired by the essentialof shape and boundary hand-crafted features [2]. Althoughthese algorithms have improved diagnosis performance, theyare typically complicated to construct, either due to theirmultiple-problem structures, multiple-phase training or largenumber of parameters. And these are especially challengingfor medical experts. More recently, contrastive learning hasshown great promise as a type of powerful discriminativeapproach in various types of computer vision models [33]–[37]. Nevertheless, this method has never been employed inany mammography-related problems as far as we acknowl-edge. In essence, the family of contrastive objective functionsaim to enlarge the distances of feature vector pairs in thedeep latent space by a self-supervised manner [36]. Althoughfeature vectors can be separated apart from each other by thistechnique, inherent structural and geometrical features of dataare ignored, thus features in latent space cannot be enhancedacross various classes. Manifold learning, on the other hand,can mitigate this dilemma by preserving the data topological
Benign Malignant C GAN [ ] O u r s po s iti v e n e i ghbo r s O u r s n e g a ti v e n e i ghbo r s Fig. 2: Augmented mass ROIs by Conditional GAN [39] (firstrow), and positive and negative neighbors by our proposedadversarial augmentation method in second and third rowrespectively.locality [15]. It is widely employed as a non-linearly dimen-sionality reduction method, since data typically resides on alow-dimensional manifold embedded into a high-dimensionalambient space in real applications [38]. However, there arefew approaches using manifold learning to solve classificationproblem. In fact, there are neither studies on manifold analysisfor mammography nor using manifold learning to alleviate thehigh data similarity problem. Thereby, it is very meaningfulto do some preliminary studies on using manifold learning formammography screening diagnosis.III. M
ETHODOLOGY
In this section, after discussing the notations and problemformulation utilized in this paper, we formally introduce thedetails of COIN, which consists of three steps as demonstratedin Fig. 3): 1) adversarial augmentation for mammography, 2) asigned graph Laplacian built upon the augmented data, 3) theproposed contrastive loss and the overall objective function.Additionally, we also present the details of constructing thedeep network and corresponding implementation.
A. Notations and Problem Formulation
Given a dataset D = { ( x i , y i ) } Ni =1 , x i ∈ R H × W is areal-valued grayscale ROI, and y i is the corresponding massdiagnosis label. Note that each ROI contains only one masscropped and resized into the fixed size H × W from acertain mammogram, where H and W both equal to .With the defined dataset D , let D c = { ( x i,c , y i,c ) } N c i =1 be thesub-dataset with N c samples from the c -th category, where c ∈ { Benign , Malignant } , and x i,c ∈ X c and y i,c ∈ Y c are arbitrary data sample and its label in this sub-dataset.The main targets solved by COIN can be formulatedas follows: (1) Given a mass contained mammogram ROI,adversarial augmentation (discussed in Sec. III-B) is first TABLE I: BMD performance by constructed deep CNN net-work with conventional and CGAN [39] augmentations onbenchmark INbreast dataset. Augmentation Method
Accuracy AUCBaseline (no augmentation) 83% 0.85Conventional augmentation 87% 0.88CGAN augmentation [39] 88% 0.89Proposed augmentation 89% 0.92 employed for each mass category one by one, so that bothon-distribution and off-distribution samples of each class arecreated: x i,c → { x + i,c , x − i,c } , where x + i,c ∈ X + c is posi-tive (indistinguishable from the real masses in X c by thediscriminator) and x − i,c ∈ X − c is negative (distinguishableby the discriminator). (2) For each mass category, with theexpanded dataset { X c , X + c , X − c } C , the local Signed graph isthen constructed. (3) Based on the results of preceding twosteps, contrastive loss is optimized within the localized builtsigned graph in the deep latent space, learning a nonlinearembedding in the deep latent space x i → h ( x i ) , wheremanifolds of two categories are maximized. Finally, the latentfeatures are transformed into diagnosis label with a softmaxfunction: h ( x i ) → y i . B. Adversarial Augmentation for Data Scarcity (Q1)1) Motivation:
As previously mentioned, data scarcity andthe high resemblance across benign and cancerous categoriesof masses are the two major causes [25] why mammographicalCADs are limited, typically with high false positive ratesand low sensitivity. Recent studies [31, 40], and [17] haveemployed GANs to create new mammogram instances. Inparticular, Wu et al. [31] have proposed the use of infillingmethod, by which generated masses are synthesized in anormal mammogram tissue. By utilizing class-conditionedGAN, their new samples produced from the generator areforced to be on the same distribution of the original data.Yet, they have ignored the importance of surrounding tissues,where textures of blood vessel have imposed a vital rolefor diagnosing cancerous lesions. This can be the reason oflimited improvement over affine augmentation method of theirapproach.Thereby, it is natural to directly employ a conditionalGAN [39] to create mass-contained ROIs either from benignor malignant classes, for the purpose of enlarging the sizeof training data and preserving the surrounding contextualfeatures. Specifically, the generator in [39] maps an observedimage x i,c from class c and random noise ω to the outputestimation x + i,c , i.e. { x i,c , ω } (cid:55)→ x + i,c . The discriminatorinvolves two mapping components: one is the distinguishingmapping { x i,c , x + i,c } (cid:55)→ z i,c , where z i,c is the predictedprobability of being a real data image; the other is a distanceconditional guidance, by which the deep latent features of acreated sample is mapped as those of the real data sample,i.e. f ( x + i,c ) (cid:55)→ f ( x i,c ) , where f ( · ) is the non-linear functionlearned by the CNN. As described in [39], the generator is +1-1-1 -1 +1 -1 (2) Build Signed Graph (1) Adversarial Augmentation (3) Joint Optimization Original Data Space M a x i m i s e M an i f o l d M a r g i n Deep Latent SpaceAugmented Data SpaceAugmented Data Space +1-1-1 -1 +1 -1 (2) Build Signed Graph (1) Adversarial Augmentation (3) Joint Optimization
Original Data Space M a x i m i s e M an i f o l d M a r g i n Deep Latent SpaceAugmented Data SpaceAugmented Data Space +1-1-1 -1 +1 -1 (2) Build Signed Graph (1) Adversarial Augmentation (3) Joint Optimization
Original Data Space M a x i m i s e M an i f o l d M a r g i n Deep Latent SpaceAugmented Data SpaceAugmented Data Space +1-1-1 -1 +1 -1 (2) Build Signed Graph (1) Adversarial Augmentation (3) Joint Optimization
Original Data Space M a x i m i s e M an i f o l d M a r g i n Deep Latent SpaceAugmented Data SpaceAugmented Data Space ( ) A d v e r s a r i a l A ug m en t a t i on ( ) B u il d S i gned G r aph ( ) J o i n t O p t i m i a z a t i on Fig. 3: An illustration of the proposed COIN framework for BMD, which consists of three steps: adversarial augmentation, tobuild Signed graph, and the joint optimization. In the figure { x , x } are samples on benign manifold M b and { x , x } areon malignant manifold M m . In the first step (adversarial data augmentation), positive neighbors x and x are created withEq. (2) for benign and malignant manifold, separately; and negative neighbors x and x are generated with either Eq. (3) forbenign and malignant manifold, individually. After that, a signed graph is built upon both original and augmented samples asEq. (4). Finally, the joint loss as Eq. (7) is optimized in the deep latent space, so that the margin between benign and malignantmanifold are maximized.constructed with an auto-encoder with skips and the discrimi-nator applies a dual-path CNN architecture with VGG-19 [41]as the backbone network [42].The generated augmentation samples by the method ofconditional GAN [39] are shown in the first row of Fig. 2,and the empirical comparison of classification is shown inTable I. As shown in Fig. 2, the conditional GAN [39] hasperformed limited ability in extracting low frequency features,whereas it focus on the high frequency information whencomparing with the original mass samples. The shape of theaugmented masses are in fact very similar to the realistic ones.In addition, the spiculated lines and blood vessels are vividlyshown in mass surroundings, and mass boundaries can be seenwith high contrast. Yet, the generated lesions are visually verynoisy, especially in the regions within masses, where textualfeatures are merely depicted. As shown in the first row ofFig. 2, there is no surrounding tissue have been generated asbackground tissue in the last subfigure. In order to examinethe effectiveness of increasing model discriminativity, we em-pirically compare the breast mass diagnosis performance (theclassification accuracy and the AUC score) in Table I. It can beseen that both augmentation methods have increased the breastmass diagnosis performance over the baseline model by aanalogously small margin, nevertheless the model complexityof conditional GAN is much higher than affine transformation.This limitation by GAN-based methods may stem fromneglecting some distinguished samples by the discriminator,which locate very close to the original data distribution. Theseoff-manifold samples are highly similar to the original data,which may confer diverse benefits to classifier discriminativityas being trained along with on-distribution samples.
2) Proposed algorithm:
In order to overcome this defectfound in previous works and experiments on cGAN [39],we desire to enlarge the mammography dataset meanwhilecreating more distinctive samples. Inspired by Yu et al.’srecent research in solving open-category classification problem[32], we propose to use adversarial learning to augmentmammographical masses with an optimization free algorithm.In this way, we augment the original dataset with both positiveneighbors , that new instances lie on the original data manifold, e.g. x and x shown in Fig. 3; and negative neighbors , thataugmented samples are off the original data manifold, e.g. x and x in Fig. 3.Specifically, augmented data samples are generated for eachclass c separately. For every mass type, the positive neighbors X + c and the negative neighbors X − c are created with the samemodel but with different objective functions. Particularly, thepositive neighbors X + c are the generated samples that cannotbe separated from X c by the discriminator, while the negativeneighbors X − c are the ones that can be separated. Finally, theexpanded dataset for class c is of the form X c = { X c ∪ X + c ∪ X − c } , and the whole dataset is X = (cid:83) c X c .In terms of the generator, the random noise ω is utilized tocorrupt selected seed points, which are a number of randomlyselected samples in X c . This step is simply a noise addition,thus no optimization with any objective function is involved.By applying the generator, new instances, including both thepositive neighbors X + c and the negative neighbors X − c ofsamples from class c , are created. All of the new sample nodesare close to the original data points, no matter whether theyare positive or negative neighbors.After the new instances are generated by the generator,the resulting samples are fed into the discriminator network,which is trained to distinguish the augmented samples andthe original data instances. We adopt a SVM classifier asthe discriminator for each type of neighbor of class c , bywhich the generated samples are discriminated as the “real”or “fake” category. The output of the generator P D rangingwith [0 , indicates how “real” the generated mass is, where P D = 1 represents real and P D = 0 denotes generated. Thecorresponding probability score by the SVM is calculated bythe logistic sigmoid of the output signed distance, which isformulated as P D ( x ) = exp (cid:0) ˜ d ( x ) (cid:1) exp (cid:0) ˜ d ( x ) (cid:1) + 1 , (1)where ˜ d ( x ) is the signed distance to the decision boundary.With the built generator and discriminator, we create thenew masses one by one, in which two SVM classifiers forthe positive and negative neighbors are trained separately. Regarding the creation of positive neighbors, let x be adesired new sample for class c , and P D ( x ; X c , X + c ) bethe output probability score of the discriminator trained forpositive neighbors. At this point, the discriminator aims togenerate new samples that are as analogous as possible tothe original instances, thereby it is trained on the union of x and { X c , X + c } . Note that X + c represents the already existingpositive neighbors, which is initialized as empty. For eachtraining batch, T generated samples { x t } Tt =1 and T originaldata images { X c } Tt =1 (for the data balanced) are utilized asthe input of the discriminator and the weights are updated.After being fully trained, we select only one best generatedsample in each batch, according to the objective as follows: argmax x P D (cid:0) x ; X c , X + c ∪ { x t } Tt =1 (cid:1) − γ max { , r − min x i ∈ X + c d ( x , x i ) } , (2)where d ( · ) is a distance measure, and γ weights the distanceregularization. This regularization term forces the generatedpoints to be different with a minimum distance r , allowingthe generator a better generalization.Regarding the creation of negative neighbors, let P D ( x , X c X − c ) corresponds to the output of the discriminator,predicting the possibility of x labeled as a “real” data samplefrom class c . X − c is the existing negative neighbor set andis initialized as empty. In this scenario, the discriminatorwould like to select the generated samples, which are notonly off the original data manifold but also are locatedclose to the original data. In this way, the new samples canprovide discriminative information. Specifically in a trainingbatch, we select the desired negative neighbor x from the T generated samples, according to the objective: argmin x P D (cid:0) x ; X c , X − c ∪ { x t } Tt =1 (cid:1) + γ max { , r − min x j ∈ X − c d ( x , x j ) } + γ max { , min x i ∈ X c d ( x , x i ) − r } , (3)where the distance regularization forces generated points toacquire a minimum distance r apart from each other. Theadded distance restriction forces new points to be scatteredclose to X c , so that the minimum distance of x to theoriginal images is at most r . The distance measure d ( · ) in (2) and (3) is set to be the angular cosine distancebecause of its superior discriminative information [43]. Let ρ = min x i , x j ∈X c d ( x i , x j ) , then we set the radius parameters r , r = ρ , and r = 3 × ρ for X c . Further T = 200 and γ is − .As for the optimization of (2) and (3), we employ thederivative free optimization method proposed in [44], in whichthe problem of argmax x ∈ X f ( x ) is considered. Instead ofcalculating the gradients with respect to each parameter, thistechnique samples a number of solutions of x , by which thefeedback information is learned for searching for better solu-tions. The advantage of this method is to optimize problemseven with bad mathematical properties, such as non-convexity,non-differentiability and too many local optima [44]. C. Contrastive Learning to Enhance Discriminativity (Q2)
Investigators have achieved promising diagnosis perfor-mance for mammography by using deep neural networks.Yet one major limiting factor for continued studies is thatdeep models disregard the structural features of data. Weconsider to integrate the inherent data geometrical factor withCNNs with the merits of contrastive learning. By doing this,samples originated from same distribution are forced to beclose whereas samples belonged to different categories arepushed away in the embedding space. Thus, the model’sdiscriminativity is expected to improve.
1) Motivation:
Contrastive learning was initially proposedto solve the manifold embedding problem in a self-supervisedmanner [45] and hence was extensively applied in represen-tation learning [34, 46]. This is attributed to its promisingperformance to improve model’s discriminativity through mea-suring similarities between correlated sample pairs, instead ofdirectly computing sample-wise loss functions ( e.g. softmax,hinge, or mean squared error loss). Specifically, for a certainanchor sample, only one positive or negative pair is used forthe calculation [36]. Positive pairs can be selected by dataaugmentation or co-occurence [37], while negative pairs aretypically data samples uniformly sampled from other classesof data. Triplet loss [47] works in similar manner but in asupervised way, where labeled triplets rather than unlabeledneighboring sample pairs are selected for loss calculation.Similarly, triplet loss depends on triplet correlated samples,which includes one positive (belonging to the same class withthe anchor) and one negative pair (from other classes) [48].Although contrastive learning is effective to separate densesamples in deep latent space, typical triplet loss is not suitablefor classifying mammography breast masses. In fact, randomselection of negative and positive pairs can lead to worsegeneralization over the baseline, as the margin of mammogrammanifolds across different classes are very close. On thecontrary, with the use of manifold learning approximated bya designed local Signed graph, contrastive learning is able topreserve manifold locality knowledge, thus maximizing themanifold margin through the penalty involved by the selectedneighboring positive and negative samples.
2) Signed Similarity Graph:
Graph embedding is trainedwith distributional context knowledge, which can boost per-formance in various pattern recognition tasks. Here, we aimto incorporate the signed graph Laplacian regularizer [49]to learn a discriminative datum representation H ( X ) by adeep neural network, where discriminative here means that theintra-class data manifold structure is preserved in the latentspace and the inter-manifold (slightly different) margins aremaximized.Using the supervision of the adversarial augmentation insection III-B, we build a Signed graph upon the expandeddata X . Given X c = { X c , X + c , X − c } for class c , and all otherclasses data X − c = (cid:83) t =1 , ··· ,C ; t (cid:54) = c { X t , X + t , X − t } , for ∀ x i ∈X c , the corresponding elements in the Signed graph is built asfollows: φ ij = (cid:40) +1 , x j ∈ { X c ∪ X + c } n + i , − , x j ∈ { X − c ∪ X − c } n − i , (4) where the {·} n + i ( {·} n − i ) denotes the corresponding n + ( n − )nearest neighborhood of x i to approximate the locality of themanifold.
3) Triplet contrastive loss:
Then, we compute the structurepreservation in the deep representation space (directly behindthe softmax layer as shown in Fig. 4) H = { h ( x i ) } Ni =1 , where N = |X | . The Signed graph Laplacian regularizer is definedas following: J g ( X , Φ) = (cid:88) i,j φ ij · dist ( h ( x i ) , h ( x j )) , if φ ij > (cid:0) , m + φ ij · dist ( h ( x i ) , h ( x j )) (cid:1) , if φ ij < , (5)where dist ( · ) is a distance metric for the dissimilarity between h ( x i ) and h ( x j ) . It encourages similar examples to be close,and those that are dissimilar to have a distance of at least m to each other, where m > is a margin.Note that instead of calculating the manifold embedding bysolving an eigenvalue decomposition, we learn the embedding H by a deep neural network. Specifically, inspired by thedepth-wise separable convolutions [50] that are extensivelyemployed to learn mappings with a series of factoring filters,we build stacks of depth-wise separable convolutions withsimilar topological architecture to that in [50] to learn suchdeep representations (Fig. 4).Therefore, by minimizing (5), it is expected that if twoconnected nodes x i and x j are from the same class (i.e. φ ij is positive), h ( x i ) and h ( x j ) are also close to each other, andvice versa. Benefiting from such learned discriminativity, wetrain a simple softmax classifier to predict the class label, i.e., J l = − N N (cid:88) i =1 C (cid:88) c =1 δ c ( y i ) log P (cid:0) y i | x i ; θ (cid:1) , (6)where δ c ( y i ) = 1 when y i = c , and otherwise; θ is theparameter set of the neural network.
4) Total Loss:
Finally, by incorporating the Signed Lapla-cian regularizer (5) and the classification loss (6), the totalobjective of D
IAG N ET is accordingly defined as: J = J l + λ J g , (7)where λ ≥ is the regularization trade-off parameter whichcontrols the smoothness of hidden representations. D. Network Architecture and Implementation
The proposed CNN model is constructed with the architec-ture shown in Fig. 4. In the first four convolutional layers,down-sampling convolutional blocks (DC blocks) involve twoseparable convolutions are employed. Specifically, the sep-arable convolution operators decompose × convolutionsinto consecutive × and × operations. After that apooling layer halves the spatial size of the feature maps. Theoutput of the down-sampling layer is then obtained by thetransformation of the ReLU nonlinearity. The four DC blocksaltered the original input × × into feature maps withspatial sizes × × , × × , × × ,and × × respectively. Sequentially, seven separableconvolutional layers are padded, reducing the total number RC B l o ck , DC B l o ck , DC B l o ck , DC B l o ck , D en s e , S o ft M a x La y e r S C on v x R eL U S C on v x R eL U M a x poo l x A dd DC Block RC Block S C on v x R eL U S C on v x R eL U A dd Fig. 4: The deep neural network architecture constructed inCOIN to extract deep latent features. “DC block” represents adown-sampling convolutional block, “RC block” is a residualconvolutional block, and “SConv” is separable convolutions.of parameters, before three fully connected layers with thenumbers of neurons are all . The obtained latent featuresof the enlarged dataset are then regularized with the proposedcontrastive loss in Sec. III-C. Finally, the learned featuresare classified into binary classes ( denotes ”Benign“ and represents ”Malignant“).IV. E XPERIMENTS
In this section, extensive experiments will be implementedto validate the proposed algorithm. We first examine thequality of generated masses from both adversarial augmen-tation modules. We then expand the original dataset with theaugmented data, and build the Signed graph. To better evaluatethe performance, we validate the proposed algorithm on thesmall FFDM mammography dataset: the INbreast dataset [51].
A. Adversarial Augmentation Performance
To visually examine the quality of generated images bythe proposed adversarial augmentation strategy, Fig. 2 showthe augmented examples for benign and malignant categories(blue stands for benign and red represents malignant masses).It is noticeable that the difference between positive (secondrow) and negative neighbors (third row) within each categoryis subtle. Visually, it is very difficult to differentiate themwithin each mass type, not only with the masses themselvesbut also with the contextual or background tissues. Thisindicates that the generated negative neighbors are challengingto recognize, thus they tend to play an important role inincreasing model’s discriminative ability. When we comparethe generated samples by our proposed method with cGANgenerated samples (first row), we can notice that the generatedpositive and negative samples of both benign and malignantcategories are less noisy with more balanced concentrationon low and high frequency signals. When observing the leftcolumn subfigures, it can be noticed that, both negative and C O I N P e rf o r m a n ce (0, 0) (5, 5) (4, 1) (5, 0) (10, 0) (1, 4) (0, 5) (0, 10)
85 91 89 89 91 93 90 8
83 90 90 92 94 95 91 8 Neighbourhood (a) Configurations of ( n + , n − ) λ .001
85 83 9094 8991 9193 9395 9 C O I N P e rf o r m a n ce (b) Configurations of λ Fig. 5: BMD Performance (accuracy and AUC score) of COIN on INBreast v.s. various hyper-parameters λ , n + and n − .(a) shows the performance with different n + positive neighbors and n − negative neighbors when λ equals 1, and (b) depictsvarious regularizer parameter λ with n + = 1 and n − = 4 . 𝑥 !" 𝑥 ! 𝑥 !$ 𝑥 %$ 𝑥 % 𝑥 %" (a) COIN ( n + = 0 , n − = 0 ) 𝑥 !" 𝑥 ! 𝑥 $" 𝑥 $% 𝑥 !% 𝑥 $ (b) COIN ( n + (cid:54) = 0 , n − = 0 ) 𝑥 !" 𝑥 𝑥 𝑥 𝑥 !% 𝑥 !$ (c) COIN ( n + (cid:54) = 0 , n − (cid:54) = 0 ) Fig. 6: t-SNE plots for the test set of INbreast dataset. (a), (b) and (c) show the embbedings of latent features trained by COINwith various learning configurations.positive neighbors of benign masses are in oval or round shapewith relatively smooth boundaries, which are very similar tothat of original INbreast data (Fig. 1). Additionally, the textualand contextual features of generated and realistic samples arevisually highly alike. From the right column in Fig. 2, itcan be seen that the shape of our resulting malignant masses(including both positive and negative neighbors) are mostlyirregular, and the boundaries are fuzzy with spiculated vessels.These characteristics are identical to malignant masses inoriginal INbreast dataset (Fig. 1).In order to further evaluate the effectiveness of the proposedAdversarial Augmentation, we design a series of experimentsto test the discriminativity of generated mass samples. Asshown in Tab. I, we evaluate the classification performancewith different augmentation algorithms in the proposed CNNarchitecture (Fig. 4), which include original INbreast data(baseline), conventional augmentation (flips and rotations),CGAN augmentation [39] and the proposed adversarial aug-mentation (positive neighbors only, i.e. ( n + , n − ) is (5 , and λ = 0 ). Note that we optimize the CNN model with cross-entropy loss. From the Tab. I, we can notice that all augmenta-tion algorithms have improved the classification performancewhen comparing with the baseline model. The conventional augmentation and CGAN [39] have achieved similar discrim-inative performance, whereas the proposed augmentation hasoutperformed other listed methods in both accuracy rate andAUC score. The proposed adversarial augmentation algorithmhas achieved 89% accuracy and 0.92 AUC score. B. Signed Graph Laplacian performance
Determining the optimal values of hyper-parameters is a bigchallenge in deep learning. To explore COIN’s performancewith different Signed graph configurations, the values of thenumber of positive neighbors n + and the number of negativeneighbors n − are first grid searched with fixed regularizationparameter λ = 1 , as shown in Fig. 5a. The best performanceoccurs when n + = 1 and n − = 4 , which increases at leastby 8% in the accuracy rate and by 12% in the AUC scorewhen compared to no graph regularization. This confirms theeffectiveness of using the signed graph regularization and alsovalidates the importance of negative neighbors to improvethe discriminativity and maximize the manifold margin. Inaddition, results show that the D IAG N ET achieves good per-formance only when both n + and n − are considered in thecorresponding singed graph construction. Furthermore, we fix TABLE II: Breast Mass Diagnosis performance comparisonsof the proposed D
IAG N ET and relative state-of-the art methodson INbreast test set. Methodology
Accuracy AUCDomingues et al. (2012) [52] 89% N/ADhungel et al. (2016) [25] 91% 0.76Zhu et al. (2017) [18] 90% 0.89Shams et al. (2018) [17] 93% 0.92Li et al. (2019) [28] 88% 0.92COIN (ours) . ± . . ± . the best performing Signed graph configuration to evaluate the λ value and obtain the best AUC score and accuracy at λ = 1 .These results indicate that the deep latent features extracted bythe deep network and the data inherent structural features areboth important when diagnosing the malignant breast massesfrom the benign ones.To visually observe the performance of data manifold learn-ing, we further explore the learned features embedding plottedby t-SNE for test set (Fig. 6). For the purpose of ablation study,we explore the performance of COIN with different learningconfigurations. For instance, Fig. 6a shows COIN without anyintra class or inter class Signed graph regularization providedby positive or negative neighbors, respectively. Fig. 6b showsthe learning performance when COIN is only regularized byintra class regularization, i.e. without the usage of negativeneighbors. And Fig. 6c illustrates the COIN learning whenboth intra and inter class regularization are employed. Whencompare these three conditions, the worst performance isobtained when there is no regularization (Fig. 6a), by whichsamples of two categories are highly intersected. When themodel is trained with intra class regularization (Fig. 6b),it achieves a better discminativity performance, in which15% samples are mis-classified. COIN with both negativeand positive regularization (Fig. 6c) has achieved the bestembedding of the test data, where 82 out of 88 massesor approximately 93% test samples are correctly identified.Additionally, we have attached the original mass examples forsome randomly selected misclassified masses in Fig. 6. We cannotice that, the misclassified malignant mass sample by COINare particularly similar to those benign masses surroundingit, and vice versa. This indicates that COIN can correctlycategorize breast masses in most cases, apart from extremelyhard example. C. Comparison to the state-of-the-art
Finally, to further explore the effectiveness of COIN, wecompare the proposed algorithm with the state-of-the-art meth-ods in Tab. II, where results of other works are taken fromtheir original papers. It shows that, COIN has outperformedthe state-of-the-art with mean accuracy 93.4% and AUCscore 0.95. When compared with the second best algorithm[17], COIN’s AUC score is significantly higher (3%) withexperiments on the whole dataset without any pre-processing,post-processing or transfer learning. V. C
ONCLUSIONS
In this paper, we have proposed a novel deep frameworkCOIN to address the two crucial challenges of BMD problem, i.e. data scarcity and data entanglement. COIN integrates ad-versarial augmentation and contrastive learning. In particular,the proposed adversarial augmentation dose not only enlargethe dataset, but also enhances the discriminativity for thediagnosis model. The proposed contrastive learning meritsthe model’s distinguishable ability further via exploiting themanifold geometry of data, which is valuable for mammog-raphy lesions of high resemblance. Experiments have shownthat COIN surpasses the state-of-the-art algorithms for BMDproblem. R
EFERENCES[1] P. Boyle, B. Levin et al. , World cancer report 2008.
IARC Press,International Agency for Research on Cancer, 2008.[2] A. Oliver, J. Freixenet, J. Marti, E. Perez, J. Pont, E. R. Denton, andR. Zwiggelaar, “A review of automatic mass detection and segmentationin mammographic images,”
Medical image analysis , vol. 14, no. 2, pp.87–110, 2010.[3] C. DeSantis, J. Ma, L. Bryan, and A. Jemal, “Breast cancer statistics,2013,”
CA: a cancer journal for clinicians , vol. 64, no. 1, pp. 52–62,2014.[4] R. Blanks, M. Wallis, and S. Moss, “A comparison of cancer detectionrates achieved by breast cancer screening programmes by number ofreaders, for one and two view mammography: results from the uknational health service breast screening programme,”
Journal of Medicalscreening , vol. 5, no. 4, pp. 195–201, 1998.[5] J. Brown, S. Bryan, and R. Warren, “Mammography screening: anincremental cost effectiveness analysis of double versus single readingof mammograms,”
BMj , vol. 312, no. 7034, pp. 809–812, 1996.[6] D. Shen, G. Wu, and H.-I. Suk, “Deep learning in medical imageanalysis,”
Annual review of biomedical engineering , vol. 19, pp. 221–248, 2017.[7] S. M. McKinney, M. Sieniek, V. Godbole, J. Godwin, N. Antropova,H. Ashrafian, T. Back, M. Chesus, G. C. Corrado, A. Darzi et al. ,“International evaluation of an ai system for breast cancer screening,”
Nature , vol. 577, no. 7788, pp. 89–94, 2020.[8] L. v. d. Maaten and G. Hinton, “Visualizing data using t-sne,”
Journalof machine learning research , vol. 9, no. Nov, pp. 2579–2605, 2008.[9] Z. Jiao, X. Gao, Y. Wang, and J. Li, “A parasitic metric learning net forbreast mass classification based on mammography,”
Pattern Recognition ,vol. 75, pp. 292–301, 2018.[10] C. Varela, S. Timp, and N. Karssemeijer, “Use of border informationin the classification of mammographic masses,”
Physics in medicine &biology , vol. 51, no. 2, p. 425, 2006.[11] T. Kooi, G. Litjens, B. Van Ginneken, A. Gubern-M´erida, C. I. S´anchez,R. Mann, A. den Heeten, and N. Karssemeijer, “Large scale deep learn-ing for computer aided detection of mammographic lesions,”
Medicalimage analysis , vol. 35, pp. 303–312, 2017.[12] A. Jalalian, S. B. Mashohor, H. R. Mahmud, M. I. B. Saripan, A. R. B.Ramli, and B. Karasfi, “Computer-aided detection/diagnosis of breastcancer in mammography and ultrasound: a review,”
Clinical imaging ,vol. 37, no. 3, pp. 420–426, 2013.[13] A. Malich, D. R. Fischer, and J. B¨ottcher, “Cad for mammography:the technique, results, current role and further developments,”
Europeanradiology , vol. 16, no. 7, p. 1449, 2006.[14] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature , vol. 521,no. 7553, p. 436, 2015.[15] D. Chen, J. Lv, and Z. Yi, “Unsupervised multi-manifold clustering bylearning deep representation,” in
Workshops at the 31th AAAI conferenceon artificial intelligence (AAAI) , 2017, pp. 385–391.[16] G. Carneiro, J. Nascimento, and A. P. Bradley, “Automated analysisof unregistered multi-view mammograms with deep learning,”
IEEEtransactions on medical imaging , vol. 36, no. 11, pp. 2355–2365, 2017.[17] S. Shams, R. Platania, J. Zhang, J. Kim, and S.-J. Park, “Deep generativebreast cancer screening and diagnosis,” in
International Conferenceon Medical Image Computing and Computer-Assisted Intervention .Springer, 2018, pp. 859–867. [18] W. Zhu, Q. Lou, Y. S. Vang, and X. Xie, “Deep multi-instance networkswith sparse label assignment for whole mammogram classification,” in
International Conference on Medical Image Computing and Computer-Assisted Intervention . Springer, 2017, pp. 603–611.[19] J. Arevalo, F. A. Gonz´alez, R. Ramos-Poll´an, J. L. Oliveira, and M. A. G.Lopez, “Representation learning for mammography mass lesion classi-fication with convolutional neural networks,”
Computer methods andprograms in biomedicine , vol. 127, pp. 248–257, 2016.[20] T. Kooi, B. van Ginneken, N. Karssemeijer, and A. den Heeten,“Discriminating solitary cysts from soft tissue lesions in mammographyusing a pretrained deep convolutional neural network,”
Medical physics ,vol. 44, no. 3, pp. 1017–1027, 2017.[21] W. Lotter, G. Sorensen, and D. Cox, “A multi-scale cnn and curriculumlearning strategy for mammogram classification,” in
Deep Learning inMedical Image Analysis and Multimodal Learning for Clinical DecisionSupport . Springer, 2017, pp. 169–177.[22] D. Chen and M. E. Davies, “Deep decomposition learning for inverseimaging problems,” in
Proceedings of the European Conference onComputer Vision (ECCV) , 2020.[23] D. Chen, M. E. Davies, and M. Golbabaee, “Compressive mr fin-gerprinting reconstruction with neural proximal gradient iterations,” in
International Conference on Medical image computing and computer-assisted intervention (MICCAI) , 2020.[24] H. Li, D. Chen, W. H. Nailon, M. E. Davies, and D. Laurenson,“Improved breast mass segmentation in mammograms with conditionalresidual U-Net,” in
Image Analysis for Moving Organ, Breast, andThoracic Images . Springer, 2018, pp. 81–89.[25] N. Dhungel, G. Carneiro, and A. P. Bradley, “The automated learningof deep features for breast mass classification from mammograms,” in
International Conference on Medical Image Computing and Computer-Assisted Intervention . Springer, 2016, pp. 106–114.[26] D. Chen, J. C. Lv, and Z. Yi, “A local non-negative pursuit methodfor intrinsic manifold structure preservation.” in
AAAI , 2014, pp. 1745–1751.[27] H. Li, D. Chen, W. H. Nailon, M. E. Davies, and D. I. Laurenson,“Signed laplacian deep learning with adversarial augmentation for im-proved mammography diagnosis,” in
International Conference on Med-ical Image Computing and Computer-Assisted Intervention . Springer,2019, pp. 486–494.[28] H. Li, D. Chen, W. H. Nailon, M. E. Davies, and D. Laurenson, “Adeep dual-path network for improved mammogram image processing,”
International Conference on Acoustics, Speech and Signal Processing ,2019.[29] S. C. Wong, A. Gatt, V. Stamatescu, and M. D. McDonnell, “Under-standing data augmentation for classification: when to warp?” in . IEEE, 2016, pp. 1–6.[30] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley,S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in
Advances in neural information processing systems , 2014, pp. 2672–2680.[31] E. Wu, K. Wu, D. Cox, and W. Lotter, “Conditional infilling GANsfor data augmentation in mammogram classification,” in
Image Analysisfor Moving Organ, Breast, and Thoracic Images . Springer, 2018, pp.98–106.[32] Y. Yu, W.-Y. Qu, N. Li, and Z. Guo, “Open-category classificationby adversarial sample generation,”
International Joint Conference onArtificial Intelligence , 2017.[33] Q. Xie, M.-T. Luong, E. Hovy, and Q. V. Le, “Self-training withnoisy student improves imagenet classification,” in
Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition ,2020, pp. 10 687–10 698.[34] R. D. Hjelm, A. Fedorov, S. Lavoie-Marchildon, K. Grewal, P. Bach-man, A. Trischler, and Y. Bengio, “Learning deep representationsby mutual information estimation and maximization,” arXiv preprintarXiv:1808.06670 , 2018.[35] O. J. H´enaff, A. Srinivas, J. De Fauw, A. Razavi, C. Doersch, S. Eslami,and A. v. d. Oord, “Data-efficient image recognition with contrastivepredictive coding,” arXiv preprint arXiv:1905.09272 , 2019.[36] K. He, H. Fan, Y. Wu, S. Xie, and R. Girshick, “Momentum contrastfor unsupervised visual representation learning,” in
Proceedings of theIEEE/CVF Conference on Computer Vision and Pattern Recognition ,2020, pp. 9729–9738.[37] P. Khosla, P. Teterwak, C. Wang, A. Sarna, Y. Tian, P. Isola,A. Maschinot, C. Liu, and D. Krishnan, “Supervised contrastive learn-ing,” arXiv preprint arXiv:2004.11362 , 2020. [38] H. S. Seung and D. D. Lee, “The manifold ways of perception,” science ,vol. 290, no. 5500, pp. 2268–2269, 2000.[39] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translationwith conditional adversarial networks,” in
Proceedings of the IEEEconference on computer vision and pattern recognition , 2017, pp. 1125–1134.[40] A. Antoniou, A. Storkey, and H. Edwards, “Data augmentation genera-tive adversarial networks,” arXiv preprint arXiv:1711.04340 , 2017.[41] K. Simonyan and A. Zisserman, “Very deep convolutional networks forlarge-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[42] C. Li and M. Wand, “Precomputed real-time texture synthesis withmarkovian generative adversarial networks,” in
European conference oncomputer vision . Springer, 2016, pp. 702–716.[43] V. Nair and G. E. Hinton, “Rectified linear units improve restrictedBoltzmann machines,” in
Proceedings of the 27th international confer-ence on machine learning (ICML-10) , 2010, pp. 807–814.[44] Y. Yu, H. Qian, and Y.-Q. Hu, “Derivative-free optimization via classi-fication,” in
Thirtieth AAAI Conference on Artificial Intelligence , 2016.[45] R. Hadsell, S. Chopra, and Y. LeCun, “Dimensionality reduction bylearning an invariant mapping,” in , vol. 2.IEEE, 2006, pp. 1735–1742.[46] Z. Wu, Y. Xiong, S. X. Yu, and D. Lin, “Unsupervised feature learningvia non-parametric instance discrimination,” in
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition , 2018, pp.3733–3742.[47] F. Schroff, D. Kalenichenko, and J. Philbin, “Facenet: A unified embed-ding for face recognition and clustering,” in
Proceedings of the IEEEconference on computer vision and pattern recognition , 2015, pp. 815–823.[48] W. Ge, “Deep metric learning with hierarchical triplet loss,” in
Proceed-ings of the European Conference on Computer Vision (ECCV) , 2018, pp.269–285.[49] D. Chen, J. Lv, and M. E. Davies, “Learning discriminative repre-sentation with signed Laplacian restricted Boltzmann machine,” arXivpreprint arXiv:1808.09389 , 2018.[50] F. Chollet, “Xception: Deep learning with depthwise separable convolu-tions,” in
Proceedings of the IEEE conference on computer vision andpattern recognition , 2017, pp. 1251–1258.[51] I. C. Moreira, I. Amaral, I. Domingues, A. Cardoso, M. J. Cardoso,and J. S. Cardoso, “INbreast: toward a full-field digital mammographicdatabase,”
Academic radiology , vol. 19, no. 2, pp. 236–248, 2012.[52] I. Domingues, E. Sales, J. Cardoso, and W. Pereira, “INbreast-databasemasses characterization,”