Unsupervised Task Design to Meta-Train Medical Image Classifiers
Gabriel Maicas, Cuong Nguyen, Farbod Motlagh, Jacinto C. Nascimento, Gustavo Carneiro
UUnsupervised Task Design to Meta-TrainMedical Image Classifiers (cid:63)
Gabriel Maicas † Cuong Nguyen † Farbod Motlagh † Jacinto C. Nascimento ‡‡ Gustavo Carneiro † † Australian Institute for Machine Learning, The University of Adelaide ‡‡ Institute for Systems and Robotics, Instituto Superior Tecnico, Portugal
Abstract.
Meta-training has been empirically demonstrated to be themost effective pre-training method for few-shot learning of medical im-age classifiers (i.e., classifiers modeled with small training sets). However,the effectiveness of meta-training relies on the availability of a reason-able number of hand-designed classification tasks, which are costly toobtain, and consequently rarely available. In this paper, we propose anew method to unsupervisedly design a large number of classificationtasks to meta-train medical image classifiers. We evaluate our methodon a breast dynamically contrast enhanced magnetic resonance imaging(DCE-MRI) data set that has been used to benchmark few-shot trainingmethods of medical image classifiers. Our results show that the proposedunsupervised task design to meta-train medical image classifiers buildsa pre-trained model that, after fine-tuning, produces better classifica-tion results than other unsupervised and supervised pre-training meth-ods, and competitive results with respect to meta-training that relies onhand-designed classification tasks.-
Keywords: meta-training, unsupervised learning, unsupervised task de-sign, breast image analysis, magnetic resonance imaging, few-shot, pre-training, clustering.
The accuracy and robustness of deep learning based medical image classifiers isgenerally positively correlated with the size of the annotated training set usedduring the modelling process [1]. However, large annotated training sets areexpensive and not readily available for some medical image analysis applications,such as breast screening from DCE-MRI [2]. Therefore, training medical imageclassifiers with small annotated training sets has become a highly investigatedtopic, particularly after the advent of deep learning [1].The most competitive medical image classifiers are currently based on convo-lutional neural networks (CNNs) [1] that need large training sets to be properlymodelled. To reduce the need for such large annotated sets, pre-training ap-proaches have been explored in medical image analysis, where the most relevantfor our paper are: 1) supervised pre-training using independent data sets [5],where the model is pre-trained by solving a classification problem in a differentdata set; 2) unsupervised pre-training using clustering [3], where the model is (cid:63)
Supported by Australian Research Council through grant DP180103232. a r X i v : . [ c s . C V ] J u l Authors Suppressed Due to Excessive Length
Fig. 1:
Unsupervised task design to meta-train medical image classifiers. Deep cluster-ing [3] produces a set of clusters that are used in the unsupervised design of classificationtasks. These tasks are used in a meta-training process to produce a pre-trained modelthat can be fine-tuned to new classification tasks using small labelled training sets, inthis paper represented by the breast screening problem from DCE-MRI [4]. pre-trained by performing clustering without any knowledge about the groundtruth labels; and 3) unsupervised pre-training using input reconstruction [6],where the model is pre-trained by reconstructing the input images of the train-ing set. Arguably, the main issue with these pre-training methods is that theirobjective functions are irrelevant for the medical image classifier being developeddownstream. Alternatively, the need for pre-training methods can be alleviatedwith the use of other types of training methods, such as multiple instance learn-ing (MIL) [7] or multi-task learning [8], but both methods still need large trainingsets. More recently, a pre-trained model produced by supervised meta-training(i.e., a meta-training process that depends on hand-designed classification tasks)showed superior performance compared to the previously described pre-trainingmethods [4]. Nevertheless, these promising meta-training results are counterbal-anced by an unappealing need of an expensive hand-designing process to producethe classification tasks [4]. Given the high cost of this process, the availability ofa large number of hand-designed classification tasks is rare, which hampers theexploration of meta-training for medical image classifiers.In this paper, we propose a new method to unsupervisedly produce a largenumber classification tasks to meta-train medical image classifiers. To this end,we use deep clustering [3] to automatically build image clusters that can begrouped in different ways to enable the design of multiple classification tasksemployed in the meta-training process – see Fig. 1. We evaluate our methodon the breast screening classification task from a breast DCE-MRI data setthat has been used to benchmark few-shot training algorithms of medical imageclassifiers [4]. Results show that our proposed approach produces classificationresults that are significantly better than other unsupervised and supervised pre-training methods, and competitive to supervised meta-training. nsupervised Task Design to Meta-Train Medical Image Classifiers 3
DCE-MRI is a recommended image modality in breast screening programs forpatients at high-risk [9]. However, DCE-MRI interpretation is time-consumingand prone to high inter-observer variability [10]. Thus, computer-aided diagnosis(CAD) systems are being developed to assist radiologists increase their diagno-sis sensitivity [11] and specificity [12], and reduce analysis time. However, thedevelopment of CAD systems for breast DCE-MRI is challenging due in part tothe small size of annotated data sets available for training.Meta-training has been shown to be an effective strategy to improve thelearning of classifiers using relatively small training sets [13]. For instance, Maicas et al. [4] proposed the use of hand-designed breast classification tasks to meta-train a model that was then fine-tuned to solve the breast screening task. Resultsshowed that this method improves over other strategies to train classifiers fromsmall data sets, such as MIL [7] and multi-task learning [8]. However, the methodproposed in [4] relies on costly hand-designed classification tasks.Similarly to our paper, Hsu et al. [14] proposed an unsupervised methodto design computer vision classification tasks for meta-training. Results showedthat this approach produced worse classification performance than meta-trainingmodelled with hand-designed tasks (i.e., supervised meta-training). We believethat the reason behind this drop in performance lies in the large number ofhand-designed tasks already available for supervised meta-training in computervision applications [14], enabling a good classification performance baseline. Thedifficulty to obtain a large number of hand-designed tasks for medical imageclassification problems means that the number of these hand-designed tasks willbe small, which may result in a relatively low classification performance baseline.We hypothesize that our proposed method that unsupervisedly designs a largenumber of classification tasks to meta-train a medical image classifier can achievea classification performance that is at least comparable to supervised meta-training [4] trained with a small number of hand-designed tasks. Our proposedmethod has the advantage that it does not rely on costly hand-designed tasks.
The data set is represented by D = { ( v i , t i , b i , y i ) } |D| i =1 , where v : Ω → R cor-responds to the first DCE-MRI subtraction volume ( Ω denotes the volume lat-tice) [15], t : Ω → R represents the T1-weighted MRI only used to separatethe left and breast regions of the volume, b ∈ { left , right } indicates the left orright breast, and y ∈ Y = { , } indicates the classification label: no malignantfindings, or malignant findings, respectively. The proposed unsupervised task design method builds several binary classifica-tion tasks from image groups formed by deep clustering [3]. The training of deepclustering alternates an optimisation of two objective functions [3]. We denotethe θ -parameterised model that produces the unsupervised learning features by Authors Suppressed Due to Excessive Length f θ ( v ) ∈ R D and the ω -parameterised classifier that produces a pseudo-labelrepresenting one of the unknown K classes and is placed on top of f θ ( . ) by g ω ( f θ ( v )) ∈ { , } K . The first objective function is the cross-entropy loss (cid:96) ( . ) with respect to the pseudo-labels { (cid:101) y i } |D| i =1 , with (cid:101) y ∈ (cid:101) Y = { , } K , min θ,ω |D| |D| (cid:88) i =1 (cid:96) ( g ω ( f θ ( v i )) , (cid:101) y i ) , (1)which is used to estimate the optimal θ ∗ and ω ∗ . The second objective functionfinds the K centroids, denoted by C ∈ R D × K , and pseudo-labels (cid:101) y with min C |D| |D| (cid:88) i =1 min (cid:101) y i (cid:107) f θ ( v i ) − C (cid:101) y i (cid:107) , (2)where (cid:101) y i is a K -dim one-hot vector.Each step of the optimization above will generate new values for the modelparameters, centroids and pseudo-labels. We extend deep clustering [3] witha model selection process based on maximising the Silhouette coefficient thatmeasures clustering quality [16] with κ = 1 |D| |D| (cid:88) i =1 b ( i ) − a ( i )max ( a ( i ) , b ( i )) , (3)where a ( i ) represents the average (cid:96) distance between f θ ( v i ) and all points f θ ( v j ) where i (cid:54) = j and (cid:101) y i = (cid:101) y j ; and b ( i ) denotes the smallest average (cid:96) distancebetween f θ ( v i ) and f θ ( v j ) where i (cid:54) = j and (cid:101) y i (cid:54) = (cid:101) y j .The unsupervised design of classification tasks is based on the formationof L binary classification problems derived from the pseudo-labels obtainedfrom (2). Each of these L binary classification problems is built by randomlyselecting 2 nonempty and disjoint subsets K (0) l and K (1) l from the pseudo labelset { , , . . . , K } and labelling their corresponding data points as class 0 and1, respectively. Note that the number of classification tasks for a given K is L = (cid:80) n − i =1 (cid:80) min( i,n − i ) k =1 ( ni ) × ( n − ik ) δ ( i − k ) , where (cid:0) AB (cid:1) denotes the binomial coefficient,and δ ( . ) represents the Dirac delta function. Meta-training estimates the parameters of a meta-learner, so it can be used asa pre-trained model that is efficiently fine-tuned to previously unseen classifi-cation tasks, using small annotated training sets [13]. The algorithm assumesthat there exists a task distribution T , from which each classification task T l isdrawn, where each task comprises a training set { v ( l,t ) i , (cid:101) y ( l,t ) i } Mi =1 and a testingset { v ( l,v ) i , (cid:101) y ( l,v ) i } Ni =1 , with M << N and M + N = |T l | . Meta-training iterativelysamples T tasks from T , and re-trains a multi-target classifier for those tasksusing the training and testing sets defined above.We use the MAML meta-training [17] that consists of a Bayesian hierarchicalmodel, where ψ denotes the classifier meta parameter, and φ l represents the nsupervised Task Design to Meta-Train Medical Image Classifiers 5 parameter for task T l . The meta-training objective function is defined by: max ψ log p ( Y ( v ) l =1 ..T |Y ( t ) l =1 ..T , V ( v ) l =1 ..T , V ( t ) l =1 ..T , ψ ) , (4)where T is the number of tasks per meta-training iteration, Y ( v ) l = { (cid:101) y ( l,v ) i } Ni =1 , Y ( t ) l = { (cid:101) y ( l,t ) i } Mi =1 , V ( v ) l = { v ( l,v ) i } Ni =1 , and V ( t ) l = { v ( l,t ) i } Mi =1 . In (4), we have log p ( Y ( v ) l =1 ..T |Y ( t ) l =1 ..T , V ( v ) l =1 ..T , V ( t ) l =1 ..T , ψ ) ≥ T (cid:88) l =1 E p ( φ l |Y ( t ) l , V ( t ) l ,ψ ) (cid:104) log p ( Y ( v ) l |V ( v ) l , φ l ) (cid:105) , (5)where the lower bound is derived from Jensen’s inequality [18]. Therefore, themaximisation in (4) is approximated with the lower bound maximisation in (5),where the posterior p ( φ l |Y ( t ) l , V ( t ) l , ψ ) is approximated with a Dirac delta func-tion at a local optimal task-specific model parameter φ ∗ l , with p ( φ l |Y ( t ) l , V ( t ) l , ψ ) = δ ( φ l − φ ∗ l ) . The local optimal model parameter φ ∗ i is obtained with truncatedgradient descent initialised by the meta parameters ψ : φ ∗ l = ψ − α ∇ φ l (cid:104) − log p ( Y ( t ) l | , V ( t ) l , φ l ) (cid:105) , (6)where α is the learning rate, and the truncated gradient descent consists ofa single step of (6). Maximising the lower bound of the log likelihood in (5)represents the MAML algorithm in [13], which produces a pre-trained modelthat can quickly learn new tasks drawn from T . We evaluate our proposed method on a breast DCE-MRI data set [2] (formallydefined in Sec. 3.1), which has previously been used to evaluate few-shot trainingmethods [4]. To allow a fair comparison with previous papers, we split the dataset in a patient-wise manner into the same training, validation and testing sets,containing 45, 13, and 59 patients, respectively. We use the T1-weighted MRI toautomatically extract the left and right breast regions from the first DCE-MRIsubtraction volume [4]. Each breast region is resized into a volume of × × [4]. For the breast screening problem, only breasts that contain a malignantfinding(s) are considered positive, while breasts with only benign findings or nofindings are considered negative. There are 30, 9, and 38 positive and 60, 17,and 80 negative breasts in the training, validation and testing sets, respectively.The model f θ ( v ) that unsupervisedly produces the volume features is a 3DDensenet [19] composed of five dense blocks, each containing two dense layers.The features represent the input to the deep clustering algorithm, explainedin Sec.3.2, with the number of clusters K ∈ { , , } . The model that is meta-trained, and fine-tuned, has the same architecture as f θ ( . ) . During meta-training,we use a meta learning rate α = 0 . in (6). At each meta-iteration, a meta-batch size of T = 4 classification tasks is sampled according to a random or acurriculum learning strategy [4]. The meta-trained model is fine-tuned to the Authors Suppressed Due to Excessive Length breast screening task using the entire training set, where model selection is per-formed using the validation set and results are reported in the test set.The evaluation for the breast screening problem is based on the area underthe ROC curve (AUC). We also measure the standard error utilising an estimatebased on the Wilcoxon test [20] that estimates confidence intervals based onthe testing set. In this evaluation, we study the type of task sampling for meta-training, i.e. random, or curriculum learning [4], and the influence of the numberof clusters K in (1), used to build the tasks. We compare our method (U-MT)with the previously proposed supervised meta-training for the case where thebreast screening task is included (S-MT (S)) and not included (S-MT (NS))in the meta-training process. We also compare our method with: a) Densenettrained from scratch on the breast screening task; b) Densenet from (a) fine-tuned with MIL [7]; c) Densenet trained with multi-tasking (using hand-designedtasks) [4]; d) Densenet pre-trained as a variational autocoder (i.e., unsupervisedtraining) and fine-tuned for the breast screening task; and e) Densenet pre-trained with deep clustering (i.e., unsupervised training) and fine-tuned for thebreast screening task. All Densenet models of these competing methods havethe same architecture as the meta-trained model described above. The rationalefor baselines (d) and (e) is to evaluate the effect of pre-training based on areconstruction or a clustering scheme. With this purpose, we present resultsbased on nearest neighbor classification and the fine-tuned classification model. We show the AUC results ( ± standard error) for breast screening baselines inTab. 1. Table 2 presents the results of meta-training, as a function of K ∈{ , , } , with supervised and unsupervised task design using random and cur-riculum learning task sampling methods. Figure 2 presents examples of breastscreening classification. Training Method Baseline AUC
From Scratch [19] . ± . MIL based fine-tuning [7] . ± . Multi-Task [8] . ± . Variational Autoencoder + Nearest Neighbour . ± . Variational Autoencoder + Fine-Tune in breast screening . ± . Deep Clustering + Nearest Neighbour . ± . Deep Clustering + Fine-Tune in breast screening . ± . Table 1:
AUC results ( ± standard error) for breast screening baselines. We measure the statistical significance of the difference in performance be-tween our best performing approaches (Random with K = 5 and Curriculumwith K = 5 ) and all baseline methods, obtaining a p-value p ≤ . for allcases (unpaired two-tailed t-test). Also, comparing our newly proposed U-MT(Random with K = 5 ) and S-MT (S) (Curriculum with K = 3 ) [4], we obtain ap-value p > . . nsupervised Task Design to Meta-Train Medical Image Classifiers 7 Random Curriculum K = 3 K = 4 K = 5 K = 3 K = 4 K = 5 S-MT [4] (S) . ± . N/A N/A . ± . N/A N/AS-MT [4] (NS) . ± . N/A N/A . ± . N/A N/A
U-MT (Ours) . ± .
05 0 . ± .
04 0 . ± .
04 0 . ± .
04 0 . ± .
04 0 . ± . Table 2:
AUC for the breast screening task for our proposed method (U-MT) as afunction of the number of image clusters K and the task sampling method (randomand curriculum). We also present the results of supervised meta-training [4] (S-MT)for the cases where the breast screening is included (labelled as S) and not included(labelled as NS) in the meta-training tasks. N/A indicates that the experiment is notfeasible due to the lack of extra ground truth labels. (a) (b) (c) (d)Fig. 2: Example of breast screening diagnosis produced by our approach. Image (2a)shows the correct positive diagnosis of a breast containing a malignant tumour. Image(2b) shows the correct negative diagnosis of a breast with a benign tumour. Image (2c)shows the incorrect positive classification of a breast containing no tumours. Image(2d) shows the correct negative diagnosis of a breast with a benign tumour.
We have presented a new method that unsupervisedly designs classification tasksto meta-train medical image classifiers. Our method significantly outperformsseveral baselines consisting of traditional pre-training methods based on varia-tional autoencoder, deep clustering, MIL, and multi-task learning (see Tab. 1).Our method also produces results comparable to the state-of-the-art set by meta-training using hand-designed tasks [4] (see Tab. 2). However, instead of usingmanually defined labels during meta-training, we unsupervisedly build classifi-cation tasks, allowing us to build a larger set of tasks, compared to the hand-designed ones. Also from Tab. 2, we notice that larger number of tasks, whichincreases with the number of clusters (Sec. 3.2), generally implies better AUCresults. This confirms our initial hypothesis that, differently from computer vi-sion problems, automatically building tasks is of great importance for medicalimage classification problems, where image labels that allow a large number oftasks are costly to obtain. We also observe that sampling tasks according tocurriculum learning provides a good improvement of accuracy compared to ran-dom task sampling for a small number of clusters ( K = 3 ), but not for largernumber of tasks ( K = 5 ). We hypothesize that meta-training with curriculumlearning sampling needs a larger number of meta-iterations to learn a curriculumthat is better than random task sampling. Given the large number of tasks for Authors Suppressed Due to Excessive Length K ∈ { , } , the meta-training process converged before the curriculum learningalgorithm – that deserves further research. References
1. Litjens, G., Kooi, T., Bejnordi, B.E., Setio, A.A.A., Ciompi, F., Ghafoorian, M.,Van Der Laak, J.A., Van Ginneken, B., Sánchez, C.I.: A survey on deep learningin medical image analysis. Medical image analysis (2017)2. McClymont, D., Mehnert, A., Trakic, A., Kennedy, D., Crozier, S.: Fully automaticlesion segmentation in breast mri using mean-shift and graph-cuts on a regionadjacency graph. JMRI (2014)3. Caron, M., Bojanowski, P., Joulin, A., Douze, M.: Deep clustering for unsupervisedlearning of visual features. In: ECCV. (2018)4. Maicas, G., Bradley, A.P., Nascimento, J.C., Reid, I., Carneiro, G.: Training med-ical image analysis systems like radiologists. In: MICCAI. (2018)5. Bar, Y., Diamant, I., Wolf, L., Lieberman, S., Konen, E., Greenspan, H.: Chestpathology detection using deep learning with non-medical training. In: ISBI. (2015)6. Dong, L.F., Gan, Y.Z., Mao, X.L., Yang, Y.B., Shen, C.: Learning deep repre-sentations using convolutional auto-encoders with symmetric skip connections. In:ICASSP. (2018)7. Zhu, W., Lou, Q., Vang, Y.S., Xie, X.: Deep multi-instance networks with sparselabel assignment for whole mammogram classification. In: MICCAI. (2017)8. Xue, W., Brahm, G., et al.: Full left ventricle quantification via deep multitaskrelationships learning. Medical image analysis (2018)9. Mainiero, M.B., Moy, L., Baron, P., Didwania, A.D., Green, E.D., Heller, S.L.,Holbrook, A.I., Lee, S.J., Lewin, A.A., Lourenco, A.P., et al.: Acr appropriatenesscriteria R (cid:13)(cid:13)