[PDF] Privacy Preserving Domain Adaptation for Semantic Segmentation of Medical Images

Abstract

Convolutional neural networks (CNNs) have led to significant improvements in tasks involving semantic segmentation of images. CNNs are vulnerable in the area of biomedical image segmentation because of distributional gap between two source and target domains with different data modalities which leads to domain shift. Domain shift makes data annotations in new modalities necessary because models must be retrained from scratch. Unsupervised domain adaptation (UDA) is proposed to adapt a model to new modalities using solely unlabeled target domain data. Common UDA algorithms require access to data points in the source domain which may not be feasible in medical imaging due to privacy concerns. In this work, we develop an algorithm for UDA in a privacy-constrained setting, where the source domain data is inaccessible. Our idea is based on encoding the information from the source samples into a prototypical distribution that is used as an intermediate distribution for aligning the target domain distribution with the source domain distribution. We demonstrate the effectiveness of our algorithm by comparing it to state-of-the-art medical image semantic segmentation approaches on two medical image semantic segmentation datasets.

Full PDF

PPrivacy Preserving Domain Adaptation for Semantic Segmentation of MedicalImages

Serban StanUniversity of Southern California [email protected]

Mohammad RostamiUSC Information Sciences Institute [email protected]

Abstract

Convolutional neural networks (CNNs) have led to sig-niﬁcant improvements in tasks involving semantic segmen-tation of images. CNNs are vulnerable in the area ofbiomedical image segmentation because of distributionalgap between two source and target domains with differentdata modalities which leads to domain shift. Domain shiftmakes data annotations in new modalities necessary be-cause models must be retrained from scratch. Unsuperviseddomain adaptation (UDA) is proposed to adapt a model tonew modalities using solely unlabeled target domain data.Common UDA algorithms require access to data points inthe source domain which may not be feasible in medicalimaging due to privacy concerns. In this work, we developan algorithm for UDA in a privacy-constrained setting,where the source domain data is inaccessible. Our idea isbased on encoding the information from the source samplesinto a prototypical distribution that is used as an intermedi-ate distribution for aligning the target domain distributionwith the source domain distribution. We demonstrate theeffectiveness of our algorithm by comparing it to state-of-the-art medical image semantic segmentation approacheson two medical image semantic segmentation datasets.

1. Introduction

Employing CNNs in semantic segmentation tasks hasbeen proven to be extremely helpful in various applica-tions, including object tracking [64, 2, 71], self-drivingcars [29, 20], and medical image analysis [28, 54, 1, 27].This success, however, is conditioned on the availabilityof huge manually annotated datasets to supervise the train-ing of state-of-the-art (SOTA) networks structures [60, 57].This assumption is not always supported in practice, es-pecially in ﬁelds such as medical image segmentation,where annotating data requires the input of experts, and pri-vacy policies make sharing data for model training time-consuming, and at times impossible. A characteristic of data in the area of medical image segmentation is the ex-istence of domain shift between different imaging modali-ties which results from using imaging devices that are basedon totally different electromagnetic principles, e.g., CT vsMRI. When domain gap exists between the distributions ofthe training (source) and the testing (target) data, the per-formance of CNNs can degrade signiﬁcantly. This makescontinual data annotation necessary for mode retraining.Domain shift is a major area of concern, as data anno-tation is a challenging procedure even for the simplest se-mantic segmentation tasks [36]. Annotating medical im-ages is also expensive, as annotation can be performed onlyby physicians, who undergo years of training to obtain do-main expertise. Unsupervised domain adaptation (UDA) isa learning setting aimed at reducing domain gap withoutdata annotation in the target domain. The goal is to adapt asource-trained model for improved generalization in the tar-get domain using solely unannotated data [19, 63, 51, 66].The core idea in UDA is to achieve knowledge transfer fromthe source domain to the target domain by aligning the latentfeatures of the two domains in an embedding space. Thisidea has been implemented either using adversarial learn-ing [21, 12, 62, 4], directly minimizing the distance of dis-tributions of the latent features with respect to a probabilitymetric [8, 58, 34], or a combination of the two [10, 53].While the existing UDA algorithms have been success-ful in reducing domain gap, the vast majority of these al-gorithms require sharing data between the source and tar-get domains for distribution alignment. This requirementlimits the applicability of most existing works when shar-ing data may not be possible, e.g., sharing data is heav-ily regulated in healthcare domains due to privacy or se-curity concerns. Until recently, there has been little explo-ration of UDA when access to the source domain is lim-ited [31, 52, 47]. These works address UDA for only classi-ﬁcation tasks which limits their applicability to the problemof organ semantic segmentation [69].

Contribution: we develop a UDA algorithm for se-mantic segmentation when privacy and security are ma-jor areas of concern. Additionally, our work provides a1 a r X i v : . [ c s . C V ] J a n ethod for semantic segmentation in a continual learningregime [49, 50]. Our approach is able to mitigate domaingap without having direct access to the source data. Welearn a prototypical distribution for the source domain, andtransfer knowledge between the source and target domainsby distribution alignment between the learned prototypicaldistribution and the latent features of the target domain. Wevalidate our algorithm on two medical image segmentationdatasets, and observe comparable performance to SOTAmethods based on joint training.

2. Related Work

SOTA semantic segmentation algorithms use deep neu-ral network architectures to exploit large annotated datasets[38, 43, 35, 18]. These approaches are based on training aCNN encoder using manually annotated segmentation mapsto learn a latent embedding of the data. An up-sampling de-coder combined with a classiﬁer is then used to infer pixel-wise estimations for the true semantic labels. Performanceof such methods is high when large amounts of annotateddata are available for supervised training. However, thesemethods are not suitable when the goal is to transfer knowl-edge between different domains [51, 44]. Model adap-tation to target domains has been explored in both semi-supervised and unsupervised settings. Semi-supervised ap-proaches rely on the presence of a small number of anno-tated target data samples [42, 65]. For example, a weaklysupervised signal on the target domain can be obtained us-ing bounding boxes. However, manual data annotation ofa small number of images is still a considerable bottleneckin the area of medical imaging because only trained profes-sionals can perform this task. For this reason, UDA algo-rithms are more appealing for healthcare applications.UDA approaches have explored two main strategies toreduce the domain gap. A large number of works rely ongenerative adversarial networks (GANs) [67, 25]. The coreidea is to train a GAN such that data points of both do-mains can be mapped into a domain-agnostic embeddingspace [21]. To this end, a cross-domain discriminator net-work is trained to classify whether an input data point be-longs to the source domain or the target domain. The dis-criminator network receives its input from a feature genera-tor network. The discriminator network is fooled by the fea-ture generator network which is trained as an adversary togenerate domain-agnostic features at its output. As a result,a classiﬁer network that is trained using the source domainannotated data at the output of the generator network wouldgeneralize well in the target domain [39, 13]. An area ofweakness for GANs is mode collapse which happens andneed for ﬁne-tuning the model hyper-parameters.A second approach for UDA is to align the distribu-tions of the two domains directly in a latent embeddingspace [46]. A shared encoder is used to generate latent embeddings for both domains and then it is trained suchthat the distance between the domain-speciﬁc distributionsis minimized with respect to a probability distance metricsat the output of the shared encoder [34, 16, 33, 48, 37].Selecting the proper distance metrics has been the majorfocus of research for these approaches. Optimal transporthas been found particularly suitable for UDA based on deeplearning due to its suitability for gradient-based optimiza-tion [11]. Building upon our prior work [56], we beneﬁtfrom the Sliced Wasserstein distance (SWD) [34] variant ofthe optimal transport. SWD has properties similar to opti-mal transport, but also has a closed form solution that canbe used to compute SWD efﬁciently.Both of the above mentioned approaches have beenfound helpful in various medical semantic segmentation ap-plications [22, 68, 5, 24]. However, both UDA strategiesrequire direct access to the source domain data to computetheir corresponding loss functions. To relax this require-ment, UDA has been recently explored in a source-free set-ting in order to address applications in which the sourcedomain is not directly accessible [31, 52]. Both Kundu etal. [31] and Saltori et al. [52] target image classiﬁcation,and beneﬁt from generative adversarial learning to gener-ate pseudo-data points that are similar to the source domaindata in the absence of actual source samples. While bothapproaches are suitable for classiﬁcation problems, extend-ing them to semantic segmentation of medical images isnot trivial. First, training models that can generate realis-tic medical images is considerably more challenging due toimportance of ﬁne details. Second, one may argue that ifthe generated images are too similar to the real images, theprivacy of human subjects in the training data may still becompromised. Our work is the ﬁrst of its kind and is basedon using a dramatically different approach. We develop asource-free UDA algorithm that performs the distributionalignment of two domains in an embedding space by usingan intermediate prototypical distribution.

3. Problem Formulation

Consider a source domain D S = ( X S , Y ) with an-notated data and a target domain with unannotated data D T = ( X T , Y ) that despite having different input spaces X S and X T , e.g., due to using different medical imagingtechniques, share the same segmentation map space Y , e.g.,the same tissue/organ classes. Following the standard UDApipeline, the goal is to learn a segmentation mapping func-tion for the target domain by transferring knowledge fromthe source domain. To this end, we must learn a function f θ ( · ) : { X S ∪ X T } → { Y } with learnable parameters θ ,e.g., a deep neural network, such that given an input image x ∗ , the function returns a segmentation mask ˆ y that best ap-proximates the ground truth segmentation mask y ∗ . Giventhe annotated training dataset { ( x s , y s ) } Ni =1 in the source2igure 1: Diagram of our proposed method. We ﬁrst perform supervised training on source MR images. Using the sourceembeddings we characterize a prototypical distribution via a GMM distribution in the latent space. We then perform sourcefree adaptation by matching the embedding of the target CT images to the learnt GMM distribution, and ﬁne tuning of theclassiﬁer on GMM samples. Finally, we verify the improved performance that our model gains from model adaptation.domain, it is straightforward to train a segmentation modelthat generalizes well in the source domain through solvingan empirical risk minimization (ERM) problem: ˆ θ = arg min θ N N (cid:88) i =1 L ( y s , f θ ( x s ))) , (1)where L is a proper loss function. For example, we can usethe pixel-wise cross-entropy loss, deﬁned as: L ce ( y ∗ , ˆ y ) = − W (cid:88) i =1 H (cid:88) j =1 K (cid:88) k =1 y ∗ ijk log ˆ y ijk , where, K denotes the number of segmentation classes, and W, H represent the width and the height the input images,respectively. Each pixel label y ∗ ij will be represented as aone hot vector of size K and ˆ y ij is the prediction vectorwhich assigns a probability weight to each label. Due tothe existence of domain gap across the two domains, i.e.discrepancy between the source domain distribution p s ( X ) and the target domain distribution p t ( X ) , the source-trainedmodel in Eq. (1) may generalize poorly in the target do-main. We want to beneﬁt from the information encoded inthe target domain unannotated dataset { x t } Mi =1 to improvethe model generalization in the target domain.We follow the common strategy of domain alignmentin a shared embedding space to tackle UDA. Consider our model f to be a deep convolutional neural network (CNN).Let f = φ ◦ χ ◦ ψ , where ψ ( · ) : R W × H × C → R U × V isa CNN econder, χ ( · ) : R U × V → R W × H is an up-scalingCNN decoder, and φ ( · ) : R W × H → R W × H is a classi-ﬁcation network that takes as inputs latent space represen-tations and assigns label-probability values. We model theshared embedding space as the output space of the subne-towrk χ ◦ ψ ( · ) . Solving UDA then would reduce to aligningthe distributions of the source domain and target domain inthe embedding space because. This translates into minimiz-ing the distributional discrepancy between the distributions χ ◦ ψ ( p s ( · )) and χ ◦ ψ ( p t ( · )) in the embedding space.A large group of UDA algorithms select a probabil-ity distribution metric D ( · , · ) , e.g.SWD or KL-divergence,and then use the source and the target domain data points, X S = [ x s , . . . , x sN ] and X T = [ x t , . . . , x tN ] , to min-imize the loss term D ( χ ◦ ψ ( p s ( · )) , χ ◦ ψ ( p t ( · ))) as aregularizer. However, this will have constrained the userto have access to the source domain data for computing D ( χ ◦ ψ ( p s ( · )) , χ ◦ ψ ( p t ( · ))) that couples the two domains.In medical image segmentation and other similar avenuesin which privacy or security are crucial, sharing the sourcedomain data is not possible. As a result, UDA based on theabove pipeline for the domain alignment of feature embed-dings would be inoperative. To tackle this challenge, weprovide a solution to align the two domains without sharing3he source domain data that beneﬁts from an intermediatedistribution that is learned in the source domain.

4. Proposed Algorithm

Our proposed approach is based on using the prototypi-cal distribution P Z as a surrogate for the learned distribu-tion of the source domain in the embedding space. Upontraining f θ using Eq. (1), the embeddings space would be-come discriminative for the source domain. This mean thatthe source distribution in the embedding will be a multi-modal distribution, where each mode denotes one of theclasses. This distribution can be modeled as a Gaussianmixture model (GMM). To develop a source-free UDA al-gorithm, we can draw random samples from the GMM andinstead of relying on the source data, align the target do-main distribution with the prototypical distribution in theembedding space. In other words, we estimate the term D ( χ ◦ ψ ( p s ( · )) , χ ◦ ψ ( p t ( · ))) with D ( P Z ( · ) , χ ◦ ψ ( p t ( · ))) which does not depend on source samples. In our formu-lation, we use the Sliced Wasserstein Distance as the dis-tribution metric for minimizing the domain discrepancy. Avisual description for our approach is presented in Figure 1.More speciﬁcally, the feature extractor ψ ◦ χ will trans-form the input distribution p s ( · ) to the prototypical distribu-tion P Z ( · ) = χ ◦ ψ ( p s ( · )) based on which the classiﬁer φ assigns the label probabilities. This distribution will have K modes. Our key idea is to approximate P Z ( · ) via a GMMwith K component, where each component encodes a class: P Z ( z ) = K (cid:88) k =1 α k p k ( z ) = K (cid:88) k =1 α k N ( z | µ k , Σ k ) , where α k represents the mixture probabilities for each class k ∈ { , . . . K } , µ k represents the mean of the Gaussian k , and Σ k is the covariance matrix of the k th component.When the network f is trained on the source domain, we canestimate the GMM parameters from the latent features ob-tained from the source training samples { φ ( χ ( x s )) ijt , y sij } .Moreover, we do need to estimate the GMM parameters us-ing unsupervised algorithms such as expectation maximiza-tion (EM), as we have direct access to the labels Y S .To improve class separations in the prototypical distribu-tion P Z , we only use high-conﬁdence samples in each classfor estimating parameters of p k ( · ) . We use a conﬁdencethreshold parameter ρ , and discard all samples for whichthe classiﬁer conﬁdence on its prediction p ij is strictly lessthan ρ . This step helps to cancel out class outliers. Let S ρ = { ( x sij , y sij ) | ψ ( χ ( φ ( x ij ))) > ρ } be the source datapixels on which the classiﬁer φ assigns conﬁdence greaterthan ρ . Also, let S ρ,k = { ( x, y ) | ( x, y ) ∈ S ρ , y = k } . Wecan then generate empirical estimates for µ k and Σ k as: ˆ α k = | S ρ,k || S ρ | , ˆ µ k = 1 | S ρ,k | (cid:88) ( x,y ) ∈S ρ,k χ ( φ ( x ))ˆΣ k = 1 | S ρ,k | ( χ ( φ ( x )) − ˆ µ k ) T ( χ ( φ ( x )) − ˆ µ k ) (2) Given the estimated parameters ˆ α k , ˆ µ k , ˆΣ k for the pro-totypical distribution, we can perform domain alignment.The goal is to adapt the model such that the the target la-tent distribution χ ( ψ ( p t ( X ))) matches the distribution P Z in the embedding space. To this end, we can generate apseudo-dataset ( Z P , Y P ) by drawing random sample fromthe GMM and align χ ( ψ ( X T )) with Z P as: L adapt = L ce ( ψ ( Z p ) , Y p ) + λ D ( χ ( ψ ( X T )) , Z p ) (3) The ﬁrst term in Eq. 3 involves ﬁne-tuning the classiﬁer onsamples from ( Z P , Y P ) to ensure that classiﬁer would con-tinue to generalize well. The second term enforces the dis-tributional alignment. As a result, the updated model willgeneralize on the target domain. Since the source samplesare not used in Eq. 3, privacy of the source domain willalso be preserved. Note that the source and target domainsshare initial similarities, otherwise domain alignment as de-scribed above may produce overlap of disjoint classes.The last ingredient for our approach is selection of thedistance metric D ( · , · ) . We used SWD for this purpose.SWD is an approximation to high dimensional WassersteinDistance (WD) [41, 17] which has been shown to be helpfulin deep learning. Computing WD directly is challenging,however WD can be empirically approximated via SWD[30]. SWD achieves this by approximating the high dimen-sional optimal transport problem via 1D Wasserstein Dis-tance instances which are obtained by performing L randomunit sphere projections o i of its two input distribution as: D ( P, Q ) = 1 L L (cid:88) i =1 D − W D ( (cid:104) P, o i (cid:105) , (cid:104) Q, o i (cid:105) ) , (4)where P and Q are the input distributions and (cid:104) P, o i (cid:105) and (cid:104) Q, o i (cid:105) denote the one-dimensional distributional slices. InEq. (4) the term D − W D denotes the 1D WD metric. SinceWD has a closed form solution in the dimension of one,Eq. (4) can be computed efﬁciently. We present the pseu-docode of the above described approach, called Source FreeSemantic segmentation (SFS), in Algorithm 1.

5. Theoretical Analysis

We propose that Algorithm 1 is effective because anupper-bound of the expected error for the target domain isminimized as a result of domain alignment.4 lgorithm 1

SFS ( λ, ρ ) Initial Training : Input: source domain dataset D S = ( X S , Y S ) , Training on Source Domain: ˆ θ = arg min θ (cid:80) i L ( f θ ( x si ) , y si ) Prototypical Distribution Estimation: Use Eq. (2), set ρ = . and estimate ˆ α j , ˆ µ j , and ˆΣ j Model Adaptation : Input: target dataset D T = ( X T ) Pseudo-Dataset Generation: D P = ( Z P , Y P ) = ([ z p , . . . , z pN ] , [ y p , . . . , y pN ]) , where: z pi ∼ P Z ( z ) , ≤ i ≤ N p for itr = 1 , . . . , IT R do draw random batches from D T and D P Update the model by solving Eq. (3) end for

We analyze the problem in a standard PAC-learningsetting. Consider that the set of classiﬁer sub-networks H = { ψ w ( · ) | ψ w ( · ) : Z → R k , w ∈ R W } to beour hypothesis space. Let e S and e T denote the ex-pected error of the optimal model that belongs to thisspace on the source and target domains, respectively.Let ψ w ∗ to be the model which minimizes the com-bined source and target expected error e C ( w ∗ ) , deﬁnedas: w ∗ = arg min w e C ( w ) = arg min w { e S + e T } .This model is the best model within the hypothesis spacein terms generalizability in both domains. Additionally,consider that ˆ µ S = N (cid:80) Nn =1 δ ( χ ( ψ v ( x sn ))) and ˆ µ T = M (cid:80) Mm =1 δ ( χ ( ψ v ( x tm ))) are the empirical source distri-bution and the empirical target distribution in the embed-ding space, both built from the available data points. Let ˆ µ P = N p (cid:80) N p q =1 δ ( z qn ) denote the empirical prototypicaldistribution, built from the generated pseudo-dataset. Sincewe build our pseudo-data set based on points with conﬁdentlabels, we can set: ρ = E z ∼ ˆ P Z ( z ) ( L ( ψ ( z ) , ψ ˆ w ( z )) . Theorem 1 : Consider that we generate a pseudo-datasetusing the prototypical distribution and preform UDA ac-cording to algorithm 1, then: e T ≤ e S + W (ˆ µ S , ˆ µ P ) + W (ˆ µ T , ˆ µ P ) + (1 − ρ ) + e C (cid:48) ( w ∗ )+ (cid:114)(cid:0) ξ ) /ζ (cid:1)(cid:0)(cid:114) N + (cid:114) M + 2 (cid:115) N p (cid:1) , (5) where W ( · , · ) denotes the WD distance and ξ is a constant,dependent on the loss function L ( · ) . Proof: proof is included in the Appendix.According to Theorem 1, algorithm 1 minimizes the up-per bound expressed in Eq. (5) for the target domain ex-pected risk. The source expected risk is minimized whenwe train the model on the source domain. The second term in Eq. (5) is minimized when the GMM is ﬁtted on thesource domain distribution. The third term in the Eq. (5)upperbound is minimized because it is the second term inEq. (3). The fourth term is small if we let ρ ≈ . The term e C (cid:48) ( w ∗ ) will be small if the domains are related to the ex-tent that a joint-trained model can generalize well on bothdomains, e.g., there shouldn’t be label mismatch betweensimilar classes across the two domains. The last term inEq. (5) is negligible if the training datasets are large enough.

6. Experimental Validation

Our code is available at

Suppressed . We evaluate our algorithm on the following datasets.

Multi-Modality Whole Heart Segmentation Dataset(MMWHS) [73]: this dataset consists of multi-modalitywhole heart images obtained on different imaging devicesat different imaging sites. Segmentation maps are providedfor MRI 3D heart images and CT 3D heart imageswhich have domain gap. Following the UDA setup, we usethe MRI images as the source domain and CT images asthe target domain. We perform UDA with respect to four ofthe available segmentation classes: ascending aorta (AA),left ventricle blood cavity (LVC), left atrium blood cavity(LAC), myocardium of the left ventricle (MYO).We will use the same experimental setup and parseddataset used by Dou et al. [14] for fair comparison. Forthe MRI source domain we use augmented samples from MRI 3D instances. The target domain consists of aug-mented samples from of

3D CT images, and we reportresults on CT instances, as proposed in by Chen et al. [7].Each 3D segmentation map used for assessing test perfor-mance is normalized to have zero mean and unit variance.

CHAOS MR → Multi-Atlas Labeling Beyond theCranial Vault: the second domain adaptation task consistsof data from two different dataset. As source domain we,consider the the 2019 CHAOS MR dataset [26], previouslyused in the 2019 CHAOS Grad Challenge. The dataset con-sists of both MR and CT scans with segmentation maps forthe following abdominal organs: liver, right kidney, left kid-ney and spleen. Similar to [7] we use the T2-SPIR MRimages as our source domain. Each scan is centered to zeromean and unit variance, and values more than three standarddeviations away from the mean are clipped. In total, we ob-tain MR scans, of which we use for training and forvalidation. The target domain is represented by the datasetwhich was presented in the Multi-Atlas Labeling Beyondthe Cranial Vault MICCAI 2015 Challenge [32]. We utilizethe CT scans in the training set which are provided seg-mentation maps, and use for adaptation and for eval- code is included in the Supplementary Material [ − , HU following lit-erature [70]. The images were re-sampled to an axial viewsize of × . Background was then cropped such thatthe distance between any labeled pixel and the image bor-ders is at least pixels, and scans were again resized to × . Finally, each 3D scan was normalized indepen-dently to zero mean and unit variance, and values more thanthree standard deviation from the mean were clipped. Dataaugmentation was performed as follows on both the train-ing MR and training CT instances: (1) random rotations ofup to degrees, (2) multiplying images by − , (3) addingrandom Gaussian noise, (4) random cropping.Both of the above problems involve D scans, howeverour network architecture receives D images, each imageconsisting of three channels. To circumvent this discrep-ancy, we follow the methodology by Chen et al. [6]. We in-troduce higher dimensional features into D images by cre-ating images from groups of three consecutive scan slices,and using as labels the segmentation map for the middleslice. Implementation details including network architec-ture, learning schedule, batch selection, hardware setup, etc.are included in the Appendix. We use two main metrics for evaluation: dice coef-ﬁcient and average symmetric surface distance (ASSD)which have been used in literature. The Dice coefﬁcientis a popular choice in medical image analysis works whichaddress on semantic segmentation [6, 7, 70]. It is used fordirect evaluation of segmentation map accuracy. The aver-age symmetric surface distance is a metric which has beenused [59, 7, 15] to assess the quality of borders of predictedsegmentation maps. A good segmentation will have a largeDice coefﬁcient and low ASSD value and depending on theapplication, one will be more appropriate to use.We compare our approach to other state-of-the-art tech-niques that are developed for unsupervised image segmen-tation. To the best of our knowledge, no prior work ad-dresses source-free UDA for semantic segmentation, so wecompare our work against existing UDA techniques thatneed source data for alignment. We compare against ad-versarial approaches PnP-AdaNet [12], SynSeg-Net [23],AdaOutput [61], CycleGAN [72], CyCADA [21], andSIFA [7]. These works are recently developed methods forsemantic segmentation that serve as upperbounds for ourmethod because we do not use the source domain data. Wereiterate that the advantage of our method is to preserve pri-vacy and we do not claim best performance.

We observe our method is comparable to SOTA approacheson the MMWHS dataset, despite the additional source-freeconstraint. We also note that we outperform all the othermethods on two of the segmentation classes - AA and LAC,while having second best Dice scores on the remaining twoclasses - LVC and MYO. However, we note that our methodtrails the other approaches when compared with respect toASSD. This shows that our domain alignment approachsuccessfully maps each class in the target embedding to itscorresponding vicinity using the prototypical distribution,but lacks the reﬁnement offered by adversarial approaches.In our second segmentation task, we again observe alarge increase in segmentation performance between theSource-Only model and the post-adaptation model. The in-crease in performance is points along the Dice metric,larger than the points increase observed on the MMWHSdataset. Again, we notice that while the Dice loss is com-parable to the adversarial approaches, our model is onceagain trailing in terms of ASSD. We also note that unlikethe MMWHS dataset where we tested our model on datamade available by [7], preparing the training/testing datasetfor the CHAOS MR → Multi-Atlas CT problem is done byfollowing the instructions provied by Chen et al. [7].We provide results for qualitative assessment. In Fig-ure 5, we present the improvement in segmentation on CTscans from both the cardiac and abdominal organ datasets.In both cases, the supervised models are able to obtain anear perfect visual similarity to the ground truth segmenta-tion mask. These represent the upper bound performancewe compare against. The MR only trained model (row 2)is able to achieve a reasonable visual performance on CTscans given it was trained on a different dataset, insightwhich is supported by the results in Tables 1 and 2. Post-adaptation quality of the segmentation maps becomes muchcloser to the the supervised regime, however, ﬁne details onimage borders are missed by our model - for example inimages , , , . This is again in line with the observedASSD performance of our model. In conclusion, while weobserve signiﬁcant gains with respect to the Dice coefﬁcientwhich directly measures the segmentation accuracy, the im-provement in surface distance is not as large as in methodsmaintaining access to the source domain during adaptation.However, our performance is competitive given the advan-tage of providing privacy for the source domain. We empirically demonstrate that our theoretical analy-sis explains why our algorithm works. To achieve this, weanalyze the shift in latent embedding before and after adap-tation, and discuss how the ρ parameter inﬂuences the pro-6otypical distribution approximation in the latent space. Theother hyper-parameter used in our approach, the regularizer λ , was empirically observed to not signiﬁcantly inﬂuenceresults. In order to visualize these embeddings, we useUMAP [40] which allows for embedding high-dimensionaldistributions into D space for D visualization.Figure 2 showcases the impact of our algorithm on thelatent distribution of the cardiac CT dataset. In Figure 2a,we record the latent embedding of the GMM prototypicaldistribution that is learned on the cardiac MR embeddings.Figure 2b exempliﬁes the distribution of the target CT sam-ples before adaptation. As we can see from Table 1, thesource-trained model is able to achieve some level of classseparation without any adaptation, which is conﬁrmed inFigure 2b. Even so, we observe non-trivial overlap betweenthe latent embeddings of two of the classes. In Figure 2cwe observe that this overlap is reduced after adaptation. Wealso observe that the latent embedding of the target CT sam-ples is shifted towards the prototypical distribution. Forcompleteness, we repeat the same analysis for the organsegmentation dataset. We observe a similar behavior in theshift of the target embeddings based on the learned proto-typical distribution, however compared to the heart segmen-tation dataset this shift is visually less pronounced.We also investigate the impact of the value of the ρ pa-rameter on our prototypical distribution. In Figure 4 wepresent the UMAP visualization for the learned GMM em-beddings for three different values of ρ . We observe thatwhile some classes will be separated for ρ , selecting high-conﬁdent samples to learn the GMM will yield a prototyp-ical distribution with high separability. Following Equation5, we use a value of ρ = . in our experiments, howeverwe note this value may be dataset speciﬁc.Due to space constraints, we have included additionalablation studies in the Appendix. (a) GMM samples (b) Pre-adaptation (c) Post-adaptation Figure 2: Indirect distribution matching in the embeddingspace: (a) drawn samples from the GMM trained on thecardiac MR distribution, (b) representations of the cardiacCT test samples prior to model adaptation (c) representa-tion of the cardiac CT test samples after domain alignment.The four colors correspond to the four cardiac classes: AA,LAC, LVC, MYO. (a) GMM samples (b) Pre-adaptation (c) Post-adaptation

Figure 3: Indirect distribution matching in the embeddingspace: (a) drawn samples from the GMM trained on theCHAOS MR distribution, (b) representations of the Multi-Atlas CT test samples prior to model adaptation (c) repre-sentation of the Multi-Atlas CT test samples after domainalignment. The four colors correspond to the four abdomi-nal organ classes: liver, right kidney, left kidney, spleen. (a) ρ = 0 (b) ρ = . (c) ρ = . Figure 4: Learnt Gaussian embeddings on the cardiacdataset. From left to right we present samples from thelearnt GMM when ρ = 0 , ρ = . and ρ = . respectively.

7. Conclusions

We developed a novel algorithm for performing unsuper-vised domain adaptation of semantic segmentation modelsin a source-free learning setting to preserve privacy for thesource domain. After supervised training on a source do-main, our algorithm is able to generalize to new domainswithout having access to source samples. Our idea is basedon estimating a prototypical distribution via a GMM andthen use is to align two distributions indirectly. We providedtheoretical analysis to demonstrated why our method is ef-fective. We also empirically demonstrated that our algo-rithm is competitive on two real-world datasets even whencompared against state of the art approaches in medical se-mantic segmentation that require access to the source data.Moreover, given the source free nature of our adaptationapproach, our algorithm is the ﬁrst of its kind for settingswhere privacy for source domain is a major concern.7 ice Average Symmetric Surface DistanceMethod AA LAC LVC MYO Average AA LAC LVC MYO AverageSource-Only [7] 28.4 27.7 4.0 8.7 17.2 20.6 16.2 N/A 48.4 N/APnP-AdaNet [12] 74.0 68.9 61.9 50.8 63.9 12.8 6.3 17.4 14.7 12.8SynSeg-Net [23] 71.6 69.0 51.6 40.8 58.2 11.7 7.8 7.0 9.2 8.9AdaOutput [61] 65.2 76.6 54.4 43.3 59.9 17.9 5.5 5.9 8.9 9.6CycleGAN [72] 73.8 75.7 52.3 28.7 57.6 11.5 13.6 9.2 8.8 10.8CyCADA [21] 72.9 77.0 62.4 45.3 64.4 9.6 8.0 9.6 10.5 9.4SIFA [7] 81.3 79.5 73.8 61.6 74.1 7.9 6.2 5.5 8.5 7.0

Supervised

Source-Only

Ours

Table 1: Table containing results for the Cardiac MR → CT adaptation task. We compare our results to the results found inTable I of [7]

Dice Average Symmetric Surface DistanceMethod Liver R. Kidney L. Kidney Spleen Average Liver R. Kidney L. Kidney Spleen AverageSource-Only [7] 73.1 47.3 57.3 55.1 58.2 2.9 5.6 7.7 7.4 5.9SynSeg-Net [23] 85.0 82.1 72.7 81.0 80.2 2.2 1.3 2.1 2.0 1.9AdaOutput [61] 85.4 79.7 79.7 81.7 81.6 1.7 1.2 1.8 1.6 1.6CycleGAN [72] 83.4 79.3 79.4 77.3 79.9 1.8 1.3 1.2 1.9 1.6CyCADA [21] 84.5 78.6 80.3 76.9 80.1 2.6 1.4 1.3 1.9 1.8SIFA [7] 88.0 83.3 80.9 82.6 83.7 1.2 1.0 1.5 1.6 1.3

Supervised

Source-Only (ours)

Ours

Table 2: Table containing results for the Abdominal MR → CT adaptation task. We compare our results to the results foundin Table II of [7]Figure 5: Segmentation maps of CT samples from the two datasets used for evaluation. The ﬁrst ﬁve columns correspond tocardiac images, while the last ﬁve columns correspond to abdominal images. From top to bottom we have: gray-scale CTimages, source-only model predictions, post-adaptation model predictions, supervised predictions on the CT data, groundtruth labels. 8 eferences [1] Nicholas Ayache. Deep learning for medical image analysis.In S. Kevin Zhou, Hayit Greenspan, and Dinggang Shen, ed-itors,

Deep Learning for Medical Image Analysis , page xxiii.Academic Press, 2017. 1[2] Luca Bertinetto, Jack Valmadre, Jo˜ao F. Henriques, AndreaVedaldi, and Philip H. S. Torr. Fully-convolutional siamesenetworks for object tracking. In Gang Hua and Herv´e J´egou,editors,

Computer Vision – ECCV 2016 Workshops , pages850–865, Cham, 2016. Springer International Publishing. 1[3] Franc¸ois Bolley, Arnaud Guillin, and C´edric Villani. Quan-titative concentration inequalities for empirical measures onnon-compact spaces.

Probability Theory and Related Fields ,137(3-4):541–593, 2007. 12[4] Konstantinos Bousmalis, Nathan Silberman, David Dohan,Dumitru Erhan, and Dilip Krishnan. Unsupervised pixel-level domain adaptation with generative adversarial net-works. In

Proceedings of the IEEE conference on computervision and pattern recognition , pages 3722–3731, 2017. 1[5] Cheng Chen, Qi Dou, Hao Chen, and Pheng-Ann Heng.Semantic-aware generative adversarial nets for unsuperviseddomain adaptation in chest x-ray segmentation. In

Interna-tional workshop on machine learning in medical imaging ,pages 143–151. Springer, 2018. 2[6] Cheng Chen, Qi Dou, Hao Chen, Jing Qin, and Pheng-AnnHeng. Synergistic image and feature adaptation: Towardscross-modality domain adaptation for medical image seg-mentation. In

Proceedings of The Thirty-Third Conferenceon Artiﬁcial Intelligence (AAAI) , pages 865–872, 2019. 6[7] C. Chen, Q. Dou, H. Chen, J. Qin, and P. A. Heng. Unsu-pervised bidirectional cross-modality adaptation via deeplysynergistic image and feature alignment for medical im-age segmentation.

IEEE Transactions on Medical Imaging ,39(7):2494–2505, 2020. 5, 6, 8[8] Chaoqi Chen, Weiping Xie, Wenbing Huang, Yu Rong,Xinghao Ding, Yue Huang, Tingyang Xu, and JunzhouHuang. Progressive feature alignment for unsupervised do-main adaptation. In

Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , pages 627–636,2019. 1[9] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos,Kevin Murphy, and Alan L Yuille. Deeplab: Semantic imagesegmentation with deep convolutional nets, atrous convolu-tion, and fully connected crfs.

IEEE transactions on patternanalysis and machine intelligence , 40(4):834–848, 2017. 12[10] Jaehoon Choi, Taekyung Kim, and Changick Kim. Self-ensembling with gan-based data augmentation for domainadaptation in semantic segmentation. In

Proceedings of theIEEE international conference on computer vision , pages6830–6840, 2019. 1[11] Nicolas Courty, R´emi Flamary, Devis Tuia, and Alain Rako-tomamonjy. Optimal transport for domain adaptation.

IEEEtransactions on pattern analysis and machine intelligence ,39(9):1853–1865, 2016. 2[12] Qi Dou, Cheng Ouyang, Cheng Chen, Hao Chen, BenGlocker, Xiahai Zhuang, and Pheng-Ann Heng. Pnp-adanet: Plug-and-play adversarial domain adaptation network at un-paired cross-modality cardiac segmentation.

IEEE Access ,7:99065–99076, 2019. 1, 6, 8[13] Qi Dou, Cheng Ouyang, Cheng Chen, Hao Chen, and Pheng-Ann Heng. Unsupervised cross-modality domain adapta-tion of convnets for biomedical image segmentations withadversarial loss. arXiv:1804.10916 [cs] , Jun 2018. arXiv:1804.10916. 2[14] Qi Dou, Cheng Ouyang, Cheng Chen, Hao Chen, and Pheng-Ann Heng. Unsupervised cross-modality domain adaptationof convnets for biomedical image segmentations with adver-sarial loss. In

Proceedings of the 27th International JointConference on Artiﬁcial Intelligence (IJCAI) , pages 691–697, 2018. 5[15] Qi Dou, Lequan Yu, Hao Chen, Yueming Jin, Xin Yang, JingQin, and Pheng-Ann Heng. 3d deeply supervised networkfor automated segmentation of volumetric medical images.

Medical Image Analysis , 41:40 – 54, 2017. Special Issueon the 2016 Conference on Medical Image Computing andComputer Assisted Intervention (Analog to MICCAI 2015).6[16] K. Drossos, P. Magron, and T. Virtanen. Unsupervised adver-sarial domain adaptation based on the wasserstein distancefor acoustic scene classiﬁcation. In , pages 259–263, 2019. 2[17] Charlie Frogner, Chiyuan Zhang, Hossein Mobahi, MauricioAraya, and Tomaso A Poggio. Learning with a wassersteinloss. In

Advances in neural information processing systems ,pages 2053–2061, 2015. 4[18] Alberto Garcia-Garcia, Sergio Orts-Escolano, Sergiu Oprea,Victor Villena-Martinez, and Jose Garcia-Rodriguez. A re-view on deep learning techniques applied to semantic seg-mentation. arXiv preprint arXiv:1704.06857 , 2017. 2[19] Muhammad Ghifary, W Bastiaan Kleijn, Mengjie Zhang,David Balduzzi, and Wen Li. Deep reconstruction-classiﬁcation networks for unsupervised domain adaptation.In

European Conference on Computer Vision , pages 597–613. Springer, 2016. 1[20] Simon Hecker, Dengxin Dai, and Luc Van Gool. End-to-endlearning of driving models with surround-view cameras androute planners. In

Proceedings of the European Conferenceon Computer Vision (ECCV) , September 2018. 1[21] Judy Hoffman, Eric Tzeng, Taesung Park, Jun-Yan Zhu,Phillip Isola, Kate Saenko, Alexei Efros, and Trevor Darrell.Cycada: Cycle-consistent adversarial domain adaptation. In

International conference on machine learning , pages 1989–1998. PMLR, 2018. 1, 2, 6, 8[22] Yuankai Huo, Zhoubing Xu, Shunxing Bao, Albert Assad,Richard G Abramson, and Bennett A Landman. Adver-sarial synthesis learning enables segmentation without tar-get modality ground truth. In , pages 1217–1220. IEEE, 2018. 2[23] Yuankai Huo, Zhoubing Xu, Hyeonsoo Moon, Shunx-ing Bao, Albert Assad, Tamara K. Moyo, Michael R.Savona, Richard G. Abramson, and Bennett A. Landman. ynseg-net: Synthetic segmentation without target modal-ity ground truth. IEEE Transactions on Medical Imaging ,38(4):1016–1025, Apr 2019. 6, 8[24] Konstantinos Kamnitsas, Christian Baumgartner, ChristianLedig, Virginia Newcombe, Joanna Simpson, Andrew Kane,David Menon, Aditya Nori, Antonio Criminisi, DanielRueckert, et al. Unsupervised domain adaptation in brainlesion segmentation with adversarial networks. In

Inter-national conference on information processing in medicalimaging , pages 597–609. Springer, 2017. 2[25] Konstantinos Kamnitsas, Christian Baumgartner, ChristianLedig, Virginia Newcombe, Joanna Simpson, Andrew Kane,David Menon, Aditya Nori, Antonio Criminisi, DanielRueckert, and Ben Glocker. Unsupervised domain adap-tation in brain lesion segmentation with adversarial net-works. In Marc Niethammer, Martin Styner, Stephen Ayl-ward, Hongtu Zhu, Ipek Oguz, Pew-Thian Yap, and Ding-gang Shen, editors,

Information Processing in Medical Imag-ing , pages 597–609, Cham, 2017. Springer InternationalPublishing. 2[26] Ali Emre Kavur, M Alper Selver, Oguz Dicle, Mustafa Barıs,and N Sinem Gezer. Chaos-combined (ct-mr) healthy ab-dominal organ segmentation challenge data, Apr 2019. 5[27] Salome Kazeminia, Christoph Baur, Arjan Kuijper, Bramvan Ginneken, Nassir Navab, Shadi Albarqouni, and AnirbanMukhopadhyay. Gans for medical image analysis.

ArtiﬁcialIntelligence in Medicine , 109:101938, 2020. 1[28] J. Ker, L. Wang, J. Rao, and T. Lim. Deep learning applica-tions in medical image analysis.

IEEE Access , 6:9375–9389,2018. 1[29] Jinkyu Kim and John Canny. Interpretable learning for self-driving cars by visualizing causal attention. In

Proceedingsof the IEEE International Conference on Computer Vision(ICCV) , Oct 2017. 1[30] Soheil Kolouri, Kimia Nadjahi, Umut Simsekli, RolandBadeau, and Gustavo Rohde. Generalized sliced wassersteindistances. In

Advances in Neural Information ProcessingSystems , pages 261–272, 2019. 4[31] Jogendra Nath Kundu, Naveen Venkat, R Venkatesh Babu,et al. Universal source-free domain adaptation. In

Proceed-ings of the IEEE/CVF Conference on Computer Vision andPattern Recognition , pages 4544–4553, 2020. 1, 2[32] Bennett Landman, Z Xu, JE Igelsias, M Styner, TRLangerak, and A Klein. Multi-atlas labeling beyond the cra-nial vault-workshop and challenge, 2015. 5[33] Tien-Nam Le, Amaury Habrard, and Marc Sebban. Deepmulti-wasserstein unsupervised domain adaptation.

PatternRecognition Letters , 125:249 – 255, 2019. 2[34] Chen-Yu Lee, Tanmay Batra, Mohammad Haris Baig, andDaniel Ulbricht. Sliced wasserstein discrepancy for unsuper-vised domain adaptation. In

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages10285–10295, 2019. 1, 2[35] Guosheng Lin, Anton Milan, Chunhua Shen, and IanReid. Reﬁnenet: Multi-path reﬁnement networks for high-resolution semantic segmentation. In

Proceedings of theIEEE conference on computer vision and pattern recogni-tion , pages 1925–1934, 2017. 2 [36] Dingding Liu, Yingen Xiong, Kari Pulli, and Linda Shapiro.Estimating image segmentation difﬁculty. In

MachineLearning and Data Mining in Pattern Recognition , volume6871, pages 484–495, 08 2011. 1[37] D. Liu, D. Zhang, Y. Song, F. Zhang, L. O’Donnell, H.Huang, M. Chen, and W. Cai. Pdam: A panoptic-level fea-ture alignment framework for unsupervised domain adaptiveinstance segmentation in microscopy images.

IEEE Trans-actions on Medical Imaging , pages 1–1, 2020. 2[38] Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fullyconvolutional networks for semantic segmentation. In

Pro-ceedings of the IEEE conference on computer vision and pat-tern recognition , pages 3431–3440, 2015. 2[39] Xinhong Ma, Tianzhu Zhang, and Changsheng Xu. Gcan:Graph convolutional adversarial network for unsuperviseddomain adaptation. In

Proceedings of the IEEE/CVF Confer-ence on Computer Vision and Pattern Recognition (CVPR) ,June 2019. 2[40] Leland McInnes, John Healy, and James Melville. Umap:Uniform manifold approximation and projection for dimen-sion reduction, 2020. 7[41] Quentin M´erigot. A multiscale approach to optimal trans-port. In

Computer Graphics Forum , volume 30, pages 1583–1592. Wiley Online Library, 2011. 4[42] Saeid Motiian, Quinn Jones, Seyed Iranmanesh, and Gian-franco Doretto. Few-shot adversarial domain adaptation. In

Advances in Neural Information Processing Systems , pages6670–6680, 2017. 2[43] Hyeonwoo Noh, Seunghoon Hong, and Bohyung Han.Learning deconvolution network for semantic segmentation.In

Proceedings of the IEEE international conference on com-puter vision , pages 1520–1528, 2015. 2[44] Yingwei Pan, Ting Yao, Yehao Li, Yu Wang, Chong-WahNgo, and Tao Mei. Transferrable prototypical networks forunsupervised domain adaptation. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 2239–2247, 2019. 2[45] Ievgen Redko, Amaury Habrard, and Marc Sebban. The-oretical analysis of domain adaptation with optimal trans-port. In

Joint European Conference on Machine Learningand Knowledge Discovery in Databases , pages 737–753.Springer, 2017. 12[46] Mohammad Rostami.

Learning Transferable KnowledgeThrough Embedding Spaces . PhD thesis, University of Penn-sylvania, 2019. 2[47] Mohammad Rostami and Aram Galstyan. Sequential un-supervised domain adaptation through prototypical distribu-tions. arXiv preprint arXiv:2007.00197 , 2020. 1[48] Mohammad Rostami, Soheil Kolouri, Eric Eaton, andKyungnam Kim. Deep transfer learning for few-shot sar im-age classiﬁcation.

Remote Sensing , 11(11):1374, 2019. 2[49] Mohammad Rostami, Soheil Kolouri, and Praveen K Pilly.Complementary learning for overcoming catastrophic forget-ting using experience replay. In

Proceedings of the 28th In-ternational Joint Conference on Artiﬁcial Intelligence , pages3339–3345. AAAI Press, 2019. 2

50] Mohammad Rostami, Soheil Kolouri, Praveen K Pilly, andJames McClelland. Generative continual concept learning.In

AAAI , pages 5545–5552, 2020. 2[51] Kuniaki Saito, Kohei Watanabe, Yoshitaka Ushiku, and Tat-suya Harada. Maximum classiﬁer discrepancy for unsuper-vised domain adaptation. In

Proceedings of the IEEE Con-ference on Computer Vision and Pattern Recognition , pages3723–3732, 2018. 1, 2[52] Cristiano Saltori, St´ephane Lathuili´ere, Nicu Sebe, ElisaRicci, and Fabio Galasso. Sf-uda3d: Source-free unsuper-vised domain adaptation for lidar-based 3d object detection. arXiv preprint arXiv:2010.08243 , 2020. 1, 2[53] Swami Sankaranarayanan, Yogesh Balaji, Carlos D Castillo,and Rama Chellappa. Generate to adapt: Aligning domainsusing generative adversarial networks. In

Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , pages 8503–8512, 2018. 1[54] Dinggang Shen, Guorong Wu, and Heung-Il Suk. Deeplearning in medical image analysis.

Annual Review ofBiomedical Engineering , 19(1):221–248, 2017. PMID:28301734. 1[55] Karen Simonyan and Andrew Zisserman. Very deep con-volutional networks for large-scale image recognition, 2015.12[56] Serban Stan and Mohammad Rostami. Unsupervised modeladaptation for continual semantic segmentation. In

Proceed-ings of The Thirty-Fifth Conference on Artiﬁcial Intelligence(AAAI) , 2020. 2[57] Mohammad Rostami, David Huber, Tsai-Ching Lu. Acrowdsourcing triage algorithm for geopolitical event fore-casting. In

Proceedings of the 12th ACM Conference on Rec-ommender Systems , 2018. 1[58] Baochen Sun, Jiashi Feng, and Kate Saenko. Correlationalignment for unsupervised domain adaptation. In

DomainAdaptation in Computer Vision Applications , pages 153–171. Springer, 2017. 1[59] Changjian Sun, Shuxu Guo, Huimao Zhang, Jing Li, MeimeiChen, Shuzhi Ma, Lanyi Jin, Xiaoming Liu, Xueyan Li,and Xiaohua Qian. Automatic segmentation of liver tu-mors from multiphase contrast-enhanced ct images based onfcns.

Artiﬁcial Intelligence in Medicine , 83:58 – 66, 2017.Machine Learning and Graph Analytics in ComputationalBiomedicine. 6[60] Marco Toldo, Andrea Maracani, Umberto Michieli, andPietro Zanuttigh. Unsupervised domain adaptation in seman-tic segmentation: A review.

Technologies , 8(2):35, Jun 2020.1[61] Y.-H. Tsai, W.-C. Hung, S. Schulter, K. Sohn, M.-H. Yang,and M. Chandraker. Learning to adapt structured outputspace for semantic segmentation. In

IEEE Conference onComputer Vision and Pattern Recognition (CVPR) , 2018. 6,8[62] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell.Adversarial discriminative domain adaptation. In

Proceed-ings of the IEEE conference on computer vision and patternrecognition , pages 7167–7176, 2017. 1[63] Hemanth Venkateswara, Jose Eusebio, Shayok Chakraborty,and Sethuraman Panchanathan. Deep hashing network for unsupervised domain adaptation. In

Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,pages 5018–5027, 2017. 1[64] Alper Yilmaz, Omar Javed, and Mubarak Shah. Object track-ing: A survey.

ACM Comput. Surv. , 38(4):13–es, Dec. 2006.1[65] Junyi Zhang, Ziliang Chen, Junying Huang, Liang Lin, andDongyu Zhang. Few-shot structured domain adaptation forvirtual-to-real scene parsing. In

Proceedings of the IEEEInternational Conference on Computer Vision Workshops ,pages 0–0, 2019. 2[66] Weichen Zhang, Wanli Ouyang, Wen Li, and Dong Xu.Collaborative and adversarial network for unsupervised do-main adaptation. In

Proceedings of the IEEE Conferenceon Computer Vision and Pattern Recognition , pages 3801–3809, 2018. 1[67] Weichen Zhang, Wanli Ouyang, Wen Li, and Dong Xu. Col-laborative and adversarial network for unsupervised domainadaptation. In

Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition (CVPR) , June 2018. 2[68] Yue Zhang, Shun Miao, Tommaso Mansi, and Rui Liao. Taskdriven generative modeling for unsupervised domain adap-tation: Application to x-ray image segmentation. In

In-ternational Conference on Medical Image Computing andComputer-Assisted Intervention , pages 599–607. Springer,2018. 2[69] Yu Zhao, Hongwei Li, Shaohua Wan, Anjany Sekuboyina,Xiaobin Hu, Giles Tetteh, Marie Piraud, and Bjoern Menze.Knowledge-aided convolutional neural network for small or-gan segmentation.

IEEE journal of biomedical and healthinformatics , 23(4):1363–1373, 2019. 1[70] Yuyin Zhou, Zhe Li, Song Bai, Chong Wang, Xinlei Chen,Mei Han, Elliot Fishman, and Alan Yuille. Prior-aware neu-ral network for partially-supervised multi-organ segmenta-tion, 2019. 6[71] Ji Zhu, Hua Yang, Nian Liu, Minyoung Kim, Wenjun Zhang,and Ming-Hsuan Yang. Online multi-object tracking withdual matching attention networks. In

Proceedings of the Eu-ropean Conference on Computer Vision (ECCV) , September2018. 1[72] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A.Efros. Unpaired image-to-image translation using cycle-consistent adversarial networks, 2020. 6, 8[73] Xiahai Zhuang and Juan Shen. Multi-scale patch and multi-modality atlases for whole heart segmentation of mri.

Medi-cal Image Analysis , 31:77 – 87, 2016. 5 . Appendix We use a result by Redko et al. [45] which is developedfor domain adaptation based on joint training.

Theorem 2 (Redko et al. [45]) : Consider that a model istrained on the source domain, then for any d (cid:48) > d and ζ < √ , there exists a constant number N depending on d (cid:48) suchthat for any ξ > and min( N, M ) ≥ max( ξ − ( d (cid:48) +2) , ) withprobability at least − ξ , the following holds: e T ≤ e S + W (ˆ µ T , ˆ µ S ) + e C ( w ∗ )+ (cid:114)(cid:0) ξ ) /ζ (cid:1)(cid:0)(cid:114) N + (cid:114) M (cid:1) . (6)Theorem 2 relates the performance of a source-trainedmodel on a target domain through an upperbound whichdepends on the distance between the source and the targetdomain distributions in terms WD distance. We use Theo-rem 2 to deduce Theorem 1 in the paper. Redko et al. [45]provide their analysis for the case of binary classiﬁer buttheir analysis can be extended to multiclas scenario. Theorem 1 : Consider that we generate a pseudo-datasetusing the prototypical distribution and preform UDA ac-cording to algorithm 1, then: e T ≤ e S + W (ˆ µ S , ˆ µ P ) + W (ˆ µ T , ˆ µ P ) + (1 − ρ ) + e C (cid:48) ( w ∗ )+ (cid:114)(cid:0) ξ ) /ζ (cid:1)(cid:0)(cid:114) N + (cid:114) M + 2 (cid:115) N p (cid:1) , (7) where W ( · , · ) denotes the WD distance and ξ is a constant,dependent on the loss function L ( · ) . Proof:

Since we use the parameter ρ to estimate the pro-totypical distribution, the probability of predicting incorrectlabels for the drawn pseudo-data points is − ρ . We can de-ﬁne the following difference: |L ( φ w ( z pi ) , y pi ) − L ( φ w ( z pi ) , ˆ y pi ) | = (cid:40) , if y pi = ˆ y pi . , otherwise . (8)We can apply Jensen’s inequality following taking the ex-pectation with respect to the target domain distribution inthe embedding space, i.e., χ ◦ ψ ( p t ( X T ))) , on both sidesof Eq. (8) and conclude: | e P − e T | ≤ E (cid:0) |L ( φ w ( z pi ) , y pi ) − L ( φ w ( z pi ) , ˆ y pi ) | (cid:1) ≤ (1 − ρ ) . (9)Now we use Eq. (9) to deduce: e S + e T = e S + e T + e P − e P ≤ e S + e P + | e T − e P | ≤ e S + e P + (1 − ρ ) . (10) Taking inﬁmum on both sides of Eq. (10) and employingthe deﬁnition of the joint optimal model yields: e C ( w ∗ ) ≤ e C (cid:48) ( w ) + (1 − ρ ) . (11)Now we consider Theorem 2 by Redko et al. [45] forthe source and target domains in our problem and mergeEq. (11) on Eq.(6) to conclude: e T ≤ e S + W (ˆ µ T , ˆ µ S ) + e C (cid:48) ( w ∗ ) + (1 − ρ )+ (cid:114)(cid:0) ξ ) /ζ (cid:1)(cid:0)(cid:114) N + (cid:114) M (cid:1) . (12)In Eq. (12), e C (cid:48) denotes the joint optimal model true errorfor the source domain and the pseudo-dataset as the seconddomain.Now we apply the triangular inequality twice in Eq. (12)to deduce: W (ˆ µ T , ˆ µ S ) ≤ W (ˆ µ T , µ P ) + W (ˆ µ S , µ P ) ≤ W (ˆ µ T , ˆ µ P ) + W (ˆ µ S , ˆ µ P ) + 2 W (ˆ µ P , µ P ) . (13)We now need Theorem 1.1 by Bolley et al. [3] to simplifythe term W (ˆ µ P , µ P ) in Eq. (13). Theorem 3 (Theorem 1.1 by Bolley et al. [3]): con-sider that p ( · ) ∈ P ( Z ) and (cid:82) Z exp ( α (cid:107) x (cid:107) ) dp ( x ) < ∞ for some α > . Let ˆ p ( x ) = N (cid:80) i δ ( x i ) denote the em-pirical distribution that is built from the samples { x i } Ni =1 that are drawn i.i.d from x i ∼ p ( x ) . Then for any d (cid:48) > d and ξ < √ , there exists N such that for any (cid:15) > and N ≥ N o max(1 , (cid:15) − ( d (cid:48) +2) ) , we have: P ( W ( p, ˆ p ) > (cid:15) ) ≤ exp( − − ξ N (cid:15) ) (14)This theorem provides a relation to measure the distance be-tween the estimated empirical distribution and the true dis-tribution when the distance is measured by the WD metric.We can use both Eq. (13) and Eq. (14) in Eq. (12) toconclude Theorem 2 as stated in the paper: e T ≤ e S + W (ˆ µ S , ˆ µ P ) + W (ˆ µ T , ˆ µ P ) + (1 − ρ ) + e C (cid:48) ( w ∗ )+ (cid:114)(cid:0) ξ ) /ζ (cid:1)(cid:0)(cid:114) N + (cid:114) M + 2 (cid:115) N p (cid:1) , (15) We use the same network architecture on both the car-diac and organ image segmentation UDA task. We use aDeepLabV3 feature extractor [9] with a VGG16 backbone[55], followed by a one layer classiﬁer.For the MMWHS dataset we train the network on the su-pervised source samples with a training schedule of , epochs repeated times. The optimizer of choice is Adam12ith learning rate e − , (cid:15) = 1 e − and decay of e − .We use the standard pixel-wise cross entropy loss, and batchsize of . For the abdominal organ segmentation dataset,we observed better performance by using a weighted crossentropy loss, dropout rate of . and we repeat the training times instead of .We learn the empirical prototypical distribution using aparameter ρ = . . We observed good separability in thelatent distribution for ρ ≥ . .Finally, when performing adaptation, we performed , epochs of training, with a batch size of . Weagain used an Adam optimizer with a learning rate of e − , (cid:15) = 1 e − and decay of e − . Due to GPU memory con-straints, we approximate the target distribution via the batchlabel distribution when sampling from the learnt GMM.Experiments were done on a Nvidia Titan Xp GPU. Codeis provided in the supplementary material section of thissubmission, and will be made freely available online at alater date. We further empirically analyze different components ofour approach to demonstrate their effectiveness.

Fine-tuning the classiﬁer.

As we discussed in the mainbody of the paper, after learning a prototypical distribu-tion characterizing the source embeddings, we align the tar-get embeddings to this distribution by minimizing SlicedWasserstein Distance. In addition, we also further trainthe classiﬁer on samples from this distribution to accountfor differences to the original source embedding distribu-tion. We next discuss the beneﬁt of ﬁne tuning the classiﬁer,based on the results in Table 3.

Metric Fine-Tuned Classiﬁer Source Domain Classiﬁer ∗ Dice 73.8 72.5ASSD 16.2 15.9

Table 3: Target performance on the MMWHS adapta-tion task of our method with and without ﬁne tuning theclassiﬁer on samples from the prototypical distribution. ∗ Reported results were taken as median over runs.Given the learnt empirical means and covariances for theprototypical distribution, we compare the performance aftertarget domain adaptation between a model that ﬁne tunesthe classiﬁer and a model that does not update the classiﬁerafter source training. As expected, we can observe ﬁne tun-ing the classiﬁer offers a prediction boost, even if the differ-ence is not a signiﬁcant one. The prototypical distribution ismeant to encourage the target embeddings to share a simi-lar latent space with the source embeddings, and ﬁne tuningthe classiﬁer accounts for the distribution shift between thesource embeddings and learnt prototypical distribution. Impact of warm start on adaptation performance.

When performing adaptation, our architecture is initializedwith weights learnt on the source domain. If weights wererandomly initialized, the network would be unable to learnboundary features of organs during the adaptation step, andwould just perform distribution matching to the GMM em-bedding. This is conﬁrmed in practice, as initializing thenetwork with random weights before adaptation grants aDice value of only , and observed ASSD of . .We also investigate the information encoded in the con-volutional ﬁlters before and after adaptation. Based on ourresults, we expect network ﬁlters to retain most of theirstructure from source training, and not alter this structuretoo much during distribution matching. We exemplify thisin Figure 6. We record the visual characteristics of the net-work ﬁlters after the ﬁrst two convolutional layers and theﬁrst four convolutional layers. We observe ﬁlters appear vi-sually similar before and after adaptation, signifying imagestructural features learnt by the network do not undergo sig-niﬁcant change, even though changes in ﬁlter values can beobserved under the Difference columns.

Performance for different loss functions

We investi-gate the performance our model observes for two possi-ble loss functions - Cross-Entropy (CE) Loss or WeightedCross-Entropy (WCE) Loss. CE Loss is a standard lossfunction widely used for classiﬁcation tasks, while WCELoss has been proposed as a variant of CE targeted towardsdomains with signiﬁcant class imbalance. In our experi-ments we observed on the MMWHS task our model per-forms best using CE Loss, and on the CHAOS MR → Multi-Atlas CT task better performance corresponds to theuse of the WCE Loss. We investigate the performance shifton the MMWHS dataset by replacing CE with WCE. Weare aware using a different loss function entails ﬁne tuningseveral training hyper-parameters or changing the trainingschedule for optimizing performance, however our intentis to show that the performance observed by our methodis not too closely tied to the loss function employed. Weobserve Dice performance of . and ASSD performanceof . post adaptation on the MMWHS task using WCELoss. While the results are inferior to the . Dice and .2