[PDF] TricycleGAN: Unsupervised Image Synthesis and Segmentation Based on Shape Priors

Abstract

Medical image segmentation is routinely performed to isolate regions of interest, such as organs and lesions. Currently, deep learning is the state of the art for automatic segmentation, but is usually limited by the need for supervised training with large datasets that have been manually segmented by trained clinicians. The goal of semi-superised and unsupervised image segmentation is to greatly reduce, or even eliminate, the need for training data and therefore to minimze the burden on clinicians when training segmentation models. To this end we introduce a novel network architecture for capable of unsupervised and semi-supervised image segmentation called TricycleGAN. This approach uses three generative models to learn translations between medical images and segmentation maps using edge maps as an intermediate step. Distinct from other approaches based on generative networks, TricycleGAN relies on shape priors rather than colour and texture priors. As such, it is particularly well-suited for several domains of medical imaging, such as ultrasound imaging, where commonly used visual cues may be absent. We present experiments with TricycleGAN on a clinical dataset of kidney ultrasound images and the benchmark ISIC 2018 skin lesion dataset.

Full PDF

TTricycleGAN: Unsupervised Image Synthesis and SegmentationBased on Shape Priors

Umaseh Sivanesan , Luis H. Braga , Ranil R. Sonnadara , and Kiret Dhindsa ∗ Department of Surgery at McMaster University, Hamilton, Ontario, Canada Division of Paediatric Urology, McMaster Children’s Hospital, Hamilton, Ontario, Canada Research and High Performance Computing, McMaster University, Hamilton, Ontario,Canada Vector Instutite for Artiﬁcial Intelligence, Toronto, Ontario, CanadaFebruary 5, 2021

Abstract

Medical image segmentation is routinely performed to isolate regions of interest, such as organs andlesions. Currently, deep learning is the state of the art for automatic segmentation, but is usuallylimited by the need for supervised training with large datasets that have been manually segmented bytrained clinicians. The goal of semi-superised and unsupervised image segmentation is to greatly reduce,or even eliminate, the need for training data and therefore to minimze the burden on clinicians whentraining segmentation models. To this end we introduce a novel network architecture for capable ofunsupervised and semi-supervised image segmentation called TricycleGAN. This approach uses threegenerative models to learn translations between medical images and segmentation maps using edge mapsas an intermediate step. Distinct from other approaches based on generative networks, TricycleGANrelies on shape priors rather than colour and texture priors. As such, it is particularly well-suited forseveral domains of medical imaging, such as ultrasound imaging, where commonly used visual cues maybe absent. We present experiments with TricycleGAN on a clinical dataset of kidney ultrasound imagesand the benchmark ISIC 2018 skin lesion dataset.

Keywords:

Biomedical imaging, machine learning, image segmentation, computer vision, data augmen-tation, generative models

In vivo medical imaging is an important and commonly used technique in a wide variety of clinical applica-tions. However, analysis is often made challenging due to the poor quality of images obtained with currentimaging technologies. In order to take advantage of these technologies, current medical practice relies heavilyon clinicians who are highly specialized in interpreting medical images [1, 2].Segmentation, or identifying and isolating regions of interest (ROIs) from an image, is often a key stepin medical image analysis. For both human and machine readers, segmentation aids in the extraction ofclinically important features by identifying relevant tissues and/or objects, such as an organ or a lesion,without interference from non-relevant tissues captured in the image [3]. Since manual segmentation, i.e.,having appropriately trained clinicians segment images by hand, is time-consuming, expensive, and subjec-tive, signiﬁcant eﬀort has been put into developing algorithms that can automatically provide accurate andreliable segmentations for downstream analysis [4]. This has lead to the growth of the subﬁeld of semanticsegmentation, which is image segmentation via the association of pixels with classiﬁcation labels (e.g., kidneyvs. not kidney). ∗ Corresponding Author: [email protected] a r X i v : . [ ee ss . I V ] F e b ere we present a novel approach to image segmentation that can be trained unsupervised or semi-supervised, thus greatly reducing, or even eliminating, the need for manually segmented images. Inspired bythe way trained clinicians use anatomical knowledge of the shapes of ROIs and their internal structures whenperforming manual segmentation, we rely on shape priors to generate synthetic labelled data that can beused to train a segmentation model. Shape priors are built in by generating synthetic medical images fromROI shape templates, making it possible to learn mappings between the generated segmentation-medicalimage pairs. We expand upon the rationale and methodology for this approach in Section 4.Our new architecture, TricycleGAN, is comprised of three Generative Adversarial Networks (GANs),speciﬁcally, three CycleGANs [5] that have been chained together. Each CycleGAN learns one of threemappings: a mapping between original medical images and edge maps derived from those medical images, amapping between edge maps and segmentation maps for the regions of interest contained in those medicalimages, and a mapping between segmentation maps and medical images. As a whole, TricycleGAN uniﬁesthese three subnetworks using an adapted cycle consistency loss and thus learns to segment medical imagesby learning to generate realistic image-edge-segmentation triplets. As is explained further in Section 4, theinclusion of edge maps into the process provides an intermediate step that introduces sources of varianceand complexity that allows TricycleGAN to translate simplistic ROI templates into realistic medical imageswhile retaining the necessary shape information needed to predict a segmentation map.To demonstrate the value of this novel approach, we evaluate our method using two distinct datasets:a dataset of ultrasound images for which the task is to segment the kidney, and the ISIC 2018 Skin LesionAnalysis competition dataset, for which the task is to segment skin lesions in dermoscopic images. Prior to deep learning approaches to automated image segmentation, techniques based on edge detection [6],region growing [7], contour modelling [8], and texture analysis [9] relied on detecting statistical diﬀerenceamong objects in an image using explicitly deﬁned constraints and criteria. These methods can often performpooly in medical imaging applications because typically used metrics, such as contrast gradients, are notrobust enough to overcome the challenges imposed by noisy images, e.g., in ultrasound imaging, and artifacts,e.g., surrounding tissues.A current strategy for image segmentation aims to overcome such limitations using supervised deep neu-ral networks that can learn to identify distinct objects and ROIs in images by discovering combinations ofdiscriminative features given exposure to numerous labelled samples. Convolutional neural networks (CNNs)have been particularly successful for semantic segmentation and a wide variety of other computer vision ap-plications [10, 11, 12, 13]. Notably, Fully Convolutional Networks (FCNs) omit the fully connected layersused to group pixels by class label in standard CNNs in favour of deconvolution layers used to estimatesegmentation probability maps for entire images [14]. Encoder-decoder networks, ﬁrst introduced by decon-volving a pretrained VGG16 CNN architecture [15], are based on a similar principle. CNNs have also beenpaired with Conditional Random Fields (CRFs) to better take advantage of the spatial correlations amongpixels belonging to the same object [16].Variations on these segmentation-oriented CNNs have also played an important role in clinical applica-tions of medical imaging [17, 18, 19, 20, 21, 22]. The U-net [23], a type of FCN, has performed well in avariety of medical image segmentation tasks (e.g., [24]), including for 3D segmentation tasks [25]. Variousextensions have been developed, including to incorporate an attention mechanism [26], or by altering theinitialization of the model [27].While successful in many areas, CNN-based approaches can be prone to poor performance under certainconditions that are often observed in medical imaging applications. One issue is that obtaining spatially andsemantically contiguous solutions usually requires signiﬁcant post-processing of the resulting segmentationmaps. In the presence of artifacts and distractors, such as the presence of multiple salient organs whenonly one is the desired ROI, such methods can result in large errors. Fortunately, recent work has begunto address this issue [28, 29, 30]. However, another signiﬁcant limitation for real-world use is that deeplearning approaches in general usually require large amounts of training data, which usually means hundredsor thousands of ground-truth segmentations that must be hand-drawn by radiologists outside of their clinicalduties. Although some success has been seen with unsupervised CNN approaches under some conditions,2.g., with W-net [31] (typically only successful for segmentation tasks with distinct non-overlapping objects)or DeepCo3 [32] (relying on feature similarity among multiple instances of same-class objects in an image),unsupervised solutions for some medical applications may require an altogether diﬀerent approach.The use of GANs has been proposed as one way to address the need for ground truth segmentation labels.GANs learn to perform image-to-image translation by training with pairs of images [33], and therefore canbe appropriated for segmentation tasks by training them with image-segmentation map pairs in supervised[34, 35, 36, 37, 17] or semi-supervised [38] fashions. A useful function of GANs is their ability to augmenttraining datasets by generating realistic synthetic data, thus requiring relatively smaller amounts of labelledtraining data [39, 40, 41]. To adapt GANs to learn segmentation in an unsupervised manner, recompositionapproaches have been used. For example, SEIGAN can be used to segment foreground objects by exploitingdistinct feature statistics between background and foreground objects and similar feature statistics betweensimilar backgrounds [42]. Another recent approach, ReDO, uses scene decomposition to separate objects inan image based on the assumption that each object is distinct with respect to speciﬁc properties, such ascolour and texture [43].Unsupervised approaches for image segmentation make reasonable assumptions about ways in which thedesired ROI is distinct from other objects in an image. However, typically relied upon features, such ascolour, texture, or brightness, may not carry over well to many medical imaging domains. In the case oforgan segmentation, as illustrated with the kidney ultrasound dataset used in this study, a clinician mustrely primarily on a priori anatomical knowledge and experience in order to estimate the contours of thekidney when a clear boundary is not visible. This task is made especially diﬃcult by the fact that the kidneymay not even be the most salient object in the image, leading even clinicians from other domains to ﬁnd thetask very challenging.To ﬁll the gap left by previous methods, TricycleGAN is aimed at overcoming these challenges by incor-porating a shape prior for the ROI to generate synthetic image-segmentation map pairs using only unlabelledmedical images, simulating the use of a priori anatomical knowledge that a radiologist would use to performsimilar segmentation tasks. TricycleGAN is an extention of CycleGAN [5] that uses three generators: theﬁrst translates between segmentation maps and edge maps, the second translates between edge maps andmedical images, and the third uses the cycle consistence loss to translate medical images back into segmen-tation maps. An illustration of the TricycleGAN pipeline is given in Figure 1, and examples showing theoutput at every major step of the pipeline using the ISIC 2018 skin lesion dataset are shown in Figure 2.

We use a dataset of renal ultrasound images developed for prenatal hydronephrosis, a congenital kidneydisorder marked by excessive and potentially dangerous ﬂuid retention in the kidneys. The dataset consistsof 2492 2D sagittal kidney ultrasound images from 773 patients across multiple hospital visits.The evaluationset consists of 438 images that have been manually segmented by a trained surgical urologist. After removingtraining images taken from the same patients represented in the evaluation set, 918 unlabelled images remainfor training. During training, random samples of 20% of the images were used for validation. Each grade ofhydronephrosis is represented in the evaluation set approximately evenly.This is a diﬃcult dataset for image segmentation due to poor image quality, unclear contours of thekidneys, and the large variation introduced by diﬀerent degrees of the kidney disorder called hydronephrosis(see Supplementary Figure S1). In addition, a major challenge of this dataset is that the two most salientboundaries are the outer ultrasound cone inherent to ultrasound imaging with a probe, and the dark innerregion of the kidney, which is caused by ﬂuid retention in hydronephrosis. As neither of these are the desiredROI, both are misleading with respect to segmenting the kidney. Further details concerning this dataset canbe found in [44, 45, 46].

We use the ISIC 2018 Lesion Boundary Segmentation Challenge dataset [47, 48] to more directly compareTricycleGAN with other approaches. The provided training set of 2075 images, 100 validation images, and319 evaluation images are used for the intended purposes in this study. By showing that TricycleGAN isalso successful on this benchmark dataset, we show that it is not limited to only one domain and imagingmodality. In addition, we show one method by which TricycleGAN can be adapted to also take advantageof commonly used features, such as colour.

We follow a similar methodology used for preprocessing renal ultrasound images described in [44]. We cropthe images to remove white borders, despeckle them to remove speckle noise caused by interference withthe ultrasound probe during imaging [49], and re-scale to 256 ×

256 pixels for consistency. We remove textannotations made by clinicians using the pre-trained Eﬃcient and Accurate Scene Text Detector (EAST)[50]. We then normalize the pixel intensity of each image to be from 0 to 1 after trimming the pixel intensityfrom the 2 nd percentile to the 98 th percentile of the original pixel intensity across the image. In addition,we enhance the contrast of each image using Contrast Limited Adaptive Histogram Equalization with a cliplimit of 0.03 [51]. Finally, we normalize the images by the mean and standard deviation of the training setduring cross-validation.We perform no preprocessing for the ISIC skin lesion images other than to resize them to 256 × Analogously to how other image features are exploited for semantic segmentation, we assume that the shapeof an ROI for a particular image segmentation task is statistically more similar for within-class objects thanit is for background objects. In order to make use of this assumption, we additionally assume that theboundary of the ROI is suﬃciently prominent in the image to appear in its corresponding edge map (thoughas mentioned previously in Section 3.1, this does not require that the ROI boundary is the most prominentboundary in the image, and does not preclude partial occlusion). Based on these two assumptions, we makeuse of a basic template shape that is appropriate for representing the ROIs of a given class of objects. Forexample, both kidneys and skin lesion ROIs are roughly elliptical. Since TricycleGAN adds the requiredcomplexity to make the resulting synthetic images realistic on its own, constructing a template can be assimple as drawing an ellipse with randomized shape, size, and location. Details of this part of the processare given in Section 4.2.Note that by constructing simple ROIs explicitly and generating synthetic images from them, we alsoprovide our model with a ground truth segmentation label for the synthetic images. It is for this reason thatwe do not generate the underlying ground truth with a generative model. Instead, the level of variance andcomplexity required to generate realistic images in the application domain are introduced by the downstreamgenerators in TricycleGAN.In addition to the potentially large variation in ROI morphologies within a class, a common challengewith medical imaging segmentation is the presence of artifacts and distractors (i.e., objects or anatomicalfeatures that also appear prominently in edge maps, but which are not part of the intended ROI). Weovercome both of these challenges simultaneously by using a generative model to learn the statistics of edgemaps extracted from real images, which can then be used to construct realistic edge maps from generatedtemplate ROIs (see Section 4.4). By then constructing synthetic images in the application domain from thesynthetic edge maps, TricycleGAN generates segmentation map - edge map - medical image triplets thatcan be used for segmentation training.The reason for using edge maps as an intermediary step is that edge maps can be easily extracted fromreal training data using known statistical methods, providing a training set of real image - edge map pairsthat can be used to learn a mapping with supervision. Since segmentation maps cannot be extracted from thereal images, a direct mapping cannot be learned. Instead, as described in Section 4.4, we introduce a patch-occlusion method for learning the statistics of real edge maps and use this to generate realistic edge mapsfrom synthetic segmentation maps, thus completing the loop from synthetic segmentation map to syntheticedge map to synthetic image and back to a predicted segmentation map with the required complexity.4 dge Detection Cover Patches

Real Image Synthetic Segmentation Map 𝐺 𝑠2𝑒 𝐺 𝑝𝑓 𝐺 𝑒2𝑚 𝐺 𝑚2𝑠 T r a i n i n g T r a i n i n g T r a i n i n g Cycle Consistency Loss 𝐺 𝑝𝑓 Edge Map Patch Occlusion ReconstructionSyntheticEdge MapEnhanced

Edge Map

Synthetic Medical ImagePredictedSegmentation

Acquiring Real

Edge Maps and

Training 𝐺 𝑝𝑓 Training TricycleGAN T r a i n i n g Figure 1: The TricycleGAN training pipeline.

The primary concern when construction ROI templates, or synthetic segmentation maps, to be used asground truth segmentation maps is to capture the distribution of shapes and locations that can be expectedin the real data. For many applications this is relatively simple to do.To generate an ROI, we create a randomized ellipse with a random origin (horizontally and verticallyoﬀset from the center by up to 1/8 of the image size), rotation (any angle), and major and minor axes (majoraxis of length between 1/4 and 1/2 of the image length, and minor axis between 0.5 and 0.9 times the lengthof the major axis). The ellipse itself serves as the ground truth segmentation mask.For the ultrasound dataset we require the addition of the ultrasound cone, as its outline is even moreprominent than the desired kidney ROI. The cone is also randomly generated by creating a randomizedbottom curve (a partially complete ellipse) and a triangle extending beyond the upper boundary of theimage (so that it is cut oﬀ at the top). All pixels of the generated kidney ROI not lying within the generatedultrasound cone is removed to simulate the common occurrence of only partially captured kidneys. Similaradjustments in constructing ROI templates can be made for other applications where prominant shapes mayserve as distractors.

Edge maps that are extracted from real images are used to train the synthetic edge map generator. Weextract edge maps from real medical images using the VGG16 model [15] with Richer Convolutional Featurespretrained as in Liu et al. [52]. As recommended for this approach, we further ﬁne-tune the output edgemaps using non-maximum suppression with Structured Forests for edge thinning [53].

To generate images with structural variation resembling that observed in the real data, we ﬁrst convert thesynthetic segmentation maps into realistic edge maps, complete with artifacts and natural irregularities,5 yntheticSegmentation SyntheticEdge Map SyntheticImage PredictedSegmentation 𝑆 𝑜 𝑏 𝑒 𝑙 𝐹 𝑖 𝑙 𝑡 𝑒 𝑟 𝑖 𝑛 𝑔 𝑆 𝑡 𝑎 𝑔 𝑒 𝐺 𝑠 𝑒 S t a g e Figure 2: Examples showing the output of each major step of the program, including synthetic segmentationmap construction and the output of each generator. 6sing the ﬁrst generative model in TricycleGAN, G s e . As with all of the generators used in TricycleGAN, G s e is based on the pix2pix architecture deﬁned in [33].To train generator G s e to produce the appropriate variation seen in the real data, we frame this prob-lem as an image completion problem (e.g., [54]) and make use of our previously extracted real edge maps.However, since G s e takes as input the very simplistic synthetic segmentation maps without any correspond-ing realistic edge maps (since the edge maps extracted from real images do not have paired segmentationmaps), on its own it learns to produce very simple edge maps. This is partially because the use of the cycleconsistency loss used to train TricycleGAN (see Section 4.7 below) encourages less complex edge maps. Wetherefore incorporate an additional generator based on the pix2pix architecture, denoted G pf (a patch-ﬁllinggenerator), which enhances the complexity of the edge maps produced by G s e by adding artifacts and agreater degree of irregularity. Therefore, edge maps produced by G s e are patch-occluded and then passedinto G pf for added complexity.We pretrain G pf on the edges2handbags dataset [55] before further training it on the edge maps extractedfrom the training set of real images after partial occlusion using randomly generated square masks. For eachextracted edge map, up to 10 non-overlapping square masks with length between 1/8 and 1/2 of the imagelength are used. Once this is done, the weights of G pf are frozen allowing it to produce the artifactsand necessary complexity to emulate realistic edge maps, which G s e does not encounter, without loss ofcomplexity due to the cycle consistency loss. Additionally, this allows us to use G pf to help train G s e byserving as a regularizer. Speciﬁcally, the L -loss used to regularize G s e is computed by taking the diﬀerencebetween the output generated by G s e and G pf , thereby encouraging G s e to increase the complexity ofits output to provide G pf a better starting point for enhancement, and countering some of the simplicityencouraged by the cycle consistency loss.A natural question is why TricycleGAN cannot simply use G pf without G s e . In practice, G pf requiresits weights to be frozen in order to maintain its ability to generate artifacts. Therefore, it does not learn tocontribute to minimizing the overall loss of the network, thus breaking uniﬁcation of the overall model andultimately causing TricycleGAN to perform suboptimally. If the weights of G pf are allowed to change, thecycle consistency loss causes it to gradually simplify its edge maps until the desired complexity is no longerproduced. Alternatively, using G s e without inﬂuence from G pf results in overly simplistic edge maps thatfail to provide enough challenge to the later generators of TricycleGAN. The required balance of complexityand optimal training for TricycleGAN is achieved by using G pf both as an augmenter of the output of G s e and as a regularizer for its training.In practice, while G s e undergoes training with G pf , we begin by using simple Sobel ﬁltering to convertsegmentation maps into edge maps so that meaningful outputs can be generated by downstream generators.After 250 epochs of training TricycleGAN, the probability of using the output of G s e versus Sobel ﬁlteringis linearly ramped down until it reaches zero at epoch 500. From there, Sobel ﬁltering is discontinuedand G pf takes only the output of G s e as input. In this way, the downstream generators of TricycleGANcan still learn to produce meaningful outputs while G s e learns to initially be able to produce edge maps,and TricycleGAN as a whole can continue to improve as G s e learns to produce edge maps with greatercomplexity than provided by Sobel ﬁltering.Finally, G s e is ready to be used to construct realistic edge maps from the synthetic segmentation mapswith artifacts and added complexity contributed by G pf . Figure 2 provides examples of generator outputsat each stage of the pipeline comparing G pf ouputs during the early stages of training when Sobel ﬁlteringis used to later stages of training when G s e is used. The second generator in TricycleGAN, G e m translates edge maps into realistic images. As before, weuse a pix2pix model here. G e m is trained using the extracted edge map and real image pairs obtainedvia our training data while also training through the cycle consistency loss as part of TricycleGAN whensynthetic images are used as input. Once trained, the generator is used to translate the edge maps previouslyconstructed from G s e into realistic synthetic medical images.7 .6 Generator 3: Medical Image to Segmentation Map Translation The ﬁnal generator in TricycleGAN, G m s performs image segmentation by translating a real or syntheticimage into its corresponding segmentation map. Again, we use the pix2pix architecture here. The model istrained using the synthetic segmentation maps and their corresponding synthetic medical images created by G e m . Since no real image - segmentation map pairs are available for the training data, G m s is only trainedusing synthetic images. Thus, it undergoes considerably less training than the other generators. The speciﬁcloss functions used to train each generator are given below. TricycleGAN is trained using a combination of unlabelled real images and labelled synthetic images. Sincewe extract edge maps from the real images, image - edge map pairs are available. Using the patch occlusionmethod, the real edge maps are used to pretrain G pf , as described in Section 4.4. When training TricycleGANwith the weights of G pf frozen, the output of G pf is used to compute the L loss for G s e . The real image -edge map pairs are also used to train G e m . However, with no segmentation map available, no training signalis computable for G m s when TricycleGAN trains with real images. Instead, every 20th image is generatedusing a synthetic segmentaiton map that is passed through the full network, meaning that G m s trains at aslower rate compared to the other generators.To ensure that generator G m s learns to generate accurate segmentation maps, we constrain the outputspace of the network by incorporating the cycle consistency loss ﬁrst introduced into adversarial learningwith CycleGAN. This pushes G m s to translate an image back into the original image input into the networkduring training, i.e., the synthetic segmentation map [5]. A second motivating factor for using the cycleconsistency loss is that it does not assume that the embedding spaces of the input and output are the same(in our case, segmentation maps are almost guaranteed to lie in a lower-dimensional embedding space thanthe medical images themselves), which provides some desired ﬂexibility in TricycleGAN.When training using synthetic images, the cycle consistency losses for each trainable generator is com-puted after the initial segmentation map is passed through the entire network, which provides a trainingsignal for G m s . The predicted segmentation map generated by G m s is then looped back to the start of thenetwork and passed through each generator again, and the diﬀerence between their outputs on the secondpass versus their outputs on their ﬁrst pass are used to compute the cycle consistency loss (with the caveatthat G s e uses the output of G pf ). This allows TricycleGAN to take as much advantage of the real data aspossible, while using synthetic data to train the chain of generators to function as a cohesive network thatserves a speciﬁc purpose. Since TricycleGAN uses three trainable generators instead of CycleGAN’s two,we use an adapted cycle consistency loss that is split into three parts; one for each generator. The trainingpipeline is illustrated in Figure 1, and further details of how each generator is trained as part of the networkand how the adapted cycle consistency loss is deﬁned are provided below. G s e To train generator G s e , we start with the adversarial loss deﬁned in [33]: (1)min θ G max θ D L GAN ( θ G , θ D ) + γ L L ( θ G ) . Here, θ G and θ D are parameters for a generator G and a discriminator D paired in a GAN. (2)min θ G max θ D L GAN ( θ G , θ D ) = E x ∼ P X [ log ( D ( x ))] + E z ∼ P Z [ log (1 − D ( G ( z )))] , is the conventional adversarial loss [56], where x is a real image from the training set X with unknowndistribution P X , and z is a random noise vector from a Gaussian distribution P Z , and (3) L L ( θ G ) = E x ∼ P X ,z ∼ P Z (cid:107) x − G ( z ) (cid:107) is simply the L loss between the generated image and the target image. When training with real images,the target image for G s e is deﬁned as the real extracted edge map, and the generated image is G pf ( G s e ( x )),8s described in Section 4.4. Full details of the construction of this loss function and the pix2pix networkarchitecture can be found at [33].To train the generators in TricycleGAN to work together as a cohesive network, we also include a cy-cle consistency loss for each generator. For an edge map x , cycle consistency is satisﬁed for G s e if x → G pf ( G s e ( x )) → G e m ( G pf ( G s e ( x ))) → G m s ( G e m ( G pf ( G s e ( x )))) → G pf ( G s e ( G m s ( G e m ( G pf ( G s e ( x )))))) ≈ x . The cycle consistency loss portion for G s e is deﬁned as (4) L s eCycle ( G s e , G e m , G m s ) = E x ∼ P X (cid:107) G pf ( G s e ( G m s ( G e m ( x )))) − x (cid:107) .L s eCycle is computed by passing the generated edge map through the generators in TricycleGAN to producean image with G e m , leading to a predicted segmentation map generated by G m s , which is then used asinput to produce another edge map with G s e . The original real edge map and the newly generated edgemap produced after a cycle through TricycleGAN are used to compute the cycle consistency loss for G s e . G e m Generator G e m is trained similarly to G e m . The adversarial loss and the L loss are deﬁned as before,except the target image for G e m is the medical image domain. As G e m starts at a diﬀerent point inTricycleGAN’s cycle, it contributes the second part of the total cycle consistency loss: (5) L e mCycle ( G s e , G e m , G m s ) = E x ∼ P X (cid:107) G e m ( G pf ( G s e ( G m s ( x )))) − x (cid:107) , where x is a real image from the application domain and y is the attempted reproduction of that imagegenerated by G e m . G m s G m s can only undergo training when a ground truth segmentation map is available, and thus only receivesa training signal when a synthetic segmentation map is passed through TricycleGAN. Importantly, G m s isnot part of a GAN, and therefore there is no adversarial loss. Instead, it acts as a segmentation model, andthus is trained using its portion of the cycle consistency loss, the binary cross-entropy loss, and the Tverskyloss.For G m s , we can deﬁne the cycle consistency loss between a ground truth segmentation x and itspredicted segmentation as (6) L m sCycle ( G s e , G e m , G m s ) = E x ∼ P X (cid:107) G m s ( G e m ( G pf ( G s e ( x )))) − x (cid:107) . The binary cross-entropy is the sum of the cross-entropy loss for all i pixels is deﬁned as (7) L BCE = − (cid:88) i ( y i log(ˆ y i ) + (1 − y i ) log(1 − ˆ y i ))for pixel-wise ground truth labels y i and model predictions ˆ y i .Finally, we add the Tversky loss, which is a generlization of the dice loss [57], to promote segmentationaccuracy. Given the Tversky similarity index for a two-class problem (8) T I = (cid:80) P g ( i ) P y ( i ) + (cid:15) (cid:80) P g ( i ) P y ( i ) + α (cid:80) Q g ( i ) + β (cid:80) Q y ( i ) + (cid:15) , where P g i = (cid:40) , if pixel i is in the ground truth ROI0 , otherwise , (9) Q g ( i ) = (1 − P g ( i )) P y ( i ), and Q y ( i ) = P g ( i )(1 − P y ( i )).The Tversky loss is simply L T = 1 − T I . The parameters α and β allow for shifting emphasis towardsminimizing false positives or false negatives, depending on the class imbalance exhibited in the data. Herewe use α = β = 0 . .7.5 Total Loss for Training TricycleGAN To train generator G s e , we add together the cycle consistency losses for G s e (Eq. 4), and G e m (Eq. 5),along with its adversarial loss (Eq. 2) and its L loss (Eq. 3) to obtain (10) L s e = L GAN ( θ G s e , θ D s e ) + λ L L ( θ G s e ) + L s eCycle ( G s e , G e m , G m s ) + L e mCycle ( G s e , G e m , G m s ) . Note that the sum of L s eCycle and L e mCycle is analagous to the sum of the forward and backward cycleconsistency in CycleGAN.Similarly for G e m , the total loss is (11) L e m = L GAN ( θ G e m , θ D e m ) + λ L L ( θ G e m ) + L s eCycle ( G s e , G e m , G m s ) + L e mCycle ( G s e , G e m , G m s ) . To train G m s , we combine the the cycle consistency losses for each trainable generator (Eq. 4, Eq. 5,and Eq. 6), the binary cross-entropy loss (Eq. 7) and the Tversky loss (Eq. 8) to obtain L m s = L e mCycle ( G s e , G e m , G m s ) + L e mCycle ( G s e , G e m , G m s ) + L m sCycle ( G s e , G e m , G m s ) + L BCE + L T . (12)For each of the above combined losses, λ = 100 as recommended in [33] and λ = 10. All generators were trained using the Adam optimizer with a learning rate of 2 − and a batch size of 32.Training was considered complete when L m s did not improve for 20 consecutive epochs. During training, images are altered slightly to introduce additional variance, thereby allowing the resultingmodels to learn additional robustness and improve performance. The alterations for both datasets are asfollows: a random translation of up to 30 pixels along both axes, and a horizontal ﬂip with a probability of0.5. The following additional alterations are applied to the ISIC 2018 training data: a random rotation of r ∈ { , π/ , π, π/ } , up to a 20% change in brightness, up to a 50% change in contrast, up to a 5% changein hue, and up to a 50% change in saturation.We also use style transfer to provide additional data augmentation for the ISIC 2018 dataset. This allowsthe generators to produce a greater variety of skin lesions, particularly to better capture the statistics ofdiﬀerent skin colours and artifacts (see Figure S2). This allows TricycleGAN to take advantage of colourfeatures in addition to the usual shape features upon which it relies. Rather than treating each style variationas separate training images, styles were incorporated as an additional input channel (or dimension), enablingthe networks to more eﬀectively learn to produce the same segmentation for the various style variants of animage. We evaluate our model using ﬁve standard metrics computed using the withheld evaluation images: the F1score (also called the dice coeﬃcient), speciﬁcity, sensitibity, intersection over union (IoU; also called theJaccard Index), and pixel-wise classiﬁcation accuracy (pACC).

In addition to unsupervised segmentation, TricycleGAN can be used for semi-supervsed segmentation. Weimplement semi-supervised learning by ﬁne-tuning the unsupervised models with increasing amounts oflabelled images. The results of this experiment are given in Figure 5.10 .2 U-Net Comparison

We use the commonly used and widely accessible U-net [23] to train a supervised segmentation model for theultrasound dataset. As with our U-net embedded in TricycleGAN, we train the U-net using the sum of thepixel-wise binary cross-entropy and the dice coeﬃcient as the loss function. We use Adam for optimizationwith a batch size of 1. Finally, we perform data augmentation with horizontal ﬂips (50% probability) andhorizontal and vertical translations of up to 26 pixels (10%).

The Mask-RCNN [58] was the segmentation model that performed best during the ISIC 2018 competition[59], therefore it is used here as a direct comparison. We use anchor sizes of 2 i , i ∈ { , , , , } and 32training ROIs per image, and other hyperparameters were kept to recommended values. We perform dataaugmentation with both horizontal and vertical ﬂips (50% probability), rotation of π/ π/

2, and aGaussian blur of up to 5 standard deviations. Note that is not the same as the winning ISIC 2018 modelthat is desribed in Section 5.5, which includes an additional encoder-decoder model for segmentation.

To compare TricycleGAN with a popular unsupervised approach, we train W-net with the soft normalizedcut term in the loss function [31]. In addition, we perform the recommended post-processing of the W-net generated segmentation maps using a fully-connected CRF for edge recovery, and hierarchical imagesegmentation for contour grouping [60].

The winning method for the ISIC 2018 segmentation task competition [59] uses Mask-RCNN for superviseddetection of the ROI. The ROI bounding box output by Mask-RCNN is then ﬂipped and rotated by diﬀerentangles to create four copies of the ROI. Each of these is then segmented with a custom encoder-decodernetwork, and the average segmentation map is taken as the ﬁnal predicted mask.

We test a recent method for unsupervised medical image segmentation based on the local centre of mass on ourkidney dataset [61]. We perform nested 5-fold cross-validation to tune the alpha and power hyperparameters(alpha ∈ { , , , } , power ∈ { , , } ) with 240 training samples, 60 validation samples, and 138test samples. Since this method provides a segmentation label for every discovered region of an image, wemeasure performance by taking the label that maximizes the Dice score on a per image basis, and accumulatethese scores across all images on the test set. In Figure 3 we show the kidney segmentation masks produced by TricycleGAN compared with the clinician-provided ground truth for randomly selected images in the test set. In Table 1 we show the correspondingsegmentation performance metrics. TricycleGAN trained in a semi-supervised manner using synthetic imagesand 45 real images and results obtained by ﬁrst training the model on synthetic data and then ﬁne-tuningit on 45 real images are listed under semi-supervised results TricycleGAN and TricycleGAN+10 respectively(with “+10” denoting 10% of the labelled data used for ﬁne-tuning; the amount required before performanceplateaued). 11igure 3: Kidney segmentation masks comparingTrycicleGAN (blue) to the clinician-provided ground truthlabels (red) for a random subset of test images.Table 1: Performance metrics for ultrasound kidney segmentation.Model F1 Speciﬁcity Sensitivity IoU pACCUnsup. TricycleGAN 0.81 (0.10) 0.93 (0.08) 0.84 (0.14) 0.69 (0.13) 0.90 (0.06)SegCM 0.48 0.31 0.91 0.31 0.47W-net 0.46 (0.10) 0.20 (0.05) 0.98 (0.02) 0.30 (0.12) 0.41 (0.07)Semi-Sup. TricycleGAN 0.87 (0.11) 0.97 (0.04) 0.86 (0.13) 0.78 (0.13) 0.93 (0.05)TricycleGAN+10 0.88 (0.08) 0.97 (0.03) 0.88 (0.09) 0.80 (0.11) 0.94 (0.04)Sup. U-net 0.91 (0.09) 0.97 (0.04) 0.90 (0.10) 0.84 (0.10) 0.95 (0.03)12able 2: Performance metrics for ISIC 2018 skin lesion boundary segmentation.Model th-IoUUnsup. TricycleGAN 0.691Semi-Sup. TricycleGAN+15 0.759Sup. Mask-RCNN 0.763Winner [59] 0.802Current Top 0.836

Performance metrics on the ISIC 2018 dataset using TricycleGAN are shown in Table 2 along with resultsobtained by the competition winner [59] and current top submission. Here we use the metrics given bythe online submission system, which includes a thresholded IoU (th-IoU). This metric sets all per-imageIoU scores that are less than 0.65 to 0 before computing the mean IoU. Examples of the output masks onrandomly selected test images are shown in Figure 4. As with the kidney dataset, TricycleGAN+15 denotesmodel performance after supervised ﬁne-tuning of the unsupervised model with 15% of the labelled validationdata. However, the submission page for model evaluation on the test set no longer provides metrics otherthan the th-IoU, so other metrics are not presented for this model.

We evaluate the eﬀect of ﬁne-tuning TricycleGAN with increasing amounts of labelled data. The resultspresented in Figure 5 suggest that only 5-10% of the validation images are needed for optimal performance onthe kidney ultrasound dataset, corresponding to 22-43 images. However, segmentation performance continuesto improve on the ISIC 2018 dataset with additional data.

We present TricycleGAN as way of creating labelled synthetic medical images from unlabelled data to enableunsupervised image segmentation. TricycleGAN takes advantage of recent advances in image synthesis andgenerative modelling by making use of shape priors that can be used to construct templates of an objectof interest. This is accomplished by extending the recently developed CycleGAN architecture to a networkcomprised of three generators that translate between images, their edge maps, and their segmentation maps.TricycleGAN performs better than alternative unsupervised methods that do not require colour andtexture features. For example, W-net performs poorly on the kidney segmentation task because it onlyidentiﬁes the ultrasound cone itself, rather than the kidney. We also show that our approach performs nearlyas well as supervised methods for most images, though overall performance is diminished. Importantly, weshow that with just a few training examples for supervised ﬁne-tuning (here, only ˜10% of the data used forthe supervised models), TricycleGAN performs close to the level of supervised models.

References [1] M. Ravi and R. S. Hegadi, “Pathological medical image segmentation: A quick review based on para-metric techniques,”

Medical Imaging: Artiﬁcial Intelligence, Image Recognition, and Machine LearningTechniques , p. 207, 2019.[2] N. Tajbakhsh, L. Jeyaseelan, Q. Li, J. Chiang, Z. Wu, and X. Ding, “Embracing imperfect datasets:A review of deep learning solutions for medical image segmentation,” arXiv preprint arXiv:1908.10454 ,2019.[3] Y. Guo, Y. Liu, T. Georgiou, and M. S. Lew, “A review of semantic segmentation using deep neuralnetworks,”

International journal of multimedia information retrieval , vol. 7, no. 2, pp. 87–93, 2018.13igure 4: Skin lesion segmentation masks comparing unsupervised TricycleGAN (blue) to a semi-supervisedTricycleGAN+15 (red). Images are a random subset taken from the ISIC 2018 test set, for which groundtruth labels are not available for download. (a) Kidney Data (b) ISIC 2018 Data

Figure 5: Result of ﬁne-tuning the unsupervised models with increasing amounts of the labelled validationdata. 144] C. L. Chowdhary and D. Acharjya, “Segmentation and feature extraction in medical imaging: a sys-tematic review,”

Procedia Computer Science , vol. 167, pp. 26–36, 2020.[5] J.-Y. Zhu, T. Park, P. Isola, and A. A. Efros, “Unpaired image-to-image translation using cycle-consistent adversarial networks,” in

Proceedings of the IEEE international conference on computervision , 2017, pp. 2223–2232.[6] J. Canny, “A computational approach to edge detection,” in

Readings in computer vision . Elsevier,1987, pp. 184–203.[7] R. Adams and L. Bischof, “Seeded region growing,”

IEEE Transactions on pattern analysis and machineintelligence , vol. 16, no. 6, pp. 641–647, 1994.[8] M. Kass, A. Witkin, and D. Terzopoulos, “Snakes: Active contour models,”

International journal ofcomputer vision , vol. 1, no. 4, pp. 321–331, 1988.[9] B. Manjunath and R. Chellappa, “Unsupervised texture segmentation using markov random ﬁeld mod-els,”

IEEE Transactions on Pattern Analysis & Machine Intelligence , vol. 5, pp. 478–482, 1991.[10] A. Voulodimos, N. Doulamis, A. Doulamis, and E. Protopapadakis, “Deep learning for computer vision:A brief review,”

Computational intelligence and neuroscience , vol. 2018, 2018.[11] H.-J. Yoo, “Deep convolution neural networks in computer vision,”

IEEE Transactions on Smart Pro-cessing & Computing , vol. 4, no. 1, pp. 35–43, 2015.[12] A. Garcia-Garcia, S. Orts-Escolano, S. Oprea, V. Villena-Martinez, and J. Garcia-Rodriguez, “A reviewon deep learning techniques applied to semantic segmentation,” arXiv preprint arXiv:1704.06857 , 2017.[13] L. Lu, Y. Zheng, G. Carneiro, and L. Yang, “Deep learning and convolutional neural networks formedical image computing,”

Advances in Computer Vision and Pattern Recognition; Springer: NewYork, NY, USA , 2017.[14] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in

Proceedings of the IEEE conference on computer vision and pattern recognition , 2015, pp. 3431–3440.[15] K. Simonyan and A. Zisserman, “Very deep convolutional networks for large-scale image recognition,” arXiv preprint arXiv:1409.1556 , 2014.[16] G. Lin, C. Shen, A. Van Den Hengel, and I. Reid, “Eﬃcient piecewise training of deep structured modelsfor semantic segmentation,” in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2016, pp. 3194–3203.[17] P. Moeskops, J. M. Wolterink, B. H. van der Velden, K. G. Gilhuijs, T. Leiner, M. A. Viergever,and I. Iˇsgum, “Deep learning for multi-task medical image segmentation in multiple modalities,” in

International Conference on Medical Image Computing and Computer-Assisted Intervention . Springer,2016, pp. 478–486.[18] W. Zhang, R. Li, H. Deng, L. Wang, W. Lin, S. Ji, and D. Shen, “Deep convolutional neural networksfor multi-modality isointense infant brain image segmentation,”

NeuroImage , vol. 108, pp. 214–224,2015.[19] J. Kleesiek, G. Urban, A. Hubert, D. Schwarz, K. Maier-Hein, M. Bendszus, and A. Biller, “Deep MRIbrain extraction: a 3D convolutional neural network for skull stripping,”

NeuroImage , vol. 129, pp.460–469, 2016.[20] Q. Dou, H. Chen, L. Yu, L. Zhao, J. Qin, D. Wang, V. C. Mok, L. Shi, and P.-A. Heng, “Auto-matic detection of cerebral microbleeds from MR images via 3D convolutional neural networks,”

IEEEtransactions on medical imaging , vol. 35, no. 5, pp. 1182–1195, 2016.1521] T. Brosch, L. Y. Tang, Y. Yoo, D. K. Li, A. Traboulsee, and R. Tam, “Deep 3D convolutional encodernetworks with shortcuts for multiscale feature integration applied to multiple sclerosis lesion segmenta-tion,”

IEEE transactions on medical imaging , vol. 35, no. 5, pp. 1229–1239, 2016.[22] D. C. Cire¸san, A. Giusti, L. M. Gambardella, and J. Schmidhuber, “Mitosis detection in breast cancerhistology images with deep neural networks,” in

International Conference on Medical Image Computingand Computer-assisted Intervention . Springer, 2013, pp. 411–418.[23] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmen-tation,” in

International Conference on Medical image computing and computer-assisted intervention .Springer, 2015, pp. 234–241.[24] P. F. Christ, M. E. A. Elshaer, F. Ettlinger, S. Tatavarty, M. Bickel, P. Bilic, M. Rempﬂer, M. Arm-bruster, F. Hofmann, M. D’Anastasi et al. , “Automatic liver and lesion segmentation in CT usingcascaded fully convolutional neural networks and 3D conditional random ﬁelds,” in

International Con-ference on Medical Image Computing and Computer-Assisted Intervention . Springer, 2016, pp. 415–423.[25] ¨O. C¸ i¸cek, A. Abdulkadir, S. S. Lienkamp, T. Brox, and O. Ronneberger, “3D U-Net: learning dense vol-umetric segmentation from sparse annotation,” in

International conference on medical image computingand computer-assisted intervention . Springer, 2016, pp. 424–432.[26] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y.Hammerla, B. Kainz et al. , “Attention u-net: learning where to look for the pancreas,” arXiv preprintarXiv:1804.03999 , 2018.[27] V. Iglovikov and A. Shvets, “Ternausnet: U-net with vgg11 encoder pre-trained on imagenet for imagesegmentation,” arXiv preprint arXiv:1801.05746 , 2018.[28] M. Havaei, A. Davy, D. Warde-Farley, A. Biard, A. Courville, Y. Bengio, C. Pal, P.-M. Jodoin, andH. Larochelle, “Brain tumor segmentation with deep neural networks,”

Medical image analysis , vol. 35,pp. 18–31, 2017.[29] K. Kamnitsas, C. Ledig, V. F. Newcombe, J. P. Simpson, A. D. Kane, D. K. Menon, D. Rueckert, andB. Glocker, “Eﬃcient multi-scale 3d cnn with fully connected crf for accurate brain lesion segmentation,”

Medical image analysis , vol. 36, pp. 61–78, 2017.[30] S. Pereira, A. Pinto, V. Alves, and C. A. Silva, “Brain tumor segmentation using convolutional neuralnetworks in mri images,”

IEEE transactions on medical imaging , vol. 35, no. 5, pp. 1240–1251, 2016.[31] X. Xia and B. Kulis, “W-net: A deep model for fully unsupervised image segmentation,” arXiv preprintarXiv:1711.08506 , 2017.[32] K.-J. Hsu, Y.-Y. Lin, and Y.-Y. Chuang, “Deepco3: Deep instance co-segmentation by co-peak searchand co-saliency detection,” in

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , 2019, pp. 8846–8855.[33] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation with conditional adversarialnetworks,” in

Proceedings of the IEEE conference on computer vision and pattern recognition , 2017, pp.1125–1134.[34] P. Luc, C. Couprie, S. Chintala, and J. Verbeek, “Semantic segmentation using adversarial networks,” arXiv preprint arXiv:1611.08408 , 2016.[35] Y. Xue, T. Xu, H. Zhang, L. R. Long, and X. Huang, “SegAN: Adversarial network with multi-scale L1loss for medical image segmentation,”

Neuroinformatics , vol. 16, no. 3-4, pp. 383–392, 2018.[36] J. Son, S. J. Park, and K.-H. Jung, “Retinal vessel segmentation in fundoscopic images with generativeadversarial networks,” arXiv preprint arXiv:1706.09318 , 2017.1637] D. Yang, D. Xu, S. K. Zhou, B. Georgescu, M. Chen, S. Grbic, D. Metaxas, and D. Comaniciu, “Au-tomatic liver segmentation using an adversarial image-to-image network,” in

International Conferenceon Medical Image Computing and Computer-Assisted Intervention . Springer, 2017, pp. 507–515.[38] S. Hong, H. Noh, and B. Han, “Decoupled deep neural network for semi-supervised semantic segmen-tation,” in

Advances in neural information processing systems , 2015, pp. 1495–1503.[39] H.-C. Shin, N. A. Tenenholtz, J. K. Rogers, C. G. Schwarz, M. L. Senjem, J. L. Gunter, K. P. Andriole,and M. Michalski, “Medical image synthesis for data augmentation and anonymization using generativeadversarial networks,” in

International Workshop on Simulation and Synthesis in Medical Imaging .Springer, 2018, pp. 1–11.[40] N. Souly, C. Spampinato, and M. Shah, “Semi supervised semantic segmentation using generativeadversarial network,” in

Proceedings of the IEEE International Conference on Computer Vision , 2017,pp. 5688–5696.[41] J. T. Guibas, T. S. Virdi, and P. S. Li, “Synthetic medical images from dual generative adversarialnetworks,” arXiv preprint arXiv:1709.01872 , 2018.[42] P. Ostyakov, R. Suvorov, E. Logacheva, O. Khomenko, and S. I. Nikolenko, “Seigan: Towards compo-sitional image generation by simultaneously learning to segment, enhance, and inpaint,” arXiv preprintarXiv:1811.07630 , 2019.[43] M. Chen, T. Arti`eres, and L. Denoyer, “Unsupervised object segmentation by redrawing,” arXiv preprintarXiv:1905.13539 , 2019.[44] K. Dhindsa, L. C. Smail, M. McGrath, L. H. Braga, S. Becker, and R. R. Sonnadara, “Grading pre-natal hydronephrosis from ultrasound imaging using deep convolutional neural networks,” in . IEEE, 2018, pp. 80–87.[45] U. Sivanesan, L. H. Braga, R. R. Sonnadara, and K. Dhindsa, “Unsupervised medical image seg-mentation with adversarial networks: From edge diagrams to segmentation maps,” arXiv preprintarXiv:1911.05140 , 2019.[46] L. C. Smail, K. Dhindsa, L. H. Braga, S. Becker, and R. R. Sonnadara, “Using deep learning algorithmsto grade hydronephrosis severity: Toward a clinical adjunct,”

Frontiers in Pediatrics , vol. 8, 2020.[47] N. Codella, V. Rotemberg, P. Tschandl, M. E. Celebi, S. Dusza, D. Gutman, B. Helba, A. Kalloo,K. Liopyris, M. Marchetti et al. , “Skin lesion analysis toward melanoma detection 2018: A challengehosted by the international skin imaging collaboration (isic),” arXiv preprint arXiv:1902.03368 , 2019.[48] P. Tschandl, C. Rosendahl, and H. Kittler, “The ham10000 dataset, a large collection of multi-sourcedermatoscopic images of common pigmented skin lesions,”

Scientiﬁc data , vol. 5, p. 180161, 2018.[49] P. C. Tay, C. D. Garson, S. T. Acton, and J. A. Hossack, “Ultrasound despeckling for contrast enhance-ment,”

IEEE Transactions on Image Processing , vol. 19, no. 7, pp. 1847–1860, 2010.[50] X. Zhou, C. Yao, H. Wen, Y. Wang, S. Zhou, W. He, and J. Liang, “East: an eﬃcient and accurate scenetext detector,” in

Proceedings of the IEEE conference on Computer Vision and Pattern Recognition ,2017, pp. 5551–5560.[51] S. M. Pizer, E. P. Amburn, J. D. Austin, R. Cromartie, A. Geselowitz, T. Greer, B. ter Haar Romeny,J. B. Zimmerman, and K. Zuiderveld, “Adaptive histogram equalization and its variations,”

Computervision, graphics, and image processing , vol. 39, no. 3, pp. 355–368, 1987.[52] Y. Liu, M.-M. Cheng, X. Hu, K. Wang, and X. Bai, “Richer convolutional features for edge detection,”in

Proceedings of the IEEE conference on computer vision and pattern recognition , 2017, pp. 3000–3009.[53] P. Doll´ar and C. L. Zitnick, “Fast edge detection using structured forests,”

IEEE transactions on patternanalysis and machine intelligence , vol. 37, no. 8, pp. 1558–1570, 2014.1754] P. Liu, X. Qi, P. He, Y. Li, M. R. Lyu, and I. King, “Semantically consistent image completion withﬁne-grained details,” arXiv preprint arXiv:1711.09345 , 2017.[55] J.-Y. Zhu, P. Kr¨ahenb¨uhl, E. Shechtman, and A. A. Efros, “Generative visual manipulation on thenatural image manifold,” in

European conference on computer vision . Springer, 2016, pp. 597–613.[56] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben-gio, “Generative adversarial nets,” in

Advances in neural information processing systems , 2014, pp.2672–2680.[57] N. Abraham and N. M. Khan, “A novel focal tversky loss function with improved attention u-net forlesion segmentation,” in .IEEE, 2019, pp. 683–687.[58] K. He, G. Gkioxari, P. Doll´ar, and R. Girshick, “Mask r-cnn,” in

Proceedings of the IEEE internationalconference on computer vision , 2017, pp. 2961–2969.[59] C. Qian, T. Liu, H. Jiang, Z. Wang, P. Wang, M. Guan, and B. Sun, “A detection and segmentationarchitecture for skin lesion segmentation on dermoscopy images,” arXiv preprint arXiv:1809.03917 ,2018.[60] P. Arbelaez, M. Maire, C. Fowlkes, and J. Malik, “Contour detection and hierarchical image segmenta-tion,”

IEEE Trans. Pattern Anal. Mach. Intell. , vol. 33, no. 5, pp. 898–916, May 2011.[61] I. Aganj, M. G. Harisinghani, R. Weissleder, and B. Fischl, “Unsupervised medical image segmentationbased on the local center of mass,”

Scientiﬁc reports , vol. 8, no. 1, pp. 1–8, 2018.18 upplementary Data (a) Grade 1 (b) Grade 2(c) Grade 3 (d) Grade 4

Figure S1: Examples from the kidney ultrasound dataset with diﬀerent hydronephrosis severity grades, from1 (low severity) to 4 (severe hydronephrosis).Figure S2: Examples of generated skin lesions. From left to right: synthetic segmentations, generated edgemaps, followed by 12 style variations generated by applying 12 diﬀerent styles with TricycleGAN generator G e mm