Cell image segmentation by Feature Random Enhancement Module
aa r X i v : . [ ee ss . I V ] J a n Cell Image Segmentation by Feature Random Enhancement Module
Takamasa Ando , Kazuhiro Hotta , Meijo University,1-501 Shiogamaguchi, Tempaku-ku, Nagoya 468-8502, Japan [email protected] Keywords: Cell Image,Semantic Segmentation,U-Net,Feature Random Enhancement ModuleAbstract: It is important to extract good features using an encoder to realize semantic segmentation with high accuracy.Although loss function is optimized in training deep neural network, far layers from the layers for computingloss function are difficult to train. Skip connection is effective for this problem but there are still far layersfrom the loss function. In this paper, we propose the Feature Random Enhancement Module which enhancesthe features randomly in only training. By emphasizing the features at far layers from loss function, we cantrain those layers well and the accuracy was improved. In experiments, we evaluated the proposed moduleon two kinds of cell image datasets, and our module improved the segmentation accuracy without increasingcomputational cost in test phase.
In recent years, the development of deep learningtechnology has been remarkable, and there is a de-mand to use it in various situations. Since the seg-mentation of cell images obtained by microscopesis performed manually by human experts, it tendsto be subjective results. Objective results are re-quired by the same criteria using deep learning tech-nology (Ciresan et al.,2012). However, the optimalnetwork for segmentation using deep learning hasnot been established yet. Even if the accuracy isnot so high, it is actually used in the field of cellbiology to obtain objective results. Therefore, au-tomatic segmentation method with high accuracy isdesired. U-Net is still widely used for segmenta-tion of microscope images because it works wellfor small number of training images and high ac-curacy is obtained without adjusting hyperparame-ters. For this reason, many improvements of U-Net have been proposed for microscope images(Jiawei Zhang et al.,2018; Qiangguo Jin et al.,2019;Qiangguo Jin et al.,2018).This study belongs to one of those variations andimproves the accuracy of U-Net. Although conven-tional improvement is done by deepening, the pro-posed method does not require any additional com-putational resources at all during inference. There-fore, it retains the advantage of U-Net that it requiresfewer computing resources. Therefore, it is a very sig-nificant proposal in the segmentation of cell images where there is a demand for lightweight and accuratemodel.A neural network such as CNN basicallyuses backpropagation for training. For this rea-son, there is a phenomenon that near layersto the layer for computing loss are more up-dated in comparison with far layers from the loss(Y. Bengio et al.,1994). In order to solve the prob-lem, ResNet (Kaiming He et al.,2016) used skipconnection and improved the accuracy. U-Net(Philipp Fischer et al.,2015) is the famous deep neu-ral network for cell segmentation task. U-Net also hasskip connection between encoder and decoder, and itcontributes to improve the accuracy. In general, it iswell known that skip connection gives the informa-tion of location and fine objects which were lost inencoder to decoder. However, we consider that thesame theory as ResNet is used in skip connection toimprove the accuracy. By using skip connection, theloss is propagated to encoder well, and weights aresuccessfully updated. This is also the reason why U-net improved the accuracy in comparison with stan-dard Encoder-Decoder CNN.Figure 1 shows the structure of U-net. We see thatskip connection is effective to propagate the loss toencoder. However, the layers shown as yellow in theFigure 1 are the farthest from the loss at the final layer.Therefore, in the case of U-net, the yellow layer in theFigure is the most difficult to train though the layerhas semantic information. In this paper, we proposenew module to train the layer effectively. We consider igure 1: U-Net and the problem to enhance the feature map at yellow layer which isthe farthest layer from the loss function. In conven-tional methods, the yellow layer is difficult to learn,and network learns to decrease the feature values atthe yellow layer not to affect the output. This phe-nomenon is shown in section 3.1. Feature values atthe yellow layer are smaller than those at skip connec-tion from encoder, and features at the yellow layer arenot effective for segmentation result. Therefore, themodel without yellow layer sometimes has higher ac-curacy than the model with yellow layer as shown insection 4.3. To train yellow layers effectively, we se-lect some feature maps randomly at yellow layer andincrease the absolute value of the feature maps multi-plied by a large constant value. Although the featuresat yellow layer and features from encoder are concate-nated, the features at yellow layer are used mainly be-cause the yellow layer has larger values by constantmultiplication. If the filters selected by our moduleare not effective for classification, network has a largeloss. Therefore, the selected feature maps are trainedwell by gradient descent.In experiments, we evaluated our method on twokinds of cell image datasets. Intersection over Union(IoU) is used as an evaluation measure. The effective-ness of the proposed module was shown in compari-son with the conventional U-Net without our moduleand U-net with SuperVision that loss function is com-puted at yellow layer.The structure of this paper is as follows. Section2 describes related works. Section 3 describes the de-tails of the proposed method. Experimental result ontwo kinds of cell image datasets are shown in section4. Finally, we summarize our work and discusses fu-ture works in section 5.
U-Net is a kind of Encoder-Decoder CNN(Hengshuang Zhaon et al.,2017). Unlike the PSPNet(Hengshuang Zhao et al.,2017), the Encoder-DecoderCNN does not use features in parallel but features areextracted in series. Thus, in Encoder-Decoder CNN,far layers from the layer for computing loss are notupdated well. ResNet and U-net solved this problemby skip connection.There is also a technique called Deep su-pervision proposed in Deeply-Supervised Nets(Chen-Yu Lee et al.,2015) to address the problem.In deep supervision, loss is also computed at middlelayer. Far layers from final layer are updated well bysupervision. U-Net++ (Zongwei Zhou et al.,2018)also used this technique. However, forcing lossfrom the ground truth in the middle of U-Net maynot obtain an intermediate representation for betterinference. In addition, U-Net++ has a structure inwhich the output image is restored by the decoderfrom various parts of the encoder, and multipledecoders are connected to each other. However, theadvantage of U-Net which is a small computationalresource is vanished. This is accompanied by a largenumber of parameters due to multiple decoders. Inthis paper, we propose new methods based on themerits and demerits of these previous studies.There are some methods that we referred to con-sider a new method. In the proposed method, fea-ture enhancement is performed on some feature mapsduring training. There are many techniques forweighting feature maps. SENet (Jie Hu et al.,2018)proposed to weight important channels. Attentionwhich has been proposed in the field of natural lan-guages (Ashish Vaswani et al.,2017) is also used inthe field of images. In recent years, many meth-ods have been proposed that focus on channels(Yulun Zhang et al.,2018; Sanghyun Woo et al.,2018;Yanting Hu et al.,2018). Attention-U-net used atten-tion for skip connection (Ozan Oktay et al.,2018). igure 2: Comparison of feature values at the yellow layer and skip connection
Dropout (Nitish Srivastava et al.,2014) is also re-lated to our approach. Dropout sets a part of the fea-ture map to 0 in only training. This prevents overfit-ting by randomly removing elements in only training.Our method randomly enhances some feature maps atthe farthest layer from loss function. We do not setsome elements to 0 and enlarge some feature maps.When some elements are set to 0, backpropagationfrom the element is stopped. In our method, featuresare enlarged randomly to use backpropagation effec-tively for the farthest layer.
This section describes the proposed method. Sec-tion 3.1 gives the overview of the proposed method.Section 3.2 mentions the implementation details ofour method.
When we obtain segmentation result by the U-Net,the magnitude of features at yellow layer as shown inFig. 1 is often smaller than that of features at skipconnection from encoder to decoder. Fig.2 shows thefact when U-net is trained on two different datasets.Two lines in each graph show the average feature val-ues at yellow layer in Fig.1 and those at skip connec-tion from encoder to decoder. Note that both featuresare extracted after ReLU function and those two fea-ture maps have positive values. The magnitude ofthese values in Fig.2 means the magnitude of influ-ence for the network’s outputs because convolution isadopted after both features in Fig.2 are concatenated.The Figure shows that the encoder’s output featuresare obviously smaller than the features at skip con-nection, and the features at yellow layer are not usedeffectively. Batch renormalization is used before wecompute ReLU function in Fig 2. If we do not usebackpropagation, both feature maps have similar av-erage values because the features are normalized by batch renormalization. Therefore, the training of U-net makes the features at yellow layer small.Does this fact show that yellow layer is not re-quired? Our answer is “NO”. This phenomenon iscaused from that near layers to the layers for com-puting loss are updated well and far layers are notupdated well. Yellow layer in Fig.1 is the farthestlayer from loss function because encoder is updatedthrough skip connection. Therefore, network learnsthe layers connected by a skip connection in compar-ison with the yellow layer because it is difficult to up-date the yellow layer.Although the easy solution for this problem is touse new normalization for yellow layers using aver-age and variance of skip connection, it prevents theyellow layer’s values from being small. However, itgenerates new problem that can not learn appropri-ate ratio of yellow layer and skip connection values.Therefore, we propose feature enhancement moduleto solve the problem. Our method is soft constraintas emphasis some feature maps in comparison withnormalization. The soft constraint means emphasiz-ing ”a part of feature maps” at yellow layer. The newnormalization described above increases the values in”all feature maps” at yellow layer to match the skipconnection. Specifically, we select some feature mapsat yellow layer randomly at each epoch and increasethe absolute values of the features by multiplying alarge constant value. This allows to use the featuresat yellow layer effectively. If the features enlarged byour module are not effective, network has a large loss.Therefore, the selected filters are trained well by gra-dient descent. The reason why we select some featuremaps randomly is to prevent vanishing gradient.Our method is not used in test phase. This is be-cause it able to learn the appropriate magnitude ofvalues in non-selected feature maps. The evidencefor this is from experimental results, and it is shownin section 4.4. Since our method is soft constraint, itsolve the problem that the normalization at the con-catenation of two feature maps can not solve. Thus,the accuracy is improved without changing the in-erence time or computational resources because ourmethod is not used in test phase. This is an advan-tage of our method though many methods deepenedthe network to improve the accuracy.
To describe the proposed method for U-Net, the en-coder’s output is enhanced by multiplying featuremaps selected randomly at each epoch by X. Thenumber of feature maps selected randomly is denotedas B. The feature maps are re-selected each epoch andthe network weight is updated during training.The closest approaches is Dropout. Similarly,dropout is used only during training, and some neu-rons are randomly set to 0. If there is an element setto 0, the backpropagation stops at that element. It is amethod to allow the ensembles. The proposed methoddiffers from Dropout. We use an adjustable featureemphasis not setting to 0. This is to improve the casewhere there is a difference in the ease of updating be-tween the near and far layers.The proposed method can be implemented in ad-dition to Dropout. However, this does not mean thatDropout will be replaced by the proposed method.Our method is difficult to implement in many layersbecause we determine adequate parameters X and Bat each layer. Implementing it at the farthest layerfrom the loss function solves the problem presentedin this paper and is the most effective.Figure 3 shows the detailed description of the pro-posed method. In the proposed method, we multiplyX by all values in the selected feature maps whichare the end of encoder shown as yellow layer in Fig-ure 1. This operation is performed only in trainingphase. The enhanced feature maps are selected ran-domly. Thus, all channels in encoder’s output arenot enhanced. We need to select hyperparameters Xand B appropriately. Hyperparameters were searchedby using the optimization with Tree-structured ParzenEstimator (TPE) (J. Bergstra et al.,2011) which usesBayesian optimization.
This section shows the experimental results of theproposed method. Section 4.1 and 4.2 describe thedataset and the network used in experiments. Exper-imental results are shown in section 4.3. In section4.4, additional experiments are conducted for consid-erations.
In this paper, we conduct experiments on two kinds ofcell image datasets. The first dataset includes only 50fluorescent images of the liver of a transgenic mouseexpressing a fluorescent marker on the cell membraneand nucleus (A. Imanishi et al.,2018). The size of theimage is 256 times
256 pixels and consists of threeclasses; cell membrane, cell nucleus, and background.We use 35 images for training, 5 images for valida-tion, and 10 images for test.The second dataset includes 20 Drosophila featherimages (Stephan Gerhard et al.,2013). The size of theimage is 1024 times
The proposed method introduces a module that ran-domly enhances the features at the final layer of en-coder during only training. We call it “Feature Ran-dom Enhancement Module”. Fig. 4 shows the U-Net used in this paper. As shown in Fig. 4, the pro-posed module was implemented on a standard U-Netwith SE block. Some feature maps are selected from512 feature maps at the farthest layer from the lossfunction which is shown as the bottom right in Fig. 4at training phase, and the value in the feature map ismultiplied by X.
In all experiments, we trained all methods till 2000epochs in which the learning converges sufficiently,and evaluation is done when the highest mIoU ac-curacy is obtained for validation images. We usedsoftmax cross entropy. The hyperparameters B andX were searched 50 times using the Tree-structuredParzen Estimator algorithm (TPE) which seem to bea sufficient number.For comparison, U-Net with only SE block is eval-uated. This is the baseline. U-Net with SE block with-out yellow layer in Fig.1 (the yellow dotted square inFig. 4) is evaluated to present the problem that deeplayers are difficult to train. The problem is that themodel without those deep layers achieved higher ac-curacy than that with those deep layers.We also evaluate the U-net with SE block whichuses SuperVision instead of the proposed module in igure 3: Feature Random Enhancement ModuleFigure 4: U-Net with SE block order to show the effectiveness of Feature Enhance-ment module. SuperVision uses 1x1 convolutionto change the number of channel to the number ofclasses at the end of encoder, and resize it to the sizeof input image and softmax cross entropy loss is com-puted. When we use SuperVision, we must optimizetwo losses; the first loss is standard softmax cross en-tropy loss at the final layer and the second loss is forsupervision. In general, the balancing weight for twolosses should be optimized.Loss=(1- λ )*Loss.1+ λ *Loss.2 (1)where is λ the balancing weight. The parameter isalso optimized by TPE. The search was performed15 times to find the appropriate parameter. Since λ is a single parameter, the number of searches issmaller than that of the two parameters B and X inour method.First, we show the experimental results on themouse cell dataset. Table 1 shows the results whenwe use B = 162 and X = 632 which gives the highestmean IoU for validation set. Table 1 shows that theaccuracy of our method is improved in all classes incomparison with the conventional models. We con-firmed 2.12% improvement from the baseline. In ad-dition, the accuracy of the proposed method is better than U-Net + SENet without deep layers. This resultshows our method can train deep layers effectively. Inaddition, the accuracy of U-Net + SENet without deeplayers is better than U-Net + SENet. It shows that thedeep layers of the U-Net + SENet used in the experi-ments is far from the loss and not updated well. Theaccuracy of mean IoU is not improved by U-Net withSuperVision ( λ = 0.3257 determined by TPE) even ifloss is computed at the end of encoder that our mod-ule is used. When we use dropout with the same per-centage as the proposed method (Dropout rate = B/thenumber of filters = 162/512), it did not improve theaccuracy.Figure 5 shows the qualitative results. In Fig. 5,(a) is input image, (b) is ground truth, (c),(d) and(e) are the results by the conventional U-Net with SEblock, U-Net with SE block without deep layers andU-Net with SE block + SuperVision, respectively, and(f) is the result by the proposed method. We see thatcell image is blurred and it is difficult for not expertsto assign class labels. This is because cells are killedby strong light and images are captured with low illu-minace.In conventional method (d), there are many non-and over-detected cell membrane or nucleus. In addi- able 1: IoU of Transgenic mouse cell dataset menbrane[%] nuclear[%] background[%] mIoU[%] U-Net + SEblock
U-Net + SEblock without deep layers
U-Net + SEblock + Supervision
Dropout (Same percentage as the proposed method)
Proposed method
Table 2: IoU of the Drosophila dataset menbrane[%] nuclear[%] background[%] Synapus[%] mIoU[%]U-Net + SEblock 91.80 76.87 76.89 50.46 73.98U-Net + SEblock without deep layers 92.32 78.24 76.93 51.31 74.70U-Net + SEblock + Supervision 92.39 77.77 78.24 52.21 75.15Dropout (Same percentage as the proposed method) 92.40 77.98 78.70 50.23 74.83Proposed method 92.93 78.71 78.02 58.14 76.95
Figure 5: Segmentation results on transgenic mouse cell im-agesFigure 6: Segmentation results on the Drosophila datasetFigure 7: TPE algorithm of Transgenic mouse cell Figure 8: TPE algorithm of Drosophila feather tion, in conventional method (e), there are many unde-tected membranes. However, in the proposed method(f), more accurate segmentation results are obtained.This is because the proposed module enables to ex-tract features from areas where training has not beendone successfully in conventional methods. In addi-tion, the method using SuperVision gave lower accu-racy than the proposed method. We consider that theloss from the middle layer does not always give anintermediate representation for good segmentation re-sult.Next, we show the experimental results on theDrosophila dataset. Table 2 shows the results whenwe use B = 8 and X = 250 when the highest mIoU isobtained for validation set. Table 2 shows that the ac-curacy of the proposed method is higher than that ofthe U-Net with SE block, and the mean IoU was im-proved 2.97% by baseline. Furthermore, the accuracywas improved in comparison with the conventionalmodels. In addition, the accuracy of U-Net + SENetwithout deep layers is better than U-Net + SENet.This result is the same as mouse cell data set. It rein-forces the theory that the deep layers of the U-Net +SENet used in the experiments are not updated well.The results by U-Net with SuperVision ( λ = 0.2781 able 3: For additional experiment,IoU of Transgenic mouse cell dataset menbrane[%] nuclear[%] background[%] mIoU[%] U-Net + SEblock
Proposed method(additional experiment)
Figure 9: Sum of feature map with/without Feature En-hancement module determined by TPE) and Dropout with the same per-centage as the proposed method (Dropout rate = B/thenumber of filters = 8/512) reinforce the theory.Figure 6 shows qualitative results. In Fig. 6, (a) isthe input image, (b) is ground truth, (c),(d) and (e)are the results by the U-Net with SE block, U-Netwith SE block without deep layers and U-Net withSE block and SuperVision, and (f) is the results bythe proposed method. In the Drosophila dataset, theimage seems to contain enough information but it isdifficult for ordinary people to assign correct class la-bels to each pixel. However, we confirmed that theproposed method (f) performs better segmentation forcell membrane and nucleus.Figure 7 and 8 show the results of hyperparame-ter search using the TPE algorithm. The vertical andhorizontal axes show the hyperparameters B and Xin the proposed module. Red points show high meanIoU for validation set, and the blue points show lowaccuracy. We see that the TPE algorithm focuses onsearching for places with high accuracy. Of course,optimal B and X depend on the dataset. However, wecan find good hyperparameters by TPE.
The proposed method emphasizes some feature mapsrandomly at each epoch to prevent over-fitting. How-ever, when we apply 10,000 times enhancement to thefixed ten filters during training, IoU accuracy was im-proved by about 1% as shown in Table 3. This showsthat the effect can be seen even if the emphasis is notperformed randomly. Thus, we observed that the sum of the values in the feature map that ReLU function isadopted after convolution because we want to confirmthe behavior of the enhanced and unenhanced filters.Figure 9 (a) shows the sum of feature map with-out the proposed module, (b) shows the sum of em-phasized feature map by the proposed module. In (c),we used the proposed module but we computed thesum of non-emphasized feature map for the compar-ison with (b). In (a), the sum of feature map gradu-ally decreases and the feature maps at the end of en-coder are not used effectively. On the other hand, in(b) and (c), the sum of feature map increased throughtraining. This means that the feature maps at the endof encoder have large value automatically and thosefeatures are used to obtain segmentation results. Theproposed method emphasizes some feature maps ran-domly at each epoch to prevent over-fitting. There-fore, from the change of the value of (c) comparedwith (a), we see that the proposed module also has aneffect on the feature maps that are not emphasized.
In this paper, we introduced the Feature RandomEnhancement Module which is enhanced feature maprandomly during only training. We succeeded in im-proving the accuracy on cell image segmentation. Wecould propose the method for improving accuracythough the amount of computation during inferencedoes not change.A future work is to establish a method for deriv-ing the parameters of the proposed module. AlthoughTPE seems to be effective for parameter search fromthe results, it requires training for each parameter untilthe accuracy converges. Therefore, the computationalcost for inference is fast but training takes longer time.Thus, we would like to study whether those parame-ters can be determined faster without convergence.
ACKNOWLEDGEMENT
This work is partially supported by MEXT/JSPSKAKENHI Grant Number 18K111382 and20H05427.