Biomedical Image Segmentation by Retina-like Sequential Attention Mechanism Using Only A Few Training Images
Shohei Hayashi, Bisser Raytchev, Toru Tamaki, Kazufumi Kaneda
BBiomedical Image Segmentation by Retina-likeSequential Attention Mechanism Using Only AFew Training Images
Shohei Hayashi, Bisser Raytchev (cid:63) , Toru Tamaki, and Kazufumi Kaneda
Department of Information Engineering, Hiroshima University, Japan
Abstract.
In this paper we propose a novel deep learning-based algo-rithm for biomedical image segmentation which uses a sequential atten-tion mechanism able to shift the focus of attention across the image in aselective way, allowing subareas which are more difficult to classify to beprocessed at increased resolution. The spatial distribution of class infor-mation in each subarea is learned using a retina-like representation whereresolution decreases with distance from the center of attention. The finalsegmentation is achieved by averaging class predictions over overlappingsubareas, utilizing the power of ensemble learning to increase segmenta-tion accuracy. Experimental results for semantic segmentation task forwhich only a few training images are available show that a CNN usingthe proposed method outperforms both a patch-based classification CNNand a fully convolutional-based method.
Keywords:
Image segmentation · Attention · Retina · iPS cells. Recently deep learning methods [2], which automatically extract hierarchicalfeatures capturing complex nonlinear relationships in the data, have managedto successfully replace most task-specific hand-crafted features. This has resultedin a significant improvement in performance on a variety of biomedical imageanalysis tasks, like object detection, recognition and segmentation (see e.g. [10]for a recent survey of the field and representative methods used in different ap-plications), and currently Convolutional Neural Network (CNN) based methodsdefine the state-of-the-art in this area.In this paper we concentrate on biomedical image segmentation . For seg-mentation, where each pixel needs to be classified into its corresponding class,initially patch-wise training/classification was used [1]. In patch-based methods,local patches of pre-determined size are extracted from the images, typically us-ing a CNN as pixel-wise classifier. During training, the patch is used as an inputto the network and it is assigned as a label the class of the pixel at the center ofthe patch (available from ground-truth data provided by a human expert). Dur-ing the test phase, a patch is fed into the trained net and the output layer of the (cid:63) corresponding author, email: [email protected] a r X i v : . [ c s . C V ] S e p Shohei Hayashi, Bisser Raytchev, Toru Tamaki, and Kazufumi Kaneda net provides the probabilities for each class. More recently, Fully ConvolutionalNetworks (FCN) [8], which replace the fully connected layers with convolutionalones, have replaced the patch-wise approach by providing a more efficient wayto train CNNs end-to-end, pixels-to-pixels, and methods stemming from this ap-proach presently define the state-of-the-art in biomedical image segmentation,exemplified by conv-deconv-based methods like U-Net [7].Although fully convolutional methods have shown state-of-the-art perfor-mance on many segmentation tasks, they typically need to be trained on largedatasets to achieve good accuracy. In many biomedical image segmentation tasks,however, only a few training images are available – either data simply being notavailable, or providing pixel-level ground truth by experts being too costly toobtain. Here we are motivated by a similar problem (section 3), where less than50 images are available for training. On the other hand, patch-wise methods needonly local patches, a huge number of which can be extracted even from a smallnumber of training images. They however suffer from the following problem.While fully convolutional methods learn a map from pixel areas (multiple in-put image values) to pixel areas (multiple classes of all the pixels in the area),patch-wise methods learn a map from pixel areas (input image values) to a singlepixel (class of the pixel in the center of the patch). As illustrated in Fig. 1 [i],this wastes the rich information about the topology of the class structure insidethe patch for many of the samples which contain more than a single class, andthese would typically be the most interesting/difficult samples [4]. Instead oftrying to represent in a suitable way and learn the complex class topology, itjust substitutes it by a single class (the class of the pixel in the center of thepatch).Regarding fully convolutional methods, they treat all locations in the imagesin the same way, which is in contrast with how human visual perception works.It is known that humans employ attentional mechanisms to focus selectively on subareas of interest and construct a global scene-representation by combiningthe information from different local subareas [6].Based on the above observations, we propose a new method, which takes amiddle ground between fully convolutional and patch-wise learning and combinesthe benefits of both of these strategies. As shown in Fig. 1, as in the patch-wiseapproach we consider subareas of the whole image at a time, which providesus with sufficient number of training samples, even if only a few ground-truthlabeled images are available. However, as illustrated in Fig. 1 [ii], the class in-formation is organized as in the retina [3]: the spatial resolution is highest inthe central area (corresponding to the fovea ), and it diminishes as we go to theperiphery of the subarea. We propose a sequential attention mechanism whichshifts the focus of attention in such a way that areas of the image which aredifficult to classify (i.e. the classification uncertainty is higher) are consideredin much more detail than areas which are easy to classify. Since the focus ofattention moves the subarea under investigation much slower over difficult areas(i.e. with much smaller step), this results in many overlapping subareas in theseregions. The final segmentation is achieved by averaging the class predictions iomedical Image Segmentation by Retina-like Sequential Attention 3 (cid:10)(cid:1)(cid:23)(cid:1)( (cid:10)(cid:1)(cid:24)(cid:1)( (cid:9)(cid:14)(cid:19)(cid:18)(cid:16)21(cid:15)(cid:18).(cid:1)(cid:8)3(cid:4)(cid:1)(cid:5)(cid:1)(cid:9)(cid:14)(cid:19)(cid:18)(cid:16)21(cid:15)(cid:18).(cid:1)(cid:8)3(cid:4)(cid:1)(cid:6)(cid:1)(cid:9)(cid:14)(cid:19)(cid:18)(cid:16)21(cid:15)(cid:18).(cid:1)(cid:8)3(cid:4)(cid:1)(cid:7)(cid:1) (cid:2)(cid:1))(cid:1) (cid:2)(cid:1)(cid:13)(cid:1) (cid:2)(cid:1))(cid:1) (cid:2)(cid:1)(cid:13)(cid:1)
Fig. 1. [i] Local patch containing pixels belonging to 3 classes shown in different color.On the right is normalized histogram showing the empirical distribution of the classesinside the patch, which can be interpreted as probabilities. A standard patch-wisemethod learns only the class of the pixel at the center, ignoring completely class topol-ogy. [ii] The proposed method learns the structure of the spatial distribution of theclass topology in a local subarea represented similarly to the retina - the resolution ishighest in the center and decreases progressively in the periphery. As attention shiftsinside the image, the information of overlapping subareas is combined to produce thefinal segmented image (details explained in the text). Figure best viewed in color. over the overlapping subareas. In this way, the power of ensemble learning [5] isutilized by incorporating information from all overlapping subareas in the neigh-borhood (each of which provides slightly different views of the scene) to furtherimprove accuracy.This is the basic idea of the method proposed in the paper, and details how toimplement it in a CNN will be given in the next section. Experimental results arereported in section 3, indicating that a significant improvement in segmentationaccuracy can be achieved by the proposed method compared to both patch-basedand fully convolutional-based methods.
We represent a subarea S extracted from an input image (centered at the currentfocus of attention) as a tensor of size d × d × c , where d is the size of the subareain pixels and c stands for the color channels if color images are used (as typicallydone, and c = 3 for RGB images). As shown in Fig. 1 [ii], we can represent theclass information corresponding to this subarea as grids of different resolution,where each cell in a grid contains a sample histogram h ( i ) calculated so thatthe k -th element of h ( i ) gives the number of pixels from class k observed in the i -th cell of the grid. If each histogram h ( i ) is normalized to sum to 1 by dividingeach bin by the total number of pixels covered by the cell, the corresponding Shohei Hayashi, Bisser Raytchev, Toru Tamaki, and Kazufumi Kaneda vector p ( i ) can now be used to represent the probability mass function ( pmf ) forcell i : the k -th element of p ( i ) , i.e. p ( i ) k , can be interpreted as the probability ofobserving class k in the area covered by the i -th cell of the grid.Next, we show how retina-like grids of different resolution levels can be cre-ated. Let’s start with a grid of size 4 ×
4, as shown in the top row of Fig. 1 [ii].This we will call
Resolution Level r = 1. At resolution level1 all the cells in the grid have the same resolution, i.e. the pmf s correspondingto each cell are calculated from areas of the same size. For example, if the sizeof the local subarea under consideration is d = 128 pixels (i.e. image patch ofsize 128 ×
128 pixels), each cell in the grid at resolution level r = 1 would cor-respond to an image area of size 32 ×
32 pixels from which a probability massfunction p ( i ) would be calculated, as explained above. Next, we can create a gridat Resolution Level r = 2) by dividing in half the four cells in the center ofthe grid, so that they form an inner 4 × N in a gridobtained at resolution level r is N = 16 + 12( r − r = 1 to be of size 4 ×
4, but choosing this number makes theprocess of creating different resolution levels especially simple, since in this casethe innermost cells are always 4 (2 × pmf s p ( i ) used astarget values. We use the cross-entropy between the pmf s of the targets p ( i ) andthe corresponding output unit activations y ( i ) as loss function L : L = − (cid:88) n (cid:88) i (cid:88) k p ( i ) k,n log y ( i ) k,n ( S n ; w ) , (1)where n indexes the training subarea image patches ( S n being the n -th trainingsubarea image patch), i indexing the cells in the corresponding resolution grid,and k the classes. Here, w represents the weights of the network, to be foundby minimizing the loss function. To form probabilities, the network output unitscorresponding to each cell are passed through the soft-max activation function.Finally, we describe the sequential attention mechanism we utilize, whosepurpose is to move the focus of attention across the image in such a way thatthose parts which are difficult to classify (i.e. classification uncertainty is high)are observed at the highest possible resolution, and the retina-like grid of pmf smoves with smaller steps across such areas. To evaluate the classification uncer-tainty of the grid over the present subarea S , we use the following function, H ( S ) = − N (cid:88) i ∈ S (cid:88) k p ( i ) k log p ( i ) k , (2)which represents the average entropy obtained from the posterior pmf s p ( i ) foreach cell (indexed by i ) inside the grid, and k indexes the classes. Using H ( S ) asa measure of classification uncertainty, the position of the next location where iomedical Image Segmentation by Retina-like Sequential Attention 5 (a) (b) (c) (a) (b) (c)prediction prediction prediction (cid:1) Extract subarea of interest from the image (cid:2)
Calculate classification uncertainty (cid:3)
Calculate location of next subarea
Segmentation imageHeat map from overlapping subareas
Entropy S t e p ! Fig. 2.
Overview of the sequential attention mechanism (see text for details). to move the focus of attention (horizontal shift in pixels) is given by f ( H ( S )) = d exp {− ( H ( S )) / σ } . (3)The whole process is illustrated in Fig. 2. We start at the upper left corner ofthe input image, with a subarea of size d × d pixels (the yellow patch (a) inthe figure). The classification uncertainty for that subarea is calculated usingEq. 2, and the step size in pixels to move in the horizontal direction is calculatedby Eq. 3. As illustrated in the graph in the center of Fig. 2, since in this casethe classification uncertainty is 0 (all pixels in the subarea belong to the sameclass), the focus of attention moves d pixels to the right, i.e. in this extremecase there is no overlap between the current and next subareas. For the subarea(b) shown in green, the situation is very different. In this case the classificationuncertainty is very high, and the focus of attention would move only slightlyto the right, allowing the image around this area to be assessed at the highestresolution. This would result in very high level of overlap between neighboringsubareas, as shown in the heat map on the right (where intensity is proportionalto the level of overlap). This process is repeated until the right corner of theimage is reached. Then the focus of attention is moved 10 pixels in the verticaldirection to scan the next row and everything is repeated until the whole imageis processed.While the above attention mechanism moves the focus of attention across theimage, the posterior class pmf s from the grids corresponding to each subareaare stored in a probability map of the same size as the image, i.e. to each pixelin the image is allocated a pmf equal to the pmf of the cell from the gridpositioned above that pixel. In areas in the image where several subareas overlap,the probability map is computed by averaging for each pixel the pmf s of all cells Shohei Hayashi, Bisser Raytchev, Toru Tamaki, and Kazufumi Kaneda
Table 1.
Experimental results for different values d of the size of the local subareas d Method Jaccard Index Dice TPR TNR Accuracy96 patch-center 0.811 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ResLv-5 0.825 ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± ± which partially overlap over that pixel. Finally, the class of the pixel is obtainedby taking the class with highest probability from the final probability map, asshown in the upper right corner of Fig. 2 for the final segmented image. In this section we evaluate the proposed method in comparison with a standardpatch-wise classification-based CNN [1] and the fully convolutional-based U-Net[7] on the dataset described below. Additionally we implemented a U-Net ver-sion, called
UNet-patch , which applies U-Net to local patches rather than to awhole image. The original U-Net method which takes as input the whole imagewe will call
UNet-image . Dataset:
Our dataset consists of 59 images showing colonies of undifferentiatedand differentiated iPS cells obtained through phase-contrast microscopy. Inducedpluripotent stem (iPS) cells [9], for whose discovery S. Yamanaka received theNobel prize in Physiology and Medicine in 2012, contain great promise for regen-erative medicine. Still, in order to fulfill their promise a steady supply of iPS cells iomedical Image Segmentation by Retina-like Sequential Attention 7
Fig. 3.
Segmentation results for several images from the iPS dataset, obtained by theproposed method (3rd column), using ResLv-4 with subarea size of d = 192. Firstcolumn shows the original images and second column the ground truth segmentationprovided by an expert (red corresponds to class Good, green to Bad and blue to Back-ground). The last column shows the corresponding heat map, where areas in whichthere was high overlap over neighboring subareas are shown with high intensity values. obtained through harvesting of individual cell colonies is needed and automatingthe detection of abnormalities arising during the cultivation process is crucial.Thus, our task is to segment the input images into three categories: Good (undif-ferentiated), Bad (differentiated) and Background (BGD, the culture medium).Several representative images together with ground-truth provided by expertscan be seen in Fig. 3. All images in this dataset are of size 1600 × Network Architecture and Hyperparameters:
We used a network archi-tecture based on the VGG-16 CNN net, apart from the following differences.There are 13 convolutional layers in VGG-16, while we used 10 here. Also, inVGG-16 there are 3 fully-convolutional layers of which the first two consist of4096 units, while those had 1024 units in our implementation. The learning ratewas set to 0.0001 and for the optimization procedure we used ADAM. Batchsize was 16, training for 20 epochs (U-Net-patch for 15 epochs and U-Net-imagefor 200 epochs). For the implementation of the CNNs we used TensorFlow andKeras. Four different sizes for the local subareas were tried: d = 96 , , , r = 1 to r = 5. The width of theGaussian in Eq. 3 was empirically set to σ = 0 . Evaluation procedure and criteria:
The quality of the obtained segmen-
Shohei Hayashi, Bisser Raytchev, Toru Tamaki, and Kazufumi Kaneda tation results for each method were evaluated by 5-fold cross-validation usingthe following criteria: Jaccard index (the most challenging one), Dice coefficient,True Positive Rate (TPR), True Negative Rate (TNR) and Accuracy. For eachscore the average and standard deviation are reported.
Results:
The results obtained on the iPS cell colonies dataset for each of themethods are given in Table 1, where, patch-center stands for the patch-wise clas-sification method, and results for resolution levels from r = 1 to r = 5 are givenfor the proposed method. As can be seen from the results, the proposed methodoutperforms both patch-wise classification and the U-Net-based methods. Fig. 3gives some examples of segmentation on images from the iPS dataset, show-ing that very good accuracy of segmentation can be achieved by the proposedmethod. The heat maps given in the last column demonstrate that the proposedattentional mechanism is able to focus the high-resolution parts of the retina-likegrid on the boundaries between the classes which seem to be most difficult toclassify, resulting in increased accuracy of segmentation. In this paper we have shown that the combined power of (1) a sequential atten-tion mechanism controlling the shift of the focus of attention, (2) local retina-likerepresentation of the spatial distribution of class information and (3) ensemblelearning can lead to increased segmentation accuracy in biomedical segmentationtasks. We expect that the proposed method can be especially useful for datasetsfor which only a few training images are available.
References
1. Cire¸san, D.C., Giusti, A., Gambardella, L.M., Schmidhuber, J.: Deep neural net-works segment neuronal membranes in electron microscopy images. In: NIPS. pp.2843–2851 (2012)2. Goodfellow, I., Bengio, Y., Courville, A.: Deep Learning. MIT Press (2016)3. Kandel, E.R., Schwarz, J., Jessell, T.M., Siegelbaum, S.A., Hudspeth, A.J.: Prin-ciples of Neural Science, 5ed. McGraw-Hill (2013)4. Kontschieder, P., Bulo, S.R., Bischof, H., Pelillo, M.: Structured class-labels inrandom forests for semantic image labeling. In: Proc. ICCV2012. pp. 2190 – 2197(2012)5. Kuncheva, L.: Combining Pattern Classifiers, 2ed. Wiley (2014)6. Rensink, R.A.: The dynamic representation of scenes. Visual Cognition (1-3),17–42 (2000)7. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed-ical image segmentation. In: MICCAI (2015)8. Shelhamer, E., Long, J., Darrell, T.: Fully convolutional networks for semanticsegmentation. IEEE Trans. Pattern Anal. Mach. Intell. (4), 640–651 (2017)9. Takahashi, K., Tanabe, K., Ohnuki, M., Narita, M., Ichisaka, T., Tomoda, K.,Yamanaka, S.: Induction of pluripotent stem cells from adult human fibroblasts bydefined factors. Cell
131 (5)131 (5)