Image Segmentation with Pseudo-marginal MCMC Sampling and Nonparametric Shape Priors
aa r X i v : . [ c s . C V ] S e p Image Segmentation with Pseudo-marginal MCMCSampling and Nonparametric Shape Priors
Ertunc Erdil , ∗ Sinan Yıldırım Tolga Tasdizen Müjdat Çetin , Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey ARM Ltd., 1 Summerpool Road, Loughborough, Leicester, UK Department of Electrical and Computer Engineering, University of Utah, Salt Lake City, UT, USA Department of Electrical and Computer Engineering, University of Rochester, Rochester, NY, USA {ertuncerdil, sinanyildirim, mcetin}@sabanciuniv.edu [email protected]
Abstract
In this paper, we propose an efficient pseudo-marginal Markov chain Monte Carlo(MCMC) sampling approach to draw samples from posterior shape distributionsfor image segmentation. The computation time of the proposed approach is inde-pendent from the size of the training set used to learn the shape prior distributionnonparametrically. Therefore, it scales well for very large data sets. Our approachis able to characterize the posterior probability density in the space of shapesthrough its samples, and to return multiple solutions, potentially from differentmodes of a multimodal probability density, which would be encountered, e.g., insegmenting objects from multiple shape classes. Experimental results demonstratethe potential of the proposed approach.
Incorporating prior shape density into the segmentation process has been widely studied in the lit-erature [20], [29], [25], [2], [30], [18]. These methods can usually handle Gaussian-like, unimodalshape prior densities. Kim et al. [21] and Cremers et al. [8] incorporate nonparametric density esti-mation based shape priors into the segmentation process using level sets. Therefore, these methodsand their variants can learn “multimodal" shape densities, which can be encountered in problemsinvolving shape densities containing multiple classes of shapes [14], [6], [31], [28], [9], [23], [11],[24]. These methods minimize an energy function and find a solution at a local optimum. This doesnot provide any measure of the degree of confidence/uncertainty in that result and any informationabout the characteristics of the posterior density.There are a limited number of Markov chain Monte Carlo (MCMC) based image segmentationmethods in the literature. Most of these methods generate samples from the posterior density byassuming the prior density is uniform [13], [4], [5]. The only sampling-based segmentation approachin the literature that uses a shape prior is the one proposed by Chen et al. [7]. However, the methodcannot handle topological changes in shapes.Our major contribution in this paper is a pseudo-marginal Markov chain Monte Carlo (MCMC)sampling-based image segmentation approach that exploits nonparametric shape priors. The pro-posed approach is able to characterize the posterior density through its samples. Our approach canlearn from very large data sets efficiently by using pseudo-marginal sampling. To the best of ourknowledge, this is the first approach that performs pseudo-marginal MCMC shape sampling-basedimage segmentation through an energy functional that uses nonparametric shape priors and level ∗ This work has been supported by the Scientific and Technological Research Council of Turkey (TUBITAK)under Grant 113E603.31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA. ets. Also, unlike other MCMC sampling-based segmentation approaches in the literature, the pro-posed approach perfectly satisfies the necessary conditions to implement MCMC sampling which isa crucial step for developing an MCMC sampler.A precursor to this work was presented in [12]. This paper advances that prior work in two ma-jor ways: (1) while [12] approximately satisfies the necessary conditions of MCMC sampling, theapproach presented in this paper perfectly satisfies these conditions; (2) unlike [12] we use pseudo-marginal sampling to be able to learn shape densities from very large data sets; the method in [12]becomes inefficient with large training data.
The image segmentation problem involves estimating an unknown segmenting curve for an objectgiven an observed image y ∈ Y M × N where Y is the set of the values that the pixels of y can take.We are particularly interested in problems in which the prior shape distribution has componentscorresponding to different object classes. We denote the class of the object by s ∈ { , . . . , n } where n ≥ is the total number of classes, which is known. For simplicity we assume that s has a uniformdistribution over { , . . . , n } so that p ( s ) = 1 /n, s = 1 , . . . , n .We ultimately aim to estimate a binary segmenting curve c ∈ { , } M × N where ’s indicate thebackground and ’s indicate the object. The conditional density of y given c is independent from s and is denoted by p Y | C ( y | c ) . We construct this density based on the piecewise-constant ver-sion of the Mumford-Shah functional [26], [3] which is a very common data fidelity term forimage segmentation. To present the curves, we use level sets, which we define as a mapping φ : { , } M × N → R MN . Our choice of level sets is to handle topological changes and use gra-dient flows effectively in our methodology. We denote the level set of c as x = φ ( c ) .We also have a training set of binary curves that are grouped into classes, C = {C , . . . , C n } , whereeach C i = { c i, , . . . , c i,m i } is the collection of m i ≥ segmented curves for class i . Let us alsodefine the level set representation of the training set as X = {X , . . . , X n } , where each X i = { x i, , . . . , x i,m i } with x i,j = φ ( c i,j ) , the level set representation of c i,j . Now we can define theprior distribution for x given the class s as p ( x | s ) = m s P m s i =1 N ( x ; x s,i , σ I ) , where N ( x ; µ, Σ) is a Gaussian density with mean µ and covariance matrix Σ . This prior corresponds to a mixtureof kernels with centers x s, , . . . , x s,m s with kernel size σ [21]. To determine the kernel size σ , weuse an ML kernel with leave-one-out [27]. Let us also define ¯ φ as the pseudo-inverse of φ suchthat ¯ φ ( φ ( c )) = c . We use ¯ φ to rewrite the conditional density of the data in terms of x as p ( y | x ) = p Y | C ( y | ¯ φ ( x )) . We can also write the joint density of s , x , and y as p ( s, x, y ) = p ( s ) p ( x | s ) p ( y | x ) .The Bayesian image segmentation problem can be formulated as finding the posterior distribution of x given y which is p ( x | y ) ∝ p ( y | x ) p ( x ) = p ( y | x ) P ns =1 p ( s ) p ( x | s ) . However, estimating p ( x | y ) can be difficult since the summation over classes makes the distribution hard to infer, e.g., usingMonte Carlo sampling methods. Alternatively, we aim for the joint posterior distribution of s and x given y , p ( s, x | y ) ∝ p ( s, x, y ) , whose marginal is still the desired posterior p ( x | y ) . We aim to sample from p ( s, x | y ) using Gibbs sampling by sampling from the conditional densities p ( s | y, x ) and p ( x | y, s ) in an alternating fashion [16], [15]. However, since the full conditional p ( x | s, y ) is hard to sample from, we update x by using a Metropolis-Hastings (MH) move, whichleads to the well known Metropolis-Hastings within Gibbs (MHwG) algorithm [17].Both conditional densities in MHwG involve p ( x | s ) which needs to be evaluated during MHupdates. This can be too costly when m s is large, which occurs when we have a big train-ing set. Therefore, towards a more computationally efficient MCMC algorithm that scales withthe training data size, we consider the following unbiased estimator of p ( x | s ) via subsampling: b p ( x | s ) = m s P ˆ m s j =1 N ( x ; x s,u j , σ I ) where { u , u , . . . , u ˆ m s } ⊂ { , , . . . , m s } is a set of sub- Note that φ is not invertible, so we define a pseudo-inverse. ˆ m s ≪ m s . This approximation of theprior leads to the approximation of the conditional posterior densities b p ( s | x, y ) ∝ p ( s ) b p ( x | s ) and b p ( x | s, y ) ∝ b p ( x | s ) p ( y | x ) . Using this approximation does not generally guarantee that the MarkovChain has an equilibrium distribution that is exactly p ( x, s | y ) . To achieve a correct MH algorithmwhich uses the approximation b p ( x | s ) and still targets p ( x, s | y ) , we adopt the pseudo-marginal MHalgorithm of [1]; the details are described in the following section. Assume that we have a non-negative random variable z such that given x and s , its conditionaldensity given s and x , g s,x ( z ) , satisfies R ∞ g s,x ( z ) zdz = p ( x | s ) . We choose z as an approximationto p ( x | s ) , in particular z = b p ( x | s ) where b p ( x | s ) is defined in Section 3.1, and its probability density g s,x ( z ) corresponds to the generation process of this approximation. (It will become clear that in factwe do not have to calculate g s,x ( z ) at all but we should be able to sample from it.) Note that z is arandom variable since we generate { u , . . . , u ˆ m s } randomly when computing b p ( x | s ) . We can definethe extended posterior density with the new variable z added as p ( x, s, z | y ) ∝ p ( s ) zg s,x ( z ) p ( y | x ) .When we integrate z out, we see that samples for s and x from p ( x, s, z | y ) will admit the desiredposterior p ( s, x | y ) : p ( s ) p ( y | x ) R zg s,x ( z ) dz = p ( s ) p ( x | s ) p ( y | x ) . Now, the problem of generatingsamples from p ( s, x | y ) can be replaced with the problem of generating samples from p ( x, s, z | y ) .We propose a pseudo-marginal MHwG sampling procedure to generate samples from p ( x, s, z | y ) .Note the important remark that this algorithm also targets p ( x, s | y ) , hence p ( x | y ) exactly. There aretwo major steps of the proposed approach which we explain in detail in the following: (1) conditionthe posterior on x and update s and z , (2) update x and z by conditioning the posterior on s . Update step for the class s and z : The distribution from which we sample s and z in Metropolis-Hastings is p ( s, z | y, x ) . We can write this distribution as p ( s, z | y, x ) = p ( s, z | x ) ∝ p ( s ) zg s,x ( z ) .Since we regard the proposal mechanism as a joint update of s and z , the proposal generates ( s ′ , z ′ ) from the density q ( s ′ | s ) g s ′ ,x ( z ) . Note that ( s ′ , z ′ ) denotes the candidate samples generated from theproposal distribution. We take q ( s ′ | s ) as a uniform distribution U{ , n } and z ′ = b p ( x | s ′ ) [1]. Once s ′ and z ′ are sampled, they are either accepted with probability min (cid:26) , p ( s ′ ) z ′ g s ′ ,x ( z ′ ) q ( s | s ′ ) g s,x ( z ) p ( s ) zg s,x ( z ) q ( s ′ | s ) g s ′ ,x ( z ′ ) (cid:27) = min (cid:26) , p ( s ′ ) z ′ q ( s | s ′ ) p ( s ) zq ( s ′ | s ) (cid:27) , (1)or the current values of s and z are kept. The probabilities in the MH ratio in (1) can be computedexactly which guarantees satisfying the necessary conditions to implement MCMC sampling. Update step for the level set x and z : The next step is to sample x and z from the conditionaldensity p ( x, z | y, s ) . To achieve this, we perform Metropolis-Hastings sampling as we use for sam-pling s ′ and z ′ . The conditional density p ( x, z | y, s ) can be written as p ( x, z | s, y ) ∝ zg s,x ( z ) p ( y | x ) .Also, for joint sampling of candidates ( x ′ , z ′ ) we can write the proposal density as q s,j ( x ′ , z ′ | y ) = q s,j ( x ′ | x, y ) g s,x ′ ( z ′ ) where j is sampled uniformly from { , m s } , and z ′ = b p ( x ′ | s ) is generatedusing subsampling. Then, the Metropolis-Hastings ratio can be computed as follows: min (cid:26) , z ′ g s ′ ,x ( z ′ ) p ( y | x ′ ) q s,j ( x | x ′ , y ) g s,x ( z ) zg s,x ( z ) p ( y | x ) q s,j ( x ′ | x, y ) g s ′ ,x ( z ′ ) (cid:27) = min (cid:26) , z ′ p ( y | x ′ ) q s,j ( x | x ′ , y ) zp ( y | x ) q s,j ( x ′ | x, y ) (cid:27) (2) Design of the proposal distribution:
The crucial part in (2) is designing the proposal distributionto generate a candidate curve x ′ from x . Let us define an energy function whose gradient drivesthe curve to the desired location using the data and the training images in a particular class s as E s ( x ) := log p ( x | s ) + log p ( y | x, s ) . When the training data set is too large, calculating log p ( x | s ) may be too expensive as discussed earlier. An unbiased estimator of E s ( x ) would be obtainedas E s,j ( x ) := log N ( x ; x s,j , σ I ) + log p ( y | x, s ) where j ∼ U (1 , . . . , m s ) . The proposal dis-tribution after sampling the j th training image in class s is then constructed as q s,j ( x ′ | x, y ) = N (cid:16) x ′ ; x − d ∇ E s,j ( x ) , Σ (cid:17) . Here, the shift term d ∇ E s,j ( x ) is an approximation to the gradient ∇ E s,j ( x ) w.r.t. x and is given by d ∇ E s,j ( x ) = σ ( x − x s,j ) + (cid:2) ( y − µ in ) ⊙ − ( y − µ out ) ⊙ (cid:3) where ( · ) ⊙ k is the element-wise power operation. Note that the term ( y − µ in ) ⊙ − ( y − µ out ) ⊙ is a discrete approximation w.r.t. the level set representation x [3], so that the whole expression is an approximation to ∇ E s,j ( x ) .
3n the design of q s,j ( x ′ | x, y ) , the most important part is selecting a covariance matrix, Σ , that gen-erates smooth perturbations since smoother curves are more likely [21], [13]. In the proposed ap-proach, we compute a positive semi-definite covariance matrix Σ such that it generates smoothrandom perturbations. Given an M × N image, we first generate an M N × M N matrix Z by draw-ing i.i.d. samples from a unit Gaussian distribution. Then, we construct another M N × M N matrix F where each column of F is a smoothed version of each row in Z . By assuming F is constructedby multiplying Z by a matrix b A , we can find the matrix b A as b A = Z − F since Z is generallyinvertible. Given b A , a covariance matrix b Σ can be computed by b Σ = b A b A T . However, generally b Σ isnot positive semi-definite since b A is not generally a full rank matrix. Therefore, we find the closestpositive semi-definite matrix to b Σ using the approach in [19] which we take it as Σ . We compare running time of the proposed pseudo-marginal sampling approach and conventionalsampling approach which uses all training examples to estimate the prior shape density. We performexperiments on the MNIST data set [22] which contains 60,000 training examples. We constructtraining sets with various sizes from K to K by randomly selecting equal number of samplesfrom each digit class. We generate samples using both the pseudo-marginal sampling and con-ventional sampling on a test image by using training sets with different sizes. The plot in Figure1 shows average running time as a function of training set size for both pseudo-marginal samplingand conventional MCMC sampling. The average single sample generation time of the proposedapproach does not change as the training set size increases since we choose ˆ m s = 10 in all experi-ments. We also measure the segmentation accuracy of all samples with the ground truth using Dicescore [10]. Average Dice score result for pseudo-marginal sampling is . whereas it is . for the conventional sampling. The very slight decrease in Dice results of pseudo-marginal samplingcan be acceptable in many applications when the huge gain in running time is considered. Training set size A v e r a g e t i m e p e r s a m p l e ( s e c . ) Pseudo-marginal MHwGConventional MHwG
Figure 1: Average running time for producing asingle sample as a function of training set size. (a) Testimage (b) Kimet al. [21] (c) The proposed approach
Figure 2: Marginal confidence bounds ob-tained by samples on a test image. Note thatin (c) red indicates the least and blue indicatesthe most confidence boundaries.We also run the proposed approach on the test image in Figure 2(a) and generate samples. Thealgorithm generates samples from digit classes , , and as shown in Figure 2(c). The optimization-based approach of Kim et al. [21] finds a single segmentation solution shown in Figure 2(b). Thisexperiment shows that our algorithm can produce samples from different modes. We propose a pseudo-marginal Markov chain Monte Carlo (MCMC) sampling-based image segmen-tation approach that exploits nonparametric shape priors. The proposed approach generates samplesfrom the posterior distribution p ( x | y ) to avoid shortcomings of the optimization-based approacheswhich include getting stuck at local optima and being unable to characterize the posterior density.The proposed MCMC sampling approach deals with all these problem while being computationallyefficient, unlike the conventional MCMC approaches, by using pseudo-marginal sampling princi-ples. Moreover, our pseudo-marginal shape sampler perfectly satisfies the necessary conditions toimplement MCMC sampling which is crucial for ensuring the generated samples come from thedesired distribution. Existing methods in the literature only approximately satisfy these conditions. Note that Dice score takes values in [0 , where indicates the perfect match with the ground truth. eferences [1] Andrieu, C., Roberts, G.O.: The pseudo-marginal approach for efficient Monte Carlo compu-tations. The Annals of Statistics pp. 697–725 (2009)[2] de Bruijne, M., van Ginneken, B., Viergever, M.A., Niessen, W.J.: Adapting active shape mod-els for 3D segmentation of tubular structures in medical images. In: Information Processing inMedical Imaging, pp. 136–147. Springer (2003)[3] Chan, T.F., Vese, L., et al.: Active contours without edges. IEEE transactions on Image pro-cessing (2), 266–277 (2001)[4] Chang, J., Fisher, J.: Efficient MCMC sampling with implicit shape representations. In:IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2081–2088. IEEE(2011)[5] Chang, J., Fisher III, J.W.: Efficient topology-controlled sampling of implicit shapes. In: Pro-ceedings of the IEEE International Conference on Image Processing (ICIP) (2012)[6] Chen, S., Radke, R.J.: Level set segmentation with both shape and intensity priors. In: Inter-national Conference on Computer Vision, pp. 763–770. IEEE (2009)[7] Chen, S., Radke, R.J.: Markov chain Monte Carlo shape sampling using level sets. In: IEEE12th International Conference on Computer Vision Workshops (ICCV Workshops), pp. 296–303. IEEE (2009)[8] Cremers, D., Osher, S.J., Soatto, S.: Kernel density estimation and intrinsic alignment forshape priors in level set segmentation. International Journal of Computer Vision (3), 335–351 (2006)[9] Cremers, D., Rousson, M., Deriche, R.: A review of statistical approaches to level set segmen-tation: integrating color, texture, motion and shape. International journal of computer vision (2), 195–215 (2007)[10] Dice, L.R.: Measures of the amount of ecologic association between species. Ecology (3),297–302 (1945)[11] Erdil, E., Ghani, M.U., Rada, L., Argunsah, A.O., Unay, D., Tasdizen, T., Cetin, M.: Nonpara-metric joint shape and feature priors for image segmentation. IEEE Transactions on ImageProcessing (2017)[12] Erdil, E., Yildirim, S., Cetin, M., Tasdizen, T.: MCMC shape sampling for image segmentationwith nonparametric shape priors. In: Proceedings of the IEEE Conference on Computer Visionand Pattern Recognition, pp. 411–419 (2016)[13] Fan, A.C., Fisher III, J.W., Wells III, W.M., Levitt, J.J., Willsky, A.S.: MCMC curve samplingfor image segmentation. In: Medical Image Computing and Computer-Assisted Intervention(MICCAI), pp. 477–485. Springer (2007)[14] Foulonneau, A., Charbonnier, P., Heitz, F.: Multi-reference shape priors for active contours.International journal of computer vision (1), 68–81 (2009)[15] Gelfand, A.E., Smith, A.F.: Sampling-based approaches to calculating marginal densities. Jour-nal of the American statistical association (410), 398–409 (1990)[16] Geman, S., Geman, D.: Stochastic relaxation, Gibbs distributions, and the Bayesian restorationof images. IEEE Transactions on pattern analysis and machine intelligence (6), 721–741 (1984)[17] Geweke, J., Tanizaki, H.: Bayesian estimation of state-space models using the Metropolis–Hastings algorithm within Gibbs sampling. Computational Statistics & Data Analysis (2),151–170 (2001)[18] Heimann, T., Meinzer, H.P.: Statistical shape models for 3D medical image segmentation: areview. Medical Image Analysis (4), 543–563 (2009)[19] Higham, N.J.: Computing a nearest symmetric positive semidefinite matrix. Linear Algebraand its Applications , 103 – 118 (1988)[20] Kass, M., Witkin, A., Terzopoulos, D.: Snakes: Active contour models. International Journalof Computer Vision (4), 321–331 (1988)[21] Kim, J., Çetin, M., Willsky, A.S.: Nonparametric shape priors for active contour-based imagesegmentation. Signal Processing (12), 3021–3044 (2007)522] LeCun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based learning applied to documentrecognition. Proceedings of the IEEE (11), 2278–2324 (1998)[23] Mesadi, F., Cetin, M., Tasdizen, T.: Disjunctive normal parametric level set with application toimage segmentation. IEEE Transactions on Image Processing (6), 2618–2631 (2017)[24] Mesadi, F., Erdil, E., Cetin, M., Tasdizen, T.: Image segmentation using disjunctive normalBayesian shape and appearance models. IEEE Transactions on Medical Imaging PP (99), 1–1(2017)[25] Milborrow, S., Nicolls, F.: Locating facial features with an extended active shape model. In:European Conference on Computer Vision (ECCV), pp. 504–513. Springer (2008)[26] Mumford, D., Shah, J.: Optimal approximations by piecewise smooth functions and associ-ated variational problems. Communications on pure and applied mathematics (5), 577–685(1989)[27] Silverman, B.W.: Density estimation for statistics and data analysis, vol. 26. CRC press (1986)[28] So˘ganlı, A., Uzunba¸s, M.G., Çetin, M.: Combining learning-based intensity distributions withnonparametric shape priors for image segmentation. Signal, Image and Video Processing (4),789–798 (2014)[29] Van Ginneken, B., Frangi, A.F., Staal, J.J., Romeny, B.M., Viergever, M.A.: Active shapemodel segmentation with optimal features. IEEE Transactions on Medical Imaging (8),924–933 (2002)[30] Wang, W., Shan, S., Gao, W., Cao, B., Yin, B.: An improved active shape model for facealignment. In: IEEE International Conference on Multimodal Interfaces, p. 523 (2002)[31] Yang, R., Mirmehdi, M., Xie, X., Hall, D.: Shape and appearance priors for level set-based leftventricle segmentation. IET Computer Vision7