Spectral Image Segmentation with Global Appearance Modeling
SSpectral Image Segmentation with GlobalAppearance Modeling
Jeova F. S. Rocha Neto
School of EngineeringBrown UniversityProvidence, RI 02906 [email protected]
Pedro F. Felzenszwalb
School of EngineeringBrown UniversityProvidence, RI 02906 [email protected]
Abstract
We introduce a new spectral method for image segmentation that incorporates longrange relationships for global appearance modeling. The approach combines twodifferent graphs, one is a sparse graph that captures spatial relationships betweennearby pixels and another is a dense graph that captures pairwise similarity between all pairs of pixels . We extend the spectral method for Normalized Cuts to this settingby combining the transition matrices of Markov chains associated with each graph.We also derive an efficient method that uses importance sampling for sparsifyingthe dense graph of appearance relationships. This leads to a practical algorithm forsegmenting high-resolution images. The resulting method can segment challengingimages without any filtering or pre-processing.
Image segmentation is a fundamental problem in computer vision. Spectral clustering methodspioneered by the normalized cuts approach [1] provide simple and powerful algorithms based onfundamental graph-theoretic notions and computational linear algebra.Spectral clustering methods are formulated using an objective function defined by a graph. Theclassical constructions used for image segmentation focus on pairwise similarity between nearbypixels. In this paper we introduce a new spectral method that incorporates long range relationshipsfor global appearance modeling. The resulting method can segment challenging images without anyfiltering or pre-processing. Figure 1 shows several results obtained with the proposed method. Figure 2shows how the new method significantly outperforms the original normalized cuts formulation.We use a dense graph to capture the global appearance of regions. The normalized cut in this graphcaptures the distributions of pixel values in each region using a kernel density estimate. The measurepenalizes the overlap between distributions in different regions.To implement our image segmentation approach we extend the normalized cuts spectral algorithmto a setting where there are multiple graphs that encode different grouping cues. Our approach forimage segmentation combines two graphs. We provide a natural interpretation for the normalizedcut criteria on each of these graphs. One of the graphs is sparse and does not depend on the imagedata, it simply captures spatial relationships between pixels. The other graph is dense and capturespairwise similarity between all pairs of pixels .The direct implementation of spectral methods to segment high resolution images is always difficultdue to memory and computational requirements. We tackle this challenge using a graph sparsificationapproach that enables the efficient segmentation of high resolution images.
Preprint. Under review. a r X i v : . [ c s . C V ] J un igure 1: Segmentation results using the proposed methodWe show experimental results with a variety of images and provide a quantitative evaluation using adataset of synthetic images with Brodatz textures. Our approach achieves highly accurate results inthis setting despite the complex appearance of the textures. Let G = ( V, E, w ) be an undirected weighted graph. A cut ( A, B ) is a partition of V into twodisjoint sets. We consider the weight of a cut in different graphs with the same set of vertices. Let w ( i, j ) = 0 when { i, j } (cid:54)∈ E . The weight of a cut ( A, B ) in G is defined as, Cut(
A, B | G ) = (cid:88) i ∈ A,j ∈ B w ( i, j ) . (1)In the context of clustering and image segmentation it is typical to use large weights to indicate thatelements are similar and should be grouped together. In this case we can look for the minimum cut tofind an optimal partition of V . However, this strategy is heavily biased towards imbalanced cuts, suchas having a single node on one side. This motivated the introduction of the celebrated normalized cutcriteria and algorithm [1].The normalized cut value is defined as, NCut(
A, B | G ) = Cut( A, B | G )Vol( A | G ) + Cut( A, B | G )Vol( B | G ) = Vol( V | G ) Cut( A, B | G )Vol( A | G ) Vol( B | G ) . (2)Here Vol( A | G ) is called the the volume of A and is defined as Vol( A | G ) = (cid:80) i ∈ A,j ∈ V w ( i, j ) .The spectral algorithm introduced in [1] solves a continuous relaxation of the minimum NCut problem. Let W be the weighted adjacency matrix of G . Let D be the diagonal degree matrix with D ( i, i ) = (cid:80) j ∈ V W ( i, j ) . The matrix L = D − W is the Laplacian of G .The NCut algorithm solves a generalized eigenvector problem, Lx = λDx. (3)The algorithm selects the eigenvector x with second smallest eigenvector, and partitions V bythresholding x . In [2] the NCut criteria and algorithm is described in terms of a Markov chain. Let P = D − W . The matrix P is the transition matrix of a Markov chain over the vertices V . The longterm behavior of this Markov chain can be characterized by the solutions to the eigenvector problem P x = λx. (4)A solution ( λ, x ) to the eigenvector problem in (4) leads to a solution (1 − λ, x ) to the generalizedeigenvector problem in (3) and vice-versa. Therefore the generalized eigenvector x used in the NCut algorithm corresponds to the eigenvector of P with second largest eigenvalue. The classical application of normalized cuts for image segmentation involves a graph H where thevertices represent the image pixels and the weights reflect both the appearance similarity and distancebetween pairs pixels. 2et H = ( V, E, w ) be a graph where the vertices V are the pixels in an image and the edges E connect every pair of pixels. We use I ( j ) and X ( j ) to denote respectively the appearance (such asthe brightness or color) and spatial location of pixel j . Now define, w ( i, j ) = exp (cid:18) − || I ( i ) − I ( j ) || σ I (cid:19) exp (cid:18) − || X ( i ) − X ( j ) || σ X (cid:19) . (5)The graph H combines two grouping cues in a single weight. Using the normalized cut criteria,pixels are encouraged to be grouped together if they have similar appearance and are close to eachother. Note, however, that pixels that have similar appearance but are far away are not encouraged tobe grouped because the corresponding weight is close to zero. Similarly, neighboring pixels that havevery different appearance (such as in a textured region) are also not encouraged to be grouped.
We combine two normalized cut values to obtain a new criteria for image segmentation. We break thegrouping cues (spatial proximity and appearance similarity) into two separate graphs, G grid and G data .Both graphs are defined over the same set of vertices, corresponding to the pixels in an image.1. The graph G grid is a grid over the image pixels, where each pixel is connected to the fourneighboring pixels with an edge of weight 1. This graph encourages neighboring pixels tobe grouped together, independent of their appearance.2. The graph G data is a fully connected graph that encourages pixels with similar appearanceto be grouped together, independent of their location. The weights in G data are based onappearance similarity of pixels, and do not depend on pixel locations, w ( i, j ) = exp (cid:18) − || I ( i ) − I ( j ) || σ (cid:19) . (6) G grid Let ( A, B ) be a cut in the grid graph. The cut defines a segmentation of the image into two regions,with a boundary Γ between them. The cut value, Cut(
A, B | G grid ) , counts the number of neighboringpixels that are in different regions. In general the cut value in the grid graph and similar graphs canbe seen as a measure of the length of the boundary Γ (see [3]). Observation 1.
Cut(
A, B | G grid ) ≈ Len(Γ) . This is a commonly used measure of spatial coherence in image segmentation problems (see, e.g.,[4]). Although the criteria
Cut(
A, B | G grid ) leads to spatially coherent segmentations and is widelyused in practice, it gives most preference to trivial solutions with a small (single pixel) region.Using the previous observation and noting that Vol( S | G grid ) ≈ | S | we can derive an expression forthe value of a normalized cut in the grid graph. Observation 2.
NCut(
A, B | G grid ) ≈ | V | | A || B | . Minimizing this criteria encourages solutions where the boundary Γ between the two regions is short(to minimize Len(Γ) ) and where the two regions have similar size (to maximize | A || B | ). G data Now we consider the weight of cuts and normalized cuts in G data .For S ⊆ V we use g S to denote a kernel density estimate defined by the pixel values in S , g S ( c ) = 1 | S | (cid:88) i ∈ S K ( I ( i ) − c ) . (7) The graph defined here differs slightly from the one used in [1] because in [1] the weight of an edge is set to0 if the distance between i and j is above a threshold. roposition 1. Cut(
A, B | G data ) = (2 πσ ) d | A || B |(cid:104) g A , g B (cid:105) , where d is the dimension of the pixel appearance vectors, g A and g B are densitiy estimates definedusing a Gaussian kernel, and (cid:104)· , ·(cid:105) denotes the standard inner product of functions. Proof.
We use the fact that the convolution of two Gaussians with equal variance is a Gaussian withtwice the variance, (cid:88) i ∈ A,j ∈ B w data ( i, j ) = (cid:88) i ∈ A (cid:88) j ∈ B exp (cid:18) − (cid:107)| I ( i ) − I ( j ) || σ (cid:19) = (2 πσ ) d (cid:88) i ∈ A (cid:88) j ∈ B (cid:90) ∞−∞ (cid:18) πσ (cid:19) d exp (cid:18) − || I ( i ) − c || σ (cid:19) exp (cid:18) − || I ( j ) − c || σ (cid:19) d c = (2 πσ ) d (cid:90) ∞−∞ ( (cid:88) i ∈ A K σ ( I ( i ) − c ))( (cid:88) j ∈ B K σ ( I ( j ) − c ))d c = (2 πσ ) d | A || B | (cid:90) ∞−∞ g A ( c ) g B ( c )d c = (2 πσ ) d | A || B |(cid:104) g A , g B (cid:105) , (8)Here K β ( · ) is a Parzen window of bandwidth β .The proposition above is related to the Laplacian PDF Distance in [5]. It is also related to the work in[6] where a different graph construction is used to define global appearance models.The weight of a cut in G data will be minimized when the pixel values in the two regions havecomplementary support. Although this intuitively makes sense, the measure encourages regions to beunbalanced in size due to the term | A || B | multiplying (cid:104) g A , g B (cid:105) .In order to derive an expression for NCut(
A, B | G data ) , we first use a similar reasoning as in theproposition above to note that Vol( S | G data ) = (2 πσ ) ( d/ | S || V |(cid:104) g S , g V (cid:105) . Then, from the definitionof the normalized cut we obtain the following result. Proposition 2.
NCut(
A, B | G data ) = (cid:104) g V , g V (cid:105) (cid:104) g A , g B (cid:105)(cid:104) g A , g V (cid:105)(cid:104) g B , g V (cid:105) . This criteria is minimized when the distributions g A and g B have little overlap and both havesignificant overlap with g V . In particular it penalizes solutions where one region does not represent asignificant amount of the image data. The normalized cut values in G grid and G data provide complementary measures for image segmentation.To combine the spatial and appearance cues we use a convex combination, MixNCut(
A, B ) = (1 − λ ) NCut( A, B | G data ) + λ NCut(
A, B | G grid ) . (9)The parameter λ ∈ [0 , controls the relative importance of the two normalized cut measures.We interpret MixNCut(
A, B ) as a mixture of an appearance and a spatial term, MixNCut(
A, B ) ≈ (1 − λ ) (cid:18) (cid:104) g V , g V (cid:105) (cid:104) g A , g B (cid:105)(cid:104) g A , g V (cid:105)(cid:104) g B , g V (cid:105) (cid:19) + λ (cid:18) | V | | A || B | (cid:19) . (10)The first term encourages a partition of the image into regions with dissimilar color distributions,while the second term encourages a spatially coherent partition. Both terms are normalized and avoidbiases towards solutions with small regions. Note that each term is normalized in a particular waythat is natural and has appropriated dimensions for the individual measures.4 Segmentation Algorithm
Let G and G be two weighted graphs. Now we describe a spectral method for optimizing a convexcombination of two normalized cut values, MixNCut(
A, B | G , G ) = (1 − λ ) NCut( A, B | G ) + λ NCut(
A, B | G ) . (11)The approach is based on the Markov chain and conductance interpretation of normalized cuts ([1, 2]).Let W and W be the weighted adjacency matrices of the two graphs while D and D are thediagonal degree matrices. Let, P = D − W , P = D − W , P = (1 − λ ) P + λP . (12)The matrices P and P define two Markov chains on V . The matrix P also defines a Markov chainon V where in one step we follow P with probability (1 − λ ) and P with probability λ . We computethe second largest eigenvector of P to find a cut ( A, B ) with small conductance.In our experiments, we use a Lanczos Process to compute the second largest eigenvector of P . Weuse k -means with k = 2 to cluster the entries in the eigenvector into 2 clusters. When the matrix P is sparse we can compute the required eigenvector much more quickly. The grid G grid is sparse but G data is dense. We sparsify the graph using a random sampling approach.The approach described here is complementary to other methods that have been used to speed up thecomputation of eigenvectors for clustering. One such method is based on Nystrom approximation [7].Another approach involves power iteration [8].Let G be a weighted graph. To construct a sparse graph G (cid:48) we independently sample m edges(with replacement) from G , with probabilities proportional to the edge weights. The weight of eachsampled edge is set to (adding up weights if there is repetition). With this approach the expectedvalue of a cut ( A, B ) in G (cid:48) equals the value of the cut in G up to a scaling factor of ( m/ Vol( V | G )) .Moreover, if m is sufficiently large then with high probability every cut in G (cid:48) has weight close to thecut value in G (up to a scaling factor of ( m/ Vol( V | G )) ) (see, e.g., [9]).To implement this approach efficiently for G data we neeed to sample edges with probability pro-portional to their weights w ( i, j ) without enumerating all possible edges. We use an importancesampling method as a practical alternative.First, partition V into L ( ≈ in practice) sets S , . . . , S L with low appearance variance. We dothis greedily, starting with a single set and repeatedly partitioning the set with highest variance intotwo using the k -means algorithm. Let m i be the mean appearance of pixels in S i and q ( a, b ) = | S a || S b | exp (cid:18) − || m a − m b || σ (cid:19) . (13)To sample an edge for G (cid:48) first select a random pair S a and S b with probability proportional to q ( a, b ) .Then select i ∈ S a and j ∈ S b uniformly at random. Finally, add the edge { i, j } to G (cid:48) with weight w (cid:48) ( i, j ) = | S a || S b | w ( i, j ) /q ( a, b ) . For the experiments with
MixNCut we use the graph sparsification method described above, wherethe number of sampled edges used to sparsify G data was set to m = 2 | V | .We compare our new segmentation method with the original normalized cut formulation NCut usingthe graph H described in section 2.2. We sparsify this graph to scale the eigenvector computation tolarge images. Again, we accomplish this use an importance sampling approach.Let H be the graph with weights defined by equation (5). To sample one edge from H , first select apixel i uniformly at random. Then, draw a location x from a Normal distribution centered at X ( i ) σ X and select the pixel j closest to that location. We add the edge { i, j } to G (cid:48) withweight w (cid:48) ( i, j ) = exp (cid:0) || I ( i ) − I ( j ) || / σ I (cid:1) . We repeat this process m times. In the followingexperiments we used m = 100 | V | to sparsify H . We tested our method on real images from a variety of datasets including the Berkeley SegmentationDataset [10], the Plant Seedlings Dataset [11], the Grabcut dataset [12], the PASCAL VOC dataset[13] and a Scanning Electron Microscope (SEM) dataset [14]. Figure 2 shows some of the results weobtained, comparing the original normalized cuts formulation with our new approach. We can see inthese examples how the new approach can segment challenging images in a variety of settings, oftenoutperforming the original normalized cuts formulation.Figure 3 illustrates segmentation results using
MixNCut to partition an image into 3 regions. In thiscase we follow the approach suggested in [15] and [2], using k -means with k = 3 to cluster the pixelsusing the second and third largest eigenvector of the transition matrix P in equation (12).For each example in these figures, we ran the algorithms using different parameter values (specifiedin the next section), and show the best result among the different runs. For a quantitative evaluation we used images with Brodatz textures [16]. To generate input images,we mixed pairs of textures using different ground-truth segmentation patterns and resized the resultto × pixels. We compare MixNCut to NCut and a version of
NCut with "texture features",where we use the magnitudes of the response of 12 Gabor filters (3 wavelengths and 4 orientations) todefine appearance vectors for each pixel. Figure 4 shows some of the input images and segmentationresults. The new
MixNCut method defined directly in terms of "raw" pixel values finds near optimalsegmentations in all of these examples, outperforming both baselines.To measure the accuracy of a segmentation we use the Jaccard Index [17] J ( S, Q ) = | S ∩ Q | / | S ∪ Q | .Let ( S, Q ) be a ground-truth segmentation. We define the accuracy of a segmentation ( A, B ) as, Jaccard = max (cid:18) J ( S, A ) + J ( Q, B )2 , J ( S, B ) + J ( Q, A )2 (cid:19) . (14)We use all pairings of the 10 textures in Figure 5 with three different ground-truth segmentationsshown in Table 1 to generate three sets of images. We compute the mean accuracy of each methodon each set of images using several parameter combinations ( σ I ∈ { , , . . . , } and σ X ∈{ , , . . . , } for NCut ; λ ∈ { . , . , . } and σ ∈ { . , , , } for MixNCut ). Table1 summarizes the best mean accuracy obtained with each method on each set of inputs. The tablealso shows the average running time of each method. We see the new
MixNCut approach obtainsnear perfect accuracy (
Jaccard ≈ ) on all ground-truth patterns, significantly outperforming theother methods. The algorithms were implemented in MATLAB and run on a computer with an Inteli5-6200U CPU @ 2.30GHz using 8 GB of RAM running Linux.Table 1: Evaluation of different segmentation methods on textured images. The table summarizesaccuracy and running time of each method on images with different ground-truth segmentations. NCut NCut + Gabor
MixNCut NCut NCut + Gabor
MixNCut NCut NCut + Gabor
MixNCutJaccard 0 .
91 0 .
94 0 .
96 0 .
57 0 .
77 0 .
93 0 .
55 0 .
70 0 . Time (s) 10.37 12.78 9.09 10.91 12.95 6.61 10.37 14.12 7.15 a) Original Image (b) NCut (c)
MixNCut
Figure 2: Segmentation results comparing
NCut and
MixNCut on real images. Column (a) showsthe input images. Column (b) shows the eigenvector found by the original
NCut formulation on theleft and the segmentation result on the right. Column (c) shows the eigenvector found by the new
MixNCut formulation on the left and the segmentation result on the right.7igure 3: Segmentation results using the proposed method for images with more than 2 regions. (a) Image (b)
NCut (c)
NCut + Gabor (d)
MixNCut
Figure 4: Comparing
NCut , NCut + Gabor, and
MixNCut on textured images. Column (a) showsthe input images. Column (b) shows the eigenvector found by the original
NCut formulation on theleft and the segmentation result on the right. Column (c) shows the eigenvector found by
NCut withGabor features on the left and the segmentation result on the right. Column (d) shows the eigenvectorfound by the new
MixNCut formulation on the left and the segmentation result on the right.Figure 5: Brodatz Patterns used in the synthetic experiments
We introduced a new spectral method for image segmentation that can segment challenging imageswhile working directly with "raw" pixel values, without any pre-processing or filtering. The approachis based on a novel combination of appearance and spatial grouping cues using two different graphs.We use a dense graph to capture appearance grouping cues. This leads to non-parametric models ofregion appearance. We also describe a technique that can be used to sparsify the resulting graph toease the computational burden of spectral segmentation. Our results show that long range interactionscan capture the appearance of complex regions and significantly improve the performance of graph-based segmentation methods. The proposed method is practical and it can be applied to differenttypes of images (natural, biomedical, textures, etc.).8 roader Impact
Image segmentation has a variety of applications that can benefit all of society. For example,segmentation methods may enable advances in biomedical image analysis (including for medicaldiagnosis and treatment), tele-conferencing technology, human-computer-interaction, remote sensing,and robotics. However, there are also potential uses with questionable ethics, including masssurveillance and military applications.
References [1] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation.
IEEE Transactionson Pattern Analysis and Machine Intelligence , 22(8):888–905, 2000.[2] Marina Meila and Jianbo Shi. Learning segmentation by random walks. In
Advances in NeuralInformation Processing Systems , pages 873–879, 2001.[3] Yuri Boykov and Vladimir Kolmogorov. Computing geodesics and minimal surfaces via graphcuts. In
IEEE International Conference on Computer Vision , 2003.[4] David Mumford and Jayant Shah. Optimal approximations by piecewise smooth functions andassociated variational problems.
Communications on pure and applied mathematics , 42(5):577–685, 1989.[5] Robert Jenssen, Deniz Erdogmus, Jose Principe, and Torbjorn Eltoft. The laplacian PDFdistance: A cost function for clustering in a kernel feature space. In
Advances in NeuralInformation Processing Systems , pages 625–632, 2005.[6] Meng Tang, Lena Gorelick, Olga Veksler, and Yuri Boykov. Grabcut in one cut. In
IEEEInternational Conference on Computer Vision , pages 1769–1776, 2013.[7] Charless Fowlkes, Serge Belongie, Fan Chung, and Jitendra Malik. Spectral grouping usingthe nystrom method.
IEEE Transactions on Pattern Analysis and Machine Intelligence , 26(2):214–225, 2004.[8] Frank Lin and William W Cohen. Power iteration clustering. In
International Conference onMachine Learning , 2010.[9] David P Williamson and David B Shmoys.
The design of approximation algorithms . CambridgeUniversity press, 2011.[10] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural imagesand its application to evaluating segmentation algorithms and measuring ecological statistics.In
International Conference on Computer Vision , volume 2, pages 416–423, July 2001.[11] Thomas Mosgaard Giselsson, Rasmus Nyholm Jørgensen, Peter Kryger Jensen, Mads Dyrmann,and Henrik Skov Midtiby. A public image database for benchmark of plant seedling classificationalgorithms. arXiv preprint arXiv:1711.05458 , 2017.[12] Carsten Rother, Vladimir Kolmogorov, and Andrew Blake. "Grabcut" interactive foregroundextraction using iterated graph cuts.
ACM transactions on graphics (TOG) , 23(3):309–314,2004.[13] Mark Everingham, S.M. Ali Eslami, Luc Van Gool, Christopher K.I. Williams, John Winn, andAndrew Zisserman. The pascal visual object classes challenge: A retrospective.
InternationalJournal of Computer Vision , 111(1):98–136, 2015.[14] Rossella Aversa, Mohammad Hadi Modarres, Stefano Cozzini, and Regina Ciancio.NFFA-EUROPE - SEM dataset, 2018. URL https://b2share.eudat.eu/records/80df8606fcdb4b2bae1656f0dc6db8ba .[15] Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and analgorithm. In
Advances in Neural Information Processing Systems , pages 849–856, 2002.916] Phil Brodatz.
Textures: a photographic album for artists and designers . Dover Pubns, 1966.[17] Paul Jaccard. Étude comparative de la distribution florale dans une portion des alpes et des jura.