[PDF] A Multi-Layer Approach to Superpixel-based Higher-order Conditional Random Field for Semantic Image Segmentation

Abstract

Superpixel-based Higher-order Conditional random fields (SP-HO-CRFs) are known for their effectiveness in enforcing both short and long spatial contiguity for pixelwise labelling in computer vision. However, their higher-order potentials are usually too complex to learn and often incur a high computational cost in performing inference. We propose an new approximation approach to SP-HO-CRFs that resolves these problems. Our approach is a multi-layer CRF framework that inherits the simplicity from pairwise CRFs by formulating both the higher-order and pairwise cues into the same pairwise potentials in the first layer. Essentially, this approach provides accuracy enhancement on the basis of pairwise CRFs without training by reusing their pre-trained parameters and/or weights. The proposed multi-layer approach performs especially well in delineating the boundary details (boarders) of object categories such as "trees" and "bushes". Multiple sets of experiments conducted on dataset MSRC-21 and PASCAL VOC 2012 validate the effectiveness and efficiency of the proposed methods.

Full PDF

AA MULTI-LAYER APPROACH TO SUPERPIXEL-BASED HIGHER-ORDER CONDITIONALRANDOM FIELD FOR SEMANTIC IMAGE SEGMENTATION

Li Sulimowicz (cid:63)

Ishfaq Ahmad (cid:63)

Alexander Aved † (cid:63) Department of Computer Science and Engineering, University of Texas at Arlington, TX, USA † Air Force Research Laboratory, Rome, NY, USA { li.yin@mavs, iahmad@cse } .uta.edu (cid:63) , [email protected] † ABSTRACT

Superpixel-based Higher-order Conditional random ﬁelds(SP-HO-CRFs) are known for their effectiveness in enforc-ing both short and long spatial contiguity for pixelwise la-belling in computer vision. However, their higher-order po-tentials are usually too complex to learn and often incur ahigh computational cost in performing inference. We pro-pose an new approximation approach to SP-HO-CRFs thatresolves these problems. Our approach is a multi-layer CRFframework that inherits the simplicity from pairwise CRFs byformulating both the higher-order and pairwise cues into thesame pairwise potentials in the ﬁrst layer. Essentially, this ap-proach provides accuracy enhancement on the basis of pair-wise CRFs without training by reusing their pre-trained pa-rameters and/or weights. The proposed multi-layer approachperforms especially well in delineating the boundary details(boarders) of object categories such as “trees” and “bushes”.Multiple sets of experiments conducted on dataset MSRC- and PASCAL VOC validate the effectiveness and efﬁ-ciency of the proposed methods. Index Terms — Superpixel-based Higher-order CRFs,Multi-layer CRFs, Semantic Segmentation.

1. INTRODUCTION

Over the last few years, convolutional neural networks(CNNs) have been highly successful in a variety of com-puter vision tasks such as image recognition and seman-tic segmentation. However, due to the loss of resolutionand position information, CNNs always produce “blobby”output with coarse boundaries. Thus, probabilistic graph-ical models such as conditional random ﬁelds (CRFs) andhigher-order CRFs (HO-CRFs) have been widely used to-gether with CNNs [1, 2, 3, 4, 5] for semantic image seg-mentation to enforce smoothness constraints for convertingthe output of CNNs from “blobby” to “sharp.” Speciﬁcally,HO-CRFs [6, 7, 8, 4] have been demonstrated to be highly ef-fective in enforcing long range or even global connections; forinstance, region-level appearance consistency, co-occurrenceof objects, and global connectivity. (a) CRF-RNN [2] (b) The proposed (c) Ground Truth

Fig. 1 : One example of the comparative result of the proposedmethod with CRF-RNN, where the proposed improves accu-racy by getting rid of small spurious regions.We observed two main problems with the traditionalregion-based HO-CRFs (Or superpixel-based HO-CRFs); asuperpixel is a group of pixels with high appearance similarityand generated by segmentation algorithms [9, 10]. First, thehigher-order terms which are deﬁned over cliques consistingof more than two nodes are computationally complex in bothlearning and inference process. For instance, transformationbased methods, mean-ﬁeld approximation, and dual decom-position. Second, the traditional higher-order terms [4, 6]need to be learned on large training set, which requires a longtraining time before obtaining effective and workable higher-order potentials.To resolve these two problems, we propose an approxi-mation method with a multi-layer framework, which has twolayers and each layer itself is a pairwise CRF. Essentially, theﬁrst layer is called Segmentation as Input (SaI), which worksby utilizing the color-sensitive feature of most pairwise poten-tials. SaI uses a segmented image, wherein each pixel takesthe averaged RGB value of the superpixel it belongs to. SaIcombines the superpixel-based cues into pairwise potentialsand at the same partially preserves the pixelwise regulation.As a result, SaI itself can work alone as SP-HO-CRF. SaIprovides the following two beneﬁts: 1) decrease the computa-tional complexity to the same level as that of pairwise CRFs.2) reuse the parameters and/or weights from pre-trained pair-wise CRFs, so that no training or just simple grid search isrequired to boost the accuracy. However, SaI slightly suf-fers from the pairwise potential information loss. The multi-layer framework with a pairwise CRF as the second layer1 a r X i v : . [ c s . C V ] A p r ompensates for this loss and further boosts the accuracy.We call this approximation approach SaI-based Multi-layerCRF (SM-CRF). SM-CRF preserves accurate boundaries formeshy objects such as trees and bushes. We applied SaI andSM-CRF on DenseCRF [11] and CRF-RNN [2], and eval-uated them using the dataset MSRC- and PASCAL VOC [12], respectively. The experimental results show thatSaI decreases the error rate by up to compared to thebaseline models with only simple grid search training. Oneexample is shown in Fig. 1.This paper is organized as: Section 2 provides prelimi-naries related to pairwise CRFs and SP-HO-CRFs. Section 3gives details of our two main contributions, SaI and SM-CRF.Then, the experiments and related results are given in Sec-tion 4. Finally, we provide concluding remarks in Section 5.

2. PRELIMINARIES

We start with the preliminaries about CRFs and Higher-orderCRFs and introduce the notations used in this paper.

Deﬁne two random ﬁelds D and X . D is a vector of ran-dom variables { D , ..., D N } , wherein D i is associated topixel i in the observed image. X is a vector of randomvariables over D and take value from a pre-deﬁned label set L = { l , ..., l L } . Let G = ( V , E ) be a graph over X . Theconditional random ﬁeld ( D , X ) is formulated by a Gibbsdistribution P ( X | D ) = Z ( D ) exp ( − E ( X | D ) . E ( X | D ) = (cid:80) c ∈ C g φ c ( X c | D ) is called Gibbs energy and Z ( D ) is thenormalization constant known as partition function. Here C g is a set of cliques in G . The maximum a posteriori labeling ofthe random ﬁeld is the same as x ∗ = arg min x ∈L N E ( X | D ) . Conventionally, for superpixel-based higher-order CRFs [6,4], the Gibbs energy is shown as follows: E ( X | D ) = (cid:80) i ∈ V ψ iU ( x i ) + (cid:80) ( i,j ) ∈ E ψ ijP ( x i , x j ) + (cid:80) s ∈ S ψ sSP ( X s ) (1)here the Gibbs energy consists of sums of unary, pairwise andhigher-order potentials. S refers to a set of image segments. Gaussian Pairwise Potential in DenseCRFs [11, 14, 2,4, 3] take the form as follows: ψ Pij ( x i , x j ) = µ ( x i , x j ) K (cid:88) m =1 ω ( m ) κ ( m ) ( f i , f j ) , (2) here µ is a label compatibility function, with Potts model µ ( x i , x j ) = 1 [ x i (cid:54) = x j ] . κ ( f i , f j ) = (cid:80) Km =1 ω ( m ) κ ( m ) ( f i , f j ) is a linear combination of Gaussian kernels. f i is the fea-ture vector from pixel i . For multi-class image labelling, DenseCRF [11] uses contrast-sensitive two-kernel potentialsdeﬁned in terms of the color and position vector I, P : κ ( f i , f j ) = ω (1) exp ( − | P i − P j | θ α − | I i − I j | θ β ) (cid:124) (cid:123)(cid:122) (cid:125) appearance kernel (3) + ω (2) exp ( − | P i − P j | θ γ ) (cid:124) (cid:123)(cid:122) (cid:125) smoothness kernel Region-based Higher-order Potentials.

Region-basedpotentials mainly use P N Potts model. In Dense+Potts [8],H-CRF-RNN [4], each clique is a is superpixel which con-sist of all the pixels inside. The higher-order potentials with P n - P otts model type energy [15] are shown in Eq. 4. ψ SPs ( X s = x s ) = (cid:40) ω Low ( l ) , if all x is = l,ω High , otherwise . (4) where x s corresponds to all the pixels within the s th super-pixel, ω Low ( l ) < ω High for all l . Costs ω Low ( l ) and ω High are weights need to be learned.The Robust P N model is deﬁned as ψ SPs ( X s = x s ) = min l ∈ L ( γ maxc , r lc + k lc N lc ) , (5) where N lc = (cid:80) i ∈ c δ ( x i (cid:54) = l ) is the number of inconsistentpixels in the clique c with the label l .

3. THE PROPOSED METHODS

First, in Section 3.1, we present the two important observa-tions related to higher-order potentials and the pairwise poten-tials. Then, we describe the ﬁrst layer of our approximationapproach which is SaI in Section 3.2. In Section 3.3, we ex-plain how the multi-layer approximation approach SM-CRFworks. Section 3.4 provides the details of the application ofSaI and SM-CRF on DenseCRF [16] and CRF-RNN [2].

We make the following two important observations on theregion-based higher-order potentials and the pairwise poten-tials:(1) This higher order potential shown in Eq. 5 is thenformulated as a minimization of a secondary pairwise CRFsbetween an auxiliary variable y c and the pixel within thehigher order clique. y c takes value from the extended labelset L E = L ∪ L F . Here the observed sequence data is all thepixels in the higher order clique. ψ SPs ( X s ) = min y c ∈ L E ( φ c ( y c ) + (cid:88) i ∈ c φ c ( y c , x i )) [ , ] (6)This conversion of higher order potential makes us believethat the higher-order regulation eventually “ﬂows” down ontopairwise cliques.2 a) Original Image (b) D s with BGPS [9] Fig. 2 : The examples of the segmented image D s fromBGPS [9].(2) When the color brightness is used as a feature vectorin the pairwise potentials the value of potentials is inverselyproportional to the absolute color brightness difference of twopixels. This color-sensitive feature makes the pairwise po-tentials obtain the maximum value at I i = I j as shown inEq. 7. This encourages pixels with high resemblance to takethe same label and vice versa. ( I i , I j ) ∗ = arg max I i ,I j ∈ L N ψ Pij ( x i , x j ; θ , F K I ) (7) ( I i , I j ) ∗ = ( I i = I j ) F K I represents all feature vectors but I , e.g. P . θ denotes allthe hyperparameters and weights of this model. Inspired by observation one, we proposed Segmentation asInput (SaI) method to form the higher-order cues directly onpairwise potentials to resolve the computational complexityfaced by traditional higher-order CRFs in the inference andlearning, Suppose we form a pairwise graph on the segmenta-tion of a given image, for the pairwise clique: if it exists insideone superpixel, which we name such edge as intra-edge andthe potential as intra-potential, ψ intra ( x i , x j ) ,and if it existsacross two different superpixels, which we name such edge asextra-edge and the potential as extra-potential, ψ extra ( x i , x j ) .Then it is reasonable to have the following equation: ψ intra ( x i , x j ) ≥ ψ extra ( x i , x j ) , δ ( x i , x j ) = 1 , (8)so that label consistency in segments is enforced. SaI is pro-posed to satisfy this scheme to approximate both the higher-order and pairwise potentials. Segmentation as Input.

In SaI, we pre-process theoriginal RGB images from the dataset with unsupervisedsegmentation, in our paper, we used mean-shift [18] andBGPS [9] as shown in Fig. 2. We use s i and s j as the seg-ment indexes of pixel i and j , respectively. Then we storea segmented image wherein each pixel i takes the averageRGB value C s i of the superpixel that it belongs to. Wenoted such segmented image as D s . The SaI based pair-wise potential is denoted as ψ SaIij ( x i , x j ; D s ) . For exam-ple, if we formulate our SaI potential as ψ SaIij ( x i , x j ) = µ ( x i , x j )( θ p + θ v exp ( − θ β | C s i − C s j | )) , we have our intrapotentials as ψ SaIintra ( x i , x j ) = µ ( x i , x j )( θ p + θ v ) , s i = s j . According to Observation two the intra-potential is the maxi-mum which makes Eq. 8 satisﬁed. Relation to Robust P N Potts Potential.

Inside the pair-wise graph that built on X and observed on D s , pairwiseedges can be divided into two groups, the intra-edges de-noted as E intra and the extra-edges as E extra , which we have E = E intra ∪ E extra . The sum of all intra-potentials are fur-ther decomposed as follows: (cid:88) ( i,j ) ∈ E intra ψ SaI ( x i , x j ) = (cid:80) c ∈ S (cid:80) i,j ∈ c ψ SaIintra ( x i , x j ) , (9) = (cid:80) c ∈ S ( N c ( X s )( θ p + θ v )) wherein N c ( X s ) is the number of pairwise edges in the clique c that take inconsistent labels. This equation shows the equiv-alency of the sum of intra-potentials in the pairwise graph asthe Robust P N Potts potential in Eq. 5. The sum of all extra-potentials functions as a regulator between segments.SaI based pairwise CRF takes the following Gibbs energyfunction: E ( X | D s ) = (cid:80) i ∈ V ψ iU ( x i ) + (cid:80) ( i,j ) ∈ E ψ SaIij ( x i , x j ) (10) The beneﬁt of this model is that the computational complex-ity is same as that of the original pairwise CRFs of both thelearning and inference processes. However, SaI does slightlysuffer from the pairwise regulation loss. This is due to: when s i (cid:54) = s j , we have | C s i − C s j | ∼ = | I i − I j | compared with D , a gap of the left and the right sides of the equation exists.This gap potentially depends on the resulting segments, thesmaller each superpixel is, the more identical the two sides ofthe equation are. The proposed SaI-based Multi-layer CRFsFramework (SM-CRF) incorporates another pairwise layerafter SaI to compensate the potential drawback of SaI. : SM-CRF Framework. In the middle red box is the ﬁrstlayer of SaI. We observe performance improvement compar-ing of the output of SM-CRF noted in the red box on the rightside with the output of the original CRF.The framework of SaI-based Multi-layer CRFs (SM-CRF) is shown in Fig. 3. SM-CRF is a two layer pairwiseCRF wherein the ﬁrst layer is SaI which can be looked uponas a higher-order potential term and the second layer is a pair-wise CRF, which takes the original RGB image as input in-stead. Each layer’s Gibbs energy function is shown in Eq. 103nd Eq. 11. The initial unary probability label map is U ( x ) ,the output unary map from SaI layer is denoted as U (1) SaI ( x ) ,and the unary map as input to the second layer is denoted as U (1) ( x ) . We set U (1) ( x ) = α ∗ U (1) SaI ( x )+ β ∗ U ( x ) , α + β = 1 .This is trained with grid search. E ( X | D ) (2) = (cid:88) i ∈ V ψ U (1) ( x i ) + (cid:88) ( i,j ) ∈ E ψ P ( x i , x j ) (11) This multiple layer structure helps the ﬁrst SaI layer to in-corporate more efﬁcient pairwise regulations. The result ishigher accuracy and better performance at preserving the de-tails of the boundaries.

We focus on two basic models: DenseCRF [16, 19] andCRF-RNN [2] which represents simple Potts model where µ ( x i , x j ) = 1 [ x i (cid:54) = x j ] being used as post-processing tool af-ter classiﬁer and more advanced model wherein CRF is for-mulated as a stack of Recurrent Neural Networks which istrained end-to-end with the classiﬁer CNN, repectively. InDenseCRF, we use setting ω ( m ) , µ ∈ R , while in CRF-RNN ω ( m ) , µ ∈ R L × L and are learned on large dataset. To updatethese two basic models to incorporate higher-order cues, weset our SaI-based Pairwise potentials to be the same as theformulation of DenseCRF shown in Eq. 2 and 3. In the casethat we do not have a pre-trained basic models, (1) the learn-ing and inference of SaI will be the same as the basic models,and (2) the learning of SM-CRF is in a layer-by-layer manner.First to train SaI and then ﬁx the parameters of the ﬁrst layerto be the trained optimum values to train parameters involvedwith adding the second layer. However, if there is already apre-trained pairwise model that one can just perform a simpleupgrade to obtain a superpixel-based higher-order CRF. Upgrade Pre-trained Pairwise Model to SaI model.

As shown in DenseCRF [16, 19], ω (1) and θ α are the mostimportant hyperparameters or weights. Thus, to keep thenew model simple but effective, in SaI the pairwise potentialsreuse all the other hyperparamters or weights except ω (1) and θ α , which include compatibility function µ , ω (2) , θ β , and θ γ . For DenseCRF.

If there is a trained pairwise models, toupgrade this model to incorporate superpixel-level cues, wejust need to do simple grid search for the two most importantparameters ω (1) and θ α in model SaI. For CRF-RNN.

Because in CRF-RNN, ω (1) is a × matrix that learned on large training dataset. Thus, we pro-pose the relation ω (1) SaI = rω (1) , r ∈ (0 , to effectively reusethis parameter. Therefore, we can still use simple grid searchof r and θ α to upgrade this pairwise model to SaI model. Withthe upgrade, the new SaI based pairwise potential takes the following form: ψ SaIi,j ( x i , x j ) = { rω (1) exp ( − | P i − P j | θ αSaI − | C si − C sj | θ β ) (12) + ω (2) exp( − | P i − P j | θ γ ) } µ ( x i , x j ) Upgrade Pre-trained Pairwise Model to SM-CRFModel.

We need to train SM-CRF layer by layer. At ﬁrst,we perform the ﬁrst round of grid search to obtain ω (1) and θ α for the ﬁrst SaI layer by setting setting α = 0 . , β = 0 . .Next, we obtain one more hyperparameter α ( β = 1 − α )which connects the initial unary map and the unary outputmap of SaI layer.

4. EXPERIMENTS4.1. Experimental Setup

In our experiments, we evaluated our methods on two publicdatasets: MSRC- and PASCAL VOC [12]. In MSRC- , there are a total of RGB images with coarse labeledground truth of object classes [20], which is called Stan-dard Ground Truth (SGT). Such coarse ground truth with un-labeled regions around the object boundaries makes it difﬁcultto quantitatively evaluate the algorithm’s performance, sincethe algorithm crave for the pixel-level accuracy. Therefore,we used a subset from MSRC- of images with pixel-level accurate labelling, which is noted as Accurate GroundTruth (AGT). PASCAL VOC has a total of , im-ages with a total of object classes in the training and vali-dation set. In our experiment, we evaluated the performanceon a reduced validation set (RVS) as used in [2, 4], which in-cludes images. Three metrics are used for evaluating theproposed methods, global pixel accuracy (Global), averagedpixel accuracy (Average), and mean intersection over union(MeanIoU). First, we segmented both SGT and AGT with mean-shift seg-mentation with hyper-parameters set as ( h s , h r ) = (7 , . .The output from the segmentation is noted D s . Then wesplit AGT into half as training set and half as testing set.All the models were tested on the testing set in AGT. Weuse the same unary potentials that were used in the imple-mentation of DenseCRF [11] which was derived from Tex-tonBoost [20, 17]. Next, we generated two models based onDenseCRF. 1) SaI, which is the same as DenseCRF, exceptthat our observation is D s instead of D . 2) SM-CRF; con-sists of SaI followed by a second layer which is a pairwiseCRF with D . Here we set the comparatively less importanthyperparameters to be θ β = 13 , θ γ = 3 , and ω (2) = 3 . The Effectiveness of Simple Grid Search.

We per-formed grid search for DenseCRF and SaI with two sets oftraining. First, we set θ α = 80 for both models to train ω (1) ω (1) = 6 for DenseCRF and for SaI. Then, we ﬁxed ω (1) and trained to ﬁnd θ α , which was for DenseCRF and forthe SaI, respectively. For SM-CRF, an additional set of gridsearch is required for α, β and the training process is shownin Fig. 4. The trained parameter was α = 0 . , β = 0 . . Forthe comparative model Dense + Potts [8], it needs to be trainedon a large dataset to obtain workable higher-order potentials.Here, we use SGT minus the testing set from AGT as its train-ing set. After the higher-order potential is trained, we did gridsearch to train its two hyperparameters. The testing result ofeach model is shown in Table 1. Our models’ performanceon the testing set demonstrates the effectiveness of learningwith simple grid search without another set of training for thehigher-order potentials needed in Dense+Potts model.

Table 1 : Quantitative results on MSRC- dataset. Accurate Ground Truth TimeGlobal Average IOUUnary .

33 83 .

30 63 . N/A

DenseCRF [16] .

63 86 .

29 71 .

80 0 . s Dense + Potts [8] .

76 84 .

41 72 .

73 1 . s SaI .

57 87 .

25 74 .

62 0 . s SM-CRF .

99 87 .

93 76 . . s From Table 1, the accuracy improvement we gainedfor SaI and SM-CRF compared with DenseCRF is ( . , . ) on Global, ( . , . ) on Average, and ( . , . ) on MeanIoU, respectively. The model SaI consumedthe same amount of time as DenseCRF, and for SM-CRF,the time is slightly more than double of DenseCRF. The pro-posed scheme decreased the error rate by as much as . .While, Dense + Potts gained an accuracy equivalent to Dense-CRF. This may be due to the trained higher-order term, whichcan recommend a new label based on its dependent training.This can be both helpful and harmful. When the predictedlabel is right, it can be an advantage as row two in Fig. 5 the“grass” marked by purple color is more complete than any ofthe other models. However, this new label might be a wrongone, especially when the initial unary prediction is already ofhigh accuracy, this in turn can make the accuracy worse. Rowthree and row ﬁve in Fig. 5 demonstrate such defect, wherewe observed that the “cat” totally disappeared. Also, we no-ticed that Dense + Potts performed worse than our models inpreserving sharp boundaries of objects.

First, we segmented our dataset RVS with BGPS segmenta-tion of scale and saved the segmented image as D s . Theweights of both CNN and CRF-RNN were tuned end-to-endin the implementation . We evaluated two models based onCRF-RNN [2]: SaI-RNN, where we just changed the input to https://github.com/torrvision/crfasrnn Fig. 4 : Grid Search of parameter α for SM-CRF on MSRC- , β = 1 − α . When α = 1 , the SM-CRF has degraded intoa single layer of DenseCRF. Fig. 5 : Examples of qualitative results on MSRC- dataset.From left to right, it is DenseCRF, SaI-CRF, SM-CRF,Dense + Potts, and GT. Row one to row ﬁve are successfulexamples and the last two rows are the failure cases.

Table 2 : Performance comparison of SaI-RNN with CRF-RNN on different setting of parameters.

CRF-RNN [2] SaI- SaI- SaI- parameters µ t , ω ( m ) t µ, ω ( m ) µ, ω ( m ) t µ t , ω ( m ) t MeanIoU . . . . Table 3 : Performance comparison of our methods with CRF-RNN [2] and H-CRF-RNN [4].

Global Average MeanIoU RetrainCRF-RNN [2] . . . N/AH-CRF-RNN [4] N/A N/A . Yes

SaI-RNN . . . Grid Search

SM-CRF-RNN . . . Grid Search D to D s and SM-CRF-RNN where we havetwo layers, the input of each layer is D s and D , respectively. The Effectiveness of Parameters Reuse.

In SaI-RNN,we have three different settings to demonstrate the effective-ness of reusing the parameters that trained from the baselineCRF-RNN [2], including µ and ω ( m ) ( ω (1) and ω (2) ) whichare × matrices. Here, we used µ t , ω ( m ) t to denote theend-to-end tuned parameters from CRF-RNN. In the ﬁrst set-ting, we named it SaI- , and here we did not use any end-to-end tuned parameters; in the second setting, we only reusedthe ω t , named SaI- ; and in the third model, we reused both µ t and ω ( m ) t , and named it SaI- . In all these settings, r wassimply set to .From the results compared with CRF-RNN as shown inTable 2, we note that by reusing both µ t and ω t we gained . improvement in MeanIoU even without any trainingprocess. When we only reused the ω t , the improvementwas . instead. However, we obtained lower accuracyby not reusing any trained parameters and/or weights. Thisdemonstrates the effectiveness and efﬁciency of SaI, which isboosted accuracy without either training or more time cost inthe inference given a pre-trained pairwise CRF. The Effectiveness of Simple Grid Search.

In this setof experiments, both SaI-CRF-RNN and SM-CRF-RNN used µ t and ω ( m ) t in the model. We used a subset of imagesfrom the validation set of PASCAL VOC to conduct thegrid search. First, we trained parameters θ α and r for the ﬁrstlayer SaI-RNN, which we set θ α = 96 , r = 0 . . Next, we didanother set of grid search for α ( β = 1 − α ), and the parameterwe gained is α = 0 . , β = 0 . .The quantitative results are shown in Table 3, which in-dicate that we obtained MeanIoU accuracy improvementfor SM-CRF-RNN, and for SaI-RNN, we gained Averageaccuracy improvement. The accuracy boost is equivalent toH-CRF-RNN [4]. The accuracy improvement we achievedthrough our models was on the condition that our modelswere either not trained or were trained only with a simplegrid search with two more hyperparamters ( r and α ). Bycontrast, H-CRF-RNN requires ﬁne tuning on large datasetof more that , images which consists of the PASCALVOC and Microsoft COCO dataset [21]. The visual re-sults are shown in Fig. 6. The ﬁgures shows that our modelsperform well “ﬁlling” the incomplete predicted object. Also,the proposed scheme preserves sharp boundaries as well asgetting rid of small spurious regions as shown in Fig. 1.

5. CONCLUSIONS

We presented SM-CRF, which is a two layer CRF frame-work that approximates traditional superpixel-based (region-based) higher-order CRFs. The ﬁrst layer of SM-CRF, SaI,itself is a pairwise CRF, and it successfully integrates thesuperpixel-based higher-order cues into the color-sensitivepairwise potentials by feeding in the segmented image which

Fig. 6 : The comparison of segmentation results on PAS-CAL VOC dataset. From left to right, it is OriginalImage, CRF-RNN, SaI-RNN, SM-CRF-RNN, and GroundTruth. Row one to row four are successful examples, and thelast two rows show the failed cases.encompassed the segmentation information. The second pair-wise layer which takes the original RGB image as inputcompensates the pairwise regulation loss from the SaI andfurther boosts the accuracy. Our approximation approachachieves outstanding accuracy gain compared with the base-lines, DenseCRF and CRF-RNN even without training or justby using a simple grid search while consuming the amountof time equivalent to those of the base models. When com-pared with other higher-order CRFs, DenseCRF + Potts andH-CRF-RNN, we either achieved better results or equivalentresults with less time cost.

6. REFERENCES [1] A. Arnab, S. Zheng, S. Jayasumana, et al., “Conditionalrandom ﬁelds meet deep neural networks for semanticsegmentation,” .[2] S. Zheng, S. Jayasumana, B. Romera-Paredes, et al.,“Conditional Random Fields as Recurrent Neural Net-works,” in

ICCV , 2015.[3] L.-C. Chen, G. Papandreou, I. Kokkinos, et al.,“Deeplab: Semantic image segmentation with deep con-volutional nets, atrous convolution, and fully connectedcrfs,”

IEEE TPAMI , vol. 1, 2016.[4] A. Arnab, S. Jayasumana, S. Zheng, et al., “Higher orderconditional random ﬁelds in deep neural networks,” in

ECCV , 2016.65] S. Chandra and I. Kokkinos, “Fast, exact and multi-scale inference for semantic image segmentation withdeep gaussian crfs,” in

European Conference on Com-puter Vision . Springer, 2016, pp. 402–418.[6] P. Kohli, P. H. Torr, et al., “Robust higher order po-tentials for enforcing label consistency,”

InternationalJournal of Computer Vision , vol. 82, no. 3, 2009.[7] N. Komodakis and N. Paragios, “Beyond pairwise en-ergies: Efﬁcient optimization for higher-order mrfs,” in

CVPR , 2009, pp. 2985–2992.[8] V. Vineet, J. Warrell, and P. H. Torr, “Filter-based mean-ﬁeld inference for random ﬁelds with higher-order termsand product label-spaces,”

IJCV , vol. 110, no. 3, 2014.[9] Z. Li, X.-M. Wu, and S.-F. Chang, “Segmentation usingsuperpixels: A bipartite graph partitioning approach,” in

CVPR , 2012.[10] L. Sulimowicz and I. Ahmad, “rapid regions-of-interestdetection in big histopathological images,” in

Multime-dia and Expo (ICME), 2017 IEEE International Confer-ence on . IEEE, 2017, pp. 595–600.[11] P. Kr¨ahenb¨uhl and V. Koltun, “Efﬁcient Inference inFully Connected CRFs with Gaussian Edge Potentials,”in

NIPS , 2011.[12] B. Hariharan, P. Arbel´aez, R. Girshick, et al., “Simul-taneous detection and segmentation,” in

European Con-ference on Computer Vision . Springer, 2014, pp. 297–312.[13] J. Lafferty, A. McCallum, and F. C. Pereira, “Condi-tional random ﬁelds: Probabilistic models for segment-ing and labeling sequence data,” in

ICML , 2001.[14] L.-C. Chen, G. Papandreou, I. Kokkinos, et al., “Seman-tic image segmentation with deep convolutional nets andfully connected crfs,” in

International Conference onLearning Representations , 2014.[15] P. Kohli, M. P. Kumar, and P. H. Torr, “P3 & beyond:Solving energies with higher order cliques,” in

CVPR ,2007.[16] C. Russell, P. Kohli, P. H. Torr, et al., “Exact and approx-imate inference in associative hierarchical networks us-ing graph cuts,” arXiv preprint arXiv:1203.3512 , 2012.[17] L. Ladicky, C. Russell, P. Kohli, et al., “Associativehierarchical crfs for object class image segmentation,”in

ICCV , 2009.[18] D. Comaniciu and P. Meer, “Mean shift: A robust ap-proach toward feature space analysis,”

IEEE TPAMI ,vol. 24, no. 5, pp. 603–619, 2002. [19] P. Kr¨ahenb¨uhl and V. Koltun, “Parameter learning andconvergent inference for dense random ﬁelds,” in

ICML ,2013, pp. 513–521.[20] J. Shotton, J. Winn, C. Rother, et al., “Textonboostfor image understanding: Multi-class object recognitionand segmentation by jointly modeling texture, layout,and context,”

IJCV , vol. 81, no. 1, pp. 2–23, 2009.[21] T.-Y. Lin, M. Maire, S. Belongie, et al., “Microsoftcoco: Common objects in context,” in