Superpixel-enhanced Pairwise Conditional Random Field for Semantic Segmentation
SSUPERPIXEL-ENHANCED PAIRWISE CONDITIONAL RANDOM FIELD FOR SEMANTICSEGMENTATION
Li Sulimowicz ? Ishfaq Ahmad ? Alexander Aved † ? Department of Computer Science and Engineering, University of Texas at Arlington, TX, USA † Air Force Research Laboratory, Rome, NY, USA { li.yin@mavs, iahmad@cse } .uta.edu ? , [email protected] † ABSTRACT
Superpixel-based Higher-order Conditional Random Fields(CRFs) are effective in enforcing long-range consistency inpixel-wise labeling problems, such as semantic segmentation.However, their major short coming is considerably longertime to learn higher-order potentials and extra hyperparam-eters and/or weights compared with pairwise models. Thispaper proposes a superpixel-enhanced pairwise CRF frame-work that consists of the conventional pairwise as well as ourproposed superpixel-enhanced pairwise (SP-Pairwise) poten-tials. SP-Pairwise potentials incorporate the superpixel-basedhigher-order cues by conditioning on a segment filtered imageand share the same set of parameters as the conventional pair-wise potentials. Therefore, the proposed superpixel-enhancedpairwise CRF has a lower time complexity in parameter learn-ing and at the same time it outperforms higher-order CRF interms of inference accuracy. Moreover, the new scheme takesadvantage of the pre-trained pairwise models by reusing theirparameters and/or weights, which provides a significant ac-curacy boost on the basis of CRF-RNN [25] even withouttraining. Experiments on MSRC-21 and PASCAL VOC 2012dataset confirm the effectiveness of our method.
Index Terms — Superpixel-enhanced Pairwise CRFs, Se-mantic Segmentation, and Higher-order CRFs.
1. INTRODUCTION
Semantic segmentation is a low-level visual scene under-standing problem that involves labelling the category of everypixel within images. Semantic segmentation has numerouspromising applications, such as autonomous driving, roboticnavigation, computer-aided medical diagnosis, and imageediting [21, 11]. In recent years, the application of convolu-tional neural networks (CNNs) on computer vision problemshas often delivered outstanding performance [19, 20, 6].However, CNNs lack the ability to model the pixel-levelcorrelation which can lead to “blobby” object boundaries.CRFs [13, 9, 7, 25] capture the correlation between pixels bymodeling a conditional distribution between the observed andthe target variables are one of the most effective and com-monly used probabilistic graphical models. When CNNs arecombined with CRFs [4, 25, 14, 1], they generate output withsharper and more accurate boundaries. Specifically, CRFs that are trained end-to-end with CNNs gain state-of-the-artaccuracy [25, 14, 1, 3, 17].Normally, the pairwise CRFs are not expressive enoughto model higher level consistency such as region-level ap-pearance consistency, co-occurrence of objects or detector-based cues wherein each clique consists of more than twopixels. Higher order CRFs (HO-CRFs) [7, 8, 15, 24, 1] arethen used to incorporate these higher order cues, which haverecently been shown to be successful in semantic segmenta-tion [1, 18]. Conventionally, the region-level or segment-levelsemantic cues are formulated as higher-order potentials intwo categories: (1) region-based higher-order potentials ( P N Potts and robust P N Potts models [12, 7]) which are normallyused as appearance consistency regular and (2) pattern-basedhigher-order potentials [24] which can provide independentlabel suggestion. In this paper, we focus on the first categoryand refer higher-order potentials as the region-based ones.In the higher-order CRF, the pairwise and the higher-orderterm will have separate set of hyperparameters and/or param-eters which need to be both learned. This results in highercost in the parameter learning compared with pairwise mod-els. Moreover, given the situation that there already exists apre-trained pairwise model, updating the pairwise models toincorporate the superpixel cues, as done in H-CRF-RNN [1],requires us to fine-tune the parameters for the original pair-wise term and train again for the new set of parameters ofhigher-order term.To reduce the learning time of the region-based higher-order CRFs and make the above update easier, we proposean alternative approach, called superpixel-enhanced pair-wise CRF, which is composed of conventional pairwisepotentials and our proposed superpixel-enhanced pairwise( SP-Pairwise) potentials. Our SP-Pairwise potentials uti-lized a segment filtered image as observed data to enforcethe segmentation-level cues upon pairwise potentials instead.Fig. 1 illustrates the Gibbs energy structure and the segmentfiltered image of our superpixel-enhanced CRF. Furthermore,SP-Pairwise potential is isomorphic to the pairwise potential.This special relation leads to another important benefit, thatis, the reusability of the pre-trained tunable parameters fromthe pairwise CRFs. Our experiments conducted on datasetsMSRC-21 [23] and PASCAL VOC 2012 [5] confirm its ef-fectiveness in terms of the accuracy and speed. The proposed a r X i v : . [ c s . C V ] M a y ig. 1 : Comparison between conventional pairwise CRFs andthe proposed. The blue and red dashed box denotes the struc-ture of the conventional CRFs and our proposed, respectively.method also helps reduce the amount of small spurious re-gions, the same as reported in [1]. Fig. 1 shows the visualcomparison of the outputs.
2. PRELIMINARIESConditional Random Fields for Object Segmentation. X is a set of random variables, each X i corresponds to pixel atlocation i in the image and takes value from a pre-defined setof labels L = { l , ..., l L } . D is the observed data sequence.Let G = ( V , E ) be a graph that V = ( X i ) i ∈ N , N is the totalnumber of pixels in the given image. The conditional randomfield ( D , X ) is formulated by a Gibbs distribution, where themaximum a posteriori (MAP) labelling is equivalent to x ∗ = arg min x ∈ L N E ( X | D ) , E ( X | D ) = ∑ c ∈ C g ψ c ( X c | D ) , C g is aset of cliques in graph G . Pairwise CRFs.
The Gibbs energy of pairwise CRFs canbe written as the sum of the unary and pairwise potentials, E ( X | D ) = ∑ i ∈ V ψ iU ( x i ) + ∑ ( i , j ) ∈ E ψ i jP ( x i , x j ) , ( i , j ) ∈ E = i , j ∈ V , i < j for the fully connected DenseCRF [9]. Theunary potentials ψ Ui ( x i ) = − log ( P ( X i = x i )) . In this model C g is the set of all unary and pairwise cliques. The pairwisepotentials take the form as Eq. 1. ψ Pi j ( x i , x j ) = µ ( x i , x j ) K ∑ m = ω ( m ) κ ( m ) ( f i , f j ) (1) µ is a label compatibility function, with Potts model µ ( x i , x j ) = [ x i = x j ] . κ ( m ) ( f i , f j ) is the m -th Gaussian kernel. f i and f j arefeature vectors for pixel i and j . DenseCRF [9] uses contrast-sensitive two-kernel potentials, defined in terms of the colorvector I and position vector P , which is shown in Eq. 2. κ ( f i , f j ) = ω ( ) exp ( − | P i − P j | θ α − | I i − I j | θ β ) | {z } appearance kernel (2) + ω ( ) exp ( − | P i − P j | θ γ ) | {z } smoothness kernel here κ ( f i , f j ) denotes ∑ Km = ω ( m ) κ ( m ) ( f i , f j ) . The appearancekernel (bilateral kernel) forces nearby pixels with similarcolor to have the same labels. The degrees of the nearnessand appearance similarity are controlled by parameters θ α and θ β . The smoothness kernel (spatial kernel) helps inremoving small isolated regions.
3. METHOD FORMULATION
In Sec. 3.1, we present our pairwise model that incorporatessuperpixel cues among pairwise potentials. Then we demon-strate the details of applying this method onto fully connectedCRFs in Sec. 3.2.
Conventionally, the higher-order cues are formulated intohigher order potentials, where each clique consists of morethan two pixels. To compute such higher-order potential,it is then reformulated to a minimization problem of sumof pairwise potentials between pixel inside the higher orderclique and an auxiliary random variable [12]. Moreover, thehigher order term and the pairwise term each has its ownset of hyperparameters and/or weights, which potentially re-sults in longer time in learning compared with the pairwisemodels. Our superpixel-enhanced pairwise CRF model aimsat resolving these two problems and provides an alternativeother than HO-CRFs to incorporate superpixel cues.
Formulating Superpixel Cues onto Pairwise Potentials.
Different from conventional HO-CRFs, we enforce superpixelcues on pairwise terms directly. Assume we put the pairwisegraph on a segmented image, 1) we would except that pair-wise potential gives out higher penalty if this pairwise edgelocates inside one segment when their labels differ, we de-note this potential as intra-potential, ψ in ( x i , x j ) , and 2) a lowerpenalty if this edge crosses two different segments, which isdenoted as ψ ex ( x i , x j ) .In order to meet this feature, we pre-process the origi-nal RGB images with unsupervised segmentation and use s i to denote the segment index of pixel i . Next, we storea segmented image wherein each pixel i takes the averageRGB value C s i of the superpixel that it belongs to. Wedenote such segmented image as D s and one example isshown in Fig. 1. Suppose we take contrast-sensitive pairwisepotential from [2], then our SP-Pairwise potential will be ψ SPi j ( x i , x j ) = µ ( x i , x j )( θ p + θ v exp ( − θ β | C s i − C s j | )) . Here,if s i = s j (which leads to | C s i − C s j | = ψ in ( x i , x j ; s i = s j , Ω ) > ψ ex ( x i , x j ; s i = s j , Ω ) ; Ω = { θ , f is , f js } , where f s is feature from D s and θ = { θ p , θ v , θ β } . Therefore, with D s as observed data, acontrast-sensitive pairwise potential can successfully carryout the segmentation level cues. E ( X | D , D s ,..., D s H ) = ∑ i ∈ V ψ iU ( x i ) + ∑ ( i , j ) ∈ E ψ i jP ( x i , x j ; D ) (3) + H ∑ h = ∑ ( i , j ) ∈ E ψ SPi j ( x i , x j ; D s h ) ollowing this, we define the Gibbs energy of the proposedsuperpixel-enhanced pairwise CRF model as Eq. 3. Clearly,our Gibbs energy function has multiple terms of pairwise po-tentials, ψ i jP ( x i , x j ; D ) is the potential based on the originalimage and ψ SPi j ( x i , x j ; D s h ) is our SP-Pairwise potential basedon the segment filtered image D s h . Each D s h can be viewedas a denoised version of D . Equivalency to Robust P N Potts Model [7].
We classifyedges into intra and extra edges, E = E in ∪ E ex . We rewrite theSP-Pairwise term as ∑ ( i , j ) ∈ E ψ SPi j ( x i , x j ) = ∑ ( i , j ) ∈ E in ψ in ( x i , x j )+ ∑ ( i , j ) ∈ E ex ψ ex ( x i , x j ) . Further, we decompose ∑ ( i , j ) ∈ E in ψ in ( x i , x j ) into the form of superpixel-based higher order potentials. ∑ ( i , j ) ∈ E in ψ in ( x i , x j ) = ∑ c ∈ S ∑ ( i , j ) ∈ c ψ in ( x i , x j ) (4)where c is the superpixel clique and S is the set of all super-pixels. Inside the superpixel clique, | C s i − C s j | =
0. We use N i ( X c ) to denote the number of edges in the clique c that havedifferent labels, then we have Eq. 5. Let γ max = | c | ( θ p + θ v ) . | c | denotes the carnality of the pixel set c which in our case isthe total number of edges in the clique. We can rewrite Eq. 5into Eq. 6. ∑ ( i , j ) ∈ c ψ in ( x i , x j ) = N i ( X c )( θ p + θ v ) (5) ∑ ( i , j ) ∈ c ψ in ( x i , x j ) = ( N i ( X c ) | c | γ max , if N i ( X c ) < | c | , γ max , otherwise . . (6)This equation proves the equivalency of the sum of SP-pairwise potentials inside each clique to a Robust superpixel-based P n model [7]. Moreover, the sum of all extra-potentialshelp enforce consistency between segments. Parameters and/or Weights Sharing.
Each pixel can beviewed as a special case of the superpixel. This isomorphismleads to the assumption that these additional superpixel-enhanced pairwise terms and the original pairwise terms canpotentially share the same parameters. Thus, by formulatingthe ψ SPi j ( x i , x j ; D s h ) with similar structure as ψ i jP ( x i , x j ; D ) ,we can avoid introducing more parameters or even weightsand hence save time in learning these parameters. Moreover,if there exists a pre-trained pairwise CRF on D , it is possibleto reuse the learned parameters or weights directly. For DenseCRF [22] that takes Eq. 2 as potential function, wedefine our fully connected SP-Pairwise potential as follows inorder to reuse parameters. ψ SPi j ( x i , x j ; D s ) = µ s ( x i , x j ) ω ( ) s exp ( − | P i − P j | θ s α − | C si − C sj | θ s β ) (7) | P i − P j | / θ s α acts as the weight of the color sensitive po-tential, thus θ s α controls the degrees of the nearness. For theSP-Pairwise potential, we set θ s β = θ β , µ s ( x i , x j ) = µ ( x i , x j ) .When multiple such potential terms exist in Eq. 3, each SP-Pairwise potential term shares the same θ s α , ω ( ) s . Therefore, to incorporate different level of segment cues we only intro-duce one weight and one hyperparameter, which makes thewhole hyperparameters to be { θ α , θ β , θ γ , θ s α } and all weightsto be { ω ( ) , ω ( ) , ω ( ) s , µ ( x i , x j ) } . Here k s ( c si , c sj ) is used todenote this kernel.For DenseCRF [22], we use Potts model where µ ( x i . x j ) = [ x i = x j ] , because the weight and the hyperparameter are allone-dimensional real value, so we do simple grid search toobtain these parameters.For CRF-RNN [25], our SP-pairwise potential shares thesame label compatibility parameter µ (a 21 ×
21 matrix) withpairwise potential because of the isomorphism of superpixeland pixels. To update a pre-trained CRF-RNN, we set ω ( ) s = r ω ( ) , r ∈ ( , ] , ω ( ) is a 21 ×
21 matrix. Thus, we only in-troduce two additional hyperparameters: θ s α and r , which canbe easily trained with grid search. Inference and Learning.
As done in [9, 22, 24], mean-field approximation can be used for inference in which the keystep is to formulate the following iterative message passingupdate for different terms of potentials [10]: Q i ( x i = l ) = Z i {− ψ u ( x i ) − ∑ l ∈ L µ ( l , l ) k ( f i , f j ) Q j ( l ) } (8)For our SP-enhanced CRF, with simple substitution of k ( f i , f j ) with ( k ( f i , f j ) + k s ( c si , c sj )) , both the inference and learningare implemented with low complexity. The time complexityis linear with the total number of pairwise terms.
4. EXPERIMENTS
We evaluated our approach on two benchmarks: MSRC-21 [23] and PASCAL VOC 2012 [5] along with the baselinemodels, DenseCRF [9] and its recurrent neural network ver-sion CRF-RNN [25]. In our experiment, we used the accuratelabelling of a subset of 92 images in
MSRC- Dataset ,which are denoted as Accurate Ground Truth (AGT). Weevaluated the performance on a reduced validation set of
PASCAL VOC
Dataset which includes 346 imagesas used in [1, 25]. We used three evaluation metrics: pixelaccuracy(
Global ), mean accuracy(
Average ), and mean IOU(
MeanIOU ) as used in [19] . We first segmented the whole set of AGT with mean-shiftsegmentation of two settings: ( h s , h r ) = { ( , . ) , ( , . ) } ,denoted as D s and D s , respectively. Then, we split thisdataset into half as training set and half as testing set. Sec-ond, we used the same unary potentials that were are in theimplementation of the baseline DenseCRF [9]. Define n ij as the number of pixels of class i classified as j , n cl as thenumber of total classes in the ground truth, t i = ∑ j n ij as the total numberof pixels belong to class i , and t as the number of all the pixels in an im-age. We have Global = ∑ i n ii / t , Average = n cl ∑ i n ii / t i , and MeanIOU = n cl ∑ i n ii / ( t i + ∑ j n ji − n ii ) . e generated three models based on DenseCRF. 1) SP-CRF: we used the same energy function as DenseCRF does,except that our observation is D s instead of the originalimage D . 2) DenseHO: with DenseCRF that takes D asinput, we add SP-Pairwise term conditioning on D s . 3)DenseHO2: on the basis of DenseHO, we used an addi-tional SP-Pairwise with the coarser segments D s . Forthe higher order benchmark model, we implemented theDense+Potts [24] as given in [24]. We use Dense+Potts tomatch DenseHO and Dense+Potts3 has one additional set ofsegments compared with DenseHO2.Tab. 1 shows the experimental results. Our model DenseHO2obtained 3 .
68% MeanIOU improvement with only 0 . s additional running time compared with DenseCR, whichreduced the error rate by nearly 14%. Even our simplestmodel SP-CRF gained around 1 .
7% IOU accuracy boost,which demonstrates that the segment filtered image can notonly provide segment-level cues, but also partly preservepixel-level cues. And both DenseHO and DenseHO2 haveoutperformed Dense+Potts on every evaluation metric usingless than half of inference time consumed by Dense+Potts.This accuracy boost might be because of our model’s abilityto enforce appearance consistency between segments withthe extra-potentials. And the faster speed is because of usingpairwise potentials, the propose method benefits from thespeedup of the filtered-based mean-field inference. Fig. 2shows the examples of the visual results of these models.
Table 1 : Quantitative results evaluated on MSRC-21 dataset.For unary, it is 82 .
33, 83 .
30, 63 .
18 in accuracy respectively.
Accurate Ground Truth TimeGlobal Average IOUDenseCRF [22] 86 .
63 86 .
29 72 .
93 0 . s Dense + Potts [24] 87 .
43 85 .
64 74 .
81 0 . s Dense + Potts3 [24] 87 .
88 86 .
61 75 .
30 0 . s SP-CRF .
57 87 .
25 74 .
62 0 . s DenseHO .
87 87 .
82 76 .
57 0 . s DenseHO2 88 .
20 88 .
37 76 . . s Table 2 : Performance comparison of CRF-RNN-HO withCRF-RNN [25] and H-CRF-RNN [1].
Global Average MeanIOU RetrainCRF-RNN [25] . . . . Yes
CRF-RNN-HO . . . We segmented the reduced validation set with BGPS segmen-tation algorithm [16] at scale 15, denoted as D s , and themean-shift segmentation with parameters set to 7 , . D s . The original image set is noted as D .For the CRF-RNN model, we used the implementation from[25], θ α = , θ β = , θ γ =
3, and µ , ω ( ) , ω ( ) which are21 ×
21 matrices that were trained end-to-end together withCNNs. Here, we constructed model CRF-RNN-HO, which
Fig. 2 : Examples of qualitative results on MSRC-21 dataset.Columns from left to right is Unary, DenseCRF, SP-CRF,DenseHO2, Dense + Potts3, and Ground Truth. (a) Original (b) CRF-RNN (c) The proposed(d) Ground Truth
Fig. 3 : Visual results on PASCAL 2010 dataset.includes one pairwise term and two SP-Pairwise terns withinput observations as D , D s , D s , respectively.We conducted simple grid search on a subset of training setin PASCAL VOC 2012 to obtain our parameters, which givesout θ s α = , r = .
5. The quantitative results shown in Table 2indicate our method gains 1 .
5% higher accuracy in Averageand 1 .
1% in MeanIOU compared with CRF-RNN. Comparedwith H-CRF-RNN [1], we obtained equivalent performanceboost. Importantly, SP-enhanced Pairwise CRF achieved theaccuracy improvement without a large amount of retrainingon thousands of images that is required by H-CRF-RNN. Thevisual results are shown in Fig. 3.
5. CONCLUSIONS
In the paper, we presented a novel Superpixel-enhanced pair-wise CRF framework which to our knowledge is the firstof such pairwise CRFs that is capable of incorporating thesegment-based cues in a pixel by pixel data driven manner.We also introduced the SP-Pairwise potentials for fully con-nected CRF family. The results tested on semantic segmen-tation demonstrated that our approach improves the accuracywith easy training and is efficient in inference process. We be-lieve this framework is generic to many other image labellingproblems. . REFERENCES [1] A. Arnab, S. Jayasumana, S. Zheng, and P. H. Torr.Higher order conditional random fields in deep neuralnetworks. In
ECCV , 2016.[2] Y. Y. Boykov and M.-P. Jolly. Interactive graph cutsfor optimal boundary & region segmentation of objectsin nd images. In
Computer Vision, 2001. ICCV 2001.Proceedings. Eighth IEEE International Conference on ,volume 1, pages 105–112. IEEE, 2001.[3] S. Chandra and I. Kokkinos. Fast, exact and multi-scaleinference for semantic image segmentation with deepgaussian crfs. In
European Conference on ComputerVision , pages 402–418. Springer, 2016.[4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy,and A. L. Yuille. Deeplab: Semantic image segmen-tation with deep convolutional nets, atrous convolution,and fully connected crfs.
IEEE TPAMI , 1:1–16, 2016.[5] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn,and A. Zisserman. The PASCAL Visual Object ClassesChallenge 2012 (VOC2012) Results.[6] G. Ghiasi and C. C. Fowlkes. Laplacian pyramid recon-struction and refinement for semantic segmentation. In
European Conference on Computer Vision , pages 519–534. Springer, 2016.[7] P. Kohli, P. H. Torr, et al. Robust higher order potentialsfor enforcing label consistency.
International Journal ofComputer Vision , 82(3):302–324, 2009.[8] N. Komodakis and N. Paragios. Beyond pairwise en-ergies: Efficient optimization for higher-order mrfs. In
CVPR , pages 2985–2992, 2009.[9] P. Kr¨ahenb¨uhl and V. Koltun. Efficient Inference inFully Connected CRFs with Gaussian Edge Potentials.In
NIPS , 2011.[10] P. Kr¨ahenb¨uhl and V. Koltun. Parameter learning andconvergent inference for dense random fields. In
ICML ,pages 513–521, 2013.[11] A. Kundu, Y. Li, F. Dellaert, F. Li, and J. M. Rehg.Joint semantic segmentation and 3d reconstruction frommonocular video. In
European Conference on ComputerVision , pages 703–718. Springer, 2014.[12] L. Ladicky, C. Russell, P. Kohli, and P. H. S. Torr. Asso-ciative hierarchical crfs for object class image segmen-tation. In
ICCV , pages 739–746, 2009.[13] J. Lafferty, A. McCallum, and F. C. Pereira. Conditionalrandom fields: Probabilistic models for segmenting andlabeling sequence data. In
ICML , 2001.[14] M. Larsson, F. Kahl, S. Zheng, A. Arnab, P. Torr, andR. Hartley. Learning arbitrary potentials in crfs with gra-dient descent. arXiv preprint arXiv:1701.06805 , 2017.[15] X. Li and H. Sahbi. Superpixel-based object class seg-mentation using conditional random fields. In
Acoustics,Speech and Signal Processing (ICASSP), 2011 IEEEInternational Conference on , pages 1101–1104. IEEE,2011. [16] Z. Li, X.-M. Wu, and S.-F. Chang. Segmentation usingsuperpixels: A bipartite graph partitioning approach. In
CVPR , 2012.[17] G. Lin, C. Shen, A. Van Den Hengel, and I. Reid. Ex-ploring context with deep structured models for seman-tic segmentation.
IEEE Transactions on Pattern Analy-sis and Machine Intelligence , 2017.[18] Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang. Deeplearning markov random field for semantic segmenta-tion.
IEEE TPAMI , 1:1–16, 2017.[19] J. Long, E. Shelhamer, and T. Darrell. Fully convolu-tional networks for semantic segmentation. In
Proceed-ings of the IEEE Conference on Computer Vision andPattern Recognition , pages 3431–3440, 2015.[20] H. Noh, S. Hong, and B. Han. Learning deconvolu-tion network for semantic segmentation. In
ICCV , pages1520–1528, 2015.[21] G. Ros, S. Ramos, M. Granados, A. Bakhtiary,D. Vazquez, and A. M. Lopez. Vision-based offline-online perception paradigm for autonomous driving. In
Applications of Computer Vision (WACV), 2015 IEEEWinter Conference on , pages 231–238. IEEE, 2015.[22] C. Russell, P. Kohli, P. H. Torr, et al. Exact and approx-imate inference in associative hierarchical networks us-ing graph cuts. arXiv preprint arXiv:1203.3512 , 2012.[23] J. Shotton, J. Winn, C. Rother, and A. Criminisi. Tex-tonboost: Joint appearance, shape and context modelingfor multi-class object recognition and segmentation. In
ECCV , pages 1–15, 2006.[24] V. Vineet, J. Warrell, and P. H. Torr. Filter-based mean-field inference for random fields with higher-order termsand product label-spaces.
IJCV , 110(3):290–307, 2014.[25] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vi-neet, Z. Su, D. Du, C. Huang, and P. H. Torr. Condi-tional random fields as recurrent neural networks. In