Hierarchical Opacity Propagation for Image Matting
HHierarchical Opacity Propagation for ImageMatting
Yaoyi Li, Qingyao Xu, and Hongtao Lu
Department of Computer Science and Engineering,Shanghai Jiao Tong University, China
Abstract.
Natural image matting is a fundamental problem in compu-tational photography and computer vision. Deep neural networks haveseen the surge of successful methods in natural image matting in recentyears. In contrast to traditional propagation-based matting methods,some top-tier deep image matting approaches tend to perform propa-gation in the neural network implicitly. A novel structure for more di-rect alpha matte propagation between pixels is in demand. To this end,this paper presents a hierarchical opacity propagation (HOP) mattingmethod, where the opacity information is propagated in the neighbor-hood of each point at different semantic levels. The hierarchical structureis based on one global and multiple local propagation blocks. With theHOP structure, every feature point pair in high-resolution feature mapswill be connected based on the appearance of input image. We furtherpropose a scale-insensitive positional encoding tailored for image mattingto deal with the unfixed size of input image and introduce the randominterpolation augmentation into image matting. Extensive experimentsand ablation study show that HOP matting is capable of outperformingstate-of-the-art matting methods.
Keywords:
Image Matting; Hierarchical Propagation; Hierarchical Trans-former
Natural image matting is the problem of separating the foreground object frombackground. The digital image matting treats the input image as a compositionand aims to estimate the opacity of foreground image. The predicted results arealpha mattes which indicate the transition between foreground and backgroundat each pixel [41].Formally, the observed RGB image I is modeled as a convex combination ofthe foreground image F and background B [8,41]: I p = α p F p + (1 − α p ) B p , α p ∈ [0 , , (1)where α p denotes the alpha matte value to be estimated at position p . Theoriginal definition of image matting is an ill-defined problem. Therefore, in mostmatting tasks, trimap images (Figure 1(b)) are provided as a coarse annotation, a r X i v : . [ c s . C V ] A p r Y. Li et al.(a) Image (b) Trimap (c) Deep Matting(d) IndexNet Matting (e) Context-aware (f) Ours
Fig. 1.
The demonstration on a real world hops image . indicating the known foreground and background region as well as the unknownregion to be predicted.Typically, conventional propagation-based matting algorithms [24,14,23,4,1]generate the alpha matte by transmitting opacity or transparency between pixelsby referring to the appearance similarity in the input image. This inductivebias is leveraged by some deep learning based matting approaches implicitly tobenefit their matting results [37,3]. SampleNet [37] performs propagation by theinpainting network [46] adopted in their method for foreground and backgroundestimation instead of opacity prediction. Moreover, the propagation of inpaintingpart is only carried out on the semantically strong features rather than high-resolution features. In AdaMatting [3], the authors introduced convolutional longshort term memory (LSTM) networks [43] into their network as the last stagefor propagation, in which the propagation is performed based on the convolutionand memory kept in the cell of ConvLSTM. The difference between ConvLSTMand direct propagation is analogous to the distinction between LSTM [17] andtransformer [38].In this paper, we present a novel hierarchical opacity propagation structurein which the opacity messages are transmitted across different semantic featurelevels. The proposed structure is constructed by two propagation blocks, namelyglobal HOP block and local HOP block. The HOP blocks perform informationtransmission by the mechanism of attention and aggregation [2] as transformerdoes [38]. Notwithstanding, both global and local HOP blocks are designed astwo-source transformers. More concretely, The relation between nodes in at-tention graph is computed from appearance feature and the information to bepropagated is the opacity feature, which contrasts with self-attention [38,32]or conventional attention [2,44]. With the help of HOP structure, our network Photograph by Lisa Baird from Pixabay.OP Matting 3 learns to predict the opacity over context on the low-resolution but semanticallystrong features by global HOP block and refine blurring artifacts on the high-resolution features by local HOP block. A demonstration of our method as wellas three state-of-the-art approaches on a real world image is shown in Figure 1.In addition to the HOP structure, we present a scale-insensitive positional en-coding to handle the variant size of input images based on the relative positionalencoding [9,32]. The random interpolation is also introduced into our method asan augmentation for matting to further boost the performance.Specifically, our proposed method differ from previous deep image mattingapproaches in following aspects:1. We propose a novel hierarchical propagation architecture for image mat-ting based on a series of global and local HOP blocks, leveraging opacity andappearance information in different semantic levels for information propagation.2. We introduce the random interpolation augmentation into the training ofdeep image matting and empirical evaluation results show that the augmentationmakes a remarkable gain in performance.3. Experiments on both Composition-1k testing set and alphamatting.comdataset demonstrate that the proposed approach is competitive with the otherstate-of-the-art methods.
Most of the natural image matting approaches can be broadly categorized aspropagation-based methods [24,14,23,4,1], sampling-based methods [40,11,13,10]and learning-based methods [45,30,29,37,18,3]. In this section, we will reviewsome deep learning based methods that are highly related to our work.General deep learning based methods utilize deep network to directly predictthe alpha mattes with the given images and trimaps. Cho et al. [7] used an end-to-end CNN which toke mattes from closed-form matting [24] and KNN matting[4] together with normalized RGB images as the input to better utilize local andnonlocal structure information. Deep Matting is proposed in [45] as a two-stagenetwork to predict the alpha matte. In the first stage they fed their deep convo-lutional encoder-decoder network with images and corresponding trimaps. In thesecond stage, the coarse predicted alpha matte was refined by a shallow network.Lutz et al. [30] introduced GAN into image matting problem and used dilatedconvolution to capture the global context information. Tang et al. [36] proposeda network called VDRN involving a deep residual encoder and a sophisticateddecoder. AdaMatting [3] divided the matting problem into two tasks, trimapadaptation and alpha estimation, and used a deep CNN with two distinct de-coder branches to handle the tasks in a multitasking manner. IndexNet Matting[29] considered indices in pooling as a function of feature map and proposed anindex-guided encoder-decoder network using the index pooling and upsamplingguided by learned indices. Hou and Liu [18] proposed a context-aware networkto predict both foreground and alpha matte. Context-aware Matting employedtwo encoders, one for local feature and the other one for context information.
Y. Li et al.
Appearance Encoder
ResNet-34 Blocks Stride-Conv Deconv ResBlock ResBlock Deconv + Conv
HOP
Global
HOP
Local HOP
LocalOpacity Encoder (a)
HOP-GlobalHOP-LocalHOP-LocalPrediction (b)
Fig. 2.
Diagrams of our proposed HOP structure. For the ease of representation, onlya 4-scale-level decoder and 2 local HOP blocks are shown. In our implementation, wehave a 5-level decoder and 3 local HOP blocks. (a) The architecture of our network.The appearance encoder branch only takes RGB image as the input. (b) The schematicdiagram of hierarchical opacity propagation on feature maps from different semanticlevel. Orange lines indicate the feature propagation.
GCA Matting [25] introduced guided contextual attention mechanism to analo-gize image inpainting processing in matting.Instead of predicting alpha matte directly, some methods show that changingthe degrees of freedom will influence the performance of their network. Tang etal. [37] proposed SampleNet to estimate foreground, background and alpha mattestep by step, rather than predict them simultaneously. Zhang et al. [47] used twodecoder branches to estimate foreground and background respectively, then theyused a fusion branch to obtain the result.All methods mentioned above relies heavily on the trimap as a input to re-duce the resolution space. However recently, for some specific practical problems,methods that leverage the image semantic information and work well withoutany trimap proposed. Shen et al. [35] combined an end-to-end CNN with closed-form matting [24] to automatically generate the trimap of portrait image andthen obtained desired alpha mattes. Chen et al. [5] proposed a Semantic Hu-man Matting that integrated a semantic segmentation network with a mattingnetwork to automatically extract the alpha matte of humans.
To achieve the direct opacity information propagation all over the input imageat a high resolution level, we propose the novel structure of hierarchical opac-ity propagation, in which the neural network can be regarded as a multi-layergraph convolutional network [22] with different graphs and the opacity can bepropagated between every two pixels. In this section, we will introduce our hier-archical opacity propagation blocks first, and then propose the scale-insensitive
OP Matting 5 query keysoftmax aggregation value (opacity)appearance outputmultiplicationadditiontransform (a) local HOP block query keysoftmax aggregationopacityoutput value (b) localself-attention query keysoftmaxaggregation value (opacity)appearance output (c) global HOPblock query keysoftmax aggregationoutputopacity value (d) globalself-attention
Fig. 3.
The diagrams of detailed structure of local HOP block, local self-attentionblock, global HOP block and global self-attention block. For these two local blocks, weonly show one query data point in the diagrams for the ease of presentation. positional encoding for our HOP block. Afterwards, the implementation detailswill be described.
Typically, an non-local block [42] or a transformer [38] is capable of carrying outthe information propagation globally by its self-attention mechanism [27]. How-ever, there are two flaws if we adopt the original non-local block or image trans-former in a natural image matting method directly. On one hand, the non-localblock is computationally expensive. Although some modification were proposedto reduce the computation and memory consumption [49], it is still infeasibleto propagate the opacity information on a high-resolution feature map, which isrequired by image matting. On the other hand, both non-local and transformerbuild a complete graph on the input feature map and the edge weights are gen-erated from the input feature of nodes. In image matting tasks, the feature ofeach node is the opacity information. It is straightforward to propagate the se-mantic feature based on relationship between different feature nodes in somesemantic tasks like video classification [42], semantic segmentation [49] or imageinpainting [46], whereas the propagation in natural image matting requires morenon-semantic appearance information than the semantic features.Inspired by the current success of sparse transformer [6,19], local self-attention[32] and guided contextual attention [25], our proposed hierarchical opacity prop-agation structure involves two different propagation blocks, namely global HOPblock and local HOP block, in which the appearance and opacity prediction areleveraged together in the propagation.The network architecture of our proposed HOP is shown in Figure 2(a). In ourmethod, there are two encoder branches, one for opacity information source andthe other one for image appearance source. Assuming the feature maps from
Y. Li et al. opacity encoder and appearance encoder are denoted by F O ∈ R HW × C and F A ∈ R HW × C respectively, and feature points at position ( i, j ) are f O ( i,j ) ∈ R C and f A ( i,j ) ∈ R C , the global HOP block can be defined as follows: q ( i,j ) = W QK f A ( i,j ) ,k ( x,y ) = W QK f A ( x,y ) ,a ( i,j ) , ( x,y ) = softmax ( x,y ) ( q T ( i,j ) k ( x,y ) (cid:107) q ( i,j ) (cid:107)(cid:107) k ( x,y ) (cid:107) ) ,g ( i,j ) = W out ( (cid:88) ( x,y ) a ( i,j ) , ( x,y ) f O ( x,y ) ) + f O ( i,j ) , (2)where the W QK is the linear transformation for both key and query, W out isthe transformation to align the propagated information with input feature map F O , and the softmax is operated along ( x, y ) dimension. Additionally, F O is thevalue term in this attention mechanism without any transformation, and we canalso regard v ( x,y ) = W out f O ( x,y ) as the value term. Figure 3(c) demonstrates thedetailed structure of global HOP block.In global HOP block, the key and value, as distinct from the self-attention[27,38] or conventional attention mechanism [2,44], are homoplasy. In self-attentionmechanism, all of the query, key and value are computed from the same feature,and in conventional attention mechanism the key and value are from the sameplace. However, in HOP blocks, the query and key share the same original fea-ture of appearance source and the value item has a distinct source from opacityfeature.Similarly, we can formulate the local HOP block which only attends to thelocal neighborhood of each feature point: a ( i,j ) , ( x,y ) = softmax ( x,y ) ∈N (( i,j ) ,s ) ( q T ( i,j ) k ( x,y ) (cid:107) q ( i,j ) (cid:107)(cid:107) k ( x,y ) (cid:107) ) ,g ( i,j ) = W out ( (cid:88) ( x,y ) ∈N (( i,j ) ,s ) a ( i,j ) , ( x,y ) f O ( x,y ) )+ f O ( i,j ) , (3)where N (( i, j ) , s ) is the neighborhood of position ( i, j ) with a window size of s .In our propagation graph, each node has two different features, opacity andappearance. The appearance feature is only utilized to generate the edge weightof the graph and opacity feature is the de facto information to be propagated.The difference between HOP block and self-attention can also be seen fromFigure 3. We will compare the performance of our HOP blocks with global &local self-attention in the ablation study.With aforementioned HOP blocks, we build the hierarchical opacity propa-gation structure for alpha matte estimation as depicted in Figure 2(b). A HOPstructure involves a global HOP block and multiple local HOP blocks. In the OP Matting 7
0, 01, 11, 1 1, 11, 11, 01, 00, 1 0, 1 0, 2 0, 2 0, 20, 2 2, 02, 02, 02, 0 1, 2 1, 2 1, 22, 22, 22, 22, 2 2, 2 2, 22, 22, 22, 22, 2 2, 2 2, 21, 2 1, 2 1, 21, 21, 2 2, 2 2, 22, 22, 2 2, 12, 12, 1 2, 12, 12, 12, 12, 1 (a) SI-PE
0, 01, 11, 1 1, 11, 11, 01, 00, 1 0, 1 0, 2 0, 3 0, 40, 2 2, 03, 04, 02, 0 1, 2 1, 3 1, 42, 22, 22, 22, 2 2, 3 2, 43, 43, 33, 24, 2 4, 3 4, 41, 2 1, 3 1, 41, 21, 2 2, 3 2, 43, 24, 2 2, 13, 14, 1 2, 13, 14, 12, 12, 1 (b) R-PE
2, 21, 13, 1 1, 33, 31, 23, 22, 1 2, 3 2, 4 2, 5 2, 62, 0 4, 25, 26, 20, 2 1, 4 1, 5 1, 60, 44, 40, 04, 0 4, 5 4, 65, 65, 55, 46, 4 6, 5 6, 63, 4 3, 5 3, 63, 01, 0 0, 5 0, 65, 06, 0 4, 15, 16, 1 4, 35, 36, 30, 30, 1 (c) A-PE
0, 01, 11, 1 1, 11, 11, 01, 00, 1 0, 1 0, 20, 2 2, 02, 0 1, 22, 22, 22, 22, 2 1, 21, 21, 2 2, 1 2, 12, 12, 1 (d) LR-PE
Fig. 4.
The illustration of different positional encoding methods. The blue rectangleindicates the position of query. The ordered pair in each feature point is (row offset,column offset). SI-PE: scale-insensitive positional encoding, R-PE: relative positionalencoding, A-PE: absolute positional encoding, LR-PE: local relative positional encod-ing. schematic diagram, we omit deconvolution blocks between HOP blocks to dis-play how the opacity information is propagated hierarchically. The bottom globalHOP block will perform a global opacity broadcast on feature maps from thebottleneck, where the feature map contains more semantic but less textural mes-sages. It is intuitive to transmit semantic feature globally to leverage the wholeinformation all over the image. Subsequently, local HOP blocks are incorporatedinto the network between deconvolution stages, where more textural messagesare represented in the high-resolution feature map. Thus it is of great motiva-tion to impel local HOP block to only attend to the neighborhood of each querypoint for textural information extraction. With our HOP structure, the opacityis propagated across different feature levels, from semantic to textural featuresand from low-resolution to high-resolution ones.Moreover, the proposed HOP structure can be considered as a 4-layer graphconvolutional network [22] with different graph in each layer and number of nodesis variable in different stages of the network. The graph in global HOP block isa complete graph and the ones in local HOP blocks are sparse. All edge weightsare computed by attention mechanism like graph attention networks [39].
Positional encoding always yields gains with self-attention mechanism in severalprevious work [38,9,32]. We now describe the positional encoding incorporatedin our approach. We employ two different positional encoding methods, scale-insensitive position encoding for global HOP block and local relative positionencoding for local HOP block. The illustrations of different positional encodingmethods are shown in Figure 4.
Scale-insensitive Position Encoding
Vaswani et al. [38] introduced the posi-tional encoding into transformer to improve the performance in natural language
Y. Li et al. processing tasks. Transformer-xl [9] further extended the absolute positional en-coding to relative positional encoding. Intuitively, we can divide the embeddinginto row and column encodings to extend the relative or absolute positionalencoding to a 2-dimensionality encoding for image matting. However, a fatalobstacle of the previous positional encoding is that the neighborhood size of theattention must be fixed. Once the input image size is larger than training ones,there will be some new positional embeddings which never appeared in training.In image matting, it is extremely common that the testing image is larger thanthe training image patches. To address this issue, we propose a scale-insensitiveposition encoding for global HOP block.In our scale-insensitive positional encoding, we define a radius s of the neigh-borhood. Any point located beyond the radius shares the same positional encod-ing, and for the points within radius s we use the relative positional encoding(see the illustration in Figure 4(a)). Then the global HOP block can be writtenas: e d = (cid:40) W P E r d d ≤ s ; W P E r s otherwise,a ( i,j ) , ( x,y ) = softmax ( x,y ) ( q T ( i,j ) k ( x,y ) (cid:107) q ( i,j ) (cid:107)(cid:107) k ( x,y ) (cid:107) + q T ( i,j ) (cid:107) q ( i,j ) (cid:107) ( e | i − x | + e | j − y | )) ,g ( i,j ) = W out ( (cid:88) ( x,y ) a ( i,j ) , ( x,y ) f O ( x,y ) ) + f O ( i,j ) , (4)where we employ the sinusoidal encoding r d following [38,9] for simplicity, andselect s = 7 in implementation. With the scale-insensitive positional encoding,our HOP matting can handle input images with any shape and size. In additionto the positional embedding, we also design a trimap embedding to learn whetherthe foreground, background and unknown area should have different weightsin attention. Hence the term ( e | i − x | + e | j − y | ) is modified to ( e | i − x | + e | j − y | + W T t ( x,y ) ), where t x,y is the data point at position ( x, y ) from the resized trimap. Local Relative Position Encoding
For the local HOP block, the neighbor-hood size is always a constant in the network, which makes the scale-insensitivepositional encoding not necessary. We extent the local relative positional encod-ing proposed in [32] to an direct-invariant version without proposing a novelembedding method. The encoding used in local HOP block is depicted in Figure4(d).In contrast to previous work with positional encoding [38,9,32], the positionalencodings adopted in our image matting method are both direction-invariant,which means the embedding is only related to the absolute distance along row orcolumn between positions of query and key. This trait is motivated by the factthat natural image matting is more a low-level vision problem of less semanticsand it should be rotation-invariant.
OP Matting 9
Table 1.
Evaluation results on the resized Composition-1k testing set with differenttesting interpolations. RI is for random interpolation augmentation.Methods Test Interp. SAD MSE(10 − ) Gradground-truth nearest 21.66 6.6 13.76bilinear 5.8 0.4 0.49cubic 1.1 0.02 0.02IndexNet Matting [29] nearest 62.43 23.8 42.91bilinear 46.35 14.1 24.41cubic 45.65 13.1 25.47HOP-5x5 nearest 63.79 25.4 39.85bilinear 49.71 19.2 27.41cubic 38.45 11.2 18.87HOP-5x5 + RI nearest 53.99 19.5 40.81bilinear 30.34 6.5 12.87cubic Our network is only trained with the alpha matte reconstruction loss functionwhich is formulated as the averaged absolute difference between the estimationand ground truth alpha mattes: L = 1 |T u | (cid:88) i ∈T u | α i − α gti | , (5)where α i is the predicted alpha matte at position i , α gt is the ground truth alphamatte and T u is a set of unknown pixels in the trimap.We select the first 11 blocks of ResNet-34 [15] pretrained on ImageNet [34]as our backbone in the opacity encoder. As for appearance encoder, we optfor a stack of stride convolutional layers to extract more low-level information.The network is trained on the foreground images from Adobe Image Mattingdataset [45] and background images from MS COCO dataset [26]. We follow thebasic data augmentation proposed in [25]. We normalize the training phase byboth batch normalization [20] and spectral normalization [31]. The optimizationis performed by Adam optimizer [21]. The model is trained in FP16 precisionfollowing [16]. Warmup [12] and cosine decay [28] are applied in our training. We conduct extensive experiments and ablation study to demonstrate the ef-fectiveness of our proposed HOP matting. We report the empirical results ofour HOP matting on two widely used datasets, the Composition-1k testing set[45] and alphamatting.com dataset [33]. The results are evaluated under mean
Fig. 5.
The visual comparison results on Adobe Composition-1k testing set [45]. squared error (MSE), sum of absolute difference (SAD), gradient error (Grad)and connectivity error (Conn) as [33] suggested. We also visualize the attentionin our HOP structure for a better understanding.
Empirically, the performance of deep image matting methods is sensitive tothe image resize. It is because that typical natural image matting approachesattend to the detailed texture information in images, and the resize operationmay blur edges or high frequency information and lead to a deterioration inthe performance. Therefore, most of the matting methods are evaluated on theoriginal image without any resize operation. In Context-aware Matting [18], theauthors claimed that different image formats of foreground and background willintroduce subtle artifacts into composited training images, which can help thenetwork to distinguish foreground from background. Analogously, in this section,we will show some new observations that deep neural network based mattingmethods are sensitive to the interpolation algorithm and introduce the randominterpolation augmentation used in our method.We conduct an empirical experiment to support this observation on theComposition-1k testing set [45]. Firstly, we upsample RGB images with a fac-tor of 1.5 by a selected interpolation algorithm and then downsample images totheir original size by the same interpolation algorithm. More concretely, suppos-ing that the RGB image is 800 x 800 and the selected testing interpolation isbilinear, we resize the image to 1200 x 1200 by bilinear interpolation and resize
OP Matting 11
Table 2.
The quantitative results on Composition-1k testing set. PosE stands for positional encoding , and TriE is for trimap embedding in HOP blocks and RI for randominterpolation augmentation. The variants of our approaches are emphasized in italic.Best results are in boldface.Methods SAD MSE(10 − ) Grad ConnDCNN Matting [7] 161.4 87 115.1 161.9Learning Based Matting [48] 113.9 48 91.6 122.2Information-flow Matting [1] 75.4 66 63.0 -Deep Matting [45] 50.4 14 31.0 50.8IndexNet Matting [29] 45.8 13 25.9 43.7AdaMatting [3] 41.7 10 16.8 -SampleNet Matting [37] 40.35 9.9 - -GCA Matting [25] 35.28 9.1 16.9 32.5Context-aware Matting [18] 35.8 8.2 17.3 33.2 HOP-5x5
HOP-9x9 w/o HOP + RI
HOP-5x5 + PosE + TriE + RI
HOP-9x9 + PosE + TriE + RI
Our scores on the alphamatting.com benchmark. S, L and U denote threetrimap types, small, large and user, included in the benchmark.
Average Rank SAD MSE Gradient ErrorOverall S L U Overall S L U Overall S L UOurs
AdaMatting [3] 7 6.1 6.1 8.8 8 the new image back to 800 x 800 by bilinear interpolation afterwards. Finally,the resized RGB image is feed forward to the network and we evaluate the er-ror between prediction and original ground truth. It is worth noting that we donot begin with a downsampling followed by an upsampling operation becausedownsampling first will lead to more information loss.Evaluation results are reported in Table 1. We provide the method ground-truth as a reference. Method ground-truth means that we directly resize theground truth alpha matte image without any estimation and then calculate theerror between resized and unresized ground truth images. These results revealthe error introduced by interpolation itself, which can be seen as an ideal lowerbound of these evaluation results.
HOP-5x5 stands for our baseline HOP modelwith 5x5 neighborhood in the local HOP block and without positional encoding
Table 4.
Parameter numbers and efficiency comparison on Composition-1k testing seton a single NVIDIA RTX 2080 Ti with 11G memory. (Input images of Deep Mattingand Context-aware Matting are downsampled with a factor of 0.8.)Methods
HOP-1x1
HOP-5x5
HOP-9x9 or trimap embedding in HOP blocks. From Table 1, we can notice that mostof the result gaps between different testing interpolations are larger than the ground-truth lower bound. In other words, different interpolation algorithms canintroduce more error in inference than the interpolation itself. We can also see thegap between bilinear and cubic of our
HOP-5x5 method is larger than IndexNetMatting [29]. Our explanation is that, in the data augmentation of trainingset, we resize the background image to the same size as foreground by cubicinterpolation following Deep Matting [45]. This fixed interpolation augmentationmakes our model fit cubic interpolation much better than the others.Based on empirical observation aforementioned, we introduce the random in-terpolation augmentation into our method. In the data preprocessing of trainingphase, we randomly select a interpolation algorithm with equal probability forany resize operation. Therefore, the composited images in a training mini-batchmay be generated by different interpolations. Furthermore, the foreground, back-ground and alpha matte images can be resize with different algorithm before thecomposition. As Table 1 shows, the training with random interpolation not onlyimproves the performance but also mitigates the error gap between bilinear andcubic interpolations.
The Composition-1k testing set [45] contains 1000 composed images from 50distinct foregrounds. We compare our method with the other top-tier naturalimage matting approaches quantitatively as the results shown in Table 2. Variant w/o HOP + RI indicates the backbone network without any HOP block andtrained with random interpolation augmentation. All our variants outperformstate-of-the-art methods. Some qualitative results on Composition-1k testingset are displayed in Figure 5. The results of Deep Matting [45] are generatedfrom the source code and pretrained model provided by IndexNet Matting [29].Furthermore, we compare the number of parameters and the model efficiencywith some of the state-of-the-art methods in Table 4. We evaluate the meaninference time of each image in Composition-1k testing set on a single NVIDIARTX 2080 Ti GPU. Notably, Context-aware Matting [18] and Deep Matting
OP Matting 13
Table 5.
Ablation study on different HOP blocks on the Composition-1k testing set.HOP-Local- k indicates the k -th local HOP block in the decoder.HOP-Global HOP-Local-1 HOP-Local-2 HOP-Local-3 SAD MSE(10 − )37.89 10.05 (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) Global & Local Self-attention 35.97 9.24
Table 6.
Ablation study on positional encoding and trimap embedding and randominterpolation augmentation on the Composition-1k testing set.Method PosE TriE RI SAD MSE(10 − ) Grad ConnHOP-5x5 34.82 9.0 16.01 32.04 (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) (cid:88) [45] require more than 11G GPU memory to estimate the alpha mattes of high-resolution images from Composition-1k testing set. Thus we downsample theinput image with a factor of 0.8 for these two methods. The alphamatting.com dataset contains eight test images for online benchmarkevaluation. Each test image has three different trimaps ( i.e. ”small”, ”large”and ”user”). We provide the average rank value of our proposed approach onalphamatting.com benchmark in Table 3. The
Overall rank is an average rankover all three trimap types for each evaluation metric. As the ranking showsin Table 3, our HOP matting outperform the other state-of-the-art approachesunder different evaluation metrics.
To validate the efficacy of each component in HOP matting, we conduct threedifferent experiments on the Composition-1k testing set. We first evaluate the
HOP-5x5 model by removing different HOP blocks. From the results reportedin Table 5, we can notice that the hierarchical opacity propagation structure iscapable of improving the performance of networks in image matting.
Global & Local Self-attention indicates the method that replaces global HOP block withglobal self-attention and replaces local HOP block with local self-attention. Inthe second ablation study, we reveal the effect of introducing positional encoding,
Fig. 6.
The visualization of attention in our HOP structure on input images. trimap embedding and random interpolation augmentation into our method. Thequantitative results evaluated on Composition-1k testing set [45] is provided inTable 6. We also provide the evaluation of different neighborhood window sizein supplementary material.
Visualizing the attention of HOP structure is a convenient way to understandhow the opacity information is propagated hierarchically in our approach. To thisend, we visualize where our model attend to by the gradient map on the inputimage. We randomly select a pixel in the unknown region from the alpha matteprediction. Then a large loss is assigned to this single pixel and all the otherpixels of the prediction are assumed to be perfectly correct without any loss.Afterwards, the back-propagation is executed and we propagate the gradientbackward to the input image. The gradient map reveals how each pixel of theinput image is related to the selected alpha matte pixel in the prediction. Weshow the gradient map of an image from Composition-1k testing set [45] inFigure 6. The results without HOP block are generated from the model wetrain for ablation study reported in Table 5. As we can observe from Figure 6,the model with HOP block is able to aggregate information all over the inputimage and pay more attention to the area with similar appearance, while themodel without HOP blocks, conversely, focuses more on a local region aroundthe selected prediction point.
In this paper, we proposed a HOP matting network for image matting. Ourmethod utilizes local and global HOP blocks to achieve the hierarchical opacitypropagation across feature maps at different semantic levels. The experimentalresults demonstrate the superiority of the proposed HOP matting. Furthermore,the effectiveness of our positional encoding and random interpolation augmen-tation are verified by the ablation study. Considering the success of fully self-attention networks [32], it is a promising future work to investigate how thefully HOP block networks works in matting. Another interesting future work isa hybrid network of stacked local HOP blocks and fully convolutional blocks.
OP Matting 15
References
1. Aksoy, Y., Ozan Aydin, T., Pollefeys, M.: Designing effective inter-pixel informa-tion flow for natural image matting. In: CVPR (2017)2. Bahdanau, D., Cho, K., Bengio, Y.: Neural machine translation by jointly learningto align and translate. arXiv preprint arXiv:1409.0473 (2014)3. Cai, S., Zhang, X., Fan, H., Huang, H., Liu, J., Liu, J., Liu, J., Wang, J., Sun, J.:Disentangled image matting. In: Proceedings of the IEEE International Conferenceon Computer Vision. pp. 8819–8828 (2019)4. Chen, Q., Li, D., Tang, C.K.: Knn matting. IEEE TPAMI (2013)5. Chen, Q., Ge, T., Xu, Y., Zhang, Z., Yang, X., Gai, K.: Semantic human matting.In: ACM MM (2018)6. Child, R., Gray, S., Radford, A., Sutskever, I.: Generating long sequences withsparse transformers. arXiv preprint arXiv:1904.10509 (2019)7. Cho, D., Tai, Y.W., Kweon, I.S.: Deep convolutional neural network for naturalimage matting using initial alpha mattes. IEEE TIP (3), 1054–1067 (2019)8. Chuang, Y.Y., Curless, B., Salesin, D.H., Szeliski, R.: A bayesian approach todigital matting. In: CVPR (2). pp. 264–271 (2001)9. Dai, Z., Yang, Z., Yang, Y., Cohen, W.W., Carbonell, J., Le, Q.V., Salakhutdi-nov, R.: Transformer-xl: Attentive language models beyond a fixed-length context.arXiv preprint arXiv:1901.02860 (2019)10. Feng, X., Liang, X., Zhang, Z.: A cluster sampling method for image matting viasparse coding. In: European Conference on Computer Vision. pp. 204–219. Springer(2016)11. Gastal, E.S., Oliveira, M.M.: Shared sampling for real-time alpha matting. In:Computer Graphics Forum. vol. 29, pp. 575–584. Wiley Online Library (2010)12. Goyal, P., Doll´ar, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tul-loch, A., Jia, Y., He, K.: Accurate, large minibatch sgd: Training imagenet in 1hour. arXiv preprint arXiv:1706.02677 (2017)13. He, K., Rhemann, C., Rother, C., Tang, X., Sun, J.: A global sampling method foralpha matting. In: CVPR 2011. pp. 2049–2056. IEEE (2011)14. He, K., Sun, J., Tang, X.: Fast matting using large kernel matting laplacian matri-ces. In: 2010 IEEE Computer Society Conference on Computer Vision and PatternRecognition. pp. 2165–2172. IEEE (2010)15. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.In: CVPR (2016)16. He, T., Zhang, Z., Zhang, H., Zhang, Z., Xie, J., Li, M.: Bag of tricks for imageclassification with convolutional neural networks. In: CVPR (2019)17. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation (8), 1735–1780 (1997)18. Hou, Q., Liu, F.: Context-aware image matting for simultaneous foreground andalpha estimation. In: Proceedings of the IEEE International Conference on Com-puter Vision. pp. 4130–4139 (2019)19. Huang, L., Yuan, Y., Guo, J., Zhang, C., Chen, X., Wang, J.: Interlaced sparseself-attention for semantic segmentation. arXiv preprint arXiv:1907.12273 (2019)20. Ioffe, S., Szegedy, C.: Batch normalization: Accelerating deep network training byreducing internal covariate shift. arXiv preprint arXiv:1502.03167 (2015)21. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprintarXiv:1412.6980 (2014)6 Y. Li et al.22. Kipf, T.N., Welling, M.: Semi-supervised classification with graph convolutionalnetworks. arXiv preprint arXiv:1609.02907 (2016)23. Lee, P., Wu, Y.: Nonlocal matting. In: CVPR 2011. pp. 2193–2200. IEEE (2011)24. Levin, A., Lischinski, D., Weiss, Y.: A closed-form solution to natural image mat-ting. IEEE TPAMI (2008)25. Li, Y., Lu, H.: Natural image matting via guided contextual attention. arXivpreprint arXiv:2001.04069 (2020)26. Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ar, P.,Zitnick, C.L.: Microsoft coco: Common objects in context. In: ECCV. pp. 740–755.Springer (2014)27. Lin, Z., Feng, M., Santos, C.N.d., Yu, M., Xiang, B., Zhou, B., Bengio, Y.: A struc-tured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130 (2017)28. Loshchilov, I., Hutter, F.: Sgdr: Stochastic gradient descent with warm restarts.arXiv preprint arXiv:1608.03983 (2016)29. Lu, H., Dai, Y., Shen, C., Xu, S.: Indices matter: Learning to index for deep imagematting. In: ICCV (2019)30. Lutz, S., Amplianitis, K., Smolic, A.: Alphagan: Generative adversarial networksfor natural image matting. In: BMVC (2018)31. Miyato, T., Kataoka, T., Koyama, M., Yoshida, Y.: Spectral normalization forgenerative adversarial networks. arXiv preprint arXiv:1802.05957 (2018)32. Ramachandran, P., Parmar, N., Vaswani, A., Bello, I., Levskaya, A., Shlens, J.:Stand-alone self-attention in vision models. arXiv preprint arXiv:1906.05909 (2019)33. Rhemann, C., Rother, C., Wang, J., Gelautz, M., Kohli, P., Rott, P.: A perceptuallymotivated online benchmark for image matting. In: CVPR (2009)34. Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z.,Karpathy, A., Khosla, A., Bernstein, M., et al.: Imagenet large scale visual recog-nition challenge. IJCV (3), 211–252 (2015)35. Shen, X., Tao, X., Gao, H., Zhou, C., Jia, J.: Deep automatic portrait matting. In:ECCV (2016)36. Tang, H., Huang, Y., Fan, Y., Zeng, X., et al.: Very deep residual network for imagematting. In: 2019 IEEE International Conference on Image Processing (ICIP). pp.4255–4259. IEEE (2019)37. Tang, J., Aksoy, Y., ¨Oztireli, C., Gross, M., Aydın, T.O.: Learning-based samplingfor natural image matting. In: Proc. CVPR (2019)38. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser,(cid:32)L., Polosukhin, I.: Attention is all you need. In: NIPS (2017)39. Veliˇckovi´c, P., Cucurull, G., Casanova, A., Romero, A., Lio, P., Bengio, Y.: Graphattention networks. arXiv preprint arXiv:1710.10903 (2017)40. Wang, J., Cohen, M.F.: Optimized color sampling for robust matting. In: 2007IEEE Conference on Computer Vision and Pattern Recognition. pp. 1–8. IEEE(2007)41. Wang, J., Cohen, M.F., et al.: Image and video matting: a survey. Foundationsand Trends R (cid:13) in Computer Graphics and Vision3