Towards Real-Time Automatic Portrait Matting on Mobile Devices
Seokjun Seo, Seungwoo Choi, Martin Kersner, Beomjun Shin, Hyungsuk Yoon, Hyeongmin Byun, Sungjoo Ha
TTowards Real-Time Automatic Portrait Matting on Mobile Devices
Seokjun Seo ∗ , Seungwoo Choi ∗ , Martin Kersner ∗ , Beomjun Shin ∗ , Hyungsuk Yoon † ,Hyeongmin Byun , and Sungjoo Ha Hyperconnect, Seoul, South Korea Unaffiliated, Seoul, South Korea { seokjun.seo, seungwoo.choi, martin.kersner } @[email protected] { hyeongmin.byun, shurain } hpcnt.com Abstract
We tackle the problem of automatic portrait matting onmobile devices. The proposed model is aimed at attainingreal-time inference on mobile devices with minimal degra-dation of model performance. Our model MMNet, basedon multi-branch dilated convolution with linear bottleneckblocks, outperforms the state-of-the-art model and is or-ders of magnitude faster. The model can be acceleratedfour times to attain 30 FPS on Xiaomi Mi 5 device withmoderate increase in the gradient error. Under the sameconditions, our model has an order of magnitude less num-ber of parameters and is faster than Mobile DeepLabv3while maintaining comparable performance. The accompa-nied implementation can be found at https://github.com/hyperconnect/MMNet .
1. Introduction
Image matting, the task which predicts alpha values offoreground on every pixel, has been studied [8, 12–15]. Im-age matting system offers an opportunity for wide applica-tions in computer vision such as color transformation, styl-ization, and background edits. It is well-known, however,that image matting is an ill-posed problem [19] since sevenunknown values (three for foreground RGB, three for back-ground RGB and one for alpha) should be inferred fromthree known RGB values. The most widely used methodto alleviate the difficulties of the matting problem is toutilize an additional input which roughly separates an im-age such as trimap [12, 30] and scribbles [19]. A trimap ∗ Equal contributions. † Work done while at Hyperconnect. . . . DAPMLDN+FBMD8MD16MMNet
32 38 44 53 61 66 82 90 113 129 140146Latency (ms), Xiaomi Mi 52 . . . . . . . . G r a d E rr o r( × − ) ~ ~ ~ ~ Figure 1. The trade-off between gradient error and latency ona mobile device. Latency is measured using a Qualcomm Snap-dragon 820 MSM8996 CPU. Size of each circle is proportionalto the logarithm of the number of parameters used by the model.Different circles of Mobile DeepLabv3 are created by varying theoutput stride and width multiplier. The circles are marked withtheir width multiplier. Results using × inputs are markedwith ∗ , otherwise, inputs are in × . Notice that MMNetoutperforms all other models forming a Pareto front. The numberof parameters for LDN+FB is not reported in their paper. Bestviewed in color. splits an image into three parts: definite foreground, def-inite background, and ambiguous blended regions. Scrib-bles, on the other hand, indicate foreground and backgroundwith a few strokes. Even though some of the traditionalmethods [19, 27, 30, 34] work well if additional inputs areprovided, it is hard to extend these methods to various im-age and video matting applications which require real-timeperformance due to their high computational complexity aswell as the dependency on user-interactive inputs.Other approaches have been studied to automate matting1 a r X i v : . [ c s . C V ] A p r onvolutionEncoderDecoderEnhancementRefinement Figure 2. The overall structure of the proposed model. A standard encoder-decoder architecture is adopted. Successively applying encoderblocks summarize spatial information and capture higher semantic information. Decoding phase upsamples the image with decoder blocksand improves the result with enhancement blocks. Information from skip connections is concatenated with the upsampled information.Images are resized to target size before going through the network. The resulting alpha matte is converted back to its original resolution. by specifying the object which has to be selected as a fore-ground [9, 29, 36], for example, portrait matting. Automaticportrait matting showed even better result than the othermethods using trimap [29], but the latency is far too high tobe used in a real-time application. Zhu et al. [39] releaseda lightweight model which can perform automatic mattingrelatively fast on mobile devices, attaining the latency of 62ms per image on Xiaomi Mi 5. However, the gradient errorof lightweight model was more than two times worse thanthat of the state-of-the-art, which made it less attractive inreal-world applications.In this paper, we propose a compact neural networkmodel for automatic portrait matting which is fast enoughto run on mobile devices. The proposed model adoptsan encoder-decoder network structure [4] and focuses ondevising efficient components of the network. We applydepthwise convolution [31] as the basic convolution oper-ation to extract and downsample features. The depthwiseconvolution is considerably cheaper than other convolutionseven if we take efficient convolutions such as × convolu-tion [16] into account as well. The linear bottleneck struc-ture [26] benefits from the efficiency of depthwise convo-lutions, boosting the performance while maintaining the la-tency. Building upon these observations, the encoder blockof the proposed model, consists of multi-branch dilated con-volution with linear bottleneck blocks which can reducethe model size with the linear bottleneck structure whileaggregating multi-scale information with multi-branch di-lated convolutions. We introduce the width multiplier, aglobal variable which enlarges or shrinks the number ofchannels of a convolution, to control the trade-off betweenthe size and the latency of the model. We incorporate mul-tiple losses into our loss function, including a gradient losswhich we propose.The proposed model shows better performance than thestate-of-the-art method while achieving 30 FPS on iPhone8 without GPU acceleration. We also evaluate the trade-offs between performance, the latency on mobile devices,and the size of the model. Our model can achieve 30 FPSon Google Pixel 1 and Xiaomi Mi 5 using a single core while suffering roughly 10% degradation of gradient errorcompared to the state-of-the-art.Our contributions are as follows: • We propose a compact network architecture for auto-matic portrait matting task which achieves a real-timelatency on mobile devices. • We explore multiple combinations of input resolutionand width multiplier, which can beat strong baselinesfor automatic portrait matting on mobile devices. • We demonstrate the capability of each component ofthe model, including the multi-branch dilated convolu-tion with linear bottleneck blocks, the skip connectionrefinement block and the enhancement block, throughablation studies.
2. Methods
Image matting problems take input image I , which isa mixture of the foreground image F and the backgroundimage B . Each pixel at the i -th position is computed asfollows: I i = α i F i + (1 − α i ) B i , (1)where the foreground opacity determines α i . Since all thequantities on the right-hand side of the Equation 1 are un-known, the problem is ill-posed.However, we add an assumption that F i and B i are iden-tical to I i in order to reduce the complexity of the problem.Even though the assumption may decrease the performancesubstantially, the empirical result of our experiments showthis assumption is reasonable considering the latency gainwe get. Our model follows a standard encoder-decoder architec-ture that is widely used in semantic segmentation tasks [4,20, 24]. Encoder successively reduces the size of the inputby downsampling and summarizes the spatial informationwhile capturing higher semantic information. Decoder, inurn, upsamples the image to recover the detailed spatial in-formation and restores the original input resolution. Thewhole network structure of our model, mobile matting net-work (MMNet), is depicted in Figure 2.Many modern neural network architectures replace aregular convolution with a combination of several cheaperconvolutions [11, 32, 35]. Depthwise separable convolu-tion [11, 16] is one of the examples which consists of adepthwise convolution, applying a single convolutional fil-ter per input channel, and a pointwise convolution ( × convolution) that accumulates the results. We not only usedepthwise separable convolution for some blocks but alsoadopt the concept of depthwise separable convolution whendesigning our encoder block. Depthwise convolution isone such example which we use extensively. All convolu-tion operations are followed by a batch normalization and aReLU6 non-linearity except the linear projection operationthat is placed at the end of the encoder block [26]. Due tothe linear bottleneck structure, the information flow from anencoder block to another is projected to a low-dimensionalrepresentation.In the encoder block, the information flowing from thelower layers is expanded by the first × multi-branch con-volutions. The linear bottleneck compresses the processedimage. The data upsampled by the decoder block is con-catenated with the refined knowledge through a skip con-nection. The number of channels for each path are main-tained to have the same value. Table 1 details how mucheach component expands and compresses the informationflow.To control the trade-off between model size and modelperformance, we adopt width multiplier [16]. The widthmultiplier α , is a global hyperparameter that is multiplied tothe number of input and output channels to make the layersthinner or thicker depending on the computational budget. n-1 Concat1x1 Conv … Figure 3. The encoder block. It employs a multi-branched dilatedconvolution with a linear bottleneck. The linear bottleneck com-presses the information to a low-dimensional representation beforehanding it over to the next encoder block.
Name Component Details Output SizeInitial Block Conv × , S × , 32Encoder 1 DR [1 , , , , S × , 16Encoder 2 DR [1 , , , , S × , 24Encoder 3 DR [1 , , , , S × , 24Encoder 4 DR [1 , , , , S × , 24Encoder 5 DR [1 , , , S × , 40Encoder 6 DR [1 , , , S × , 40Encoder 7 DR [1 , , , S × , 40Encoder 8 DR [1 , , , S × , 40Encoder 9 DR [1 , , S × , 80Encoder 10 DR [1 , , S × , 80Decoder 1 Upsample × (Skip 5) × , 128Decoder 2 Upsample × (Skip 1) × , 80Enhancement 1 DR [1 , , , S × , 40Enhancement 2 DR [1 , , , S × , 40Decoder 3 Upsample × × , 16Final Block Conv × , Softmax × , 2 Table 1. The model architecture of MMNet. We assume that widthmultiplier are set to . . Decoder MMNet encoder block has a multi-branched dilated con-volution structure with a linear bottleneck. Input flows tomultiple branches which undergo channel expansion fol-lowed by a strided convolution and a dilated convolution.The dilation rates are different for all branches following n − rates. Multi-branch dilated convolution amounts tosampling spatial information at different scales. The out-puts of different branches are concatenated to form a ten-sor containing multi-scale information. Applying encoderblocks in succession allows the network to capture multi-level information increasingly. As the encoder blocks areconsecutively applied, we decrease the number of branchesin an encoder block, slowly changing the dilation rates from [1 , , , to [1 , .A linear bottleneck structure is imposed on the encoderblock where the output of the encoder block is thinner thanthe intermediate representations. The final convolution aftercombining the multi-branch information projects the inputto a low-dimensional compressed representation. The linearbottleneck is a decomposition of a regular convolution thatconnects two encoder blocks into two cheaper convolutionswith reduced channels. The encoder block is illustrated inFigure 3. esize Bilinear x2(Repeatable)1x1 Conv 3x3 DW Conv1x1 Conv(a) Decoder Block (b) Refinement Block Figure 4. The decoder block (a) upsamples bilinearly which couldbe repeated multiple times to upsample by a larger factor. Therefinement block (b) is added to each skip connection where thedirect information from a lower level is refined before mergingwith the higher level information from a decoder block.
The decoder performs multiple upsampling to restore theinitial resolution of the input image. To help decoder withthe restoration of low-level features from compressed spa-tial information, skip connections are employed to directlyconnect the output of the lower-layer encoder to its corre-sponding decoder [24]. Instead of using the informationprovided by the corresponding encoder blocks without anymodifications we refine the information by performing a × depthwise separable convolution. The resulting re-fined information is concatenated with the upsampled infor-mation. This specific refinement technique is reminiscentof the refinement module proposed in SharpMask [22, 33].A decoder block with a refinement block is illustrated inFigure 4. In this work, we connect the feature map of en-coder × up-sampling instead of the usual × to shorten the decodingpipeline. As the decoder block keeps upsampling the feature map,there is no way to enhance the predictions of neighboringvalues. To tackle this problem, we insert two enhance-ment blocks in the middle of the decoding phase. Ratherthan designing a new block, we share the same architecturewith encoder block. The only difference between enhance-ment block and encoder block is that depthwise convolutionwith stride two is removed because the enhancement blockshould sustain the resolution of a feature map. In the ab-lation study, we show the effectiveness of the enhancementblock.
The alpha loss and the compositional loss are frequentlyused in matting tasks. The alpha loss L α , measures themean absolute difference between the ground truth maskand the mask predicted by the model. The compositionalloss L c , measures the mean absolute difference betweenthe values of ground truth RGB foreground pixels and themodel predicted RGB foreground pixels. The composi-tional loss penalizes the model when the model incorrectlypredicts a pixel with high value. L α = 1 K K (cid:88) i =1 (cid:12)(cid:12) α i − α gt i (cid:12)(cid:12) (2) L c = 13 K K (cid:88) i =1 3 (cid:88) j =1 (cid:12)(cid:12) α ij I ij − α gt ij I ij (cid:12)(cid:12) , (3)where the K is equal to the width time height, W × H ,and α is a vectorized alpha matte where each pixel value isindexed by subscript i . The gt superscript denotes the alphamatte is from ground truth.We use the KL divergence between the ground truth A gt ∈ R W × H and the model predicted A ∈ R W × H . TheKL divergence is defined to be: L KL = − p ( A gt ) log p ( A ) p ( A gt ) (4) = − p ( A gt ) log p ( A ) + p ( A gt ) log p ( A gt ) . (5)The second term is the entropy of the ground truth alphamatte, which is constant with respect to model predicted A . Removing the second term leads to optimization of thefollowing loss: ˜ L KL = K (cid:88) i =1 (cid:0) α gt i log α i + (1 − α gt i ) log(1 − α i ) (cid:1) . (6)Two additional loss terms are included in the loss func-tion. An auxiliary loss [31] L aux , helps with the gradientflow by including an additional KL divergence loss betweenthe downsampled ground truth mask and the output of theencoder block L grad , guides the modelto capture fine-grained details in the edges. We use Sobel-like filter S : S = − − − , (7)to create a concatenation of two image derivatives G ( A ) =[ S ∗ A, S T ∗ A ] where ∗ is a convolution. The resulting G ( · ) yields a two-channel output that contains the gradi-ent information along x -axis and y -axis. We apply G ( · ) toboth the ground truth mask and the model predicted mask toompute the mean absolute differences. The gradient loss iscomputed as follows: L grad = 1 K K (cid:88) i =1 (cid:12)(cid:12) G ( A ) i − G ( A gt ) i (cid:12)(cid:12) (8) = 12 K K (cid:88) i =1 (cid:0)(cid:12)(cid:12) ( S ∗ A − S ∗ A gt ) i (cid:12)(cid:12) + (cid:12)(cid:12) ( S T ∗ A − S T ∗ A gt ) i (cid:12)(cid:12)(cid:1) . (9)The following Equation 10 depicts the loss function ofour proposed network. L = β L α + β L c + β ˜ L KL + β L grad + β L aux , (10)where we set β values to control the influence of each lossterms. We set them to have equal values of one for the fol-lowing experiments.
3. Experiments
Automatic portrait matting takes input image with a por-trait and denotes each pixel with a linear mixture of the fore-ground and the background.We use data provided by Shen et al. [29] which consistsof 2,000 images of × resolution where 1,700 and300 images are split as training and testing set respectively.To overcome the lack of training data, we augment imagesby utilizing scaling, rotation and left-right flip. First, animage is rescaled to the input size of the model and randomscaling factor is selected from to . . The image is thenscaled with the selected factor. Rotation by [ − ◦ , ◦ ] isapplied with a probability of . which means that half ofthe augmented images are not rotated. Additional croppingis computed to make the size of the image to match the inputsize of the model. Finally, the left-right flip is also appliedwith a probability of . .To train our model, we optimize our proposed modelwith respect to the loss function in Equation 10 using Adamoptimizer with a batch size of 32 and a fixed learning rateof × − . Input images were resized to × and × . The model trained on × images are fasterbut produces worse alpha mattes compared to the modeltrained on × images. Weight decays were set to × − . All experiments are conducted using a Tensor-Flow [3] trained on a single Titan V GPU.Following the work of Zhu et al. [39], we used gradienterror to evaluate our model in portrait matting problem. Thegradient error as a metric, which is different from gradientloss, is defined as: K (cid:88) i (cid:13)(cid:13) ∇ α i − ∇ α gt i (cid:13)(cid:13) , (11) where α is the alpha matte predicted by the model, and α gt is the corresponding ground truth and K is equal to width × height. ∇ denotes the differential operator that is com-puted by convolving the alpha map with first-order Gaus-sian derivative filters with variance . [23].Another metric we use to evaluate our model is the meanabsolute differences (MAD). The MAD is defined as fol-lows: K (cid:88) i (cid:12)(cid:12) α i − α gt i (cid:12)(cid:12) . (12)For a fair comparison with previous methods, we scalethe predicted alpha matte to the original size of input im-ages, × in this case, and calculate evaluation met-rics.We compare our model to DAPM [29], LDN+FB[39],and Mobile DeepLabv3 [26]. Mobile DeepLabv3 exploitsMobileNetV2 as its feature extractor and has its atrous spa-tial pyramid pooling (ASPP) module removed as suggestedby Sandler et al. [26]. We use Equation 10 to optimize Mo-bile DeepLabv3 in equal footings as MMNet, but removethe auxiliary loss since it requires a modification to the net-work architecture.
4. Results
Table 2 compares the result of DAPM [29],LDN+FB [39], Mobile DeepLabv3 [26], and the pro-posed method. Input images were scaled to × or × , depending on the hyper-parameter. Whensmaller images are fed into the network, the latency dropsconsiderably at the expense of the quality of the alphamatte. Input images were rescaled back to their originalresolutions before evaluation. The gradient error andthe latency for DAPM and LDN+FB were reported byZhu et al. [39]. For a fair comparison, we compute thelatency of the models on a Xiaomi Mi 5 device (QualcommSnapdragon 820 MSM8996 CPU), as suggested by Zhuet al. [39]. Since Zhu et al. [39] did not report how muchCPU resources they used, we measure the latency byrestricting the use to a single core. Specifically, we useTensorFlow Lite [2] benchmark tool to compute the latencyof Mobile DeepLabv3 and MMNet by averaging 100 runsof the model inference on a Xiaomi Mi 5 device whilerestricting the models to use a single thread.Zhu et al. [39] reports that DAPM takes 6 seconds ona computer with Core E5-2600 @2.60Ghz CPU. MMNet-1.0 outperforms DAPM while running orders of magnitudefaster on a mobile CPU. When the input image is resizedto × for faster inference, our model attains real-time inference, surpassing the rate of 30 frames per second.The real-time version of MMNet is still competitive againstDAPM with a moderate increase in its gradient error. a) Input Image (b) Ground Truth (c) Graph Cut (d) MD16-0.75 (e) MMNet-1.00 (f) MMNet-1.00* Figure 5. Visual comparison of different models. Graph Cut [25] results were obtained using OpenCV library [1]. The column markedwith ∗ displays the result using inputs. MMNet is better able to construct delicate details compared to other models. Note that MMNetwith × input still outputs a reasonable alpha matte despite its reduced capacity. The visual comparison of alpha matte in Figure 5 illus-trates the qualitative differences of different models. MM-Net is better able to construct the finer details compared to other models. Even the real-time version of MMNet pro-duces a reasonable alpha matte regardless of its reduced ca-pacity.ethod Time Gradient Error(ms) ( × − )Graph-cut trimap † - 4.93Trimap by [28] † - 4.61Trimap by FCN [20] † - 4.14Trimap by DeepLab [6] † - 3.91Trimap by CRFasRNN [38] † - 3.56DAPM [29] - 3.03LDN+FB [39] 140 7.40MD16-0.75 146 3.23MD16-1.0 203 3.22MD16-0.75 ∗
38 3.71MMNet-1.0 129 2.93MMNet-1.4 213
MMNet-1.0 ∗ Table 2. Model comparisons on the test split. Time is computedon Xiaomi Mi 5 phone. Mobile DeepLabv3 used output strideof 16. Floating point numbers in the method name indicate thewidth multiplier. The row marked with ∗ displays the result using inputs. Our model outperforms other models while processingimages at a faster rate. The experiments marked with † are copiedfrom Shen et al. [29]. To examine the trade-off between execution time andmodel performance, we explore the model space by vary-ing the width multiplier values and the input resolution. Wecompare our model with Mobile DeepLabv3 suggested bySandler et al. [26]. Table 3 details the result of the experi-ment. The results are sorted by the latency and models withcomparable execution time are clustered using horizontaldividers. We see that our proposed model dominates Mo-bile DeepLabv3 in all clusters in terms of gradient error.Also, note that the number of parameters differs by an or-der of magnitude. Requiring a small number of parametersis especially appealing if we target a mobile device sinceend-users do not have to download a bulky model wheneverthere is an update of the model.Figure 1 plots trade-off between gradient error and la-tency on a mobile device. Note that MMNet develops aPareto-front in this space and outperforms other models.Latency comparison of Pixel 1 and iPhone 8 are includedin the supplementary material.
Our proposed network owes its performance to severalbuilding blocks utilized in its model architecture. We ana-lyze the impact of each design choices by performing abla-tion experiments. Method Time Gradient MAD Params(ms) (10 − ) (10 − ) (M)MD16-0.75 146 3.25
129 2.93 ∗
113 3.53 2.61 1.327MMNet-0.75 90 ∗
66 3.61 2.85 0.713MMNet-0.50 ∗ ∗
53 3.68 2.88 2.142MD8-0.35 ∗
44 3.72 3.07 0.454MD16-0.75 ∗
38 3.77 2.96 1.327MMNet-1.00 ∗ Table 3. Comparison of MMNet against Mobile DeepLabv3.Floating point numbers in the method name indicate the widthmultiplier. The row marked with ∗ displays the result using inputs. Output strides of 8 and 16 were tested for MobileDeepLabv3. Note that the proposed model dominates MobileDeepLabv3 when the latency is less than 60. In slower regime,MMNet still outperforms Mobile DeepLabv3 in gradient error butare sometimes worse in MAD. Quantized model is included in thelast row. We study the effect ofdifferent dilation rates in the encoder block. The proposedmodel contains a multi-branch dilated convolutions in theencoder block. We analyze the impact of this decision byfixing the dilation rates to one.
Refinement Block
Whenever there is a skip connection,we have included a refinement block to improve the decod-ing quality. The refinement block enhances the result ofthe encoder block by performing × depthwise separableconvolution followed by batch normalization and a ReLU6non-linearity. We remove the refinement block and study itsimpact on the final result. Enhancement Block
The enhancement blocks are in-tended to give the network a layer to improve the final resultbefore its resolutions are fully recovered. We study the ef-fect of the enhancement block by removing it entirely fromthe network.Table 4 illustrates the results when different componentsof the model architecture are modified. We see that all thecomponents contribute to the final performance of the pro-posed model. When the dilation rate is fixed to one, theethod Gradient Error( × − )No dilation 3.25No enhancement in decoding 3.04No refinement in skip connection 3.07Proposed model 2.93 Table 4. Ablation study on the test split of matting dataset. Allexperiments are performed using MMNet with width multiplier of1.0. network has a hard time generalizing due to its limited ef-fective receptive field. Enhancement and refinement in thedecoding phase also boost the network performance.
We demonstrate the full pipeline for training a real-timeportrait matting model targeting a mobile platform by incor-porating quantization of our model. Quantization of modelparameters and its activation reduces the bit-width requiredby the model. The reduction of bit-width allows one to ex-ploit integer arithmetics in boosting the network inferencespeed.The target model undergoes a quantization-aware train-ing phase via fake quantization [17]. While maintainingfull precision weights, tensors are downcasted to fewer bitsduring the forward pass. On a backward pass, the full pre-cision weights are updated instead of downcasted tensorsfrom which the gradients are computed. Once the trainingis complete, quantized models are executed using the Ten-sorFlow Lite framework [2].Table 3 contains the result of 8-bit quantized model. Themodel enjoys 25% decrease in latency and better gradienterror. The details for quantization are included in the sup-plementary material.
5. Related Work
Image matting task has been mostly approached usingsampling [12, 13, 15, 27, 34] or propagation-based [8, 14,19, 30] ideas. Recently, with the success of convolutionalneural networks (CNN) in computer vision tasks, there hasbeen a growing number of works utilizing CNNs. Cho et al.[10] proposed end-to-end network which relies on othermatting algorithms’ outputs, such as the closed form mat-ting [19] and the KNN matting [8], to produce the final al-pha matte. Shen et al. [29] proposed an automatic imagematting method leveraging CNN to create a trimap whichis fed to closed form matting [19] by backpropagating thematting error back to the trimap network. Xu et al. [36] takethe approach further by directly learning the alpha matte.Chen et al. [9] combine trimap generation and alpha mattegeneration using a fusion module. Many works on image matting are mainly focused onachieving higher accuracy rather than the real-time infer-ence of models. But recently, researchers are shifting the fo-cus to networks that accommodate real-time inference [39].Zhu et al. [39] studied real-time portrait matting on mobiledevices which is directly comparable to our result.Since the work of Long et al. [20], fully convolutionalnetworks (FCN) have been widely used in various segmen-tation tasks [18, 37]. Many of the semantic segmentationnetworks adopt encoder-decoder structure [4]. The pro-posed model uses skip connections to concatenate the out-put of an encoder block to a decoder block which has beenknown to improve the result of semantic pixel-wise segmen-tation tasks [24].Chen et al. [6] proposed DeepLab [5, 7] architecturewhich extensively uses the ASPP module. ASPP moduleaims to solve the problem of efficient upsampling and han-dling objects at multiple scales. Our model adopts a multi-branch structure from Inception network [31], together withthe dilated convolution of different dilation rates, which re-sembles the ASPP module.One of the most prominent light-weight neural networksis MobileNet and its variants [16, 26]. Depthwise separableconvolution was shown to be extremely effective in creatinga light-weight network while keeping the accuracy drop toa tolerable level.ENet, an efficient neural network architecture designedwith the intention of tackling a semantic segmentation task,was proposed by Paszke et al. [21]. Our work is inspiredby the design choices detailed in their work for creating anefficient neural network.
6. Conclusions
In this work, we have proposed an efficient model forperforming automatic portrait matting task on mobile de-vices. We were able to accelerate the model four timesto achieve 30 FPS on Xiaomi Mi 5 device with only 15%increase in the gradient error. Comparison against MobileDeepLabv3 showed that our model is not only faster whenthe performance is comparable, but also requires an orderof magnitude less number of parameters. Through abla-tion studies, we have shown that our choice of the multi-branch dilated convolution with a linear bottleneck is es-sential in maintaining high performance. We also make ourimplementation available at https://github.com/hyperconnect/MMNet .A general extension of our work is to handle general im-age matting problem, such as automatic saliency matting.Since we can already achieve real-time, it is natural to ex-tend the work further by tackling the video matting problemas well. Pushing for real-time inference on mobile devicesrequires a carefully prepared pipeline for it to work in areal-world setting. Distillation to guide the mobile-friendlyodel in training and even lower-bit quantization for addedspeedup is highly desired.
References [1] OpenCV. https://opencv.org/ . Accessed: 2018-10-23.[2] TensorFlow Lite. . Accessed: 2018-10-23.[3] M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean,M. Devin, S. Ghemawat, G. Irving, M. Isard, et al. Tensor-flow: A system for large-scale machine learning. In
Pro-ceedings of the USENIX Symposium on Operating SystemsDesign and Implementation , 2016.[4] V. Badrinarayanan, A. Kendall, and R. Cipolla. SegNet: Adeep convolutional encoder-decoder architecture for imagesegmentation.
IEEE Transactions on Pattern Analysis andMachine Intelligence , 2017.[5] L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam. Re-thinking atrous convolution for semantic image segmenta-tion. arXiv preprint arXiv:1706.05587 , 2017.[6] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, andA. L. Yuille. DeepLab: Semantic image segmentation withdeep convolutional nets, atrous convolution, and fully con-nected CRFs.
IEEE Transactions on Pattern Analysis andMachine Intelligence , 40(4):834–848, 2018.[7] L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam.Encoder-decoder with atrous separable convolution for se-mantic image segmentation. In
Proceedings of the EuropeanConference on Computer Vision , 2018.[8] Q. Chen, D. Li, and C.-K. Tang. Knn matting.
IEEE Trans-actions on Pattern Analysis and Machine Intelligence , 35(9):2175–2188, Sept 2013.[9] Q. Chen, T. Ge, Y. Xu, Z. Zhang, X. Yang, and K. Gai. Se-mantic human matting. In
Proceedings of the ACM Multime-dia Conference , pages 618–626, 2018.[10] D. Cho, Y.-W. Tai, and I. Kweon. Natural image mattingusing deep convolutional neural networks. In
Proceedings ofthe European Conference on Computer Vision , 2016.[11] F. Chollet. Xception: Deep learning with depthwise separa-ble convolutions. arXiv preprint , pages 1610–02357, 2017.[12] Y.-Y. Chuang, B. Curless, D. H. Salesin, and R. Szeliski. Abayesian approach to digital matting. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , 2001.[13] E. S. L. Gastal and M. M. Oliveira. Shared sampling forreal-time alpha matting.
Computer Graphics Forum , 29(2):575–584, May 2010. Proceedings of Eurographics.[14] L. Grady, T. Schiwietz, S. Aharon, and R. Westermann. Ran-dom walks for interactive alpha-matting. In
Proceedings ofVisualization, Imaging, and Image Processing , 2005.[15] K. He, C. Rhemann, C. Rother, X. Tang, and J. Sun.[16] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang,T. Weyand, M. Andreetto, and H. Adam. MobileNets: Effi-cient convolutional neural networks for mobile vision appli-cations. arXiv preprint arXiv:1704.04861 , 2017.[17] B. Jacob, S. Kligys, B. Chen, M. Zhu, M. Tang, A. Howard,H. Adam, and D. Kalenichenko. Quantization and training of neural networks for efficient integer-arithmetic-only infer-ence. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2018.[18] S. J´egou, M. Drozdzal, D. Vazquez, A. Romero, and Y. Ben-gio. The one hundred layers tiramisu: Fully convolutionalDenseNets for semantic segmentation. In
IEEE Conferenceon Computer Vision and Pattern Recognition Workshops ,2017.[19] A. Levin, D. Lischinski, and Y. Weiss. A closed-form solu-tion to natural image matting.
IEEE Transactions on PatternAnalysis and Machine Intelligence , 30(2):228–242, 2008.[20] J. Long, E. Shelhamer, and T. Darrell. Fully convolutionalnetworks for semantic segmentation. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recogni-tion , 2015.[21] A. Paszke, A. Chaurasia, S. Kim, and E. Culurciello. Enet:A deep neural network architecture for real-time semanticsegmentation. arXiv preprint arXiv:1606.02147 , 2016.[22] P. O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollar. Learn-ing to refine object segments. In
Proceedings of the Euro-pean Conference on Computer Vision , 2016.[23] C. Rhemann, C. Rother, J. Wang, M. Gelautz, P. Kohli, andP. Rott. A perceptually motivated online benchmark for im-age matting. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2009.[24] O. Ronneberger, P. Fischer, and T. Brox. U-Net: Convolu-tional networks for biomedical image segmentation. In
Pro-ceedings of the International Conference on Medical ImageComputing and Computer Assisted Intervention , 2015.[25] C. Rother, V. Kolmogorov, and A. Blake. ”GrabCut”: Inter-active foreground extraction using iterated graph cuts.
ACMTransactions on Graphics , 23(3):309–314, August 2004.[26] M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, and L.-C.Chen. MobileNetV2: Inverted residuals and linear bottle-necks. In
Proceedings of the IEEE Conference on ComputerVision and Pattern Recognition , 2018.[27] E. Shahrian, D. Rajan, B. Price, and S. Cohen. Improvingimage matting using comprehensive sampling sets. In
Pro-ceedings of the IEEE Conference on Computer Vision andPattern Recognition , 2013.[28] X. Shen, A. Hertzmann, J. Jia, S. Paris, B. Price, E. Shecht-man, and I. Sachs. Automatic portrait segmentation for im-age stylization. In
Computer Graphics Forum , volume 35,pages 93–102, 2016.[29] X. Shen, X. Tao, H. Gao, C. Zhou, and J. Jia. Deep automaticportrait matting. In
Proceedings of the European Conferenceon Computer Vision , 2016.[30] J. Sun, J. Jia, C.-K. Tang, and H.-Y. Shum. Poisson matting.In
ACM Transactions on Graphics , volume 23, pages 315–321, 2004.[31] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.Going deeper with convolutions. In
Proceedings of the IEEEConference on Computer Vision and Pattern Recognition ,2015.[32] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna.Rethinking the inception architecture for computer vision. In
Proceedings of the IEEE conference on computer vision andattern recognition , pages 2818–2826, 2016.[33] M. Treml, J. Arjona-Medina, T. Unterthiner, R. Durgesh,F. Friedmann, P. Schuberth, A. Mayr, M. Heusel, M. Hof-marcher, M. Widrich, B. Nessler, and S. Hochreiter. Speed-ing up semantic segmentation for autonomous driving. In
Advances in Neural Information Processing Systems , 2016.[34] J. Wang and M. F. Cohen. Optimized color sampling forrobust matting. In
Proceedings of the IEEE Conference onComputer Vision and Pattern Recognition , 2007.[35] M. Wang, B. Liu, and H. Foroosh. Factorized convolutionalneural networks. In
ICCV Workshops , pages 545–553, 2017.[36] N. Xu, B. L. Price, S. Cohen, and T. S. Huang. Deep imagematting. In
Proceedings of the IEEE Conference on Com-puter Vision and Pattern Recognition , 2017.[37] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet,Z. Su, D. Du, C. Huang, and P. H. Torr. Conditional ran-dom fields as recurrent neural networks. In
Proceedings ofthe International Conference on Computer Vision , 2015.[38] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet,Z. Su, D. Du, C. Huang, and P. H. S. Torr. Conditional ran-dom fields as recurrent neural networks. In
Proceedings ofthe International Conference on Computer Vision , 2015.[39] B. Zhu, Y. Chen, J. Wang, S. Liu, B. Zhang, and M. Tang.Fast deep matting for portrait animation on mobile phone. In
Proceedings of the ACM Multimedia Conference , 2017.
Supplemental MaterialsA. Quantization
We used tensorflow.contrib.quantize to quantize our model. Custom implementation of resize_bilinear operation, optimized using SIMDinstructions was deployed. Since we are using fakequantization [17] for quantization-aware training, ad-ditional fake quantization node was inserted after a resize_bilinear operation.The quantized version of softmax provided by Tensor-Flow Lite is slow for our use case since it is optimized for aclassification task. Our formulation allows us the make anassumption that the output has only two channels. Quan-tizing the values to 8-bits means that there are only 65,536valid logit pairs. Instead of explicit computation of soft-max, we precompute the values and substitute the calcula-tion with a table lookup.
B. Latency
Table 5 depicts the latency of different models measuredon Pixel 1, Xiaomi Mi 5, and iPhone 8. All measure-ments are performed with the TensorFlow Lite [2] bench-mark tool on a mobile device while restricting the modelsto use a single thread. The mean and the standard devia-tion obtained from 100 runs are included in the table. Themeasurements were separated apart in time to give the de-vice enough time to cool down. Demo video is available at https://github.com/hyperconnect/MMNet . Method Pixel 1 Mi 5 iPhone 8MD16-0.75 ± ± ± MMNet-1.00 ± ± ± MD8-0.75 ∗ ± ± ± MMNet-0.75 ± ± ± MD16-0.50 ± ± ± MD8-0.50 ∗ ± ± ± MMNet-0.50 ± ± ± MMNet-1.40 ∗ ± ± ± MD16-1.00 ∗ ± ± ± MD8-0.35 ∗ ± ± ± MD16-0.75 ∗ ± ± ± MMNet-1.00 ∗ ± ± ± MD16-0.75 ± - -MMNet-1.00Q ± ± - Table 5. Latency of models on different mobile devices. All num-bers are in milliseconds. The row marked with ∗ displays the resultusing inputs. Quantized model is included in the last row. C. Detailed Architectures
Table 6 illustrates the number of channels used in eachcomponent of MMNet. The initial block outputs a 32 chan-nel feature map, as described in the first row. The numbersin the encoder/enhancement columns represent the numberof channels returned by the multi-branch × convolu-tions and the final output of the encoder/enhancement blockafter the concatenation. For example, encoder channel input which the × convolutions inmultiple branches each expand to channels. After themulti-branch, the outputs are concatenated and convoled bya × convolution which compresses the number of chan-nels back to . Whenever there is a skip connection, theoutput of a decoder block is concatenated with the outputof a refinement block. Their respective number of channelsare delineated in the decoder rows. The final block returnsa two-channel output, each for foreground and background.ame Output channels of 1x1 convolutionFirst Encoder/Enhancement Decoder Refinement FinalInitial Block − , − − − − Encoder 1 − , − − − Encoder 2 − , − − − Encoder 3 − , − − − Encoder 4 − , − − − Encoder 5 − , − − − Encoder 6 − , − − − Encoder 7 − , − − − Encoder 8 − , − − − Encoder 9 − , − − − Encoder 10 − , − − − Decoder 1 − −
64 64 − Decoder 2 − −
40 40 − Enhancement 1 − , − − − Enhancement 2 − , − − − Decoder 3 − − − − Final Block − − − −2