Compact retail shelf segmentation for mobile deployment
CCompact retail shelf segmentation for mobiledeployment
Pratyush Kumar, Muktabh Mayank Srivastava
ParallelDots, Inc. (cid:63) { pratyush,muktabh } @paralleldots.com Abstract.
The recent surge of automation in the retail industries hasrapidly increased demand for applying deep learning models on mobiledevices. To make the deep learning models real time on-device, a com-pact efficient network becomes inevitable. In this paper, we work on onesuch common problem in the retail industries -
Shelf segmentation . Shelfsegmentation can be interpreted as a pixel-wise classification problem,i.e., each pixel is classified as to whether they belong to visible shelfedges or not. The aim is not just to segment shelf edges, but also todeploy the model on mobile devices. As there is no standard solution forsuch dense classification problem on mobile devices, we look at semanticsegmentation architectures which can be deployed on edge. We modifylow-footprint semantic segmentation architectures to perform shelf seg-mentation. In addressing this issue, we modified the famous U-net archi-tecture in certain aspects to make it fit for on-devices without impactingsignificant drop in accuracy and also with 15X fewer parameters. In thispaper, we proposed Light Weight Segmentation Network (LWSNet), asmall compact model able to run fast on devices with limited memoryand can train with less amount ( ∼ In the last few years, deep convolutional neural networks (CNNs) [1] have out-performed the state of the art in many visual recognition tasks. CNNs have beenresponsible for the phenomenal advancements in tasks like object classification[2], object localization [3] etc., and the continuous improvements to CNN ar-chitectures are bringing further progress [4,5]. In many visual tasks, especiallyin retail industries, the desired output should include localization (or seman-tic segmentation), i.e., a class label is supposed to be assigned to each pixel orpixel-wise classification. Development of recent deep convolutional neural net-works (CNNs) makes remarkable progress on semantic segmentation [6,7,8]. Theeffectiveness of these networks largely depends on the sophisticated model designregarding depth and width, which has to involve many operations and parame-ters. The CNN based semantic segmentation has been widely used in differentapplications with great accuracy. (cid:63) a r X i v : . [ c s . C V ] A p r Pratyush Kumar, Muktabh Mayank Srivastava
Fig. 1.
A planogram (left side) visual representation. Retail execution (rightside) showing products arranged systematically ensuring planogram compliance.Recent interest in the automation of retail and consumer industries has a strongdemand for deploying shelf segmentation models on mobile devices. Few suchapplications are 1) to determine whether a photo is being taken from front orcorner while doing shelf eye tracking studies. Eye tracking devices are used tomeasure where a person is looking, also known as our point of gaze. These mea-surements are carried out by an eye tracker, that records the position of theeyes and the movements they make. 2) measuring retail execution or in-storeexecution. Retail execution aims to put the right products are in the right placeon the retail shelf with the right amount of space between them. 3) measuringplanogram compliance. A planogram is a visual representation that shows howand where specific retail products should be placed on retail shelves. Planogramcompliance is part of the retail execution and it ensures stores follow the ruleswhich capture optimal consumer attention. 4) separating shelves to do an out ofstocks analysis. To solve these problems on a mobile device, there are no stan-dard architectures and real baselines, we thus begin with use of standard deeparchitectures for dense pixel-wise classification on shelf images and adapt themfor accuracy and deployment on mobile devices. There are some architecturesproposed for low resource semantic segmentation which are used as a baseline.It is challenging to design a model with both small memory footprint and highaccuracy. All the current state-of-the-art approaches exploit deep learning ar-chitectures and are typically based on fully convolutional networks [9] with anencoder-decoder architecture. Deep neural networks allow to obtain impressiveperformance, however due to millions of parameters the model size is huge whichmakes their deployment on mobile very inefficient. So, efficient neural networksare becoming ubiquitous in mobile applications enabling entirely new on-deviceexperiences.CNNs have shown a great result for various semantic segmentation tasks inrecent years. There has also been research to find low resource semantic seg-mentation architectures, unlike for Dense Pixelwise Classification. As there is nostandard solution for shelf segmentation on phones, we used semantic segmen- ompact retail shelf segmentation for mobile deployment 3 tation approaches as a baseline for the task. U-Net [10] is a famous architecturefor dense classification. The structure of U-Net, comprising an encoder and adecoder network, and the corresponding layers of the encoder and decoder net-work are connected by skip connection, prior to a pooling and subsequent toa deconvolution operation respectively. U-Net has been showing impressive po-tential in segmenting images, even with a scarce amount of labeled training data.The proposed LWSNet is in the direction to cover the limitations for mobiledeployment, which is a small compact model able to run fast on devices withlimited memory.
Different classes of deep learning based dense classification architectures havebeen proposed. Most of the models follow the fully convolutional networks ap-proach with an Encoder-Decoder type of architecture. In the encoder part ofthese networks, the feature extractors are powerful object detectors like ResNet[11], ResNext [12], etc. The most famous encoder-decoder network is U-Net whichgives good accuracy with few training data.Designing deep neural network architecture for the optimal trade-off between ac-curacy and efficiency has been an active research area in recent years. SqueezeNet[13] extensively uses 1x1 convolutions with squeeze and expand modules primar-ily focusing on reducing the number of parameters. In this section, we introducerelated work on small segmentation models. Small semantic segmentation mod-els require making a good trade-off between accuracy and model parameters.More recent works that focus on real-time efficient segmentation are ERFNet[14], ENet [15], ICNet [16]. However, most of them follow the design principlesof image classification, which makes them have poor segmentation accuracy.Quantization [17,18,19] is another important complementary effort to improvenetwork efficiency through reduced precision arithmetic. Post-training quanti-zation reduces model size while improving CPU and hardware accelerator la-tency, with little degradation in model accuracy. There are many quantizationtechniques e.g., dynamic range quantization, full integer quantization, float16quantization.
The main focus of our work is to use encoder-decoder architecture with skipconnections and efficient CNN modules by keeping the macro structure remainsame. We adopt the U-Net macro architecture and experiments with convolutionmodule, model compression and design space.
Pratyush Kumar, Muktabh Mayank Srivastava
Fig. 2.
The broad schema of Inception-SE module
The network architecture is symmetric, having an Encoder that extracts spatialfeatures from the image, and a Decoder that constructs the segmentation mapfrom the encoded features. The Encoder follows the typical formation of a con-volutional network. It involves a sequence of convolution operations, followed bya max pooling operation. The Decoder part up-samples the feature map using atransposed convolution operation and reduces the feature channels, followed bya sequence of convolution operations. The most ingenious aspect of the U-Netarchitecture is the introduction of skip connections. In every layer, the outputof the convolutional layer of the Encoder is transferred to the Decoder.
Inception [5] modules are used in Convolutional Neural Networks to allow formore efficient computation and deeper Networks through dimensionality reduc-tion with stacked 1x1 convolutions. The modules were designed to solve theproblem of computational expense, as well as overfitting, among other issues.The solution, in short, is to take multiple kernel filter sizes within the CNN, and ompact retail shelf segmentation for mobile deployment 5
Fig. 3.
The U-Net architecture with Inception-SE modulerather than stacking them sequentially, ordering them to operate on the samelevel. Inception Layer is a combination of all those layers (namely, 1x1 Convolu-tional layer, 3x3 Convolutional layer, 5x5 Convolutional layer) with their outputfilter banks concatenated into a single output vector forming the input of thenext stage.
The main goal of the SE [20] module is to make a good trade-off between im-proved performance and increased model complexity. The SE is a small networkthat tries to detect a good pattern in global average of features, and then ex-cites or suppresses those features in a way that helps classification. While itshould be noted that the SE modules themselves add depth, they do so in anextremely computationally efficient manner and yield good returns even at thepoint at which extending the depth of the base architecture achieves diminishingreturns. It has been seen that the gains are consistent across a range of differentnetwork depths, suggesting that the improvements induced by SE modules maybe complementary to those obtained by simply increasing the depth of the basearchitecture.
Pratyush Kumar, Muktabh Mayank Srivastava
The network architecture is illustrated in Fig. 3. It consists of a contractingpath or encoder (left side) and an expansive path or decoder (right side). Theencoder consists of the repeated application of Inception layer, each followed bySE layer and a 2x2 max pooling operation with stride 2 for downsampling. Ateach downsampling step we double the number of feature channels. Every stepin the decoder path consists of an upsampling of the feature map followed by a2x2 up-convolution that halves the number of feature channels, a concatenationwith corresponding cropped feature from the encoder path and Inception layer.At the final layer a 1x1 convolution is used to map each final feature vector tothe desired number of classes. We used
Table 1.
LWSNet architecture
Input name/type Output size
We compare the number of parameters for each layer in Table 1. In the contrac-tion path, there are 5 Inception layer, 4 SE layer, each followed by MaxPool. In ompact retail shelf segmentation for mobile deployment 7 the expansion path, there is a deconv layer, concatenated with the correspond-ing feature path from the contracting path, each followed by the Inception layer.There are 4 deconv layer and 4 Inception layer. At the end, 1x1 convolutionallayer to the final number of classes. In Table 2, LWSNet uses around ∼ ∼ ∼
22 million in ERFNet andICNet respectively. There is a significant reduction in the parameters. We usedpost-training quantization when converting the LWSNet model to TensorFlowLite format. This further has helped to reduce the model size by 3X, with thefinal model weight ∼ Table 2.
Number of parameters for different architectures
Model Number of parameters Model size (MB)LWSNet 1,818,838 22.0ERFNet [14] 2,063,086 24.9ICNet [16] 22,726,752 140.7
We used our in-house datasets of retail industries for shelf segmentation. Inour training datasets, we have 98 training images of shelve and products in thestores. These all 98 training images come from one domain set. We used 100 testimages from 4 domains e.g., test set 1, 2, 3, 4; with each domain containing 25images. The training and test set 1 have images similar to one shown in Fig. 4.In this domain, shelve images are taken closely and each image has one completeblock of a shelve. In test set 2, images of shelve are taken from distant. Similarlyin test set 3, there is more than one complete block of shelve and in test set 4,there are both distant images and more than one complete block of shelve. Thetraining set contains images of different scales and sizes, ranging from 400 pixelsto 3000 pixels. We cropped high resolution images to multiple windows so thatthe model can learn low level information more efficiently. After LWSNet modelprediction, we use that output and do post-processing to complete shelf edgessegmented as shown in Fig. 4.
We have tried the model for 4 test sets with 25 images each and each set belongsto a different domain. We calculated Jaccard Index (IoU) of the LWSNet withtwo similar state-of-the-art dense classification models for on-device, ERFNetand ICNet. Our main target focused mainly to make a good trade-off between
Pratyush Kumar, Muktabh Mayank Srivastava
Fig. 4.
Original image (left) and prediction through the model (right)the number of parameters and accuracy. In our in-house datasets, our approachlead to give improved accuracy with large reduction in the model size.
Table 3.
Accuracies of different models on multiple test sets.
Model Test set 1 Test set 2 Test set 3 Test set 4LWSNet
ICNet [16] 0.448 0.441 0.445 0.433ERFNet [14] 0.439 0.432 0.429 0.424
The LWSNet has performed better in terms of accuracy and with lesser numbersof parameters. We beat the state-of-the-art networks of similar types of modelson our in-house datasets. Shelf segmentation is a common problem in the retailindustries but has broad scope to solve it with deep neural networks. In the futurework, more research will be required to solve it more efficiently but encoder-decoder architecture indeed would be taken as baseline.
The authors would like to thank Srikrishna Varadarajan, Sonaal Kant, andHarshita Seth for their contributions in gathering datasets and helping withthe post-processing stage.