[PDF] Fine-grained Apparel Classification and Retrieval without rich annotations

Abstract

The ability to correctly classify and retrieve apparel images has a variety of applications important to e-commerce, online advertising and internet search. In this work, we propose a robust framework for fine-grained apparel classification, in-shop and cross-domain retrieval which eliminates the requirement of rich annotations like bounding boxes and human-joints or clothing landmarks, and training of bounding box/ key-landmark detector for the same. Factors such as subtle appearance differences, variations in human poses, different shooting angles, apparel deformations, and self-occlusion add to the challenges in classification and retrieval of apparel items. Cross-domain retrieval is even harder due to the presence of large variation between online shopping images, usually taken in ideal lighting, pose, positive angle and clean background as compared with street photos captured by users in complicated conditions with poor lighting and cluttered scenes. Our framework uses compact bilinear CNN with tensor sketch algorithm to generate embeddings that capture local pairwise feature interactions in a translationally invariant manner. For apparel classification, we pass the feature embeddings through a softmax classifier, while, the in-shop and cross-domain retrieval pipelines use a triplet-loss based optimization approach, such that squared Euclidean distance between embeddings measures the dissimilarity between the images. Unlike previous works that relied on bounding box, key clothing landmarks or human joint detectors to assist the final deep classifier, proposed framework can be trained directly on the provided category labels or generated triplets for triplet loss optimization. Lastly, Experimental results on the DeepFashion fine-grained categorization, and in-shop and consumer-to-shop retrieval datasets provide a comparative analysis with previous work performed in the domain.

Full PDF

FFine-grained Apparel Classiﬁcation and Retrievalwithout rich annotations

Aniket Bhatnagar · Sanchit AggarwalAbstract

The ability to correctly classify and retrieve apparel images has avariety of applications important to e-commerce, online advertising, internetsearch, and visual surveillance industry. In this work, we propose a robustframework for ﬁne-grained apparel classiﬁcation, in-shop and cross-domainretrieval which eliminates the requirement of rich annotations like boundingboxes and human-joints or clothing landmarks, and training of bounding box/key-landmark detector for the same. Factors such as subtle appearance diﬀer-ences, variations in human poses, diﬀerent shooting angles, apparel deforma-tions, and self-occlusion add to the challenges in classiﬁcation and retrievalof apparel items. Cross-domain retrieval is even harder due to the presence oflarge variation between online shopping images, usually taken in ideal lighting,pose, positive angle and clean background as compared with street photos cap-tured by users in complicated conditions with poor lighting, cluttered scenes,and complex background. Our framework utilizes compact bilinear CNN [11]with tensor sketch algorithm to generate embeddings that capture local pair-wise feature interactions in a translationally invariant manner. For apparelclassiﬁcation, we pass the obtained feature embeddings through a softmaxclassiﬁer, while, the in-shop and cross-domain retrieval pipelines use a triplet-loss based optimization approach and deploy three compact BCNNs, with aranking loss such that squared Euclidean distance between embeddings mea-sures the dissimilarity between the images. Unlike previous settings that reliedon bounding box, key clothing landmarks or human joint detectors to assistthe ﬁnal deep classiﬁer, proposed framework can be trained directly on the pro-vided category labels or generated triplets for triplet loss optimization. Lastly,

Aniket BhatnagarSquadrun Solutions Private LimitedMobile.: +91 9999105171E-mail: [email protected] AggarwalSquadrun Solutions Private LimitedE-mail: [email protected] a r X i v : . [ c s . C V ] N ov Aniket Bhatnagar, Sanchit Aggarwal

Experimental results on the DeepFashion [25] ﬁne-grained categorization, andin-shop and consumer-to-shop retrieval datasets provide a comparative anal-ysis with previous work performed in the domain.

Keywords

Apparel classiﬁcation · In shop apparel retrieval · cross domainapparel retrieval · Compact bilinear CNN

Many methods have been proposed on the subject of apparel recognition [1,17,25, 38], clothes parsing [10, 13, 16, 30, 34, 36, 37, 39], apparel attribute detectionand description [5, 6, 9, 18, 26, 31], clothing item retrieval and recommenda-tion [8, 12, 14, 15, 20, 21, 23, 24, 33, 35] due to its tremendous impact on variousindustries.Online retail stores itself have huge opportunities ranging from a rich userdiscovery experience to quality control operation such as product identiﬁca-tion, tagging, moderation, enrichment, and contextual advertisement. Conse-quentially, algorithms for automatic categorization and retrieval of visuallysimilar fashion products would have signiﬁcant beneﬁts.However, clothes categorization and retrieval is still an open problem, espe-cially due to a large number of ﬁne-grained categories with very subtle visualdiﬀerences in style, texture, and cutting, compared with other common objectcategories. It is even more challenging to classify apparel images due to factorssuch as appearance, variations of human poses, and diﬀerent shooting angles.Another challenge is the subjectivity of apparels to deformations and self-occlusion. Moreover, apparel retrieval is often confronted with diﬃculties dueto large variations between online shopping images compared with selﬁes. Usu-ally, online shopping images have ideal lighting, pose, clean backgrounds andare captured from a positive angle, while street photos captured by users havepoor lighting, cluttered scenes, and complex background.Last few years have seen an emergence in the domain of ﬁne-grained cate-gorization [3, 22, 33, 41]. Since, discriminating parts for the apparel categoriestend to become subtle diﬀerences in shapes, styles and textures, categoriza-tion or identifying diﬀerent attributes of a clothing item like sleeve length orpattern can be posed as a ﬁne-grained classiﬁcation problem.Fine-grained classiﬁcation compared to general purpose visual categoriza-tion problems, focuses on the characteristic challenge of making subtle distinc-tions despite high intra-class variance due to factors such as pose, viewpoint orlocation of the object. A common approach to ﬁne-grained classiﬁcation prob-lem is to ﬁrst localize various parts of the object and model the appearanceconditioned on detected locations.Parts are often deﬁned manually and a part detector is also learned aspart of the overall pipeline. Branson et al. [3], approached the problem forﬁne-grained visual categorization of bird species, by estimating objects poseand computing features by deploying deep convolutional nets to image patchesthat are located and normalized by the pose. Zhang et al. [41] leveraged deep ine-grained Apparel Classiﬁcation and Retrieval without rich annotations 3 convolutional features computed on bottom-up region proposals. These meth-ods generally increase the cost of labeling as annotating tags is easier thanannotating bounding box coordinates for part detectors.Recently, an easier method which does not require bounding boxes orlandmark locations for ﬁne-grained classiﬁcation namely Bilinear Convolu-tional Neural Nets (BCNNs) have been introduced. Lin et al. [22] proposed aframework which utilizes a layer of bilinear pooling, just before the last fullyconnected layer which helps achieve remarkable performance on ﬁne-grainedclassiﬁcation datasets.Bilinear pooling collects second order statistics of local features over thewhole image and then does a global pool operation across each channel. Thesecond order statistics capture pairwise correlations between the feature chan-nels and global pooling introduces invariance to deformations. However, therepresentational power of bilinear features comes with very high dimensionalfeature maps. To reduce the model size and end-to-end optimization of thevisual recognition system, Gao et al. [11], proposed using Compact BilinearCNNs with TensorSketch or Random Maclaurin algorithm. Gao used a ker-nelized representation to exhibit that bilinear descriptor compares each localdescriptor in the ﬁrst image with that in the second image and the comparisonoperator is a second order polynomial kernel. Thus proving that a compactversion of the bilinear pooling is possible using any low dimensional approxi-mation of the second order polynomial kernel. Further, compact bilinear CNNsexhibit near equal and at times better performance as compared to full bilinearCNN.Thus, in this work, we propose an eﬃcient and reliable framework basedon compact bilinear CNN for ﬁne-grained apparel classiﬁcation, in-shop, andcross-domain retrieval, eliminating requirements for bounding box or key land-marks. To our information, this is the ﬁrst attempt at solving Fashion products(apparel items) classiﬁcation and retrieval without using bounding boxes orﬁnding key landmarks and employing a compact bilinear CNN to do the same.

We now take a closer look at the more recent work on Apparel Classiﬁca-tion and Retrieval [1, 9, 12, 14, 17, 20, 23–25, 33]. These methods are quite ef-ﬁcient although based on strong requirement of manually labelled clothinglandmarks [25], or object detectors to predict bounding boxes [9, 12, 14, 20],estimating human pose [17, 24] or body parts detector [1, 23].Liu et al. [25] proposed FahionNet, which tries to simultaneously modellocal attribute level, general category level, and clothing image similarity levelrepresentation with the dependence on clothing attributes and landmarks.Apart from this, FashionNet also requires bounding box annotation aroundclothing item or around human model wearing the clothing apparel in theimage for learning classiﬁers. Obtaining these massive attribute annotationsalong with clothing landmarks for apparel items is a tedious and costly task. It

Aniket Bhatnagar, Sanchit Aggarwal is not always possible for online marketplaces which maintain huge cataloguesof clothing items to create such hand-crafted annotated datasets.Dong et al. [9], construct a deep model capable of recognizing ﬁne-grainedclothing attributes on images in the wild using multi-task curriculum transferlearning. They collected a large clothing dataset and their meta-label as at-tributes from diﬀerent online shopping web-sites. They learned a pre-processorto detect clothing images using Faster R-CCN and then employed an objectdetector which was trained on PASCAL VOC2007 followed by ﬁne-tuning onan assembled bounding box annotated clothing dataset consisting of 8, 000street/shop photos. Their model was then trained using obtained boundingboxes and rich annotations.Hadi et al. [12] utilize an alexnet [19] with activations from a fully connectedlayer FC6 to identify exact matching clothes from street to shop domain. Theycollected street and shop photos, and obtained bounding box annotations usingMechanical Turk service. Huang et al. [14], proposed a dual attribute-awareranking network, which optimizes attribute classiﬁcation loss and image tripletquantization loss together for cross-domain image retrieval. They utilize two-stream CNN for handling in- shop and street images respectively with the de-pendence on bounding boxes generated using Faster RCNN [28]. They curated381,975 online-oﬄine image pairs of diﬀerent categories from the customer re-view pages. Then, manually pruned the noisy labels, merged similar labelsbased on human perception using crowd-source annotators and obtained ﬁnegrained clothing attributes using image descriptors.Bossard et al. [1] introduce a recognition and classiﬁcation pipelines whichinclude building blocks like upper body detectors, feature channels and a multiclass learner based on random forest. They crawled the dataset from web anddeﬁned 15 clothing classes and used a bounding box detector to label theimages. Liang et al. [20], developed an integrated system for cloth co-parsingusing a multi-image graphical model. They constructed a clothes dataset whichconsisted of 2098 high resolution street fashion images and requested annota-tors to clean the dataset along with extraction from the text tags of images forsemantic attribute labelling for their joint label formulation. The dependencyon rich annotations for above ﬁne-grained apparel classiﬁcation and retrievaltasks requires huge time and cost to curate a dataset.The main contribution of our work is a robust framework that can beused to categorize apparel images, and identify similar clothing items for bothin-shop and cross-domain image retrieval problems, without any overhead totrain a network to identify bounding boxes around clothing items or to iden-tify landmark locations in an apparel image. For both image retrieval tasks[in-shop and cross domain], we avoided the daunting task of manual labellingof bounding boxes or landmarks locations and yet attain robustness with eﬀec-tive usage of compact bilinear CNNs along with triplet loss approach. Ratherthan curating a new fashion dataset, we have shown the performance of ourframework on existing benchmark datasets for fashion apparel categorizationand image retrieval. In Section III, we give an overview of our frameworkand describe the dataset and features used for various models. Section IV de- ine-grained Apparel Classiﬁcation and Retrieval without rich annotations 5

BlazersTeesTanksCardigans

Categorization Pipeline

Images belonging to different categories VGGNet c on v on v poo l on v on v poo l on v on v on v on v poo l on v on v on v poo l on v on v VGGNet c on v on v poo l on v on v poo l on v on v on v on v poo l on v on v on v poo l on v on v C o m pa c t B ili nea r La y e r S i gned s q r t La y e r L2 N o r m a li s a t i on La y e r F u ll y C onne c t ed S o ft m a x Compact Bilinear CNN

Fig. 1

Overview of categorisation pipeline. We utilize compact BCNNs [11] with tensorsketch algorithm to generate embeddings that capture local pairwise feature interactionsfollowed by a fully connected soft-max layer. (Best viewed in color) scribes the evaluation metrics followed by results and analysis of experimentsin Section IV-B. We conclude the paper in Section V.

An overview of our frameworks for apparel categorization and retrieval pipelinesis illustrated in Figure 1 and Figure 2 respectively. We aim to solve fashioncategorization and retrieval of apparel items using two diﬀerent frameworksbased on compact bilinear CNN. This will help to solve the problem withgreater performance and without any overhead to train a network to identifybounding boxes around clothing items or to identify landmark locations ina clothing image. The three problems which we tackled using the proposedframework are: – Fashion Apparel Categorization : The goal here is to assign each ap-parel item a unique category amongst the ﬁfty ﬁne-grained yet mutuallyexclusive categories. – Fashion Apparel in-shop image retrieval : Given a clothing image,aim here is to identify whether two apparel images belong to the sameitem or not. It can be helpful when customers encounter a shop-image ona particular e-commerce website and want to know more details for theparticular product or similar products on other e-commerce sites. – Fashion Apparel cross-domain (street to shop) image retrieval :Given a street image of clothing item in unconstrained domain i.e. con-sumer clicked photographs, the target here is to match it with its shopcounterparts where images are taken in a constrained environment, i.e. byprofessional photographers under apt light and brightness levels. This canbe useful for a person who wants to buy the same apparel as that seen ona friend or a celebrity picture.

Aniket Bhatnagar, Sanchit Aggarwal H i nge l o ss l a y e r Triplet

VGGNet conv 5_3

Compact Bilinear CNN 1

VGGNet conv 5_3VGGNet conv 5_3

Compact Bilinear CNN 2

VGGNet conv 5_3VGGNet conv 5_3

Compact Bilinear CNN 3

VGGNet conv 5_3 p q n pqn

In-Shop and Cross-domain Retrieval Pipeline

Minimize DistanceMaximize Distance

Fig. 2

Overview of in-shop and cross-domain retrieval pipeline. We use a triplet-basedapproach and deploy three compact BCNNs [11] to generate feature embeddings, followedby hinge loss layer to minimze distance between similar images and maximise distancebetween dissimilar images

Doing this we also highlight that compact BCNNs can be used to generateembeddings that capture the notion of visual similarity and later use theseembeddings to ﬁnd apparel item nearest to a query input fashion image.We use the same architecture for compact BCNN as mentioned in [11]. Fur-ther, we use Tensor Sketch algorithm instead of Random Maclaurin algorithmas it is faster and is more memory eﬃcient. We deploy a symmetric compactbilinear CNN which uses VGG16 network [32] till the last convolution layerwith Relu activation to obtain feature matrices. These feature matrices arefollowed by a compact bilinear layer which computes tensor-sketch projectionsof the embedding matrices and performs a sum pooling over all regions in theobtained projection matrices. The feature vectors obtained from VGGNet areof the order 1 x 512, and we set compact bilinear layer output dimensions[tensor sketch projection dimensions] to 8192 channels as it has been notedin [11] that 8192-dimensional Tensor sketch features have the same perfor-mance as full-scale bilinear features. The resultant vector is passed throughsigned square root step followed by L2 normalization motivated from [27].We used VGGNet instead of ResNet or Inception as the base model forthe compact bilinear CNN pipeline as the primary purpose of Bilinear CNNor Compact bilinear CNN is to increase the representation power of CNN,without having to drastically increase the number of layers in the base network.Additionally, training a network like ResNet in place of VGG based compactBCNN would increase the number of weights to learn and the time it takesfor end-to-end optimization of the model drastically.Finally, the l2 normalized feature vector is pushed to fully connected soft-max layer in case of categorization use case. The image retrieval model, forboth in-shop and consumer-to-shop retrieval, uses a triplet-based approachwith a ranking loss [33] to learn embeddings such that squared Euclidean dis-tance between embeddings measures the (dis)similarity between the images. ine-grained Apparel Classiﬁcation and Retrieval without rich annotations 7

The compact bilinear CNN based pipeline for each use case was implementedin Caﬀe and trained on NVIDIA K80 GPU.3.1 Pipeline and Model Training

We used the compact bilinear CNN with weights for layers before the compactbilinear layer initialized with an imagenet [7] pre-trained VGG weights. Similarto [5], we trained the compact BCNN using the two-step procedure, where weinitially train only the last fully connected layer at a large learning rate of 1.0and a small weight decay constant of 5 × − and then ﬁne tune the entiremodel for several iterations at a small learning rate of 0.001 and a largerweight decay constant of 5 × − . Both training procedures were carried outusing Momentum based SGD optimizer keeping momentum at 0.9. For imagepreprocessing, we resize all images to size 512 x 512 and cropped each imagefrom the center for size 448 x 448. We then subtracted imagenet mean fromeach color channel for each image. The image retrieval model was trained using the triplet of images providedas input to the triplet of compact bilinear CNN models with shared weights.Each triplet of the form < q, p, n > is pushed to their respective subnets. Hereq, p, and n represent the query image, matching image and any dissimilarimage respectively. We take the output from l2- normalization layer of each C-BCNN to represent embeddings for each image in the triplet. The three subnetsshare the same weights in the entire training process. These embeddings are then fed to a hinge loss function to optimize the networkto be able to diﬀerentiate between similar and dissimilar images. We use thefollowing loss function: L = max (0 , g + D ( p vec , q vec ) − D ( q vec , n vec )) (1)where D(x, y) represents the squared Euclidean distance between two em-bedding vectors. Also note as embedding vectors are L2 normalized the squaredeuclidean distance is equal to twice of cosine distance between the embeddingvectors. Besides this, the number of output dimensions for compact bilinearlayer [or dimension of tensor-sketch projections] remains the same as 8192.Thus each image embedding has a length of 8192. The hinge loss optimizationhelps pull q vec and p vec closer while pushing the embedding vectors q vec and n vec farther.We used triplet loss paradigm to train the model for concept of similar-ity between apparel item images, rather than using siamese network [4], astriplet loss paradigm on each iteration trains the model to minimise distance Aniket Bhatnagar, Sanchit Aggarwal between similar inputs and maximise distance between dissimilar inputs si-multaneously, while siamese network would on a given iteration only look ata pair of images. Thus, triplet loss based trained models should be better ableto learn the context of ”why” item ”q” is closer to item ”p”, and not ”n”.For this network, same image pre-processing was applied as for ﬁne-grainedcategorization model. The model weights for each of the subnets were initial-ized from the weights of the ﬁne-tuned C-BCNN used for categorization. Thecomplete network was then trained at a small learning rate of 0.001, with aweight decay constant of 5 × − , keeping g=1 in the triplet loss function.Here also momentum based SGD optimizer is used while training at a momen-tum of 0.9. In ﬁne-grained category classiﬁcation compact bilinear CNN outperforms Fash-ionNet [25], DARN [14] and WTBI [5] without using either bounding box an-notation or clothing landmarks. Thus, performance boost with a much lessexpensive labeling technique. Table I speciﬁes a quantitative comparison ofour apparel categorization framework while Figure 3 displays the qualitativeresults of the pipeline.This performance even in the absence of bounding box or landmark de-tector can be attributed to tensor sketch projections which have the abilityto represent pairwise correlations between the feature channels in a concisemanner. Thus, assisting the model to focus on the features which would helpit in uniquely identifying the ﬁne grained category in relation to the object.Example, visualisations from Conv-5-3 layer of categorisation pipeline (Figure4) using concept of class activation mapping [29] showcases that compact bi-linear layer infuses the principle of attention in the model without explicit useof an attention layer, and thus helping it centre attention on key-landmarkswithout having to train the model for the same. ine-grained Apparel Classiﬁcation and Retrieval without rich annotations 9

Table 1

Top-3 and Top-5 Accuracy for Category Classiﬁcation. The proposed method issigniﬁcantly better than other state of the art teachniques. We achieved these result withoutthe dependence on Part based models or rich annotations required for labelling datasets.

Method Top-3 Top-5

WTBI [5] 43.73 66.26DARN [14] 59.48 79.58FashionNet [25] 82.58 90.17Categorisation Framework

Top-20 Accuracy for In-shop retrieval. We achieved these result without the de-pendence on part based models or rich annotations required for labelling datasets.

Method Top-20

WTBI [5] 50.6DARN [14] 67.5FashionNet [25] 76.4In-Shop Retrieval Framework

In-shop image retrieval is diﬃcult due to the task of detecting the same cloth-ing item from diﬀerent poses/ arrangement of the same. But still, our frame-work based on compact BCNN trained using triplet approach with hinge lossis able to gauge (dis)similarity between clothing apparel items without any ad-ditional requirement of ﬁnding bounding boxes, human joints [40], poselets [2]or clothing landmarks [25]. In fact, our framework is very close to the stateof art FashionNet [25] which is trained using fashion landmarks, and clothingattribute information. Compact bilinear CNN trained without any boundingbox annotation achieves a top- 20 accuracy of 76.26 while FashionNet achievesan overall accuracy of 76.4. Table II speciﬁes a quantitative comparison of ourin-shop retrieval framework while Figure 5 displays the qualitative results ofthe pipeline.

Cross-domain image retrieval is even more diﬃcult as compared to in-shopimage retrieval because of the high variance between the images of two do-mains introduced by various factors such as bad lighting, less proportion ofvisible apparel item and clutter in the images because of surrounding itemsand background. Our framework outperforms WTBI and DARN and achievesa top-20 retrieval accuracy of 17.19 and is comparable to FashionNet [25] whichis trained using bounding box and fashion landmarks jointly with clothing at-tributes. Table III speciﬁes a quantitative comparison of our cross-domain re-trieval framework while Figure 6 displays the qualitative results of the pipeline.

Fig. 3

Results using our categorisation pipeline. (Rows 1-11) are Blazers, Blouses, Cardi-gans, Dresses, Jackets, Jeans, Leggings, Rompers, Sweatpants, Tanks and Tees respectively.Columns (a-h) are correctly classiﬁed images while columns (i-j) are incorrectly categorisedimages.(Best viewed in color)ine-grained Apparel Classiﬁcation and Retrieval without rich annotations 11(a) (b) (c) (d) (e) (f)

Fig. 4

Columns (a, c and e) show actual images while columns b, d and f depict

Conv5-3layer visualization of the same using Grad-CAM [29] from proposed apparel categorizationpipeline. (Best viewed in color)(a) (b) (c) (d) (e) (f)

Fig. 5

Columns (a-f) Query and Top-5 retrieved examples using in-shop retrieval pipeline.Column (a) Query Image, (b-f) Top 5 Images.(Best viewed in color)2 Aniket Bhatnagar, Sanchit Aggarwal

Table 3

Top-20 Accuracy for Cross-domain retrieval. We achieved these result without thedependence on Part based models or rich annotations required for labelling datasets.

Method Top-20 Retrieval Accuracy

WTBI [5] 6.3DARN [14] 11.1FashionNet [25] 18.8Cross-Domain Retrieval Framework (a) (b) (c) (d) (e) (f)

Fig. 6

Columns (a-f) shows query image and

Top-5 retrieved examples using our proposedcross-domain retrieval pipeline. Column a) is query image and column(b-f) are Top 5retrieved images.(Best viewed in color)

In this paper, we proposed a framework for apparel categorization, and in-shopand cross-domain apparel retrieval problems. We have shown that compact bi-linear CNNs can be duly utilized for ﬁne-grained clothing categorization andsimilar apparel retrieval. Our experimental results show that our frameworkoutperforms and is comparable with state-of-the-art methods without the de-pendence to obtain bounding boxes around the apparel item.Further, it eliminates the requirement to train human joints, pose-lets orclothing landmark detectors to combat high intra-class variance in clothingdatasets due to diﬀerent poses, styles and non-rigid deformations in the apparelitems, which has been a major prerequisite to classiﬁcation in previous fashionclassiﬁcation or retrieval research. As a future work in this domain, it wouldbe interesting to see if we can infuse the concept of attention with bilinearmodels to push the state of art further in this domain. ine-grained Apparel Classiﬁcation and Retrieval without rich annotations 13

References

1. Bossard, L., Dantone, M., Leistner, C., Wengert, C., Quack, T., Van Gool, L.: Apparelclassiﬁcation with style. In: Asian conference on computer vision, pp. 321–335. Springer(2012)2. Bourdev, L., Malik, J.: Poselets: Body part detectors trained using 3d human poseannotations. In: International Conference on Computer Vision, pp. 1365–1372. IEEE(2009)3. Branson, S., Van Horn, G., Belongie, S., Perona, P.: Bird species categorization usingpose normalized deep convolutional nets. arXiv preprint arXiv:1406.2952 (2014)4. Bromley, J., Guyon, I., LeCun, Y., S¨ackinger, E., Shah, R.: Signature veriﬁcation usinga” siamese” time delay neural network. In: Advances in neural information processingsystems, pp. 737–744 (1994)5. Chen, H., Gallagher, A., Girod, B.: Describing clothing by semantic attributes. Euro-pean Confence on Computer Vision pp. 609–623 (2012)6. Chen, Q., Huang, J., Feris, R., Brown, L.M., Dong, J., Yan, S.: Deep domain adaptationfor describing people based on ﬁne-grained clothing attributes. In: computer vision andpattern recognition, pp. 5315–5324 (2015)7. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scalehierarchical image database. In: Computer Vision and Pattern Recognition, pp. 248–255. IEEE (2009)8. Di, W., Wah, C., Bhardwaj, A., Piramuthu, R., Sundaresan, N.: Style ﬁnder: Fine-grained clothing style detection and retrieval. In: computer vision and pattern recogni-tion workshops, pp. 8–13 (2013)9. Dong, Q., Gong, S., Zhu, X.: Multi-task curriculum transfer deep learning of clothingattributes. In: Applications of Computer Vision (WACV), pp. 520–529. IEEE (2017)10. Gallagher, A.C., Chen, T.: Clothing cosegmentation for recognizing people. In: Com-puter Vision and Pattern Recognition, pp. 1–8. IEEE (2008)11. Gao, Y., Beijbom, O., Zhang, N., Darrell, T.: Compact bilinear pooling. In: ComputerVision and Pattern Recognition, pp. 317–326 (2016)12. Hadi Kiapour, M., Han, X., Lazebnik, S., Berg, A.C., Berg, T.L.: Where to buy it:Matching street clothing photos in online shops. In: International Conference on Com-puter Vision, pp. 3343–3351 (2015)13. Hasan, B., Hogg, D.C.: Segmentation using deformable spatial priors with applicationto clothing. In: BMVC, pp. 1–11 (2010)14. Huang, J., Feris, R.S., Chen, Q., Yan, S.: Cross-domain image retrieval with a dualattribute-aware ranking network. In: International Conference on Computer Vision, pp.1062–1070 (2015)15. Jagadeesh, V., Piramuthu, R., Bhardwaj, A., Di, W., Sundaresan, N.: Large scale visualrecommendations from street fashion images. In: SIGKDD international conference onKnowledge discovery and data mining, pp. 1925–1934. ACM (2014)16. Jammalamadaka, N., Minocha, A., Singh, D., Jawahar, C.: Parsing clothes in unre-stricted images. In: BMVC, vol. 1, p. 2 (2013)17. Kalantidis, Y., Kennedy, L., Li, L.J.: Getting the look: clothing recognition and seg-mentation for automatic product suggestions in everyday photos. In: Internationalconference on multimedia retrieval, pp. 105–112. ACM (2013)18. Kiapour, M.H., Yamaguchi, K., Berg, A.C., Berg, T.L.: Hipster wars: Discovering el-ements of fashion styles. In: European conference on computer vision, pp. 472–488.Springer (2014)19. Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classiﬁcation with deep convo-lutional neural networks. In: Advances in neural information processing systems, pp.1097–1105 (2012)20. Liang, X., Lin, L., Yang, W., Luo, P., Huang, J., Yan, S.: Clothes co-parsing via joint im-age segmentation and labeling with application to clothing retrieval. IEEE Transactionson Multimedia18