[PDF] Item Popularity Prediction in E-commerce Using Image Quality Feature Vectors

Abstract

Online retail is a visual experience- Shoppers often use images as first order information to decide if an item matches their personal style. Image characteristics such as color, simplicity, scene composition, texture, style, aesthetics and overall quality play a crucial role in making a purchase decision, clicking on or liking a product listing. In this paper we use a set of image features that indicate quality to predict product listing popularity on a major e-commerce website, Etsy. We first define listing popularity through search clicks, favoriting and purchase activity. Next, we infer listing quality from the pixel-level information of listed images as quality features. We then compare our findings to text-only models for popularity prediction. Our initial results indicate that a combined image and text modeling of product listings outperforms text-only models in popularity prediction.

Full PDF

IItem Popularity Prediction in E-commerceUsing Image Quality Feature Vectors

Stephen Zakrewsky

Drexel [email protected]

Kamelia Aryafar

[email protected]

Ali Shokoufandeh

Drexel [email protected]

Abstract —Online retail is a visual experience- Shop-pers often use images as ﬁrst order information todecide if an item matches their personal style. Imagecharacteristics such as color, simplicity, scene composi-tion, texture, style, aesthetics and overall quality play acrucial role in making a purchase decision, clicking onor liking a product listing. In this paper we use a set ofimage features that indicate quality to predict productlisting popularity on a major e-commerce website, Etsy .We ﬁrst deﬁne listing popularity through search clicks,favoriting and purchase activity. Next, we infer listingquality from the pixel-level information of listed imagesas quality features. We then compare our ﬁndings totext-only models for popularity prediction. Our initialresults indicate that a combined image and text model-ing of product listings outperforms text-only models inpopularity prediction. I. I

NTRODUCTION

The informative presentation of product listingsthrough text and images is the foundation of moderne-commerce. Shoppers often have a speciﬁc style orvisual preference for many of the available itemssuch as jewelry, clothing, home decor, etc. Imagesprovide the ﬁrst order information for product listings.Users often use images in combination with other datamodalities such as textual description, price, ratingsand etc. to decide if an item is a suitable match forwhat they need and have in mind. The selection ofproper high quality images is then an important step inlisting a successful product. In this paper we examinethe role of image quality in listing popularity on amajor e-commerce website, Etsy .Etsy is an online marketplace for artisans sellingunique handcrafted goods, and vintage wares thatcouldn’t be found elsewhere. Etsy caters to the longtail of online retail [1], [2]. With more than onemillion sellers, million unique product listings andnearly a hundred million images, Etsy is uniquelypositioned to answer some interesting questions aboutthe role of images as a rich visual experience in e-commerce settings. Each Etsy listing is composed oftext information such as title, tags, item description,shop and seller name and complementary images. For Fig. 1: Sample Etsy listing images are shown withdifferent lighting, scene composition, and quality.a product listing to stand out, high-quality imagesdescribing the content of the product listing is a ne-cessity [3], [4]. Figure 1 illustrates some Etsy imageswith different scene composition, lighting and imagequality as featured on the website.Early work in the literature has deﬁned imagepopularity as quality [5] or aesthetics [6] and usedata from photography rating websites where userswho have interest in photography upload their photosand rate others. Popularity has also been deﬁned asmemorability [7], and interestingness [8], [9]. Morerecent work has directly tackled popularity. In [10],popularity is deﬁned as the number of views on Flickr,and [2] uses favorited listings on Etsy.In this paper we introduce a mechanism for productlistings popularity prediction from the images repre-senting those listings. We then explore the correlationbetween image quality and user interaction with whatis for sale. Because sales are rare in comparison tothe number of items available on a large site such as a r X i v : . [ c s . C V ] M a y tsy, we look into a combination of mechanisms for in-teraction, including the number of favorites, purchasesand clicks on items to deﬁne item popularity. Favoritesindicate an interest in an item and are similar to likingmechanisms on other websites such as thumbs-up onFacebook.Popularity tends to be predicted using typical clas-siﬁers such as SVMs or regression [6] [10] [11] [12].Datta et. al. [6] uses a two class SVM classiﬁerwith a forward selection algorithm to ﬁnd suitablefeature vectors indicating popularity. By using elasticnet to rank feature relevance to aesthetics, and a bestﬁrst algorithm to ﬁnd feature sets that minimize theRMSE cross validation error, [12] are able to achievea 30.1% improvement compared to [11]. A few haveexplored other machine learning techniques. In [5] anaive Bayes classiﬁer is used and Aryafar et. al [2]studied the signiﬁcance of color in favorited listingson Etsy using logistic regression, perceptron, passiveaggressive and margin infused relaxed algorithms.The features used in popularity prediction modelthe same qualities professional photographers use suchas light, color, rule of thirds, texture, smoothness,blurriness, depth of ﬁeld, and scene composition [5][6] [11] [12]. Most of these features are unsupervised,but some such as the spatial edge distribution andcolor distribution features of [5] require all of thelabeled training data. Some recent work has looked atsemantic object features. [10] used the popular CNNImageNet to detect the presence of 1000 differenceobject categories in the image. The presence/absenceof these categories is used as the feature. In this paperwe propose a combination of simplicity, blur, depth ofﬁeld, rule of thirds and texture features as the imagequality representation. We also combine the imagerepresentation with text features as a multimodal em-bedding of items for sale. State-of-the-art studies haveoften shown that multimodal embeddings of items canoutperform single modality representations for multi-ple prediction, ranking and classiﬁcation problems [13][14] [15].The remainder of the paper is organized as follows:Section II describes the image quality feature vectors.We examine the performance of image quality featuresin predicting listing popularity in section III. Finally,we conclude this paper in section IV and proposefuture research directions.II. F EATURES

The quality features extracted from images are com-posed of a set of hand-crafted features including sim-plicity, blur, depth of ﬁeld, rule of thirds, experimentaland texture features. In this section, we explain thedetails of each subset of features. The implementation of this features is made publicly available . The ﬁnalimage quality feature vector is a concatenation of thesefeatures. Table I shows the dimensionality of eachfeature. The dimensionality of the ﬁnal quality featurevector (image representation) is the sum of all thesefeatures. A. Simplicity

High quality photos are typically simpler than oth-ers. They often have one subject placed deliberately inthe frame. Sometimes the background is out of focusto emphasize the subject. Poor quality photographstend to have cluttered backgrounds and it may bedifﬁcult to distinguish the subject of the scene. Weused the four measures of simplicity from [5], spatialedge distribution, hue count, contrast and lightness,and blur.

1) Spatial Edge Distribution:

Spatial edge distri-bution measures how spread out sharp edges are inthe image. A single subject is expected to have asmall distribution while an image with a clutteredbackground would have a large distribution. Edges aredetected by applying a × Laplacian ﬁlter and takingthe absolute value. The ﬁlter is applied to each RGBchannel independently and the ﬁnal image is computedas the mean across all three channels. The Laplacianimage is resized to × and normalized to sum to1. Then, the edges are projected onto the x and y axisindependently. Let w x , and w y be the width of of the projected edges respectively. The image qualityfeature f = 1 − w x w y is the percent of area outside themajority of edges. Figure 2 shows the edges detectedfrom two different images and their respective featurevalues.

2) Hue Count:

Professional photographs look morecolorful and vibrant, but actually tend to have lessdistinct hues because cluttered scenes contain manyheterogeneous objects. We use a hue count featureby ﬁltering an image in the HSV color space suchthat V is in the range of [0 . , . and S is greaterthan . . A bin histogram is computed on theremaining H values. Let m be the maximum valueof the histogram and let N = { i | H ( i ) > αm } , bethe set of bins with values greater than αm . Thequality feature f = 20 − || N || is when there area many different hues and grows larger as the numberof distinct hues in the image goes down. We used alpha = 0 . as shown in the literature [5].

3) Contrast and Lightness:

Brightness is a wellknown variable that professional photographers aretrained to understand and adjust. We use the average We make our feature extraction pipeline for image qualityfeatures available at:https://github.com/szakrewsky/quality-feature-extraction a)(b) Fig. 2: The Laplacian image for computing spatial edge distribution for two images is illustrated. The value ofthe feature for ﬁgure a. is . and for b. is . .brightness feature [5], [11] computed from the Lchannel of the Lab color space. Contrast is similar,and is the ratio of maximum and minimum pixelintensities. We sum the RGB level histograms, andnormalize it to sum to 1. We use the width of thecenter mass of the histogram [5]. B. Blur

Blurry images are almost always considered to beof poor quality. We use the common blur features inthe literature [5] [16]. In [5] blur is modeled as I b = G σ ∗ I where I b is the result of convolving a Gaussianﬁlter with an image. The larger the σ the more highfrequencies are removed from the image. Assumingthe frequency distribution of all I is approximatelythe same, then the maximum frequency || C || can beestimated as C = { ( u, v ) | || F F T ( I b ) || > Θ } . Thefeature is f = || C || ∼ /σ , after normalizing by theimage size.In [16], blur estimation is done based on changesin the edge structures. The blur operation will causegradual edges to lose sharpness. Assuming that mostimages have gradual edges that are sharp enough, theblur is measured as the ratio of gradual edges that havelost their sharpness. C. Rule of Thirds

The rule of thirds is an important compositiontechnique. Thirds lines are the horizontal and verticallines that divide an image into a × grid of equalsized cells. The rule of thirds states that subjects placedalong these lines are aesthetically more pleasing andmore natural than subjects centered in the photograph.In order to segment the subject of the image fromthe background, we use the Spectral Residual saliencydetection algorithm [17]. The feature is a × mapwhere each cell is the average saliency value [18]. Let w p be the saliency value of the pixel and A ( W i ) is thearea of the cell, then the value of each cell is w i = (cid:80) p ∈ W i w p A ( W i ) . (1)To compute the feature, the image is divided into a × grid with emphasis on the thirds lines; the horizontaland vertical regions centered on the thirds lines are / of the image size. Figure 3 shows the saliencydetection with the × grid overlay, and the thirdsmap feature for an image.3 a) (b) (c) Fig. 3: Example of Rule of Thirds feature. Figure b. shows the SR saliency detection, and c. shows the thirdsmap feature.

D. Texture

A smooth image may indicate blur or out-of-focus,and the lack of which may indicate poor ﬁlm, or toohigh an ISO setting. In contrast, texture in the sceneis an important composition skill of a photographer.Smoothness may indicate the lack of texture. Textureand smoothness are some of the most statically corre-lated features for quality/popularity [12] [10]. We usethree smoothness/texture features from these.A three level wavelet transform is applied to the Lchannel of the Lab color space. We only use the bottomlevel of the pyramid. The result is squared to indicatepower. Let b = { HH, HL, LH } be the bottom levelof a wavelet transform, the extracted feature is then f = 13 M N M (cid:88) m =1 N (cid:88) n =1 (cid:88) b w b ( m, n ) (2)where w is the square of the wavelet value. Becausethe Laplacian is often used as a pyramid of differentscales, another feature f = 1 M N M (cid:88) m =1 N (cid:88) n =1 l ( m, n ) (3)is also used. This time l is the second level from thebottom of a Laplacian pyramid.Another texture feature is computed using localbinary pattern (LBP). Then a pyramid of histogramsare computed as in [19]. Figure 4 shows the similaritiesof LBP features and the three channels of Daubechiesdb1 wavelet. E. Depth of Field

Depth of ﬁeld is the distance between the nearestand farthest objects that appear in sharp focus. Atechnique of professional photographers is to use lowdepth of ﬁeld to focus on the photographic subjectwhile blurring the background. We used the feature [6]of the ratio of high frequency detail in center regionsof the image compared to the entire image. Let w be the bottom level of a wavelet transform, the featurecan be describes as: f = (cid:80) ( x,y ) ∈ M ∪ M ∪ M ∪ M w ( x, y ) (cid:80) i =1 (cid:80) ( x,y ) ∈ M i w ( x, y ) , (4)where M i | ≤ i ≤ are the cells of a × grid.The same feature is also reapplied using the Laplacianpyramid l instead of w [12]. These features only lookat the center region of the image. A third feature[12] looks at the spatial distribution of high frequencydetails. Let l be the bottom layer of a Laplacianpyramid and c row , c col are the center of mass, thefeature is obtained as: f = 1 M N M (cid:88) m =1 N (cid:88) n =1 l ( m, n ) (cid:112) ( m − c row ) + ( n − c col ) . (5)Figure 5 visualizes how these features are computedfor a sample image. F. Experimental

Maximally Stable Extremal Regions (MSER) [20]can be used to detect text because characters aretypically single solid colors with sharp edges thatstandout from the background [21]. Additionally, tex-ture patterns are also often detected by MSER, likebricks on a wall. In this paper, we used the countof the number of MSER regions as the experimentalfeature. In the future, we would like to continue thisexperiment into other features based on text in images.III. P

OPULARITY P REDICTION

We collect a set of images from Etsy through Etsy’sAPI for popularity prediction. Our dataset consists of , Etsy listing images. Each Etsy listing has atleast one photo and can have up to ﬁve photos to showdifferent angles and details. In our experiments weonly extract the ﬁrst (main) listing image which showsup in search results and is featured as the main image a) (b)(c) Fig. 4: Smoothness and texture features are illustrated. Figure b. shows Local Binary Pattern (LBP) featureimage, and c. shows the 3 channels of the DB1 wavelet transform on the sample image.

Feature Dimension’Ke06-qa’: spatial edge distribution 1’Ke06-qh’: hue count 1’Ke06-qf’: blur 1’Ke06-tong’: blur tong etal 1’Ke06-qct’: contrast 1’Ke06-qb’: brightness 1’-mser count’: mser count 1’Mai11-thirds map’: thirds map 25’Wang15-f1’: avg lightness 1’Wang15-f14’: wavelet smoothness, 1’Wang15-f18’: laplacian smoothness 1’Wang15-f21’: wavelet low dof 1’Wang15-f22’: laplacian low dof 1’Wang15-f26’: laplacian low dof swd 1’Khosla14-texture’: texture 5120

TABLE I: Image quality feature dimensions are shownby feature.on the listing page. We denote the number of favoritesfor each listing, L with main image I as F ( L I ) , thenumber of purchases with P ( L I ) and number of clickswith C ( L I ) . We associate each listing image with it’spopularity score as : P opularity ( L I ) = (cid:88) F ( L I ) + C ( L I ) + P ( L I ) . We extract the quality feature vectors as describedin Section II for each listing image and denote thatwith q ( L I ) for listing L and image I . Table I showsthe dimensionality of each feature that is used to buildthe quality feature vector. Once the dataset has been TABLE II: Lift in accuracy rate using a logisticregression, relative to text-only baseline ( % ), on thesample dataset is shown in image-only and multimodalsettings. Modality Image Image+Text (MM)Relative lift in AUC +1 .

07% + . % tagged with these quality features, we extract textualinformation from the listing as t ( L I ) . These textualfeatures consist of the tokenized listings titles unigramsand bigrams and tokenized listings tags unigrams andserve as the single modality listing representation. Themultimodal feature vector representation, M M ( L I ) is obtained by concatenating quality and textual fea-tures as a single feature vector, i.e., M M ( L I ) = (cid:104) q ( L I ) , t ( L I ) (cid:105) .We then use a logistic regression against popularityscores, P opularity ( L I ) and report the accuracy liftusing images and multimodal feature vectors relativeto the baseline text-only model. Table II shows theseresults. We can observe that the quality features incombination with textual features can increase theprediction accuracy on the collected dataset.IV. CONCLUSION

This works presents an initial study on understand-ing how image quality can impact the popularity5 a) (b) (c)

Fig. 5: Figure b. shows the Low Depth of Field features in the center grid region for the Laplacian image.Figure c. shows the same image with its center of mass.of items in e-commerce settings, thereby providingbetter user understanding and a better overall shop-ping experience. To facilitate this understanding, thiswork proposed an empirical method to estimate theimage quality features representing product listingson Etsy. These feature vectors were combined withtraditional textual features to serve as the multimodalitem representation. We compared the efﬁciency ofsingle modality (text-only and image-only) features tomultimodal feature vectors in popularity prediction.Our initial results indicate that quality features incombination with text information can increase theprediction accuracy for a sample dataset.R

EFERENCES[1] C. Anderson,

The Long Tail: Why the Future of Business IsSelling Less of More . Hyperion, 2006. 1[2] K. Aryafar, C. Lynch, and J. Attenberg, “Exploring userbehaviour on etsy through dominant colors,” in . IEEE,2014, pp. 1437–1442. 1, 2[3] Y. J. Wang, M. S. Minor, and J. Wei, “Aesthetics and the onlineshopping environment: Understanding consumer responses,”

Journal of Retailing , vol. 87, no. 1, pp. 46–58, 2011. 1[4] P. Obrador, X. Anguera, R. de Oliveira, and N. Oliver, “Therole of tags and image aesthetics in social image search,” in

Proceedings of the ﬁrst SIGMM workshop on Social media .ACM, 2009, pp. 65–72. 1[5] Y. Ke, X. Tang, and F. Jing, “The design of high-level featuresfor photo quality assessment,” in

Computer Vision and PatternRecognition, 2006 IEEE Computer Society Conference on ,vol. 1. IEEE, 2006, pp. 419–426. 1, 2, 3[6] R. Datta, D. Joshi, J. Li, and J. Z. Wang, “Studying aestheticsin photographic images using a computational approach,” in

Computer Vision–ECCV 2006 . Springer, 2006, pp. 288–301.1, 2, 4[7] P. Isola, J. Xiao, A. Torralba, and A. Oliva, “What makes animage memorable?” in

Computer Vision and Pattern Recog-nition (CVPR), 2011 IEEE Conference on . IEEE, 2011, pp.145–152. 1[8] S. Dhar, V. Ordonez, and T. L. Berg, “High level describableattributes for predicting aesthetics and interestingness,” in

Computer Vision and Pattern Recognition (CVPR), 2011 IEEEConference on . IEEE, 2011, pp. 1657–1664. 1[9] M. Gygli, H. Grabner, H. Riemenschneider, F. Nater, andL. Van Gool, “The interestingness of images,” 2013. 1 [10] A. Khosla, A. Das Sarma, and R. Hamid, “What makes animage popular?” in

Proceedings of the 23rd internationalconference on World wide web . ACM, 2014, pp. 867–876.1, 2, 4[11] M. Chen and J. Allebach, “Aesthetic quality inference foronline fashion shopping,” in

IS&T/SPIE Electronic Imaging .International Society for Optics and Photonics, 2014, pp.902 703–902 703. 2, 3[12] J. Wang and J. Allebach, “Automatic assessment of onlinefashion shopping photo aesthetic quality,” in

Image Processing(ICIP), 2015 IEEE International Conference on . IEEE, 2015,pp. 2915–2919. 2, 4[13] C. Lynch, K. Aryafar, and J. Attenberg, “Images don’t lie:Transferring deep visual semantic features to large-scale mul-timodal learning to rank,” arXiv preprint arXiv:1511.06746 ,2015. 2[14] J. Yu, Y. Rui, and D. Tao, “Click prediction for web imagereranking using multimodal sparse coding,”

Image Processing,IEEE Transactions on , vol. 23, no. 5, pp. 2019–2032, 2014. 2[15] J. Yu, D. Tao, M. Wang, and Y. Rui, “Learning to rankusing user clicks and visual features for image retrieval,”

Cybernetics, IEEE Transactions on , vol. 45, no. 4, pp. 767–779, 2015. 2[16] H. Tong, M. Li, H. Zhang, and C. Zhang, “Blur detection fordigital images using wavelet transform,” in

Multimedia andExpo, 2004. ICME’04. 2004 IEEE International Conferenceon , vol. 1. IEEE, 2004, pp. 17–20. 3[17] X. Hou and L. Zhang, “Saliency detection: A spectral residualapproach,” in

Computer Vision and Pattern Recognition, 2007.CVPR’07. IEEE Conference on . IEEE, 2007, pp. 1–8. 3[18] L. Mai, H. Le, Y. Niu, and F. Liu, “Rule of thirds detectionfrom photograph,” in

Multimedia (ISM), 2011 IEEE Interna-tional Symposium on . IEEE, 2011, pp. 91–96. 3[19] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags offeatures: Spatial pyramid matching for recognizing naturalscene categories,” in

Computer Vision and Pattern Recognition,2006 IEEE Computer Society Conference on , vol. 2. IEEE,2006, pp. 2169–2178. 4[20] J. Matas, O. Chum, M. Urban, and T. Pajdla, “Robust wide-baseline stereo from maximally stable extremal regions,”

Im-age and vision computing , vol. 22, no. 10, pp. 761–767, 2004.4[21] H. Chen, S. S. Tsai, G. Schroth, D. M. Chen, R. Grzeszczuk,and B. Girod, “Robust text detection in natural images withedge-enhanced maximally stable extremal regions,” in

ImageProcessing (ICIP), 2011 18th IEEE International Conferenceon . IEEE, 2011, pp. 2609–2612. 4. IEEE, 2011, pp. 2609–2612. 4