e-Commerce product classification: our participation at cDiscount 2015 challenge
aa r X i v : . [ c s . L G ] J un e-Commerce product classification: our participation at cDiscount 2015challenge Ioannis Partalas
Viseo R&D, France [email protected]
Georgios Balikas
University of Grenoble Alpes, France [email protected]
Abstract
This report describes our participation inthe cDiscount 2015 challenge where thegoal was to classify product items in a pre-defined taxonomy of products. Our bestsubmission yielded an accuracy score of64.20% in the private part of the leader-board and we were ranked 10th out of 175participating teams. We followed a textclassification approach employing mainlylinear models. The final solution was aweighted voting system which combineda variety of trained models.
In this report we present our participationin the cDicount 2015 product classifica-tion challenge that was organised in the platform. Theorganisers provided a large collection of productitems containing mainly its textual description.We followed a text classification approach us-ing mainly linear models (like SVMs) as ourbase models. Our final solution consisted ofa weighted voting system which combined avariety of base models. In Section 2 we provide abrief description of the cDiscount task and data.Section 3 describes our implementations and theresults we obtained. Finally, Section 4 concludeswith a discussion and the lessons learnt.
Item categorization is fundamental to many as-pects of an item life cycle for e-commerce sitessuch as search, recommendation, catalog buildingetc. It can be formulated as a supervised classifi-cation problem where the categories are the targetclasses and the features are the words composing some textual description of the items. In the con-text of the cDiscount 2015 challenge the organis-ers provided the descriptions of e-commerce itemsand the goal was to develop a system that wouldperform automatic classification of those prod-ucts. Table 1 presents some training instances.There were 15,786,886 product descriptions in thetraining test and 35,066 instances to be classifiedin the test set. The target categories (classes) wereorganised in a hierarchy that comprised 3 levels:the top level had 52 nodes, the middle 536 and thelowest level 5,789. The goal of the challenge wasto predict the lowest level category for each testinstance. It is to be noted that most of the classeswere represented in the training set with only afew examples and there were only a few classeswith many examples. For instance 40% of thetraining instances belong to the 10 most commonclasses and around 1,500 classes contain from 1 to30 product items. We followed a text classification approach work-ing with the textual information provided in thetraining data. In this context x ∈ R d representsa document in a vector space and y ∈ Y = { . . . K } its associated class label where |Y| > .The large number of classes in the problem treatedas well as the scarcity of data for minority classesis a typical situation in large-scale systems. Forexample, we cite here challenges in the same linewhere in contrast the available text is much biggerlike LSHTC (Partalas et al., 2015) and BioASQ(Balikas et al., 2014). ategorie3 Description Libelle Marque1000015309 De Collectif aux ´editions SOLESMES Benedictions de l eglise1000015309 De Collectif aux ´editions SOLESMES Notice de st benoit lot de 101000010100 or 750, poids : 3.45gr, diamants : 0.26carats Bague or et diamants AUCUNE1000003407 Champagne Brut - Champagne-Vendu `a l’unit´e-1 x 75cl Mumm Brut AUCUNE Table 1: Part of the training data. We only present the fields “Description”, “Libelle” and “Marque” thatwe used in our implementations. The “Categorie3” value was the class to be predicted.
Since we used only thetextual information that was provided for eachtraining/test instance by the organisers our firsttask was to clean the data. The pipeline for clean-ing the data included: removal of non-ascii char-acters, removal of non-printable characters, re-moval of html tags, accents removal, punctuationremoval and lower-casing. We also split wordsthat consisted of a text part and a numerical part into two distinct parts. For instance, “12cm” wouldbecome “12” and “cm”. In addition, we did notperform stemming, lemmatization and stop-wordremoval due to the fact that the text spans weresmall and such operations would result in loss ofinformation (Shen et al., 2012). Finally, we tok-enized the remaining text in words using the whitespace as delimiter.
Vectorization.
Having cleaned the text, we gen-erated the vectors using an one-hot-encoding ap-proach. We experimented with binary one-hot-encoded vectors (i.e. a word exists or not), withterm frequency vectors (how many times eachword occurred in the instance) and with the tf − idf (term frequency, inverse document frequency)weighting scheme. In the early stages of the chal-lenge, we found that the latter performed the bestin our experiments and we used it exclusively. Tocalculate tf − idf vectors we smoothed idf and ap-plied the sublinear tf scaling 1+ log( tf ) . Finally,each vector has been normalized to a unit vector.Classification of short documents can benefitfrom successful feature selection of n -grams. Thesize of the vocabulary of our cleaned dataset com-bined with the large number of training instanceswas prohibitively large for using all the n -gramswith n = 1 , , . On the other hand, selectingtoo many features may lead to over-fitting. As aresult, selecting a representative part of those n -grams required careful tuning. Apart from the “Description” and “Libelle”fields that we concatenated, we also used the“Marque” field. We examined two ways of inte-grating this information in our pipeline: by con-catenating its value with the already existing textof “Description” and “Libelle” and by generatingbinary flags for each of the values of the field seenon the training set. Either way, compared to fea-ture generation using only “Description” and “Li-belle” benefited our models.After generating the vectors we applied the α -power normalization, where each vector x =( x , x , . . . , x d ) is transformed to x power =( x α , x α , . . . , x αd ) . This normalisation has beenused in computer vision tasks (Jegou et al., 2012).The main intuition is that it reduces the effect ofthe most common words. After the transforma-tion the vector is normalized to the unit vector. Wefound that taking the root ( a = 0 . ) of the valuesof the features consistently benefited the perfor-mance. Subsampling
As the training dataset was highlyimbalanced the learned models would be biasedtowards the big classes. For this reason and alsoin order to speed up the vectorization process aswell as the training of the systems we randomlysampled the data by downsampling the majorityclasses. This procedure helped to improve the per-formance of all the single systems (around +2.5%to our best single system) and also reduced thetraining time of the base models. The size of thevocabularies for the several sub-samples rangedfrom around 1 million unigrams for around halfof the data to 1.6 millions of unigrams for the fulldataset.
Tuning and Validation Strategy
Tuning thehyper-parameters of our models was an importantaspect of our approach. Performing k -fold crossvalidation in 15 million training instances with2housands of features (corresponding to the vocab-ulary size of the training set) was prohibitive givenour cpu resources. We overcame this problem byperforming hyper-parameter tuning locally, in asubsample of the training set. The subsample con-sisted of the instances of 1500 classes randomlyselected. This approach allowed us to acceler-ate the calculations. Note that we also validatedthe applicability of our tuning strategy using thepublic part of the leaderboard; the decisions thatimproved the accuracy of our models locally hadthe same effect on our scores on the public leader-board. A trick that we found useful was to usethe current best submission on the public boardas golden standard in order to validate the trainedmodels. This helped us to avoid unnecessary sub-missions. We relied on linear models as our base learningchoice due to their efficiency on high-dimensionaltasks like text classification. Support Vector Ma-chines (SVMs) are well known for achievingstate-of-the-art performance in such tasks. Forlearning the base models we used the Liblinear li-brary which can support linear models learning inhigh dimensional datasets (Fan et al., 2008).We tried two main strategies: a) flat classifierswhich ignore the hierarchical structure among theclasses and b) hierarchical top-down classifica-tion. For flat classification we followed the One-Versus-All (OVA) approach with a complexity O ( K ) in the number of classes. For top-downclassification we trained a multi-class classifierfor each parent in the hierarchy of products andduring prediction we started from the root andselected the best class according to the currentmulti-class classifiers. We iteratively proceededuntil reaching a leaf node. Note that top-downhierarchical classification has a logarithmic com-plexity to the number of classes which acceleratesthe training and prediction processes significantly.Trying to explore different learning methodsthat would also help to diversify the ensemble,we experimented with k -Nearest Neighbors clas-sifiers, Rochio classifiers also known as Nearest-Centroid classifiers and online stochastic gradient-descent methods. Although widely studied wefound that those methods did not give satisfac- tory results. For instance, our 3-Nearest Neigh-bors runs using 200 thousands unigram features, tf − idf feature representation achieved accu-racy score 41.62% in the public leaderboard. Wealso experimented with text embeddings using theword2vec tool (Mikolov et al., 2013); generatingtext representations in a low dimensional space(200 dimensions) and 3-NN as a category predic-tion approach improved over using the tf − idf representation but was still away from performingcompetitively. For the above mentioned reasonswe only report results in the rest of the report forSVMs. In the majority of the models we used L -regularized L -loss SVMs in the dual and we set C = 1 . with bias ( − B ). Our final solutions were based on averaging themodels in the ensemble which contained 103models. We experimented with simple voting aswell as with weighted voting schemes. Simplevoting was always improving performance whenit was calculated in a fraction of the whole en-semble containing only the best performing mod-els. A simple approach was to order the modelsaccording to their performance (in terms of accu-racy) with respect to the current best model on theleader board. Then a simple majority voting wasapplied on around 20%-30% of the ordered clas-sifiers. This procedure would create a homoge-neous sub-ensemble that would likely reduce thevariance.Also, we employed a weighted voting schemeweighting more few of our best single models.Our final best submission (64.20% on the pri-vate board) was a weighting voting ensemblegiving bigger weights to the two best models.Weighted voting consistently improved accuracyabout 1.2%-1.8%.
During the challenge we used scikit-learn(Pedregosa et al., 2011) as well as our own scriptsto pre-process the raw data. For training themodels we experimented with Liblinear, scikit-learn and Vowpal Wabbit. We had full accessto a machine with 4 cores at 2,4Ghz and 16Gbof RAM and limited access to a shared machinewith 24 cores at 3,3Ghz and 128 Gb of RAM.3n the first machine we ran our experiments withup to 270 thousand features and in the secondthe experiments with more features that requiredmore memory.
We provide in Table 2 a subset of our submis-sions along with the public and private scores weobtained. Comparing the pairs of submissions(1),(2) and (9), (10) it is clear that the α -powertransformation helps the performance with respectto accuracy. From submissions (1), (3) and (4) onecan see that by increasing the number of the uni-gram features the performance increases. How-ever, at around 300 thousand features the improve-ments becomes negligible. The pairs of submis-sions (3) , (5) and (2), (6) demonstrate the advan-tage of adding bigrams apart from unigrams to thefeature set. Going further and adding, apart fromunigrams and bigrams, trigrams also improves theperformance as indicated by comparing submis-sions (8), (9) and (10) with the rest of the Table.Note that in each of the above cases the featureswere selected with criterion their frequency. Forinstance in submission (1) we selected the 200kmost common unigrams, and in submission (9) themost common features between all unigrams, bi-grams and trigrams.We tested also several hierarchical models us-ing 200,000 unigrams for each parent node in thehierarchy. The best submission was one with thefirst level pruned achieving 60.61% on the privateboard while the fully hierarchical model got only58.88%. Note that in the case of the pruned hi-erarchy we remove a step during prediction andthus reduce the propagation errors. Several otherhierarchical models were used to increase the vari-ability of the ensemble.Table 3 presents the best single system whichwas trained in approximately half of the datawhere the “Marque” was concatenated directly inthe description. Also we present the results onthe two best weighted voting systems using a totalof 103 base models. Additionally, we report thecoverage of each system which shows how manyclasses were detected during the prediction phase. Participating in the cDiscount 2015 was a nice andinteresting experience with several lessons learnt.We shortly discuss below the most important ones: • Data Preparation. Although short text clas-sification for e-Commerce product is not anew task the cDiscount problems had twoparticularities: the huge number of traininginstances and the big vocabulary even af-ter carefully cleaning the dataset. We foundthat feature selection and feature engineeringperformed by studying the statistics of thedataset, the predictions of base models andby trying to identify patterns for the classesplayed an important role. As an example wewould like to highlight the “Marque” field.We found in the late stages of the challengethat by using this information our modelscould benefit significantly. • Learning tools. From the early stages of thechallenge we decided to use SVMs which areknown to perform well in such problems. Itit the case, however, that the more tools onecan use efficiently the better since there isno strategy that works in every setting. Un-fortunately, we did not perform an exhaus-tive hyper-tuning of Vowpal Wabbit on-linelearning system which had the potential toprovide a high performing system. • Ensemble methods. Base models such asSVMs can obtain satisfactory performance inclassification tasks. In the framework of achallenge, however, ensemble methods havethe potential of deciding the winner. En-semble methods (stacking, feature weightedlinear stacking etc.) with a pool of strongand diversified models can improve perfor-mance by several units of accuracy. Here werelied on weighted voting processes but webelieve that using more sophisticated tech-niques would have helped us.
Acknowledgements
We would like to thank the AMA team of theUniversity of Grenoble-Alpes for providing us themachines where we ran our algorithms.4escription Public Private1 200K unigrams 61.09 60.772 200K unigrams, α =0.5 61.60 61.253 250K unigrams 61.142 60.774 300K unigrams 61.148 60.875 250K unigrams, 250K bigrams 61.79 61.376 200K unigrams, 400K bigrams, α =0.5 62.09 61.767 200K unigrams, 400K bigrams, α =0.5, “Marque” as binary feature 62.64 62.158 1,2 M unigrams, bigrams, trigrams, α =0.5 62.28 61.999 2M unigrams, bigrams, trigrams 62.35 61.8310 2M unigrams, bigrams, trigrams, α =0.5 63.30 62.99Table 2: A subset of our base model submissions with their scores in the public and private leaderboardDescription Public Private Coverage1 270K unigrams, α = 0 . , half data 63.56 63.11 3,2082 Weighted voting 1 64.55 64.20 3,1283 Weighted voting 2 64.57 64.14 3,116Table 3: Our best submissions with their scores in the public and private leaderboard References [Balikas et al.2014] George Balikas, Ioannis Partalas,Axel-Cyrille Ngonga Ngomo, Anastasia Krithara,Eric Gaussier, and George Paliouras. 2014. Re-sults of the bioasq track of the question answeringlab at clef 2014.
Results of the BioASQ Track of theQuestion Answering Lab at CLEF , 2014:1181–93.[Fan et al.2008] Rong-En Fan, Kai-Wei Chang, Cho-Jui Hsieh, Xiang-Rui Wang, and Chih-Jen Lin.2008. Liblinear: A library for large linear classifi-cation.
The Journal of Machine Learning Research ,9:1871–1874.[Jegou et al.2012] H. Jegou, F. Perronnin, M. Douze,J. Sanchez, P. Perez, and C. Schmid. 2012. Aggre-gating local image descriptors into compact codes.
Pattern Analysis and Machine Intelligence, IEEETransactions on , 34(9):1704–1716, Sept.[Mikolov et al.2013] Tomas Mikolov, Kai Chen, GregCorrado, and Jeffrey Dean. 2013. Efficient estima-tion of word representations in vector space. arXivpreprint arXiv:1301.3781 .[Partalas et al.2015] Ioannis Partalas, Aris Kosmopou-los, Nicolas Baskiotis, Thierry Artieres, GeorgePaliouras, Eric Gaussier, Ion Androutsopoulos,Massih-Reza Amini, and Patrick Galinari. 2015.Lshtc: A benchmark for large-scale text classifica-tion. arXiv preprint arXiv:1503.08581 .[Pedregosa et al.2011] F. Pedregosa, G. Varoquaux,A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg,J. Vanderplas, A. Passos, D. Cournapeau,M. Brucher, M. Perrot, and E. Duchesnay. 2011.Scikit-learn: Machine learning in Python.
Journalof Machine Learning Research , 12:2825–2830.[Shen et al.2012] Dan Shen, Jean-David Ruvini, andBadrul Sarwar. 2012. Large-scale item categoriza-tion for e-commerce. In
Proceedings of the 21stACM international conference on Information andknowledge management , pages 595–604. ACM., pages 595–604. ACM.