[PDF] Search Intelligence: Deep Learning For Dominant Category Prediction

Abstract

Deep Neural Networks, and specifically fully-connected convolutional neural networks are achieving remarkable results across a wide variety of domains. They have been trained to achieve state-of-the-art performance when applied to problems such as speech recognition, image classification, natural language processing and bioinformatics. Most of these deep learning models when applied to classification employ the softmax activation function for prediction and aim to minimize cross-entropy loss. In this paper, we have proposed a supervised model for dominant category prediction to improve search recall across all eBay classifieds platforms. The dominant category label for each query in the last 90 days is first calculated by summing the total number of collaborative clicks among all categories. The category having the highest number of collaborative clicks for the given query will be considered its dominant category. Second, each query is transformed to a numeric vector by mapping each unique word in the query document to a unique integer value; all padded to equal length based on the maximum document length within the pre-defined vocabulary size. A fully-connected deep convolutional neural network (CNN) is then applied for classification. The proposed model achieves very high classification accuracy compared to other state-of-the-art machine learning techniques.

Full PDF

SSearch Intelligence: Deep Learning For DominantCategory Prediction

Ebay Inc.Email: zmalik, mkobrosli, [email protected]

Abstract —Deep Neural Networks, and speciﬁcally fully-connected convolutional neural networks are achieving remark-able results across a wide variety of domains. They have beentrained to achieve state-of-the-art performance when appliedto problems such as speech recognition, image classiﬁcation,natural language processing and bioinformatics. Most of thesedeep learning models when applied to classiﬁcation employ thesoftmax activation function for prediction and aim to minimizecross-entropy loss. In this paper, we have proposed a supervisedmodel for dominant category prediction to improve search recallacross all eBay classiﬁeds platforms. The dominant category labelfor each query in the last 90 days is ﬁrst calculated by summingthe total number of collaborative clicks among all categories.The category having the highest number of collaborative clicksfor the given query will be considered its dominant category.Second, each query is transformed to a numeric vector bymapping each unique word in the query document to a uniqueinteger value; all padded to equal length based on the maximumdocument length within the pre-deﬁned vocabulary size. A fully-connected deep convolutional neural network (CNN) is thenapplied for classiﬁcation. The proposed model achieves veryhigh classiﬁcation accuracy compared to other state-of-the-artmachine learning techniques.

I. I

NTRODUCTION

In recent years Deep Belief Networks have achieved remark-able results in natural language processing [1], computer vision[2][3] and speech recognition [2] tasks. Speciﬁcally, withinnatural language processing, modeling information in searchqueries and documents has been a long-standing research topic[4][5]. Most of the work with deep learning has involvedlearning word vector representations through neural languagemodels [6][7][8] and performing composition over the learnedword vectors for classiﬁcation [9].The optimal transformation in our case was to map eachquery document to a single numeric vector by assigning asingle numeric value to each unique word across all querydocuments. A second phase was then employed by mappingthe numerically transformed query vectors to a random em-bedding space having a uniform distribution between -1 and 1.This helped far more in reducing the distance between querieshaving similar words while further discriminating queries faron the data space having more dissimilar words. Anothersuitable criteria that is applicable to our problem is proposedby Johnson and Zhang [10] in 2014, where they propose asimilar model, but swapped in high dimensional ‘one-hot’vector representations of words as CNN inputs. Convolutional Neural Networks (CNN) are biologically-inspired variants of Multiple Layer Perceptrons (MLP). Theyutilize layers with convolving ﬁlters that are applied to localfeatures [11] originally invented for computer vision. Convo-lutional neural networks have also been shown to be highlyeffective for natural language processing and have achievedexcellent results in information retrieval [12], semantic parsing[13], sentence modeling [14] and other traditional naturallanguage processing tasks [9].Before going into the details of our model architecture andresults, we will ﬁrst narrate the work we did to prepare ourquery data for modelling.II. Q

UERY D ATA P REPARATION

The advertisements in eBay’s classiﬁeds platforms are clas-siﬁed according to a pre-deﬁned hierarchy. The ﬁrst level(L1) of this hierarchy categorizes advertisements into generalgroupings like ‘buy & sell’, ‘cars & vehicles’, ‘real state’,‘pets’, ‘jobs’, ‘services’, ‘vacation rentals’ and ‘community’.The second level (L2) further classiﬁes each L1-category withmany subclasses with more speciﬁcity. The third level (L3)further classiﬁes and so on. Most platforms terminate thehierarchy at a level of three or four. In this paper we will onlydemonstrate the results of our work related to L1-categoryquery classiﬁcation.For each keyword search initiated within a user sessionat the all-advertisement level (all-advertisement level meansa search across all inventory with no category restrictionsemployed), the chain of actions on that search is analysed.When that sequence of actions results in a view of an ad-vertisement within a speciﬁc category, that particular categoryis scored with a dominance point for the given query. Thereare many noisy factors that must be accounted for whenapplying this technique. Among them include factors like bots,redundant query actions, ﬁltering out conversions to categoriesthat no longer exist and ﬁltering out queries without enoughconversions.The dominance of category for each query document in thelast 90 days is computed on the basis of the maximum numberof collaborative clicks for each L1-category. The categorywith the highest number of clicks is considered the dominantcategory for that query. This also enabled us to producethe ﬁrst highest, second highest and third highest dominantcategory and their respective conversion rates for each query. a r X i v : . [ c s . I R ] F e b he conversion rate per query is calculated by counting thetotal number of clicks for each category divided by the totalnumber of clicks for that query.Finally all query documents for the last 90 days arestandardized by transforming them to lower-case, removingduplicate queries, extra spaces, punctuations and all othernoise factors. A single pattern from each L1-category of theﬁnal preprocessed data ready to be used for learning is shownin Table I.In Table I the CategoryID feature is used as a label forsupervised learning using a deep convolutional neural network.The total distinct query patterns for most of the categories inthe last 90 days ranges between 5000 to 7000.III. M ODEL A RCHITECTURE

The model architecture shown in Figure 1, follows [15] and[1]. Let x i (cid:15) R k be the k dimensional transformed numericvector for each query document mapping each word in thequery document to an integer within the deﬁned vocabularysize.Suppose we have a given query document D =( w , w , ..., w N ) with vocabulary V . CNN requires vectorrepresentation of data that can uniquely preserve internallocations (word order in this case) as input. The chosen bestand straight forward representation would be to treat each wordas a pixel, treat D as if it were an image of | D | × pixel,and to represent each pixel (i.e. each word) with a uniquenumeric value. As a running real-time example suppose thatquery document D = { ”giving”, ”away”, ”free”, ”free” } andwe associate the words with unique numeric value. Then wehave the document vector as:- x = [1235 , , , , , , , , (1)All the query document vectors are padded to equal lengthbased on the maximum document size of the last nighty daysquery corpus is represented as x n = x ⊕ x ⊕ x ⊕ ... x n (2)where ⊕ is the concatenation operator. Let x i : i + j referto the concatenation of words x i , x i +1 , ..., x i + j of a singlequery document with the unique numeric conversion for eachword. The ﬁltration of w (cid:15) R hk is considered as a convolutionoperation, which is applied to a window of h words to producea new feature. Supposedly, a feature c i is generated from awindow of words x i + h − by c i = f ( w · x i : i + h − + b ) (3)where b (cid:15) R is a bias term and f is a non-linear activationfunction such as the tangent hyperbolic function. The ﬁlteris applied to each possible window of words throughout thewhole set { x h , x h +1 , ..., x n − h +1: n } to produce a featuremap. c = [ c , c , ..., c n − h +1 ] (4) with c (cid:15) R n − h +1 . The feature mapping is followed bythe rectiﬁed linear unit which zeros out negative values andproduces sparse activations. Next comes the max-pooling layerwhich captures the most signiﬁcant feature, the one with thehighest value for each feature map.Above, we explained the process by which one feature isextracted from one ﬁlter. The proposed model uses multipleﬁlters with varying window sizes to obtain multiple features.These extracted signiﬁcant features form the penultimate layerand are passed to a fully connected softmax layer whose outputis the probability distribution over 8 labels.We have employed dropout for regularization on the penul-timate layer [2]. Dropout helps in preventing co-adaptation ofhidden units by randomly dropping with a certain probability.Given the penultimate layer z = [ˆ c , ...., ˆ c m ] , dropout uses y = w · ( z ◦ r + b ) , (5)where ◦ is the element-wise multiplication operator and r (cid:15) R m is a ‘masking’ vector of Bernoulli based randomvariables with probability p being 1. In this way the dropoutmechanism for regularization on the penultimate layer stochas-tically disables a fraction of its neurons. This ultimatelyprevent neurons from co-adapting and forces them to learnindividually useful features. The fraction of neurons to keepenabled is deﬁned by the dropout keep probability input to thenetwork.Table II summarizes the conﬁguration details of the em-ployed deep convolutional neural network which signiﬁcantlysolved the dominant category prediction problem across sev-eral eBay Classiﬁeds platforms. The ﬁrst column deﬁnes thelength of embedding layer size which maps the input to anembedding space. The ﬁlter size narrates the number of wordswe need to consider in each convolutional ﬁlter. The totalnumber of ﬁlters for each window of size 1, 2 and 3 are 128.The batch size and number of epochs for training are set to64 and 100. The maximum length of query sequence in ourcase is 10 and the total number of L1-Category classes are 8.The training time of the algorithm for the 90 days of data isapproximately 50 minutes on an Intel Core i7 with 2.8 GHzspeciﬁcation.The summary statistics of our pre-computed dominantcategory prediction dataset are shown in Table III whichdescribes the total number of classes, average sentence length,vocabulary size, training and testing data length.IV. R ESULTS & D

ISCUSSION

Results of the proposed model for the dominant categoryprediction problem compared to other state-of-the-art methodsare listed in Table IV. The proposed well-tuned deep convolu-tional neural network simply outperformed its variations andother models. We tested the predictive accuracy by ﬁrst usingfew days different testing data from training shown in the ﬁrstrow and fourth column of Table IV for every model type. TheCNN model produced a very high training and testing accuracyof 99.9 % and 98.5 %. Secondly we tried testing completelyABLE I: A Single Unique Pre-processed Pattern From each L1-Category

Category Name CategoryID Query Category Conversion-Rate Total Patternscars & vehicles 27 2007 civic 0.9857 98% 5000 - 7000jobs 45 cash jobs 0.7051 70% 5000 - 7000services 72 makeup artist 0.8911 89% 5000 - 7000buy & sell 10 air conditioner 0.9783 97% 5000 - 7000vacation rentals 800 sherkston shore 0.4694 46% 2000 - 3000pets 112 western saddle 0.8268 82% 5000 - 7000real state 34 mortgage 0.4782 47% 5000 - 7000community 1 christmas markets 1 100% 2000 - 3000

Rectified Linear Unit

Max-pooling 1 x 128 Each Total Number of Filters: 128 Sliding-window region of size 1, 2 and 3 convolution Activation function for feature map before Rectified Linear Unit M e r g i n g p oo l e d o u t pu t s Fully connected 128 x 8 Softmax function and regularization in this layer

C1 C2 C3 C4 C5 C6 C7 C8

Buy & Sell Cars & Vehicles Real state pets jobs services Vacation rentals community

Fig. 1: Model ArchitectureTABLE II: Conﬁguration of Deep Convolutional Neural Network For L1 Dominant CategoryPrediction

Embedding Layer Dim. Filter Sizes Number of Filters Dropout Keep Probability Batch Size No. of Epochs Sequence Length No. of Classes128 1, 2, 3 128 0.5 64 100 10 8 different days testing data from training and the resultingoutcomes are shown in the second row of Table IV for everymodel type. This is our worst case scenario where we haveused a completely different testing data for dominant categoryprediction but still the CNN model has produced a very hightesting accuracy of 95.8 %. The major advantage with CNNcompared to other state-of-the-art approaches is its addedcapability to learn invariant features. This capability of CNN tomake the convolution process invariant to translation, rotationand shifting helps in approximating to the same class evenwhen there is a slight change in the input query document.The step by step training accuracy and loss of our convolu-tional neural network model are also shown in Figure 2a and 2b. Initially the accuracy was noted very low but gradually itimproved at each training step and almost reached to one inthe end as shown in Figure 2a. Similarly, the loss was veryhigh in the beginning, but almost reached to zero in the endas shown in Figure 2b. This clearly shows the convergence ofthe proposed well-tuned deep convolutional neural network.The multiple layer perceptron model with an empiricallyevaluated one and two hidden layers of size 200 did notperform effectively well and produced a predictive accuracyof 55.91 % and 54.98 % on both of the testing sets. We alsofurther tried to increase the count of hidden layers to explicitlyadd the certain level of non-linearity but still the predictiveaccuracy more or less remained constant. Furthermore we a) Training Accuracy (b) Training Loss

Fig. 2: Training Accuracy & Loss of CNNTABLE III: Summary Statistics of the Dataset

Data Number of Classes Average Sentence Length Vocabulary Size Training Size Testing SizeDominant-Category 8 2 12812 32088 32087

TABLE IV: Results of the proposed well-tuned CNN model against other methods

Model Type Number of Days Training Date Range Testing Date Range Training Accuracy Testing AccuracyCNN (Proposed)

Past 90 Days 28-06-2016 to 28-09-2016 07-06-2016 to 07-09-2016

CNN (Proposed)

Past 90 Days 28-06-2016 to 28-09-2016 28-02-2016 to 28-05-2016

CNN-static [1] Past 90 Days 28-06-2016 to 28-09-2016 07-06-2016 to 07-09-2016

CNN-static [1] Past 90 Days 28-06-2016 to 28-09-2016 28-02-2016 to 28-05-2016

CNN-non-static [1] Past 90 Days 28-06-2016 to 28-09-2016 28-02-2016 to 28-05-2016

MLP with two hidden layers Past 90 Days 28-06-2016 to 28-09-2016 07-06-2016 to 07-09-2016 0.563486 56.35 % 0.559056 55.91 %MLP with two hidden layers Past 90 Days 28-06-2016 to 28-09-2016 28-02-2016 to 28-05-2016 0.563486 56.35 % 0.549894 54.98 %MLP with single hidden layer Past 90 Days 28-06-2016 to 28-09-2016 07-06-2016 to 07-09-2016 0.483046 48.31 % 0.479556 47.95 %MLP with single hidden layer Past 90 Days 28-06-2016 to 28-09-2016 28-02-2016 to 28-05-2016 0.483046 48.31 % 0.483915 48.39 %LSTM RNN Network Past 90 Days 28-06-2016 to 28-09-2016 07-06-2016 to 07-09-2016 0.658262 65.83 % 0.651895 65.19 %LSTM RNN Network Past 90 Days 28-06-2016 to 28-09-2016 28-07-2016 to 28-04-2016 0.658262 65.82 % 0.630651 63.06 %LSTM Bi-RNN Network Past 90 Days 28-06-2016 to 28-09-2016 07-06-2016 to 07-09-2016 0.536496 53.65 % 0.529887 52.98 %LSTM Bi-RNN Network Past 90 Days 28-06-2016 to 28-09-2016 28-07-2016 to 28-04-2016 0.536496 53.65 % 0.505335 50.05 % tried running Long Short Term Memory (LSTM) recurrentneural networks which are shown to outperform other recurrentneural network algorithms speciﬁcally for language modelling[16]. However, in our case there is no sequence to sequenceconnection between the current and previous activations of thesequential query patterns, the maximum predictive accuracythat LSTM recurrent neural network could produce was 63.06% and 65.19 % for both the testing datasets. The Bi-directionalrecurrent neural network worked a little worse compared toLSTM network and produced a predictive accuracy of 52.98% and 50.05 % on both the testing datasets.V. C

ONCLUSION

In the present work we have described a tuned, fullyconnected CNN that outperformed its variants and other state-of-the art ML techniques. Speciﬁcally, in query to categoryclassiﬁcation across several eBay Classiﬁeds platforms. Ourresults integrate to evidence that numeric vector mappingto random uniformly distributed embedding spaces provesmore suitable both computationally and performance wise in comparison to word2vec. Speciﬁcally for datasets having alimited vocabulary corpus (between 10,000 to 15,000 words)and few words (between 2 to 3) in each query document.VI. A

CKNOWLEDGEMENT

The ﬁrst and second authors are grateful to JohannSchweyer for his contribution in query normalization and ag-gregation. We are also extremely thankful to Brent Mclean VP,CTO, eBay Classiﬁeds for his kind support and encouragementthroughout this dominant category prediction project.R

EFERENCES[1] Y. Kim, “Convolutional neural networks for sentence classiﬁcation,” arXiv preprint arXiv:1408.5882 , 2014.[2] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R.Salakhutdinov, “Improving neural networks by preventing co-adaptationof feature detectors,” arXiv preprint arXiv:1207.0580 , 2012.[3] Z. K. Malik, A. Hussain, and Q. J. Wu, “Multilayered echo statemachine: A novel architecture and algorithm,” vol. PP, no. 99, 2016.[4] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, andR. Harshman, “Indexing by latent semantic analysis,”

Journal of theAmerican society for information science , vol. 41, no. 6, p. 391, 1990.5] J. Gao, J.-Y. Nie, G. Wu, and G. Cao, “Dependence language model forinformation retrieval,” in

Proceedings of the 27th annual internationalACM SIGIR conference on Research and development in informationretrieval . ACM, 2004, pp. 170–177.[6] Q. V. Le and T. Mikolov, “Distributed representations of sentences anddocuments.” in

ICML , vol. 14, 2014, pp. 1188–1196.[7] Z. K. Malik, A. Hussain, and J. Wu, “Novel biologically inspired ap-proaches to extracting online information from temporal data,”

CognitiveComputation , vol. 6, no. 3, pp. 595–607, 2014.[8] Z. K. Malik, A. Hussain, and Q. J. Wu, “An online generalizedeigenvalue version of laplacian eigenmaps for visual big data,”

Neu-rocomputing , vol. 173, pp. 127–136, 2016.[9] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, andP. Kuksa, “Natural language processing (almost) from scratch,”

Journalof Machine Learning Research , vol. 12, no. Aug, pp. 2493–2537, 2011.[10] R. Johnson and T. Zhang, “Effective use of word order for textcategorization with convolutional neural networks,” arXiv preprintarXiv:1412.1058 , 2014.[11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,”

Proceedings of the IEEE , vol. 86,no. 11, pp. 2278–2324, 1998.[12] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil, “Learning semanticrepresentations using convolutional neural networks for web search,” in

Proceedings of the 23rd International Conference on World Wide Web .ACM, 2014, pp. 373–374.[13] W.-t. Yih, K. Toutanova, J. C. Platt, and C. Meek, “Learning discrim-inative projections for text similarity measures,” in

Proceedings of theFifteenth Conference on Computational Natural Language Learning .Association for Computational Linguistics, 2011, pp. 247–256.[14] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neu-ral network for modelling sentences,” arXiv preprint arXiv:1404.2188 ,2014.[15] Y. Zhang and B. Wallace, “A sensitivity analysis of (and practitioners’guide to) convolutional neural networks for sentence classiﬁcation,” arXiv preprint arXiv:1510.03820 , 2015.[16] M. Sundermeyer, R. Schl¨uter, and H. Ney, “Lstm neural networks forlanguage modeling.” in