Search Intelligence: Deep Learning For Dominant Category Prediction
SSearch Intelligence: Deep Learning For DominantCategory Prediction
Ebay Inc.Email: zmalik, mkobrosli, [email protected]
Abstract —Deep Neural Networks, and specifically fully-connected convolutional neural networks are achieving remark-able results across a wide variety of domains. They have beentrained to achieve state-of-the-art performance when appliedto problems such as speech recognition, image classification,natural language processing and bioinformatics. Most of thesedeep learning models when applied to classification employ thesoftmax activation function for prediction and aim to minimizecross-entropy loss. In this paper, we have proposed a supervisedmodel for dominant category prediction to improve search recallacross all eBay classifieds platforms. The dominant category labelfor each query in the last 90 days is first calculated by summingthe total number of collaborative clicks among all categories.The category having the highest number of collaborative clicksfor the given query will be considered its dominant category.Second, each query is transformed to a numeric vector bymapping each unique word in the query document to a uniqueinteger value; all padded to equal length based on the maximumdocument length within the pre-defined vocabulary size. A fully-connected deep convolutional neural network (CNN) is thenapplied for classification. The proposed model achieves veryhigh classification accuracy compared to other state-of-the-artmachine learning techniques.
I. I
NTRODUCTION
In recent years Deep Belief Networks have achieved remark-able results in natural language processing [1], computer vision[2][3] and speech recognition [2] tasks. Specifically, withinnatural language processing, modeling information in searchqueries and documents has been a long-standing research topic[4][5]. Most of the work with deep learning has involvedlearning word vector representations through neural languagemodels [6][7][8] and performing composition over the learnedword vectors for classification [9].The optimal transformation in our case was to map eachquery document to a single numeric vector by assigning asingle numeric value to each unique word across all querydocuments. A second phase was then employed by mappingthe numerically transformed query vectors to a random em-bedding space having a uniform distribution between -1 and 1.This helped far more in reducing the distance between querieshaving similar words while further discriminating queries faron the data space having more dissimilar words. Anothersuitable criteria that is applicable to our problem is proposedby Johnson and Zhang [10] in 2014, where they propose asimilar model, but swapped in high dimensional ‘one-hot’vector representations of words as CNN inputs. Convolutional Neural Networks (CNN) are biologically-inspired variants of Multiple Layer Perceptrons (MLP). Theyutilize layers with convolving filters that are applied to localfeatures [11] originally invented for computer vision. Convo-lutional neural networks have also been shown to be highlyeffective for natural language processing and have achievedexcellent results in information retrieval [12], semantic parsing[13], sentence modeling [14] and other traditional naturallanguage processing tasks [9].Before going into the details of our model architecture andresults, we will first narrate the work we did to prepare ourquery data for modelling.II. Q
UERY D ATA P REPARATION
The advertisements in eBay’s classifieds platforms are clas-sified according to a pre-defined hierarchy. The first level(L1) of this hierarchy categorizes advertisements into generalgroupings like ‘buy & sell’, ‘cars & vehicles’, ‘real state’,‘pets’, ‘jobs’, ‘services’, ‘vacation rentals’ and ‘community’.The second level (L2) further classifies each L1-category withmany subclasses with more specificity. The third level (L3)further classifies and so on. Most platforms terminate thehierarchy at a level of three or four. In this paper we will onlydemonstrate the results of our work related to L1-categoryquery classification.For each keyword search initiated within a user sessionat the all-advertisement level (all-advertisement level meansa search across all inventory with no category restrictionsemployed), the chain of actions on that search is analysed.When that sequence of actions results in a view of an ad-vertisement within a specific category, that particular categoryis scored with a dominance point for the given query. Thereare many noisy factors that must be accounted for whenapplying this technique. Among them include factors like bots,redundant query actions, filtering out conversions to categoriesthat no longer exist and filtering out queries without enoughconversions.The dominance of category for each query document in thelast 90 days is computed on the basis of the maximum numberof collaborative clicks for each L1-category. The categorywith the highest number of clicks is considered the dominantcategory for that query. This also enabled us to producethe first highest, second highest and third highest dominantcategory and their respective conversion rates for each query. a r X i v : . [ c s . I R ] F e b he conversion rate per query is calculated by counting thetotal number of clicks for each category divided by the totalnumber of clicks for that query.Finally all query documents for the last 90 days arestandardized by transforming them to lower-case, removingduplicate queries, extra spaces, punctuations and all othernoise factors. A single pattern from each L1-category of thefinal preprocessed data ready to be used for learning is shownin Table I.In Table I the CategoryID feature is used as a label forsupervised learning using a deep convolutional neural network.The total distinct query patterns for most of the categories inthe last 90 days ranges between 5000 to 7000.III. M ODEL A RCHITECTURE
The model architecture shown in Figure 1, follows [15] and[1]. Let x i (cid:15) R k be the k dimensional transformed numericvector for each query document mapping each word in thequery document to an integer within the defined vocabularysize.Suppose we have a given query document D =( w , w , ..., w N ) with vocabulary V . CNN requires vectorrepresentation of data that can uniquely preserve internallocations (word order in this case) as input. The chosen bestand straight forward representation would be to treat each wordas a pixel, treat D as if it were an image of | D | × pixel,and to represent each pixel (i.e. each word) with a uniquenumeric value. As a running real-time example suppose thatquery document D = { ”giving”, ”away”, ”free”, ”free” } andwe associate the words with unique numeric value. Then wehave the document vector as:- x = [1235 , , , , , , , , (1)All the query document vectors are padded to equal lengthbased on the maximum document size of the last nighty daysquery corpus is represented as x n = x ⊕ x ⊕ x ⊕ ... x n (2)where ⊕ is the concatenation operator. Let x i : i + j referto the concatenation of words x i , x i +1 , ..., x i + j of a singlequery document with the unique numeric conversion for eachword. The filtration of w (cid:15) R hk is considered as a convolutionoperation, which is applied to a window of h words to producea new feature. Supposedly, a feature c i is generated from awindow of words x i + h − by c i = f ( w · x i : i + h − + b ) (3)where b (cid:15) R is a bias term and f is a non-linear activationfunction such as the tangent hyperbolic function. The filteris applied to each possible window of words throughout thewhole set { x h , x h +1 , ..., x n − h +1: n } to produce a featuremap. c = [ c , c , ..., c n − h +1 ] (4) with c (cid:15) R n − h +1 . The feature mapping is followed bythe rectified linear unit which zeros out negative values andproduces sparse activations. Next comes the max-pooling layerwhich captures the most significant feature, the one with thehighest value for each feature map.Above, we explained the process by which one feature isextracted from one filter. The proposed model uses multiplefilters with varying window sizes to obtain multiple features.These extracted significant features form the penultimate layerand are passed to a fully connected softmax layer whose outputis the probability distribution over 8 labels.We have employed dropout for regularization on the penul-timate layer [2]. Dropout helps in preventing co-adaptation ofhidden units by randomly dropping with a certain probability.Given the penultimate layer z = [ˆ c , ...., ˆ c m ] , dropout uses y = w · ( z ◦ r + b ) , (5)where ◦ is the element-wise multiplication operator and r (cid:15) R m is a ‘masking’ vector of Bernoulli based randomvariables with probability p being 1. In this way the dropoutmechanism for regularization on the penultimate layer stochas-tically disables a fraction of its neurons. This ultimatelyprevent neurons from co-adapting and forces them to learnindividually useful features. The fraction of neurons to keepenabled is defined by the dropout keep probability input to thenetwork.Table II summarizes the configuration details of the em-ployed deep convolutional neural network which significantlysolved the dominant category prediction problem across sev-eral eBay Classifieds platforms. The first column defines thelength of embedding layer size which maps the input to anembedding space. The filter size narrates the number of wordswe need to consider in each convolutional filter. The totalnumber of filters for each window of size 1, 2 and 3 are 128.The batch size and number of epochs for training are set to64 and 100. The maximum length of query sequence in ourcase is 10 and the total number of L1-Category classes are 8.The training time of the algorithm for the 90 days of data isapproximately 50 minutes on an Intel Core i7 with 2.8 GHzspecification.The summary statistics of our pre-computed dominantcategory prediction dataset are shown in Table III whichdescribes the total number of classes, average sentence length,vocabulary size, training and testing data length.IV. R ESULTS & D
ISCUSSION
Results of the proposed model for the dominant categoryprediction problem compared to other state-of-the-art methodsare listed in Table IV. The proposed well-tuned deep convolu-tional neural network simply outperformed its variations andother models. We tested the predictive accuracy by first usingfew days different testing data from training shown in the firstrow and fourth column of Table IV for every model type. TheCNN model produced a very high training and testing accuracyof 99.9 % and 98.5 %. Secondly we tried testing completelyABLE I: A Single Unique Pre-processed Pattern From each L1-Category
Category Name CategoryID Query Category Conversion-Rate Total Patternscars & vehicles 27 2007 civic 0.9857 98% 5000 - 7000jobs 45 cash jobs 0.7051 70% 5000 - 7000services 72 makeup artist 0.8911 89% 5000 - 7000buy & sell 10 air conditioner 0.9783 97% 5000 - 7000vacation rentals 800 sherkston shore 0.4694 46% 2000 - 3000pets 112 western saddle 0.8268 82% 5000 - 7000real state 34 mortgage 0.4782 47% 5000 - 7000community 1 christmas markets 1 100% 2000 - 3000
Rectified Linear Unit
Max-pooling 1 x 128 Each Total Number of Filters: 128 Sliding-window region of size 1, 2 and 3 convolution Activation function for feature map before Rectified Linear Unit M e r g i n g p oo l e d o u t pu t s Fully connected 128 x 8 Softmax function and regularization in this layer
C1 C2 C3 C4 C5 C6 C7 C8
Buy & Sell Cars & Vehicles Real state pets jobs services Vacation rentals community
Fig. 1: Model ArchitectureTABLE II: Configuration of Deep Convolutional Neural Network For L1 Dominant CategoryPrediction
Embedding Layer Dim. Filter Sizes Number of Filters Dropout Keep Probability Batch Size No. of Epochs Sequence Length No. of Classes128 1, 2, 3 128 0.5 64 100 10 8 different days testing data from training and the resultingoutcomes are shown in the second row of Table IV for everymodel type. This is our worst case scenario where we haveused a completely different testing data for dominant categoryprediction but still the CNN model has produced a very hightesting accuracy of 95.8 %. The major advantage with CNNcompared to other state-of-the-art approaches is its addedcapability to learn invariant features. This capability of CNN tomake the convolution process invariant to translation, rotationand shifting helps in approximating to the same class evenwhen there is a slight change in the input query document.The step by step training accuracy and loss of our convolu-tional neural network model are also shown in Figure 2a and 2b. Initially the accuracy was noted very low but gradually itimproved at each training step and almost reached to one inthe end as shown in Figure 2a. Similarly, the loss was veryhigh in the beginning, but almost reached to zero in the endas shown in Figure 2b. This clearly shows the convergence ofthe proposed well-tuned deep convolutional neural network.The multiple layer perceptron model with an empiricallyevaluated one and two hidden layers of size 200 did notperform effectively well and produced a predictive accuracyof 55.91 % and 54.98 % on both of the testing sets. We alsofurther tried to increase the count of hidden layers to explicitlyadd the certain level of non-linearity but still the predictiveaccuracy more or less remained constant. Furthermore we a) Training Accuracy (b) Training Loss
Fig. 2: Training Accuracy & Loss of CNNTABLE III: Summary Statistics of the Dataset
Data Number of Classes Average Sentence Length Vocabulary Size Training Size Testing SizeDominant-Category 8 2 12812 32088 32087
TABLE IV: Results of the proposed well-tuned CNN model against other methods
Model Type Number of Days Training Date Range Testing Date Range Training Accuracy Testing AccuracyCNN (Proposed)
Past 90 Days 28-06-2016 to 28-09-2016 07-06-2016 to 07-09-2016
CNN (Proposed)
Past 90 Days 28-06-2016 to 28-09-2016 28-02-2016 to 28-05-2016
CNN-static [1] Past 90 Days 28-06-2016 to 28-09-2016 07-06-2016 to 07-09-2016
CNN-static [1] Past 90 Days 28-06-2016 to 28-09-2016 28-02-2016 to 28-05-2016
CNN-non-static [1] Past 90 Days 28-06-2016 to 28-09-2016 28-02-2016 to 28-05-2016
CNN-non-static [1] Past 90 Days 28-06-2016 to 28-09-2016 28-02-2016 to 28-05-2016
MLP with two hidden layers Past 90 Days 28-06-2016 to 28-09-2016 07-06-2016 to 07-09-2016 0.563486 56.35 % 0.559056 55.91 %MLP with two hidden layers Past 90 Days 28-06-2016 to 28-09-2016 28-02-2016 to 28-05-2016 0.563486 56.35 % 0.549894 54.98 %MLP with single hidden layer Past 90 Days 28-06-2016 to 28-09-2016 07-06-2016 to 07-09-2016 0.483046 48.31 % 0.479556 47.95 %MLP with single hidden layer Past 90 Days 28-06-2016 to 28-09-2016 28-02-2016 to 28-05-2016 0.483046 48.31 % 0.483915 48.39 %LSTM RNN Network Past 90 Days 28-06-2016 to 28-09-2016 07-06-2016 to 07-09-2016 0.658262 65.83 % 0.651895 65.19 %LSTM RNN Network Past 90 Days 28-06-2016 to 28-09-2016 28-07-2016 to 28-04-2016 0.658262 65.82 % 0.630651 63.06 %LSTM Bi-RNN Network Past 90 Days 28-06-2016 to 28-09-2016 07-06-2016 to 07-09-2016 0.536496 53.65 % 0.529887 52.98 %LSTM Bi-RNN Network Past 90 Days 28-06-2016 to 28-09-2016 28-07-2016 to 28-04-2016 0.536496 53.65 % 0.505335 50.05 % tried running Long Short Term Memory (LSTM) recurrentneural networks which are shown to outperform other recurrentneural network algorithms specifically for language modelling[16]. However, in our case there is no sequence to sequenceconnection between the current and previous activations of thesequential query patterns, the maximum predictive accuracythat LSTM recurrent neural network could produce was 63.06% and 65.19 % for both the testing datasets. The Bi-directionalrecurrent neural network worked a little worse compared toLSTM network and produced a predictive accuracy of 52.98% and 50.05 % on both the testing datasets.V. C
ONCLUSION
In the present work we have described a tuned, fullyconnected CNN that outperformed its variants and other state-of-the art ML techniques. Specifically, in query to categoryclassification across several eBay Classifieds platforms. Ourresults integrate to evidence that numeric vector mappingto random uniformly distributed embedding spaces provesmore suitable both computationally and performance wise in comparison to word2vec. Specifically for datasets having alimited vocabulary corpus (between 10,000 to 15,000 words)and few words (between 2 to 3) in each query document.VI. A
CKNOWLEDGEMENT
The first and second authors are grateful to JohannSchweyer for his contribution in query normalization and ag-gregation. We are also extremely thankful to Brent Mclean VP,CTO, eBay Classifieds for his kind support and encouragementthroughout this dominant category prediction project.R
EFERENCES[1] Y. Kim, “Convolutional neural networks for sentence classification,” arXiv preprint arXiv:1408.5882 , 2014.[2] G. E. Hinton, N. Srivastava, A. Krizhevsky, I. Sutskever, and R. R.Salakhutdinov, “Improving neural networks by preventing co-adaptationof feature detectors,” arXiv preprint arXiv:1207.0580 , 2012.[3] Z. K. Malik, A. Hussain, and Q. J. Wu, “Multilayered echo statemachine: A novel architecture and algorithm,” vol. PP, no. 99, 2016.[4] S. Deerwester, S. T. Dumais, G. W. Furnas, T. K. Landauer, andR. Harshman, “Indexing by latent semantic analysis,”
Journal of theAmerican society for information science , vol. 41, no. 6, p. 391, 1990.5] J. Gao, J.-Y. Nie, G. Wu, and G. Cao, “Dependence language model forinformation retrieval,” in
Proceedings of the 27th annual internationalACM SIGIR conference on Research and development in informationretrieval . ACM, 2004, pp. 170–177.[6] Q. V. Le and T. Mikolov, “Distributed representations of sentences anddocuments.” in
ICML , vol. 14, 2014, pp. 1188–1196.[7] Z. K. Malik, A. Hussain, and J. Wu, “Novel biologically inspired ap-proaches to extracting online information from temporal data,”
CognitiveComputation , vol. 6, no. 3, pp. 595–607, 2014.[8] Z. K. Malik, A. Hussain, and Q. J. Wu, “An online generalizedeigenvalue version of laplacian eigenmaps for visual big data,”
Neu-rocomputing , vol. 173, pp. 127–136, 2016.[9] R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, andP. Kuksa, “Natural language processing (almost) from scratch,”
Journalof Machine Learning Research , vol. 12, no. Aug, pp. 2493–2537, 2011.[10] R. Johnson and T. Zhang, “Effective use of word order for textcategorization with convolutional neural networks,” arXiv preprintarXiv:1412.1058 , 2014.[11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learningapplied to document recognition,”
Proceedings of the IEEE , vol. 86,no. 11, pp. 2278–2324, 1998.[12] Y. Shen, X. He, J. Gao, L. Deng, and G. Mesnil, “Learning semanticrepresentations using convolutional neural networks for web search,” in
Proceedings of the 23rd International Conference on World Wide Web .ACM, 2014, pp. 373–374.[13] W.-t. Yih, K. Toutanova, J. C. Platt, and C. Meek, “Learning discrim-inative projections for text similarity measures,” in
Proceedings of theFifteenth Conference on Computational Natural Language Learning .Association for Computational Linguistics, 2011, pp. 247–256.[14] N. Kalchbrenner, E. Grefenstette, and P. Blunsom, “A convolutional neu-ral network for modelling sentences,” arXiv preprint arXiv:1404.2188 ,2014.[15] Y. Zhang and B. Wallace, “A sensitivity analysis of (and practitioners’guide to) convolutional neural networks for sentence classification,” arXiv preprint arXiv:1510.03820 , 2015.[16] M. Sundermeyer, R. Schl¨uter, and H. Ney, “Lstm neural networks forlanguage modeling.” in