[PDF] EENMF: An End-to-End Neural Matching Framework for E-Commerce Sponsored Search

Abstract

E-commerce sponsored search contributes an important part of revenue for the e-commerce company. In consideration of effectiveness and efficiency, a large-scale sponsored search system commonly adopts a multi-stage architecture. We name these stages as ad retrieval, ad pre-ranking and ad ranking. Ad retrieval and ad pre-ranking are collectively referred to as ad matching in this paper. We propose an end-to-end neural matching framework (EENMF) to model two tasks---vector-based ad retrieval and neural networks based ad pre-ranking. Under the deep matching framework, vector-based ad retrieval harnesses user recent behavior sequence to retrieve relevant ad candidates without the constraint of keyword bidding. Simultaneously, the deep model is employed to perform the global pre-ranking of ad candidates from multiple retrieval paths effectively and efficiently. Besides, the proposed model tries to optimize the pointwise cross-entropy loss which is consistent with the objective of predict models in the ranking stage. We conduct extensive evaluation to validate the performance of the proposed framework. In the real traffic of a large-scale e-commerce sponsored search, the proposed approach significantly outperforms the baseline.

Full PDF

EEENMF: An End-to-End Neural Matching Framework forE-Commerce Sponsored Search

Wenjin Wu, Guojun Liu, Hui Ye, Chenshuang Zhang, Tianshu Wu, Daorui Xiao, Wei Lin, XiaoyuZhu

Alibaba Group { kevin.wwj,guojun.liugj,yehui.yh,chenshuang.zcs,shuke.wts,daorui.xdr,yangkun.lw,benjamin.zxy } @alibaba-inc.com ABSTRACT

E-commerce sponsored search contributes an important part of rev-enue for the e-commerce company. In consideration of effectivenessand efficiency, a large-scale sponsored search system commonlyadopts a multi-stage architecture. We name these stages as adretrieval , ad pre-ranking and ad ranking . Ad retrieval and ad pre-ranking are collectively referred to as ad matching in this paper.In the ad matching stage, there are two important problems thatneed to be addressed. First, in the keyword-based mechanism oftraditional sponsored search, it is a great challenge for advertisersto identify and collect lots of relevant bid keywords for their ads.Due to the improper keyword bidding, advertisers cannot get theirdesired ad impressions; meanwhile, sometimes there are no adsdisplayed to user for long-tail queries. These issues lead to ineffi-ciency. Second, deep models with personalized features have beensuccessfully employed for click prediction in the ranking stage.However, for the reason of computing complexity, deep modelswith personalized features are not effectively and efficiently appliedin the ad matching stages. To address these two problems, we pro-pose an end-to-end neural matching framework (EENMF) to modeltwo tasks— vector-based ad retrieval and neural networks based adpre-ranking . Under the deep matching framework, vector-basedad retrieval harnesses user recent behavior sequence to retrieverelevant ad candidates without the constraint of keyword bidding.Simultaneously, the deep model is employed to perform the globalpre-ranking of ad candidates from multiple retrieval paths effec-tively and efficiently. Besides, the proposed model tries to optimizethe pointwise cross-entropy loss which is consistent with the objec-tive of predict models in the ranking stage. We conduct extensiveevaluation to validate the performance of the proposed framework.In the real traffic of a large-scale e-commerce sponsored search, theproposed approach significantly outperforms the baseline.

KEYWORDS

Sponsored search, Ad retrieval, Ad matching, Ad pre-ranking, Deeplearning

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than theauthor(s) must be honored. Abstracting with credit is permitted. To copy otherwise, orrepublish, to post on servers or to redistribute to lists, requires prior specific permissionand/or a fee. Request permissions from [email protected].

ACM Reference format:

Wenjin Wu, Guojun Liu, Hui Ye, Chenshuang Zhang, Tianshu Wu, DaoruiXiao, Wei Lin, Xiaoyu Zhu. 2019. EENMF: An End-to-End Neural MatchingFramework for E-Commerce Sponsored Search. In

Proceedings of DAPA 2019WSDM Workshop on Deep matching in Practical Applications, Melbourne,Australia, February 15th, 2019 (DAPA ’19),

When users search in the search engine, sponsored search platformenables advertisers to target advertisements (ads ) to users’ searchrequests. Along with organic results, search engine presents thesponsored results to users in response to their search requests.Precisely, in e-commerce sponsored search, organic results are theproducts named ”item” on Taobao platform, while the ads are alsoa special kind of item .In consideration of effectiveness and efficiency, large-scale searchsystems or recommendation systems often adopt a multi-stagesearch architecture [21]. In our e-commerce sponsored search sys-tem, a three-stage architecture has been adopted over the pastfew years. Sequentially we name these three stages as ad retrieval , ad pre-ranking and ad ranking . In this paper, we refer the ad re-trieval and pre-ranking stage both as matching . We mainly focuson proposing an efficient and effective neural matching model forthe ad matching stage in sponsored search.In our e-commerce sponsored search, there exist two types ofproblems in the ad matching stage. Firstly, in the traditional spon-sored search system, keyword-based mechanism provides a simplead retrieval method where the whole burden is on advertisers, mak-ing it a big challenge for advertisers to optimize bids. It is quiteimpossible for advertisers to identify and collect lots of relevantbid keywords to target their ads. Due to the improper keywordbidding, advertisers possibly can not get the desired ad impres-sions; meanwhile, there are no ads displayed in the search resultpages for long-tail queries. To alleviate this problem, the searchengines often provide an advanced matching service to advertisers,which rewrites user’s original query to many different related bidkeywords, enriching the connections between user’s query andbid keywords of ads. The query rewriting approach is limited tomatching only against predefined bid keywords of ads. Thus, thekeyword-based mechanism is still unable to achieve a good matchbetween user query requests and advertisements. Secondly, var-ious types of personalized information such as user profile, userlong-time click behaviors and real-time click behaviors, have beenproved to be effective for click prediction models in the ranking In the remainder, ad(s) is used to refer to advertisement(s) a r X i v : . [ c s . I R ] D ec ser Query Request Query Rewrite Bidword Ad IndexBidword-based Ads Retrieval UserRequest Vector Inference KNN Ad Product Quantization Index

Vector-based Ads Retrieval Ad pre-ranking Ad ranking Frontend User Behavior Logs

Real-time Feedback

Top N Top K

Ad matching

Other Ads Retrieval Top K1Top K2Top K3

Figure 1: An E-commerce Sponsored Search SystemOverview. This paper focuses on two parts: vector-based adretrieval and ad pre-ranking , which are highlighted in red.

Ad retrieval and ad pre-ranking are collectively referred toas ad matching in this paper. stage [7, 32, 38]. However, for the sake of computing complex-ity, deep models exploiting personalized information such as userrecent click behaviors, is not utilized in the matching stage.To address the inefficiency of keyword-based mechanism andlack of uniformly user personalized information modeling in the matching stage, inspired by recent work on multi-task learning [3,22], we propose a practical neural matching model to fulfill thesetwo tasks: vector-based ad retrieval and neural networks based adspre-ranking respectively.

Vector-based ad retrieval exploits user re-cent behavior sequence to select relevant ad candidates without theconstraint of keyword bidding. For the vector retrieved ads withoutbidding information, a bid optimizing strategy called OptimizedCost Per Click (OCPC)[34] is applied to determine how much theadvertisers will be charged if their ads are clicked. Simultaneously,there often exist various ads retrieval paths in sponsored search sys-tem such as keyword-based retrieval and vector-based retrieval. Weadapt the model to perform the global pre-ranking of ad candidates.The model is trained to optimize the cross-entropy loss under theguide of search impression logs, which makes the optimized objectsof matching and ranking stages consistent. Finally, we conductonline evaluation in the real-world operational environment of ourlarge-scale e-commerce sponsored search.The main contributions of this paper are summarized as follows: • We propose a novel multi-task neural matching frameworkfor a large-scale e-commerce sponsored search, which istrained on user search sessions. The proposed matchingmodel tries to optimize the cross-entropy loss functionwhich is consistent with the objective of predicting modelsin the ranking stage. • Under the framework, we implement the vector-based adretrieval to overcome the shortness of the keyword-basedad retrieval and provide advertisers a keyword-free wayto advertise their products. • The proposed approach is deployed in a large-scale e-commercesponsored search platform. The online evaluation revealsthat the proposed method retrievals and selects more rele-vant ads than the baseline methods.

X1Cosine(Vqu, Va) shared layers

ItemK Features Item2 Features Item1 Features Item0 Features Ad FeaturesItemK Item2 Item1 Item0 ItemAdQuery/User Feature hk h2 h1 h0User behavior sequenceGRUcell GRUcell GRUcell GRUcell

Task1: vector-based retrieval Task2: ad pre-ranking

Vqu Va task speciﬁed layers encoding layersembedding layer

Query User X2

Neural Attention Network wk w2 w1 w0

Figure 2: Neural Matching Model. Attentive GRU-RNN isadapted to model user behavior sequence, and this modelfulfills two tasks: vector-based retrieval and deep ad pre-ranking . In order to better understand the proposed neural matching frame-work, Figure 1 the overall architecture and data flow of the spon-sored search system.

The proposed model architecture is shown in Figure 2. Horizontally,the architecture consists of two parallel sub neural networks, onenetwork for user requests (which we term

Qu-Net ) in the left sideand the other for advertisements (which we term

Ad-Net ) in theright side. User request features and advertisement features are fedto

Qu-Net and

Ad-Net as inputs respectively, and ad clicked labeled1 or not clicked labeled 0 is produced as output. Vertically, theunderlying model architecture can be divided into four parts frombottom to top including input and embedding layers, encodinglayers, shared layers and task-specified layers. We detail theselayers in the following.

The input instance of the proposed models is consisting of fourtypes of features: query feature, user profile feature, user previousbehaviors feature and the target ad item feature. User behaviors is abehavior sequence X = { q , i , i , ..., i m } where q is user’s currentquery, and i k indicates the k th behavior item that the user actedbefore this search request. X is ordered by the time of user behavior.Each behavior item is represented by ID features including item ID,shop ID, brand ID, term IDs of the item’s title and the correspondingsearch query feature. In Ad-Net , ad item is also represented by IDfeatures like the behavior item, except the search query feature.Each unique ID space has a separately learned embedding matrix.Very large cardinality ID space (e.g. product item Ids and searchquery terms) are truncated by keeping top ones after sorting basedon their frequency in search logs. Out-of-vocabulary values areset to zero and mapped to the embedding of zero. Multivalued IDfeature embeddings, such as word IDs of item title, are summedbefore fed to the next layer. emarkably, sparse features in the same ID space share the sameunderlying embedding matrix. Since an ad item is also a product,ad item features in Ad-Net share all embedding matrices with be-havior item features in

Qu-Net . Sharing embedding is important forimproving generalization, speeding up model training and reducingmodel parameters.

When a user search in e-commerce mobile app, she browses andclicks product items in the form of streaming. For instance, when auser want to buy a shoes and search ”shoes” , she usually browsesthe result list, clicks shoes which she likes and compares them be-fore adding to cart. Intuitively, items in the same behavior sequenceare correlated. In other words, user previous behavior items arepredictable for the next behavior item. This type of relation hasbeen proved to be effective in recommendation system [12, 39]In the encoding layer, we use recurrent neural network (RNN) toencode user behavior sequence. We consider the latest previous m behaviors, padding the default symbol to the fixed size m if lengthof previous behaviors is less than m . We adopt GRUs, since GRUshave fewer parameters and competitive performance to LSTMs[10].In the e-commerce sponsored search, there may exist items inprevious behavior sequence unrelated to current search query. Forexample, user current search query is ”red dress”, while in herprevious behavior sequence there are dress product items searchedby ”dress” and shoes product items searched by ”shoes”. Obviously,these two category of product items are of different relevance tocurrent search query ”red dress”. Thus, we adopt query basedattention nets to address this problem. Vector h j is the GRU hiddenoutput at the step j . We take h j as the representation of j -th behavioritem, and represent the behavior sequence as a weighted sum of thevector representation of all behavior items. The attention weightmakes it possible to assign proper credit to items according to theirimportance to current query request. Mathematically, it takes theformulates as follows: h = m (cid:213) t = w t h t (1) w t = exp ( net ( h t , Q ; θ )) (cid:205) mi = exp ( net ( h i , Q ; θ )) (2)where w t is the weight for h t , net ( h t , Q ; θ ) is a two-layer atten-tion network parameterized by θ with hidden state h t and queryfeature embeddings Q as inputs. h is the vector representationof user previous behavior items. As for Ad-Net , we directly useone fully connected layer to map ad item embeddings to a vectorwith the same dimension as the encoding vector of user previousbehavior items h . After the encoding layers, we get a user query request vector outputand an ad vector output with the same dimension, which may notwell fit in the same vector space. Inspired by DSSM models [18],we stack two shared nonlinear fully connected layers over theencoding layer of

Qu-Net and

Ad-Net to bind these two types ofvectors. Besides, hard parameter sharing greatly reduces the risk of overfitting [31]. Furthermore, let h l denotes the correspondingoutput of the l -th hidden layers. h l = f ( h l − ) (3)where f (·) is an non-linear activation function. The output of theshared layers are query request’s representation and ad’s withdimension d (e.g. 128), which are fed to the next multi-task specificlayers. Through the previous layers, we obtain representations of bothquery request and ad. Our model has two tasks to fulfill: vector-based ads retrieval and ads pre-ranking. For these two differenttasks, we apply specific network layers and optimization objectives.

Vector-based ad retrieval . For the vector-based ad retrieval task, with V qu and V a as inputs the relevance score is computed bycosine similarity as: cosine ( V qu , V a ) = V qu · V a || V qu || · || V a || (4)The larger the cosine value, the more similar is between V qu and V a . We use the cross-entropy loss as the objective to train model: C v ( θ ) = − N N (cid:213) i = ( y i log ( P ( V qu , i , V a , i ))) + ( − y i ) log ( − P ( V qu , i , V a , i )) (5) P ( V qu , i , V a , i ) = exp ( γcosine ( V qu , i , V a , i )) + exp ( γcosine ( V qu , i , V a , i ) (6)where γ is a tuning factor determined by validation set. y i ∈ { , } is the target. If user clicked the current ad, the instance is posi-tive, otherwise negative. The loss is summed over all samples in amini-batch (128 samples in our experiments). At serving time, adretrieval is reduced to a nearest neighbor search problem. Productquantization is used to implement K nearest neighbor search, whichis efficiently supported in Faiss library [19]. The details is that weapply forward inference of the current query request through Qu-Net obtaining the V qu vector which is normalized, and then we use V qu to search ads’ Faiss index to obtain relevant ads. Ad pre-ranking . In our scenario, thousands of ads are recalledthrough bidword-based ad retrieval and vector-based retrieval. Theads pre-ranking stage needs to score and select top N (e.g. 200)ad candidates for the ranking stage. Different from the baselineapproach which uses static score built in ads inverted index to selecttop N ads, we employ a deep model with personalized features toscore and select top N ads. Here we still model it as a click-throughrate prediction (CTR) problem, and use the cross-entropy loss asthe objective which is consistent with the objective of CTR predict-ing model in the ranking stage. In order to model the interactionbetween query request features and ad features, we add a nonlinearfully connected layer for them. The loss function is descirbed inEquation 7. C r ( θ ) = − N N (cid:213) i = ( y i log ( P (cid:48) ( V qu , i , V a , i ))) + ( − y i ) log ( − P (cid:48) ( V qu , i , V a , i )) (7) (cid:48) ( V qu , V a ) = FC ( V qu , V a ; θ ) (8)where the lightweight networks FC ( V qu , V a ; θ ) is consisting ofone fully connected layer and a logits regression layer, with theconcatenation of V qu and V a as inputs. The reason for choosing FC ( V qu , V a ; θ ) is concluded as follows. First, most of the matrixcomputation ( V qu , V a ) W in the fully connected layer can be calcu-lated offline, since as described in Equation 9, W is the parametersmatrix of this layer, ( V qu , (cid:174) ) W is computed only one time per queryrequest and ((cid:174) , V a ) W can be computed offline for ads in advance.Second, the lightweight networks FC ( V qu , V a ; θ ) are flexible to in-corporate other effective features such as id features or statisticfeatures, which is similar to the wide part of Heng-Tze Cheng etal.’s deep & wide model [8]. ( V qu , V a ) W = ( V qu , (cid:174) ) W + ((cid:174) , V a ) W (9)Finally, our model is trained jointly with the objective: C joint ( θ ) = αC v ( θ ) + ( − α ) C r ( θ ) (10)where α is a hyperparameter that balances the effects of two tasks. We use the search logs from both sponsored search and organicsearch to reorder each user’s historical behaviors according to thetimeline, and then construct the train dataset and test dataset. Aninstance records the complete information about an ad impressionincluding user profile, query, user recent behaviors and correspond-ing behavior of the current ad (click or non-click). And the windowsize of user recent behaviors is 6. In our experiment, if user clickedthe current ad then the instance is positive, otherwise negative.About 3 × instances from search logs per day are sampled asdata set. We use samples of every three consecutive days for train-ing and test on the samples of the next day. We divide the searchlogs of 12 consecutive days to three groups of training and test-ing data sets for training and evaluation. We employ distributed Tensorflow machine learning platform deployed on a large-scalecomputing cluster to train our neural networks Comparison of different network architectures . Inorder to investigate the effectiveness of the proposed model, wecompare five network architectures and the baseline model DSSM [18].These models are described as follows: • DNN : it employs the mean pooling to represent user previ-ous behaviors, ignoring the order information and takingall behavior items equally. • GRU-RNN : it applies GRU cell based RNN to model userprevious behavior sequence. • Attention-DNN : it adds a query based attention networkover

DNN , and distinguishes the different role that eachbehavior item plays when predicting the current interest. Table 1: Comparison of Different Models for User BehaviorSequence on Task1 vector-based ads retrieval and Task2 adspre-ranking model type Task1 AUC Task2 AUCDNN 0.6657 0.6655GRU-RNN 0.6760 0.6758Attention-DNN 0.6762 0.6760Attention-GRU-RNN

Concatenate-DNN 0.6795 0.6796DSSM [18] 0.6200 - • Attention-GRU-RNN : Similarily, it adds the query basedattentive network

GRU-RNN , considering both order andimportance information. • Concatenate-DNN : it directly concatenates embeddingsof user previous behaviors, letting the raw informationfeed into later layers. • DSSM [18]: In DSSM [18], a query is parallel to the titles ofthe ad documents clicked on for that query. We extractedthe query-title pairs as positive samples for model trainingfrom ads click logs using a procedure similar to [18], andrandomly sampled four negative documents per positivesample. Since user’s search query is often short, we enrichthe query with the title of recent behavior items. We alsouse terms in the query and ad’s title as input features. • Search2Vec [14], which is the state-of-art approach forboard match in sponsored search. Since our offline eval-uation is based on the search session logs and user re-quest is sparse, we do not conduct offline evaluation forSearch2Vec [14]. However, we trained the Search2Vecmodel and conducted the online evaluation with the realsearch traffic, which is described in Subsection 3.3To measure the overall performance of each model, we employArea Under ROC Curve (AUC) as the evaluation metric, whichis widely used in industry. AUC measures whether the clickedinstances are ranked higher than the non-clicked ones. For thefair model comparison, we tune model parameters using validatedataset (5% samples from training dataset not used for trainingmodels) to ensure these models to achieve its best performancerespectively.Table 1 reports the overall AUC of all models on the test dataset.These results demonstrate that RNN based models outperform DNNbased models, and models with attention mechanism outperformthe ones without respectively for both tasks. Specifically, RNNbrings about 0.01 AUC improvement comparing with DNN. Atten-tion mechanism brings about 0.01 AUC improvement for DNN and0.012 for RNN. Concatenate-DNN outperforms DNN with about0.01 AUC improvement. These results conform the hypothesis thatuser previous behavior items sequence are predictable for the nextbehavior item, but previous behavior items are not equally impor-tant. The overall evaluation results show the effectiveness of theGRU-RNN with attention model. Besides, the result of the originalDSSM [18] method indicates that it is not fit for ad retrieval very able 2: Comparison of jointly traning and single trainingon two tasks: Task1 vector-based ads retrieval and Task2 adspre-ranking Task1 AUC Task2 AUCsingle training task1 0.6765 -single training task2 - 0.6798jointly training well. There are two reasons for that. First, DSSM [18] solely em-ploying the term features can not distinguish between positive andnegative samples very well, and we find that in the training datasetthe positive and negative samples seem to be relevant to their cor-responding query in the textual content. Second, the DSSM [18]method employs a negative sampling based pairwise loss whichdoes not directly aim at optimizing CTR prediction.In the following, we conduct more detailed analysis of our modelin order to analyze the individual effect of different componentsor parameters on the performance. In each experiment, we onlycheck one component or parameter, while the rest will be fixed.

Influence of jointly training . We compare our pro-posed jointly training model with the single training model. Asdescribed in Section 2.5, our model is trained to fulfill two tasks: vector-based ads retrieval and ads pre-ranking . We conduct this com-parison based on the GRU-RNN with attention model. As shownin Table 2, the jointly training model achieves better performancethan the single trained ones. For the vector-based ads retrieval task,the joint model leads to about 0.0120 AUC improvement. As for theads pre-ranking task, the joint model leads to about 0.0088 AUCimprovement. Besides, when these two tasks are trained individu-ally,

Task2 ’s AUC is larger than

Task1 ’s, which is consistent withthe empirical idea that a lightweight networks FC ( V qu , V a ; θ ) ismore powerful than the cosine interaction between V qu and V a .One possible reason is that the joint training tends to learn moreexpressive representations of user request and ads. However, wefind that two tasks almost have no significant difference in AUCmetric when they are jointly trained.Importantly, when the sponsored search system serves online,these two tasks ( vector-based ads retrieval and ads pre-ranking )are needed to serve together. Consideration of online serving’sefficiency and convenience, one joint model is a better choose thantwo single models. Influence of shared layers . In order to analysis theeffect of shared layers described in Section 2.4, we carry out thecomparison experiment and the result is shown in Table 3. It canbe observed that share layers are consistently better than the non-share ones in both tasks. In one sense, sharing layers is a type ofinteraction between query request and ads. Besides, hard parametersharing greatly reduces the risk of overfitting [31].

Influence of hyperparameter γ . For the vector-basedad retrieval task, the relevance score between V qu and V a is com-puted by cosine similarity. As shown in Equation 6 of Section 4.5, P ( V qu , i , V a , i ) is the predict value. We train the models using differ-ent γ values and evaluation results are shown in 4. The performance Table 3: Comparison of share vs. non-share layers for Task1 vector-based ads retrieval and Task2 ads pre-ranking

Task1 AUC Task2 AUCshare 0.6820 0.6821non-share 0.6761 0.6794

Table 4: Comparison of different parameter γ for vector-based ad retrieval task γ (mean,var,[min,max]) of predict value AUC1 (0.2690, 0.0000, [0.2690, 0.2710]) 0.49993 (0.0680, 0.0260, [0.0475, 0.2300]) 0.66426 (0.0570, 0.0443, [0.0060, 0.3000]) 0.6866 γ . When parameter γ is 6,the model achieves the best performance. The mean of the final pre-dict value is 0.0570 and minimum and maximum values are 0.0060and 0.3000 respectively. And we find that the mean predict value0.0570 is nearest to the ratio of positive instances in the trainingdataset. In this subsection we conduct online A/B testing on the e-commercesponsored search platform with 1% of overall search traffic lastingthree days. Four metrics are used to evaluate the performance ofthe proposed approach as following: • CT R = AdClickCount / AdPresentCount • PR = AdPresentCount / RequestCount • CPC = AdCostAmount / AdClickCount • RPM = CT R ∗ PPC

We deploy the proposed model EENMF for these two tasks: vector-based ads retrieval and ads pre-ranking in the system.

Evaluation of vector-based ads retrieval . In our sys-tem, there exists two retrieval methods: the first one is a graph cov-ering based query rewriting method similar to [23] used to retrievalthe keyword-based ads; and the second one is BKR which is a variantof the method proposed in [34] for broad match. Search2Vec [14],which is the state-of-art approach for board match in sponsoredsearch. Therefore, we conduct four groups of experiments: (1)Search2Vec; (2) BKR; (3) EENMF; and (4) EENMF combines withBKR. To compare fairly, all these methods employ the same userbehavior information including queries and recent clicked itemsto recall ads. In the online A/B test, users, together with theirqueries, are randomly and evenly distributed to four buckets. Eachexperimental group is deployed in one bucket.As shown in Table 5, The improvement of metrics PR and CTRdemonstrates that all these keyword-free and vector-based ad re-trieval methods can recall more and better ad candidates in the admatching stage. The lifts also illustrates that advertisers are able toreceive more valuable traffic even if they choose the keyword-freebidding advertising. Meanwhile, the lift of RPM metric indicates able 5: Comparison of Average Online Metric Lift Rates forTask1 vector-based ads retrieval Methods PR CTR CPC RPMSearch2Vec [14] 2.2% 0.5% 1.9% 2.4%BKR [34] 2.1% 2.0% 2.1% 4.1%EENMF 2.5% 1.9% 3.2% 5.1%EENMF + BKR [34] 3.5% 2.7% 4.0% 6.7%

Table 6: Average Online Metric Lift Rates for Task2 ads pre-ranking

Metric Lift RatePR 0.0%CTR 3.1%CPC -3.3%RPM -0.2%the improvement of the sponsored search platform’s monetizationability. Specifically, EENMF outperforms Search2Vec significantly.ENMF’s performance is a little better than BKR. The reason maybe that EENMF model is deeper and more expressive than bothSearch2Vec and BKR, and the optimized objective of EENMF ismore consistent with the object of ranking stage. Besides, the com-bination of EENMF and BKR achieves higher PR and RPM metrics,which is valuable to the ad platform.

Evaluation of ads pre-ranking . As for the ads pre-ranking task, the baseline is a heuristics method based on the staticad-level quality scores from indexes and Jaccard similarity betweenquery and ad in categories and properties.As shown in Table 6, the CTR increases 3.1% and CPC decrease3.3%, which proves the effectiveness of the proposed pre-rankingdeep model. Overall, these online evaluation results demonstratethe significant effectiveness of the proposed approach.

In this section, we discuss the efficiency of online inference. Ina large-scale e-commerce sponsored search system, it is criticalto response to user query request timely. Usually, given a queryrequest, it is very time-consuming to perform ad ranking in a deeplearning based sponsored search system. Since there exist quitea lot of ad candidates in the matching stage, the proposed deepmodels need to be served online efficiently. Reviewing the modelarchitecture described in Section 2, we find that it is divided to twosub networks:

Qu-Net and

Ad-Net . When the model serves onlinerequest, for the user query request side, we just need one forwardinference per query request through

Qu-Net and obtain the V qu vector. For the ad side, we conduct forward inference for all ads inthe advertising repository using the trained network model offlineand these inferenced ad vectors V a are built into index. When anew ad arrives, its vector V a is infereced and written into index inthe updating system. Query rewriting has been well studied in the literature. These workson query rewriting fall into two categories: one is based on therelevance matching among queries and ads [5, 6], and the other isbased on mining the co-occurrence among queries and ads fromthe historical ad click logs [4, 36]. However, these methods can notovercome the limitation of keyword-based ad retrieval. To addressthese problems, the most related works were conducted by MihajloGrbovic et al. [14] . Mihajlo Grbovic et al. [14] designed a query-adsemantic matching approach based on embeddings of queries andads, namely Search2Vec. The embeddings were learned by the skip-gram model [24] on user search session data in an unsupervisedmanner. Different from their works, we propose a deep leaningbased matching framework to realize the vector-based ad retrievaland the global ad pre-ranking in an end-to-end manner, which ismore compatible with new and long-tail ads and queries.Another related work is leveraging deep learning techniques forthe semantic matching problem in the information retrieval andrecommendation systems. In the context of information retrieval,many representation focused approaches based on the Siamesearchitecture have been explored especially for short text match-ing, such as DSSM [18]. DeepCrossing model [32] are some of themethods that learn query and document text embedding to pre-dict click-through rate. Zhang et al. [38] proposed a frameworkdirectly modeling the dependency on user’s sequential behaviorsinto the click prediction process through the recurrent network.The interaction focused neural models [16] learn query-documentmatching patterns from word-level interactions. Wasi Ahmad etal. [1] proposed a joint framework trained on search sessions topredict next query and rank corresponding documents. Our workfalls into the representation focused approach to learn user re-quest and document representations and jointly models the twotasks: retrieval ads and pre-ranking ads in sponsored search. Inthe context of recommendation, deep learning based models learna better representation of user’s demands, item’s characteristicsand historical interactions between them. Cheng et al. [8] pro-posed an App recommender system for Google Play with a wide& deep model. Covington et al. [12] posed recommendation as ex-treme multiclass classification and presented a deep neural networkbased recommendation algorithm for video recommendation onYouTube. DeepIntent model proposed in Zhai et al. [37] comprisesa Bidirectional Recurrent Neural Network (BRNN) combined withan attention module. Among these previous works, DeepIntentmodel [37] and DSSM [18] are mostly similar to our work. However,There are two important different aspects. First, our model tries tominimize the pointwise cross-entropy loss which is consistent withthe objective of CTR predicting model in the ranking stage. Second,we focus on employing an attention based GRU-RNN model tolearn user search request which consists of user query and recentbehavior sequence.

The paper contributes an efficient and effective ad matching frame-work based on neural networks for the ad matching phrase in large-scale e-commerce sponsored search. The optimized objective ofthe proposed matching model is consistent with the predict model n the ranking phrase, which makes the performance of the multi-stage architecture better. The neural network model introducespersonalized and real-time features to the ad matching stage. Andwe jointly fulfill the vector-based ad retrieval task and the global adpre-ranking task in e-commerce sponsored search. Comparing withbaseline methods, experiment results show that our ad matchingframework achieves better performance. In the near future, we willintroduce more features to the model such as image features andexplore external memory networks [35] to model user behaviors. ACKNOWLEDGMENT

The authors would like to thank the colleagues for their valuablesupports, such as Yan Zhang, Genbao Chen, Yuping Jiang, HaoWan, Sheng Xu, Zhenkui Huang, Qing Ye, Tao Ma, Hang Xiang, DiZhang, Hongbin Zhao, Jinhui Li, Bo Wu.

REFERENCES [1] Wasi Ahmad, Kai-Wei Chang, and Hongning Wang. 2018. Multi-Task Learningfor Document Ranking and Query Suggestion. In

ICLR .[2] Ioannis Antonellis, Hector Garcia Molina, and Chi Chao Chang. 2008. Simrank++:Query Rewriting Through Link Analysis of the Click Graph. In

VLDB , Vol. 1.VLDB Endowment, 408–421.[3] Jing Bai, Ke Zhou, Guirong Xue, Hongyuan Zha, Gordon Sun, Belle Tseng,Zhaohui Zheng, and Yi Chang. 2009. Multi-task Learning for Learning to Rankin Web Search. In

Proceedings of the 18th ACM Conference on Information andKnowledge Management (CIKM ’09) . ACM, New York, NY, USA, 1549–1552.[4] Paolo Boldi, Francesco Bonchi, Carlos Castillo, Debora Donato, Aristides Gionis,and Sebastiano Vigna. 2008. The query-flow graph: model and applications. In

CIKM ’08: Proceeding of the 17th ACM conference on Information and knowledgemanagement . ACM, New York, NY, USA, 609–618.[5] Andrei Broder, Peter Ciccolo, Evgeniy Gabrilovich, Vanja Josifovski, DonaldMetzler, Lance Riedel, and Jeffrey Yuan. 2009. Online expansion of rare queries forsponsored search. In

WWW ’09: Proceedings of the 18th international conferenceon World wide web . ACM, New York, NY, USA, 511–520.[6] Andrei Z. Broder, Peter Ciccolo, Marcus Fontoura, Evgeniy Gabrilovich, VanjaJosifovski, and Lance Riedel. 2008. Search advertising using web relevancefeedback. In

CIKM (2008-11-10), James G. Shanahan, Sihem Amer-Yahia, IoanaManolescu, Yi Zhang, David A. Evans, Aleksander Kolcz, Key-Sun Choi, andAbdur Chowdhury (Eds.). ACM, 1013–1022.[7] Haibin Cheng and Erick Cant´u-Paz. 2010. Personalized click prediction in spon-sored search. In

WSDM , Brian D. Davison, Torsten Suel, Nick Craswell, and BingLiu (Eds.). ACM, 351–360.[8] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, RohanAnil, Zakaria Haque, Lichan Hong, Vihan Jain, Xiaobing Liu, and Hemal Shah.2016. Wide & Deep Learning for Recommender Systems. In

Proceedings of the 1stWorkshop on Deep Learning for Recommender Systems (DLRS 2016) . ACM, NewYork, NY, USA, 7–10.[9] Yejin Choi, Marcus Fontoura, Evgeniy Gabrilovich, Vanja Josifovski, Maurcio R.Mediano, and Bo Pang. 2010. Using landing pages for sponsored search adselection.. In

WWW (2010-04-28), Michael Rappa, Paul Jones, Juliana Freire, andSoumen Chakrabarti (Eds.). 251–260.[10] Junyoung Chung, Caglar Gulcehre, Kyunghyun Cho, and Yoshua Bengio. 2014.

Empirical evaluation of gated recurrent neural networks on sequence modeling .[11] Djork-Arn´e Clevert, Thomas Unterthiner, and Sepp Hochreiter. 2015. Fast andAccurate Deep Network Learning by Exponential Linear Units (ELUs).

CoRR abs/1511.07289 (2015). arXiv:1511.07289[12] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep Neural Networks forYouTube Recommendations. In

RecSys , Shilad Sen, Werner Geyer, Jill Freyne,and Pablo Castells (Eds.). ACM, 191–198.[13] John Duchi, Elad Hazan, and Yoram Singer. 2011. Adaptive Subgradient Methodsfor Online Learning and Stochastic Optimization.

J. Mach. Learn. Res.

12 (July2011), 2121–2159.[14] Mihajlo Grbovic, Nemanja Djuric, Vladan Radosavljevic, Fabrizio Silvestri, Ri-cardo Baeza-Yates, Andrew Feng, Erik Ordentlich, Lee Yang, and Gavin Owens.2016. Scalable Semantic Matching of Queries to Ads in Sponsored Search Adver-tising. In

SIGIR . ACM, New York, NY, USA, 375–384.[15] Mihajlo Grbovic, Nemanja Djuric, Vladan Radosavljevic, Fabrizio Silvestri, andNarayan Bhamidipati. 2015. Context- and Content-aware Embeddings for QueryRewriting in Sponsored Search. In

SIGIR . ACM, 383–392. [16] Jiafeng Guo, Yixing Fan, Qingyao Ai, and W. Bruce Croft. 2016. SemanticMatching by Non-Linear Word Transportation for Information Retrieval. In

Proceedings of the 25th ACM International on Conference on Information andKnowledge Management (CIKM ’16) . ACM, New York, NY, USA, 701–710.[17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2015. Delving Deepinto Rectifiers: Surpassing Human-Level Performance on ImageNet Classifica-tion.

CoRR abs/1502.01852 (2015). arXiv:1502.01852[18] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry P.Heck. 2013. Learning deep structured semantic models for web search usingclickthrough data. In

CIKM . ACM, 2333–2338.[19] Jeff Johnson, Matthijs Douze, and Herv´e J´egou. 2017. Billion-scale similaritysearch with GPUs. arXiv preprint arXiv:1702.08734 (2017).[20] Gnter Klambauer, Thomas Unterthiner, Andreas Mayr, and Sepp Hochreiter. 2017.Self-Normalizing Neural Networks.

CoRR abs/1706.02515 (2017).[21] Shichen Liu, Fei Xiao, Wenwu Ou, and Luo Si. 2017. Cascade Ranking for Opera-tional E-commerce Search. In

Proceedings of the 23rd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining (KDD ’17) . ACM, New York,NY, USA, 1557–1565.[22] Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, Kevin Duh, and Ye-Yi Wang.2015. Representation Learning Using Multi-Task Deep Neural Networks forSemantic Classification and Information Retrieval. In

NAACL HLT 2015, The 2015Conference of the North American Chapter of the Association for ComputationalLinguistics: Human Language Technologies, Denver, Colorado, USA, May 31 - June5, 2015 . 912–921.[23] Azarakhsh Malekian, Chi-Chao Chang, Ravi Kumar, and Grant Wang. 2008.Optimizing Query Rewrites for Keyword-based Advertising. In

Proceedings ofthe 9th ACM Conference on Electronic Commerce (EC ’08) . ACM, New York, NY,USA, 10–19.[24] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. 2013.Distributed Representations of Words and Phrases and their Compositionality.In

NIPS . Curran Associates, Inc., 3111–3119.[25] Bhaskar Mitra, Fernando Diaz, and Nick Craswell. 2017. Learning to Match UsingLocal and Distributed Representations of Text for Web Search. In

Proceedings ofthe 26th International Conference on World Wide Web (WWW ’17) . 1291–1299.[26] Vinod Nair and Geoffrey E. Hinton. 2010. Rectified Linear Units Improve Re-stricted Boltzmann Machines. In

ICML , Johannes Frnkranz and Thorsten Joachims(Eds.). Omnipress, 807–814.[27] Shumpei Okura, Yukihiro Tagami, Shingo Ono, and Akira Tajima. 2017.Embedding-based News Recommendation for Millions of Users. In

Proceedingsof the 23rd ACM SIGKDD International Conference on Knowledge Discovery andData Mining (KDD ’17) . ACM, New York, NY, USA, 1933–1942.[28] Hamid Palangi, Li Deng, Yelong Shen, Jianfeng Gao, Xiaodong He, Jianshu Chen,Xinying Song, and Rabab K. Ward. 2015. Deep Sentence Embedding Using theLong Short Term Memory Network: Analysis and Application to InformationRetrieval.

CoRR abs/1502.06922 (2015).[29] Liang Pang, Yanyan Lan, Jiafeng Guo, Jun Xu, Shengxian Wan, and Xueqi Cheng.2016. Text Matching As Image Recognition. In

Proceedings of the Thirtieth AAAIConference on Artificial Intelligence (AAAI’16) . AAAI Press, 2793–2799.[30] Prajit Ramachandran, Barret Zoph, and Quoc V. Le. 2017. Searching for ActivationFunctions.

CoRR abs/1710.05941 (2017). arXiv:1710.05941[31] Sebastian Ruder. 2017. An Overview of Multi-Task Learning in Deep NeuralNetworks.

CoRR abs/1706.05098 (2017).[32] Ying Shan, T. Ryan Hoens, Jian Jiao, Haijing Wang, Dong Yu, and JC Mao. 2016.Deep Crossing: Web-Scale Modeling Without Manually Crafted CombinatorialFeatures. In

Proceedings of the 22Nd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (KDD ’16) . ACM, New York, NY, USA,255–262.[33] Yelong Shen, Xiaodong He, Jianfeng Gao, Li Deng, and Grgoire Mesnil. 2014.Learning semantic representations using convolutional neural networks for websearch. In

WWW (Companion Volume) , Chin-Wan Chung, Andrei Z. Broder,Kyuseok Shim, and Torsten Suel (Eds.). ACM, 373–374.[34] Yan Su, Lin Wei, Wu Tianshu, Xiao Daorui, Wu Bo, and Liu Kaipeng. 2018.Beyond Keywords and Relevance: A Personalized Ad Retrieval Framework inE-Commerce Sponsored Search. (2018). https://arxiv.org/abs/1712.10110[35] Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. 2015. End-To-End Memory Networks. In

NIPS . 2440–2448.[36] David Sussillo. 2014. Random Walks: Training Very Deep Nonlinear Feed-Forward Networks with Smart Initialization.

CoRR abs/1412.6558 (2014).[37] Shuangfei Zhai, Keng-hao Chang, Ruofei Zhang, and Zhongfei Mark Zhang. 2016.DeepIntent: Learning Attentions for Online Advertising with Recurrent NeuralNetworks. In

Proceedings of the 22Nd ACM SIGKDD International Conference onKnowledge Discovery and Data Mining (KDD ’16) . ACM, New York, NY, USA,1295–1304.[38] Yuyu Zhang, Hanjun Dai, Chang Xu, Jun Feng, Taifeng Wang, Jiang Bian, BinWang, and Tie-Yan Liu. 2014. Sequential Click Prediction for Sponsored Searchwith Recurrent Neural Networks. In

AAAI . AAAI Press, 1369–1375.[39] Guorui Zhou, Chengru Song, Xiaoqiang Zhu, Xiao Ma, Yanghui Yan, XingyaDai, Han Zhu, Junqi Jin, Han Li, and Kun Gai. 2017. Deep Interest Network for lick-Through Rate Prediction. (2017). https://arxiv.org/abs/1706.06978lick-Through Rate Prediction. (2017). https://arxiv.org/abs/1706.06978