[PDF] Personalized Embedding-based e-Commerce Recommendations at eBay

Abstract

Recommender systems are an essential component of e-commerce marketplaces, helping consumers navigate massive amounts of inventory and find what they need or love. In this paper, we present an approach for generating personalized item recommendations in an e-commerce marketplace by learning to embed items and users in the same vector space. In order to alleviate the considerable cold-start problem present in large marketplaces, item and user embeddings are computed using content features and multi-modal onsite user activity respectively. Data ablation is incorporated into the offline model training process to improve the robustness of the production system. In offline evaluation using a dataset collected from eBay traffic, our approach was able to improve the Recall@k metric over the Recently-Viewed-Item (RVI) method. This approach to generating personalized recommendations has been launched to serve production traffic, and the corresponding scalable engineering architecture is also presented. Initial A/B test results show that compared to the current personalized recommendation module in production, the proposed method increases the surface rate by \sim6\% to generate recommendations for 90\% of listing page impressions.

Full PDF

PPersonalized Embedding-based e-Commerce Recommendationsat eBay

Tian Wang [email protected] Inc.

Yuri M. Brovman [email protected] Inc.

Sriganesh Madhvanath [email protected] Inc.

ABSTRACT

Recommender systems are an essential component of e-commercemarketplaces, helping consumers navigate massive amounts ofinventory and find what they need or love. In this paper, we presentan approach for generating personalized item recommendations inan e-commerce marketplace by learning to embed items and usersin the same vector space. In order to alleviate the considerablecold-start problem present in large marketplaces, item and userembeddings are computed using content features and multi-modalonsite user activity respectively. Data ablation is incorporated intothe offline model training process to improve the robustness of theproduction system. In offline evaluation using a dataset collectedfrom eBay traffic, our approach was able to improve the Recall@kmetric over the Recently-Viewed-Item (RVI) method. This approachto generating personalized recommendations has been launched toserve production traffic, and the corresponding scalable engineeringarchitecture is also presented. Initial A/B test results show thatcompared to the current personalized recommendation module inproduction, the proposed method increases the surface rate by ∼ CCS CONCEPTS • Computing methodologies → Learning from implicit feed-back ; Neural networks ; •

Information systems → Personal-ization ; Information retrieval ; Recommender systems . KEYWORDS deep learning, personalization, recommender systems, e-commerce,cold-start

ACM Reference Format:

Tian Wang, Yuri M. Brovman, and Sriganesh Madhvanath. 2021. Personal-ized Embedding-based e-Commerce Recommendations at eBay. In

Proceed-ings of ACM Conference (Conference’17).

ACM, New York, NY, USA, 9 pages.https://doi.org/10.1145/nnnnnnn.nnnnnnn

Generating product recommendations for users is commonplacein e-commerce marketplaces. The eBay marketplace, with over 1.6billion live items and over 183 million users, presents a unique set

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].

Figure 1: Screenshot of an eBay recommendations modulewhere the user has been previously looking at games. of challenges when it comes to generating recommendations. Tra-ditional collaborative filtering and matrix factorization methods [1]produce poor results given the scale and extreme sparsity of eBay’suser-item matrix [12]. With millions of new items listed daily, thecold start problem affects a substantial fraction of the inventory.Furthermore, over half of the live listings are single quantity, that is,they can be purchased by at most one buyer. After being purchased,items are removed from the site, and no longer accessible to users.Consequently, implicit user feedback signals such as clicks andpurchases are extremely sparse. In this paper, we describe how weattempt to address these unique challenges to build an effectiverecommender system.Generally speaking, e-commerce recommendations may be drivenpurely by the shopping context or they may be personalized fora user based on a user profile. On an item listing page in an e-commerce marketplace, the 𝑠𝑒𝑒𝑑 item provides strong indication ofa user’s shopping mission that may be used to guide the generationof recommendations. Indeed, there are several recommender sys-tems based on the seed item context that are deployed at eBay [3, 4,12]. However, on other landing pages such as the homepage, suchseed item context is missing. There may be other occasions as wellwhere we need to provide personalized recommendations for theuser in the absence of a seed item, and the signal for generatingrecommendations is primarily available user information. Such"personalized recommendations" are our primary concern in thispaper. Figure 1 depicts a screenshot of an eBay recommendationsmodule ("sponsored items based on your recent views") where theinput is primarily taken from a user’s activity on the marketplace.Information about a user to generate a profile may be capturedexplicitly, by asking the user to fill out a survey as part of onsiteregistration, or implicitly, e.g. by parsing the user’s shopping history.Although explicit methods directly capture the user’s interests,there are several limitations with this approach: an exhaustive setof potential interests is difficult to curate, user participation tends tobe low, the input can be highly incomplete, and long-term interestsmay not capture specific short-term shopping missions. Due to a r X i v : . [ c s . I R ] F e b onference’17, July 2017, Washington, DC, USA Tian Wang, Yuri M. Brovman, and Sriganesh Madhvanath these limitations, generating a user profile of interests is commonlyperformed using implicit user interaction data.In this paper, we propose to model users as embeddings basedon implicitly observed user shopping behavior. Using a two-towerdeep learning model architecture [8], one tower for items and onefor users, users and items are represented as points in the samevector space. In order to address the data sparsity and cold-startchallenges mentioned above, (i) items are represented using content-features only, and (ii) we expand the set of implicit user signals toincorporate multi-modal user onsite behaviors such as item clickingand query searching. Once trained, a k-nearest neighbor (KNN)search using a user embedding is used to generate a set of itemrecommendations for the user that reflect his or her implicit shop-ping behavior. At runtime, an additional Learning-To-Rank (LTR)model may be applied to this candidate item set in order to improveconversion, as was done in the work by Brovman et al. [4]. How-ever this paper primarily focuses on the method for generatingpersonalized recommendation candidate items. And since deploy-ing a deep learning based recommendation model to a large scaledynamic industrial marketplace environment involves non-trivialengineering challenges, we also discuss details of our productionengineering architecture. In summary, we contribute methods andtechniques for:(i) generating content-based item embeddings to address thecold-start problem(ii) generating multi-modal user embeddings from various onsiteevents, such as item views and search queries(iii) selectively dropping out training data to increase productionmodel robustness(iv) utilizing cluster-based KNN algorithm to increase recom-mended item diversity(v) deployment of the model and end-to-end recommender sys-tem to eBay’s large scale industrial production settingThis paper is organized in the following manner. Section 2 sum-marizes related work from academia as well as industry. We describethe proposed core model architecture in Section 3. The dataset aswell as the offline experiments to evaluate the model are presentedin Section 4. To analyze the model robustness in production environ-ment, we conduct user data ablation analysis, and propose solutionsto improve model performance. We then turn our attention to themodel prediction stage in Section 5, cover retrieval as well as theproduction engineering architecture and discuss empirical A/B testresults. Finally, we present a summary of this work and discussfuture directions in Section 6. The generation of personalized recommendations is a well studiedproblem in both academia and industry. Among the most populartechniques are matrix factorization models (e.g. [18, 22, 27]) whichdecompose a user–item matrix into user and item matrices, and treatrecommendation as a matrix imputation problem. Despite seeingsuccess in the Netflix competition for movie recommendation [22],traditional matrix factorization models require unique user and itemidentifiers, and do not perform as well in a dynamic e-commercemarketplace where existing items sell out and new items comein continuously. Utilizing content features such as the item title text becomes essential for tackling data sparsity and cold-startissues, and various methods have been proposed to address thiswithin the matrix factorization framework. For example, Content-boosted collaborative filtering [23] uses a content-based model tocreate pseudo user-item ratings. Factorization machine [26] andSVDFeature [5] directly incorporate user and item features into themodel.More recently, neural networks have been used to model morecomplex content features and combine them in a non-linear fashion.Covington et al. [8] proposed two-tower neural networks to embedusers and items separately, and applied it to the task of generatingvideo recommendations. He et al. [16] explored the use of a non-linear affinity function to replace the dot product between theuser and item embedding layers for improved model capacity. Zhuet al. [33] and Gao et al. [13] further extended the idea by usinggraph structures for candidate recall and scaling the non-linearaffinity function for an industrial setting, for e-commerce and videorecommendations respectively. Our work takes inspiration fromthese efforts and the practical challenges and limitations posed bythe eBay marketplace.There is a different but related line of work focusing on usingneural networks for LTR, such as Deep and Wide [6] and DIN [32].However, our work is is aimed at tackling the core candidate recallretrieval problem in an industrial setting, with the primary goal ofefficiently selecting small groups of relevant items from a very largepool of candidate items. As mentioned earlier, an LTR model maybe applied to this candidate item set to improve user engagementand conversion.

Our proposed approach for personalized recommendations is basedon training a two-tower deep learning model to generate user em-beddings and item embeddings at the same time. The architectureof the model is as shown in Fig. 2, and described in detail below. Wealso mention the impact of adding specific model features to ourprimary offline model performance metric, Recall@K, described indetail in Section 4.3.Following the work by Covington et al. [8], we model generat-ing recommendations as a classification problem with the softmaxprobability: 𝑃 ( 𝑠 𝑖 | 𝑈 ) = 𝑒 𝛾 ( v 𝑖 , u ) (cid:205) 𝑗 ∈ 𝑉 𝑒 𝛾 ( v 𝑗 , u ) , (1)where u ∈ R 𝐷 is a 𝐷 -dimensional vector for the embedding ofuser 𝑈 , v 𝑖 ∈ R 𝐷 is a 𝐷 -dimensional vector for the embedding ofitem 𝑠 𝑖 , 𝛾 is the affinity function between user and item, and 𝑉 isall items available on eBay. As 𝑉 could contain billions of items,it is infeasible to perform a full-size softmax operation. Negativesampling has to be used to limit the size of 𝑉 , and we will discussthis further in Sec. 4.2. The whole model is trained to minimize thenegative log-likelihood (NLL) of observed user clicks in the dataset.Next, we discuss the details of how eBay items are encoded by themodel. ersonalized Embedding-based e-Commerce Recommendations at eBay Conference’17, July 2017, Washington, DC, USA Figure 2: Model architecture with recurrent user representation.

In the eBay marketplace, an item corresponds to a listing (or offer)of something for sale from a seller. In order to address the cold-startproblem, an item in our model is represented not as a unique identi-fier (item id), but solely by using its content-based features such asitem title, category (e.g mobile phone), and structured aspects (e.gbrand: Apple, network: Verizon, etc.). We chose not to incorporatehistorical item-behavior features (e.g. historical Click-Through-Rate, Purchase-Through-Rate) in our model. These features are notapplicable to cold-start items and are constantly changing by theirvery nature, creating additional engineering complexity for storageand retrieval when building a large-scale production system.For title and aspect features, we tokenize and convert raw textinto token embeddings with embedding size 𝐷 𝑡𝑒𝑥𝑡 , and use theContinuous-Bag-of-Words (CBOW) [24] approach to generate titleand aspect feature representations. The vocabulary for the titlefeature consists of approximately 400K tokens and is gathered fromeBay item titles as opposed to a generic English language corpus.This allows us to better capture the distribution of item title tokensin the eBay marketplace, which is drastically different from thetraditional English language, as is demonstrated in the work byWang and Fu [30]. Tokenization is comprised of replacing anycharacter that is not a-Z or a number with whitespace, and splittingby whitespace. The vocabulary for aspect features comes from theexisting production database and contains around 100K aspecttokens. For the item category feature, we index the category valuesand map them into an embedding space of size 𝐷 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 using alookup table. All of the embedding tables are trained from scratchwith random initialization from the standard normal distribution N ( , ) .After mapping all item features into a continuous space, itemfeature embeddings z 𝑖 are concatenated and passed through a MLP with 𝐿 hidden layers, 𝐻 hidden dimensions, and Rectified LinearUnits (ReLU) [14] as the non-linear activation function, to generatea 𝐷 -dimensional item embedding v 𝑖 : z 𝑖 = concat ( z title 𝑖 , z aspect 𝑖 , z category 𝑖 ) , ˜ v 𝑖 = MLP ( z 𝑖 ) . v 𝑖 = ˜ v 𝑖 || ˜ v 𝑖 || (2)The item embedding v 𝑖 is normalized to unit length. We now turnour attention to the user tower part of the model. A user’s activity on an e-commerce marketplace is not limited toonly viewing items. A user may also perform actions such as mak-ing a search query, adding an item to their shopping cart, adding anitem to their watch list, and so on. These actions provide valuablesignals for the generation of personalized recommendations. In thiswork, we have attempted to create a generic framework to incor-porate such "multi-modal" user activity into the model. We havechosen to start with item viewing and the search query user actionsas representatives of item-based events and query-based eventsrespectively, since these are the quintessential online shoppingactivities.Item views/clicks are the most common form of implicit userfeedback for an e-commerce marketplace, and generate large vol-umes of training data. For an item-based event 𝑧 𝑖 , we first mapthe corresponding item 𝑠 𝑧 𝑖 to the corresponding embedding v 𝑧 𝑖 asdescribed in Sec. 3.1, and then concatenate it with a 4-dimensionalvector e 𝑧 𝑖 representing its event type.User searches are a valuable signal for a recommender systemas they are strong indications of explicit user interest or shopping onference’17, July 2017, Washington, DC, USA Tian Wang, Yuri M. Brovman, and Sriganesh Madhvanath mission. In order to encode this user action into our framework,we model each search query as a "pseudoitem" with the actualquery text taking place of the item title, and the "dominant" querycategory (predicted using a separate model) taking place of the itemcategory, and the aspects left empty. The event type embeddingis concatenated to the item-based embedding. Adding this searchquery signal to the model resulted in a ∼

4% improvement in ouroffline validation metric, [email protected] denote for each user event 𝑧 𝑖 , its corresponding vector rep-resentation 𝐸 ( 𝑧 𝑖 ) as: 𝐸 ( 𝑧 𝑖 ) = concat ( v 𝑧 𝑖 , e 𝑧 𝑖 ) (3)We explored different methods of generating a user embeddingfor a given user 𝑈 with onsite activity 𝑍 = { 𝑧 , ..., 𝑧 𝑛 } . The first approachis to bag all the event embeddings into a single vector by averagingover all embeddings. After combining all events into a single vector,we use a MLP layer with 𝐿 layers, 𝐻 hidden dimension, and ReLUnon-linear activation functions to generate a 𝐷 -dimensional userembedding u : ˜u = MLP ( 𝑛 𝑛 ∑︁ 𝑖 = 𝐸 ( 𝑧 𝑖 )) , u = ˜u || ˜u || (4)Continuous Bag-of-Events is the simplest representation of useractivity, however, in this approach, the ordering of the events doesnot affect the outcome. In order to integrate the orderinginformation of user historical events, we also experimented withusing a recurrent neural network to process the sequence of eventembeddings. We start with gated recurrent units [GRU, 7], whichhave the update rule h 𝑡 = 𝜙 ( x 𝑡 , h 𝑡 − ) defined by: r 𝑡 = 𝜎 ( W 𝑟 x 𝑡 + U 𝑟 h 𝑡 − ) u 𝑡 = 𝜎 ( W 𝑢 x 𝑡 + U 𝑢 ( r 𝑡 ⊙ h 𝑡 − )) ˜ h 𝑡 = tanh ( Wx 𝑡 + U ( r 𝑡 ⊙ h 𝑡 − )) h 𝑡 = ( − u 𝑡 ) ⊙ h 𝑡 − + u 𝑡 ⊙ ˜ h 𝑡 , (5)where 𝜎 is a sigmoid function, x 𝑡 is the input at the 𝑡 -th timestep,and ⊙ is element-wise multiplication.We initialize the GRU recurrent hidden state l as 0. For eachevent 𝑧 𝑡 in the user history 𝑍 , we feed the corresponding eventembedding into the GRU cell in sequence as input in each timestep: l 𝑡 = 𝜙 ( 𝐸 ( 𝑧 𝑖 ) , l 𝑡 − ) , l = . (6)The 𝐷 -dim user embedding u is generated by taking the averageover output vectors from all GRU steps: ˜u = (cid:205) 𝑛𝑖 = l 𝑖 𝑛 , u = ˜u || ˜u || (7)Compared to the Continuous Bag-of-Events user representation,recurrent user representation has access to the order of user activity, and in principle can better relate user relevance feedback to theuser’s interaction history. In our experiments, using this recurrentuser representation in our model resulted in a ∼

5% gain in ouroffline Recall@20 metric.

The affinity function 𝛾 ( v 𝑖 , u ) between user 𝑈 and item 𝑠 𝑖 is con-structed by the dot product between the user and item embeddings.As user and item embeddings are normalized to have unit length( || u || = || v 𝑖 || = temperature 𝜏 [31] term to our affinity function described in Eq. 1as follows: 𝛾 ( v 𝑖 , u ) = v 𝑖 u 𝜏 . (8)The temperature hyperparameter was tuned to maximize the re-trieval metric, Recall@k. In our experiments, we found that 𝜏 has alarge impact on the performance of the trained model. By tuning 𝜏 on the validation set, we are able to increase Recall@20 by ∼ In this section, we describe the dataset we created to train ourmodel, the importance of negative sampling during the trainingprocess, as well as the offline experiments performed to evaluatethe effectiveness of the model.

Since we treat the recommendation task as a classification prob-lem (Eq.1), in order to train our model, we require positive andnegative samples of items, where positive samples represent itemsrelevant to the user and their shopping journey at impression time.eBay’s e-commerce site (and mobile apps) features millions of listingpages corresponding to active items, and each page contains rec-ommendations for other items, organized into horizontal modulesrepresenting "similar items", "related items", "seller’s other items","items based on recent views" and so on. These recommendationsare typically powered by module-specific recall and ranking stages.Each module presents multiple items (up to 12 on desktop web),and there may be as many as 6 such modules on each listing page,distributed along the length of the page.In order to collect positive and negative data samples, we lookedat implicit user interactions with these merchandising recommen-dation modules on eBay’s listing pages, captured in the form ofoffline log data. Only those listing page impressions that had arecorded click event on a recommendation module were consideredfor positive and negative data samples. Recommended items acrossrecommendations modules that were clicked on by a user wereselected as positive examples for the model target. Click eventswere chosen due to the volume of available data, however, othersignals such as purchases may also be used. Recommended itemsthat were not clicked on were treated as negative examples. Sinceclicking on a recommended item causes a new listing page to beloaded, each listing page impression typically resulted in one posi-tive and multiple negative samples. As we shall discuss in the next ersonalized Embedding-based e-Commerce Recommendations at eBay Conference’17, July 2017, Washington, DC, USA section, the sampling strategy used for negative examples is criticalto achieve good model training performance.The data needed for the user tower was gathered over a 30 dayperiod going back from a given page impression. All of the positiveand negative items were enriched with necessary metadata aboutcategory, title, and aspects using offline tables. A typical trainingrun would consist of around 10 million page impressions gatheredfrom 8 days of data. A validation set with approximately 110K pageimpressions was collected following the end of the training datatime frame, in order to avoid information leakage across trainingand validation sets. In order to avoid biasing the outcome towardsa few users with high engagement, a given user was only allowedto contribute to one page impression in the training data and vali-dation data. Therefore, we had 10 million unique users and 110Kunique users in our training and validation datasets respectively.In order to better capture the distribution of users and their di-verse shopping patterns, data was collected from logs from all ofeBay’s platform experiences: desktop web, mobile web, and iOSand Android native apps.

As previously mentioned, the number of available items | 𝑉 | onthe eBay marketplace is on the scale of one billion, therefore it isinfeasible to perform a full-size softmax operation as defined inEq.1. We experimented with two approaches for sampling negativeexamples. In this approach, on the listingpage, we take the item(s) clicked on as positive, and a subset ofthe items that were impressed but not clicked on as our negatives.Specifically, each positive item is paired with 8 un-clicked negativeitems. This approach failed in our initial model training, resultingin overfitted models that were unable to generalize. The main rea-son for this is that on the listing page, all of the impressed itemsfrom current recommendation modules are very similar to the seeditem, and this leads to the effect that the model is unable distin-guish positive from negative examples utilizing content-based itemfeatures.

We then experimentedwith using random items as negatives. Rather than randomly sam-pling items from the whole item pool (billions of items), we usein-batch negative sampling [17] by using the impressed but un-clicked items from other training examples within the same batchas negatives. This approach gives us a less complex and more effi-cient sampling strategy. This approach has some similarities with apopularity-based sampling approach, as the likelihood of an itemserving as a negative sample is proportional to the number of timesthis item is presented to a user.

We experimented with multiple evaluation metrics to measuremodel performance using our offline validation dataset. Given thesimilarity of our problem to the ranking problem in the informationretrieval setting, we considered several metrics commonly usedfor ranking problems such as Normalized Discounted CumulativeGain (NDCG), Recall@k, Precision@k, and Mean Reciprocal Rank

Symbol Hyperparameter Description Value 𝐷 Item/User embedding dimension 64 𝐷 𝑡𝑒𝑥𝑡 Text-based feature dimension 64 𝐷 𝑐𝑎𝑡𝑒𝑔𝑜𝑟𝑦 Category feature dimenstion 64 𝐿 Number of hidden layers in MLP 3 𝐻 Hidden dimension in MLP 64 𝜏 Temperature in affinity function 0.1

Table 1: Model hyperparameter settings. (MRR) [11]. As mentioned in the previous section, we typicallyhave only 1 positive in each page impression, therefore it becomesimportant to measure whether or not the positive recommenda-tion is in the top k results. We therefore ultimately chose to useRecall@k as our primary evaluation metric, for 𝑘 = , , , , 𝑅𝑒𝑐𝑎𝑙𝑙 @ 𝑘 = 𝑃 𝑃 ∑︁ 𝑖 = We used the PyTorch [25] deep learning framework to implementthe core model. Additionally, we utilized the PyTorch-Lightning [10]framework which shortened development iterations and standard-ized the training loop so that it was seamless to transition the modelbetween different CPU and GPU training environments. We chosethe Adam optimizer [21] with a 0.01 learning rate. The gradientclipping parameter, set to 0.001, was essential in stabilizing thegradient in the recurrent part of the network, which spanned sev-eral hundred steps. We chose to sample 3000 negatives for eachpositive item, and use 600 as our batch size to maximize GPU uti-lization. Finally, we trained the model with 10 epochs over our datato reach convergence of the evaluation metrics. Model hyperparam-eters were selected considering production storage constraints andmodel performance on the validation set. The chosen settings arereported in Table 1.

As an offline baseline recommendation method for comparison, weused Recently Viewed Items (RVI), which recommends items thata user has recently viewed ranked by the viewed item’s recency.Although this method is simple and does not use a collaborativefiltering (CF) based approach, RVI is widely used as a way of generat-ing personalized recommendations in production systems. It is alsoa difficult baseline to beat in terms of generating user engagement,given that these are items the user has engaged with recently. Theworks of Song et al. [28] and Wang et al. [29] show approaches sim-ilar to RVI to be strong baseline methods, outperforming CF-basedmethods.We evaluated our best model, which used a recurrent user rep-resentation based on item views and search query events, and the onference’17, July 2017, Washington, DC, USA Tian Wang, Yuri M. Brovman, and Sriganesh Madhvanath

Recall@k RVI Proposed Model

40 0.16 hyperparameters shown in Table 1. This evaluation was performedon a separate test set, which consists of 7K unique users and 10million candidate items. The number of candidate items used forthis evaluation is similar in scale to the number of candidate itemstypically used at prediction time in production.Experimental results are reported in Table 2. Our method out-performs the RVI method in several Recall@k metrics that weremeasured. This shows that our model is able to generate appro-priate personalized recommendations based on the user’s currentshopping mission, and potentially inspire new ones given the rightuser history. The RVI method, in contrast, only serves a re-targetingpurpose, wherein a user is shown items they have already browsedin order to encourage re-engagement with previous shopping mis-sions. In our approach, the multi-modal user embedding frameworkallows the model to incorporate various user activities seamlesslyin a machine-learned manner to maximize user engagement.

Stability of the model and robustness of its prediction is essentialfor a production environment, but is rarely studied for machinelearning models that power recommender systems. We conductedan ablation analysis for our model with respect to user history datain order to study model performance under the condition wherepart of the user history is missing.To understand the possible impact of missing the most recentuser history at prediction time, we performed several experiments,the results of which are shown in Figure 3. First, we trained a modelwith the full user history present (dashed blue curve in Fig. 3), andcomputed predictions on a validation set while dropping differentlengths of the most recent user activity (horizontal axis in Fig. 3). Ascan be seen by the dashed blue curve in Fig. 3, when the most recent5-minute user activity was missing, the Recall@20 metric decreasedby more than 30%, from 0.9 to 0.62. The metric degrades by as muchas 50% to 0.45 when user activity within the most recent 60 minutesis missing. This creates significant performance risk for productiondeployment, since the model may not always have up-to-date real-time user onsite history at prediction time. To counteract this effect,we chose to train the model while dropping the most recent useractivities, not random ones from the user history, in order to betteralign with the scenario happening in production system wherethere may be a gap in time between the batch model predictionoutput and a user impression. A user can simply be browsing thesite, potentially with a new shopping mission, for some time aftera batch update.

Figure 3: Model prediction performance with missing userhistory.

We experimented with training models by dropping some ofthe user onsite activity data before impression time, ranging fromthe most recent 10 minutes (green curve) to 1 day (purple curve).As we can see in Fig. 3, models trained using skipped user historyperformed better compared to the original model, when part ofuser history was missing at prediction time. For example, with 60minutes of user activity missing at prediction time, all "skipped"models were able to achieve Recall@20 of 0.56, compared to 0.45 forthe original model (dashed blue curve). In addition, training withmore skipped history leads to a "flatter" curve, suggesting a morerobust model under conditions of variable missing history. However,we also observe that model performance decreases when trainedwith more dropped out user history, especially across the range of0 minute (no skipping) to 30 minutes. In order to find a balancebetween performance and consistency, the production model isselected as the one with the largest area under curve from Fig. 3amongst those trained with 10-minutes of skipped user activity(green curve). This sort of user history dropout is important toconsider when training a model with robust prediction expectationsfor a production setting.

The previous sections focused on describing the model and offlinevalidation testing. In this section, we will turn our attention todescribing the model prediction stage, including the multitude ofengineering considerations and trade-offs for building a large scaleproduction engineering recommender system.

During the prediction stage, given a user embedding and a poolof candidate item embeddings, retrieval is conducted by the KNNsearch algorithm. We used the KNN implementation from FAISS [19].For a marketplace with an enormous item-based inventory, similaritem listings are common on the site (at the time of paper writ-ing, searching “iphone 11” on eBay would return 3047 results). Asour model only consumes content-based features (title, category,and aspects) for items, all those content-similar items would have ersonalized Embedding-based e-Commerce Recommendations at eBay Conference’17, July 2017, Washington, DC, USA similar embedding from our model. Utilizing the traditional KNNapproach, given a user embedding, the retrieved items would beextremely overlapping in the embedding space.

Figure 4: Sample clusters demonstrating model quality.

To address this diversity problem, we use K-means clustering togroup all candidate items into 𝐾 clusters, each with a centroid 𝑐 𝑖 .At retrieval time, we try to find 𝑁 candidate recall items, given auser embedding 𝑢 . We first find the nearest 𝑀 clusters, and in eachcluster conduct a KNN search to retrieve 𝑚 𝑖 items: 𝑚 𝑖 = ⌈ 𝑒 𝛾 ( c 𝑖 , u ) (cid:205) 𝑀𝑗 = 𝑒 𝛾 ( c 𝑗 , u ) · 𝑁 ⌉ , (10)where 𝛾 ( c j , u ) is the same affinity function defined in Eq. 1. Figure 4demonstrates the quality of the item clusters generated by themodel, with each row representing one cluster. (a) Without Clustering ( 𝑀 = 𝐾 =100,000)(b) With Clustering ( 𝑀 = 𝐾 =100,000) Figure 5: Generated personalized item recommendations (a)without and (b) with clustering, where the user has been pre-viously looking at Adidas Yeezy shoes.

Since the inventory on eBay is highly diverse, the existing prod-uct catalog does not cover all eBay items consistently. The clusteringstep essentially creates a "pseudo catalog" covering all items, andcontent-similar items could be organized into static entities. Inour experiment, we found 𝐾 = 100,000 provides the right level ofgranularity in clustering items.The number of clusters, 𝑀 , for retrieval is chosen by balancingbetween item diversity and retrieval metrics. With larger 𝑀 , theretrieved items would cover more potential user interests, but withless concentration on a specific direction. With 𝑀 =

1, this approachdegenerates to the traditional KNN method. In our experiment, ascan been shown from an example in Fig. 5, the clustering-based

Figure 6: Production engineering architecture for model pre-diction. method generates a more diverse set of item recommendations,without losing relevancy. This technique enables controlling recom-mendation diversity with a simple hyperparameter, 𝑀 , and can betuned during the prediction stage separately from the training stage.In production, we found that using 𝑀 =

10 and 𝐾 = 100,000 providesthe optimal amount of diversity of eBay item recommendations. In this section, we describe the engineering architecture used formodel prediction to serve the personalized recommended items toeBay users. The production engineering architecture is depictedin Figure 6. Since a user’s browsing history is constantly beingupdated, we recalculate the user embedding as well as the KNNresults on a daily basis. Note that our model is based on contenttext based features, which are mostly static for any given item, sowe found that we do not need to retrain the full model on a dailybasis.The prediction process is performed offline in batch mode. First,two Spark extract, transform, load (ETL) jobs generate the candidateitems and the up to date user histories for all users on eBay thathad activity in the last 30 days, along with the necessary metadataaggregated in Hadoop. The user behaviour data at eBay is aggre-gated once per day in Hadoop, so this is the reason the ETL jobs runwith this daily frequency. In order to control for item popularity,we limit the candidate item set to items that have had 2 or moreclicks in the past 4 days.Next, we utilize eBay’s GPU cluster, Krylov [20], to run a forwardpass on the item and user towers in the trained model, in orderto generate the item and user embeddings, respectively. Now, forevery user embedding, a KNN search is performed on all of the item onference’17, July 2017, Washington, DC, USA Tian Wang, Yuri M. Brovman, and Sriganesh Madhvanath embeddings, to generate the KNN results and write them back toHadoop. The KNN results are then loaded to a Couchbase database,which is utilized as a fast (with a latency of a few milliseconds)run-time cache, with the user id as the key, using a batch loadingapplication. This caching approach is scalable to hundreds of mil-lions of users, and is only limited by the Couchbase cluster capacitybeing used.The eBay run-time web serving application stack is based onthe Java Virtual Machine (JVM). The backend application servingrecommendations is written in Scala and runs on the JVM for fastrun-time performance. One of the reasons the caching architecturewas chosen, is due to the latency requirements of the run-timeapplication for serving personalized recommendations to the eBayusers. Traditional approaches, such as in-memory matrix factor-ization, would simply not be computationally feasible at eBay datascale. In addition to enriching the recommended items with neces-sary metadata, the backend application can also apply a separatelytrained LTR model [4] in order to optimize conversion.

The best trained model was deployed to our production environ-ment and evaluated in an online A/B test. Compared to the cur-rent personalized recommendation module in production, whichmainly focuses on resurfacing items a user has viewed, our modeldemonstrated an increase in the surface rate of ∼

6% to generate rec-ommendations for 90% of listing page impressions. This is mainlyachieved by addressing the cold start and data sparsity problemthrough content-based item and user embeddings. The productionRVI based algorithm is limited to using a user’s item activity only,as opposed to our model which uses search activity as well. Addi-tionally, for the production algorithm, specific items can be filteredout due to item expiration or business logic. The embeddings basedapproach generates a much larger candidate set, which results in ahigher surface rate overall. The current system, refreshed daily tocapture new user activity, serves 50 million US users and will bescaled to cover all global eBay users in the near future.

In this paper, we presented an approach for generating personalizeditem recommendations in a large scale e-commerce marketplace. Atwo-tower deep learning model is used to learn embeddings of itemsand users in a shared vector space. Items and users embeddingsare learned using their content features and multi-modal onsiteuser activity respectively, to tackle the cold-start problem. To betteraddress the instability of the online production environment, userdata ablation is incorporated into the offline training process togenerate a more robust model. Offline and online experiments havevalidated our approach, and shown significant improvements overbaseline approaches. A personalized recommender system based onour approach has been launched to production and is now servingrecommendations at scale to eBay buyers.We are actively working to enhance the quality of the model.More types of implicit user feedback, such as "add to cart" and"watch item" should be incorporated into the model to better cap-ture a user’s shopping mission. Additionally, having only a singlevector to represent a user is potentially limiting, since the user may have diverse interests and multiple ongoing shopping missions.We are working on an improved user representation that encodesparallel shopping missions, and has the capability to separate longterm user interests from short term shopping missions. From theitem modeling perspective, a pre-trained language understandingmodel (e.g BERT [9]) could be leveraged to better understand itemtitles. We are also working on incorporating an additional LTRmodel at runtime which to improve conversion metrics. This LTRmodel would be trained specifically for our context wherein theinput is the user history and not a 𝑠𝑒𝑒𝑑 item.One of the shortcomings of our current engineering implemen-tation is the frequency of the recommendation update, which iscurrently limited by the need for daily batch processing. We areworking on moving to a near real-time (NRT) system for generatingpersonalized recommendations using user and item embeddings.This entails several infrastructural improvements including i) areal-time KNN service based on algorithms such as FAISS [19] andScaNN [15] to compute distances between embedding vectors, ii)a real-time model prediction service for generating item and userembeddings (potentially using the Open Neural Network Exchange(ONNX) format [2] for transferring the trained model from theoffline training to the real-time prediction environment), and iii)an event stream processing service for capturing up to date itemand user actions on the eBay site. We are actively working on de-veloping this infrastructure in the form of more general servicesthat can be utilized for a variety of deep learning embedding-basedalgorithms, in addition to personalized recommendations.

ACKNOWLEDGMENTS

We would like to thank Jesse Lute for product guidance and A/Btesting. We also appreciate useful feedback on the manuscript fromZhe Wu, Hongliang Yu, and Menghan Wang.

REFERENCES [1] Charu C. Aggarwal. 2016.

Recommender Systems: The Textbook (1st ed.). SpringerPublishing Company, Incorporated.[2] Junjie Bai, Fang Lu, Ke Zhang, et al. 2019. ONNX: Open Neural Network Exchange.https://github.com/onnx/onnx.[3] Y. M. Brovman. 2019.

Complementary Item Recommendations at eBay Scale .https://tech.ebayinc.com/engineering/complementary-item-recommendations-at-ebay-scale/[4] Yuri M. Brovman, Marie Jacob, Natraj Srinivasan, Stephen Neola, Daniel Galron,Ryan Snyder, and Paul Wang. 2016. Optimizing Similar Item Recommendationsin a Semi-Structured Marketplace to Maximize Conversion. In

Proceedings of the10th ACM Conference on Recommender Systems (Boston, Massachusetts, USA) (RecSys ’16) . Association for Computing Machinery, New York, NY, USA, 199–202.https://doi.org/10.1145/2959100.2959166[5] Tianqi Chen, Weinan Zhang, Qiuxia Lu, Kailong Chen, Zhao Zheng, and Yong Yu.2012. SVDFeature: a toolkit for feature-based collaborative filtering.

The Journalof Machine Learning Research

13, 1 (2012), 3619–3622.[6] Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra,Hrishi Aradhye, Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al.2016. Wide & deep learning for recommender systems. In

Proceedings of the 1stworkshop on deep learning for recommender systems . 7–10.[7] Kyunghyun Cho, Bart Van Merriënboer, Caglar Gulcehre, Dzmitry Bahdanau,Fethi Bougares, Holger Schwenk, and Yoshua Bengio. 2014. Learning phraserepresentations using RNN encoder-decoder for statistical machine translation. arXiv preprint arXiv:1406.1078 (2014).[8] Paul Covington, Jay Adams, and Emre Sargin. 2016. Deep neural networksfor youtube recommendations. In

Proceedings of the 10th ACM conference onrecommender systems . 191–198.[9] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert:Pre-training of deep bidirectional transformers for language understanding. arXivpreprint arXiv:1810.04805 (2018). ersonalized Embedding-based e-Commerce Recommendations at eBay Conference’17, July 2017, Washington, DC, USA [10] WA Falcon. 2019. PyTorch Lightning.

GitHub. Note:https://github.com/PyTorchLightning/pytorch-lightning

SIGKDD (Sydney, NSW, Australia). 10. https://doi.org/10.1145/2783258.2788579[12] Daniel A Galron, Yuri M Brovman, Jin Chung, Michal Wieja, and Paul Wang.2018. Deep Item-based Collaborative Filtering for Sparse Implicit Feedback. arXivpreprint arXiv:1812.10546 (2018).[13] Weihao Gao, Xiangjun Fan, Jiankai Sun, Kai Jia, Wenzhi Xiao, Chong Wang, andXiaobing Liu. 2020. Deep Retrieval: An End-to-End Learnable Structure Modelfor Large-Scale Recommendations. arXiv preprint arXiv:2007.07203 (2020).[14] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. 2011. Deep sparse rectifierneural networks. In

Proceedings of the fourteenth international conference onartificial intelligence and statistics . 315–323.[15] Ruiqi Guo, Philip Sun, Erik Lindgren, Quan Geng, David Simcha, Felix Chern,and Sanjiv Kumar. 2020. Accelerating Large-Scale Inference with AnisotropicVector Quantization. In

International Conference on Machine Learning . https://arxiv.org/abs/1908.10396[16] Xiangnan He, Lizi Liao, Hanwang Zhang, Liqiang Nie, Xia Hu, and Tat-SengChua. 2017. Neural collaborative filtering. In

Proceedings of the 26th internationalconference on world wide web . 173–182.[17] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk.2015. Session-based recommendations with recurrent neural networks. arXivpreprint arXiv:1511.06939 (2015).[18] Yifan Hu, Yehuda Koren, and Chris Volinsky. 2008. Collaborative filtering for im-plicit feedback datasets. In

Data Mining, 2008. ICDM’08. Eighth IEEE InternationalConference on . Ieee, 263–272.[19] Jeff Johnson, Matthijs Douze, and Hervé Jégou. 2017. Billion-scale similaritysearch with GPUs. arXiv preprint arXiv:1702.08734 (2017).[20] S. Katariya and A. Ramani. 2019. eBay’s Transformation to a Modern AI Plat-form . https://tech.ebayinc.com/engineering/ebays-transformation-to-a-modern-ai-platform/[21] Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic opti-mization. arXiv preprint arXiv:1412.6980 (2014).[22] Yehuda Koren, Robert Bell, and Chris Volinsky. 2009. Matrix factorization tech-niques for recommender systems.

Computer

42, 8 (2009).[23] Prem Melville, Raymond J Mooney, and Ramadass Nagarajan. 2002. Content-boosted collaborative filtering for improved recommendations.

Aaai/iaai arXiv preprint arXiv:1301.3781 (2013).[25] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, GregoryChanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Des-maison, Andreas Kopf, Edward Yang, Zachary DeVito, Martin Raison, AlykhanTejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and SoumithChintala. 2019. PyTorch: An Imperative Style, High-Performance Deep LearningLibrary. In

Advances in Neural Information Processing Systems 32 , H. Wallach,H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (Eds.). Cur-ran Associates, Inc., 8024–8035. http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf[26] Steffen Rendle. 2010. Factorization machines. In . IEEE, 995–1000.[27] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-Thieme.2009. BPR: Bayesian personalized ranking from implicit feedback. In

Proceedingsof the twenty-fifth conference on uncertainty in artificial intelligence . AUAI Press,452–461.[28] Yang Song, Ali Mamdouh Elkahky, and Xiaodong He. 2016. Multi-rate deeplearning for temporal recommendation. In

Proceedings of the 39th InternationalACM SIGIR conference on Research and Development in Information Retrieval .909–912.[29] Tian Wang, Kyunghyun Cho, and Musen Wen. 2019. Attention-based mixturedensity recurrent networks for history-based recommendation. In

Proceedings ofthe 1st International Workshop on Deep Learning Practice for High-DimensionalSparse Data . 1–9.[30] Tian Wang and Yuyangzi Fu. 2020. Item-based Collaborative Filtering with BERT.In

Proceedings of The 3rd Workshop on e-Commerce and NLP . 54–58.[31] Xinyang Yi, Ji Yang, Lichan Hong, Derek Zhiyuan Cheng, Lukasz Heldt, AditeeKumthekar, Zhe Zhao, Li Wei, and Ed Chi. 2019. Sampling-bias-corrected neuralmodeling for large corpus item recommendations. In

Proceedings of the 13th ACMConference on Recommender Systems . 269–277.[32] Guorui Zhou, Xiaoqiang Zhu, Chenru Song, Ying Fan, Han Zhu, Xiao Ma, YanghuiYan, Junqi Jin, Han Li, and Kun Gai. 2018. Deep interest network for click-throughrate prediction. In

Proceedings of the 24th ACM SIGKDD International Conferenceon Knowledge Discovery & Data Mining . 1059–1068.[33] Han Zhu, Xiang Li, Pengye Zhang, Guozheng Li, Jie He, Han Li, and Kun Gai.2018. Learning tree-based deep model for recommender systems. In