Leveraging User Behavior History for Personalized Email Search
LLeveraging User Behavior History for Personalized Email Search
Keping Bi , Pavel Metrikov , , Chunyuan Li , Byungki Byun University of Massachusetts Amherst, Microsoft, Microsoft Research, Institute for Information Transmission Problems, Russian Academy of Sciences [email protected], {pametrik, chunyl, bybyun}@microsoft.com
ABSTRACT
An effective email search engine can facilitate users’ search tasksand improve their communication efficiency. Users could have var-ied preferences on various ranking signals of an email, such asrelevance and recency based on their tasks at hand and even theirjobs. Thus a uniform matching pattern is not optimal for all users.Instead, an effective email ranker should conduct personalized rank-ing by taking users’ characteristics into account. Existing studieshave explored user characteristics from various angles to makeemail search results personalized. However, little attention hasbeen given to users’ search history for characterizing users. Al-though users’ historical behaviors have been shown to be beneficialas context in Web search, their effect in email search has not beenstudied and remains unknown. Given these observations, we pro-pose to leverage user search history as query context to characterizeusers and build a context-aware ranking model for email search. Incontrast to previous context-dependent ranking techniques that arebased on raw texts, we use ranking features in the search history.This frees us from potential privacy leakage while giving a bettergeneralization power to unseen users. Accordingly, we proposea context-dependent neural ranking model (CNRM) that encodesthe ranking features in users’ search history as query context andshow that it can significantly outperform the baseline neural modelwithout using the context. We also investigate the benefit of thequery context vectors obtained from CNRM on the state-of-the-artlearning-to-rank model LambdaMart by clustering the vectors andincorporating the cluster information. Experimental results showthat significantly better results can be achieved on LambdaMart aswell, indicating that the query clusters can characterize differentusers and effectively turn the ranking model personalized.
CCS CONCEPTS • Information systems → Environment-specific retrieval ; Re-trieval models and ranking . KEYWORDS
Email Search, Query Context, Personalized Search
ACM Reference Format:
Keping Bi , Pavel Metrikov , , Chunyuan Li , Byungki Byun . 2021. Lever-aging User Behavior History for Personalized Email Search. In Proceedings
This paper is published under the Creative Commons Attribution 4.0 International(CC-BY 4.0) license. Authors reserve their rights to disseminate the work on theirpersonal and corporate Web sites with the appropriate attribution.
WWW ’21, April 19–23, 2021, Ljubljana, Slovenia © 2021 IW3C2 (International World Wide Web Conference Committee), publishedunder Creative Commons CC-BY 4.0 License.ACM ISBN 978-1-4503-8312-7/21/04.https://doi.org/10.1145/3442381.3450110 of the Web Conference 2021 (WWW ’21), April 19–23, 2021, Ljubljana, Slovenia.
ACM, New York, NY, USA, 11 pages. https://doi.org/10.1145/3442381.3450110
Email has long been an indispensable way for both business andpersonal communication in our daily life. As email storage quo-tas become large due to cheap data storage, users do not feel theneed to delete emails and let their mailboxes keep growing [9, 20].Therefore, an effective email search engine is quite important tosupport users’ search needs and facilitate users’ daily life.In email search, users could have varied preferences on email’svarious ranking signals such as relevance, recency, and other emailproperties due to their primary purposes of the accounts and theiroccupation. For instance, commercial users (the employees of acompany that uses the email service) and consumer users (individ-ual email service users) could have significantly different preferredemail properties, such as the length of the email threads and thenumber of recipients and attachments. While it can be known ex-plicitly whether a user is commercial or consumer, other implicituser cohorts also exist for whom different aspects of emails mattermore for ranking. For example, in contrast to engineers, it is morelikely that the emails salespeople want to re-find have them assenders rather than as recipients. Users who organize their mail-boxes more often could prefer recency than relevance since they cansearch inside a specific folder that already satisfies some filteringconditions. Hence, it could improve search quality by identifyingdiversified implicit user cohorts as query context and conductingpersonalized ranking accordingly.Existing work has explored various types of user information toprovide personalized email search results. Zamani et al. [38] lever-age the situational context of a query such as geographical (e.g.,country, language) and temporal (e.g., weekday, hour) features tocharacterize users. Weerkamp et al. [34] used users’ email threadsand mailing lists as contextual information to expand queries. Kuziet al. [15] trained local word embeddings with the content of eachuser’s mailbox and used the nearest neighbors of the query termsin the embedding space as the query context. Shen et al. [27] col-lected the frequent n-grams of the results retrieved from the user’smailbox with a decent initial ranker, based on which the query isclustered, and the cluster information can be considered as contextfor conducting adaptive ranking.Despite the extensive studies on exploring user characteristics forpersonalized email search, little attention has been paid to charac-terizing users with their search history. Leveraging users’ historicalclick-through data as context has been shown to be beneficial inWeb search [6, 13, 17, 21, 28, 31, 35, 36]. However, the effect of in-corporating users’ historical behaviors in email search has not beenstudied and remains unknown. There are substantial differences a r X i v : . [ c s . I R ] F e b WW ’21, April 19–23, 2021, Ljubljana, Slovenia Keping Bi , Pavel Metrikov , , Chunyuan Li , Byungki Byun between email search and Web search. To name a few, first, emailsearch can only be conducted on users’ personal corpus instead ofa shared public corpus, and the target is usually to re-find a knownitem; second, email content cannot be exposed to third parties dueto privacy concern, which makes it hard to investigate and analyzeranking models; third, in email search logs, cross interaction withthe same item from different users do not exist and search queriesmay be hard to generalize across users since they are personal andpertain to their particular email content. Due to the differences men-tioned above, it is valuable to investigate the benefit of using usersearch history as context for email search, and it is also challengingto incorporate this information effectively.In this paper, we study the effect of characterizing users withtheir search history as query context in personalized email search.As far as we know, this is the first work in this direction. We con-struct the context from user history grounded on ranking featuresinstead of raw texts as previous context-aware ranking models do.These numerical features free us from potential privacy leakagepresent in a neural model trained with raw texts while having bettergeneralization capabilities to unseen users. They can also captureusers’ characteristics from a different perspective compared withtheir semantic intents indicated by the terms or n-grams in theirclick-through data. Based on both neural models and the prevail-ing state-of-the-art learning-to-rank model - LambdaMart [8], weinvestigate the query context obtained from the ranking featuresin the user search history. Specifically, we propose a neural base-line ranking model based on the numerical ranking features and acontext-dependent neural ranking model (CNRM) that encodes theranking features of users’ historical queries and their associatedclicked documents as query context. We further cluster each querycontext vector learned from CNRM to reveal hidden user cohortsthat are captured in users’ search behaviors. To examine the ef-fectiveness of using the learned user cohorts, we then incorporatethe cluster information into the LambdaMart model. Experimentalresults show that the query context helps improve the rankingperformance on both neural models and LambdaMart models.Our contributions in this paper can be summarized as follows:1) we propose a context-dependent neural ranking model (CNRM)to incorporate the query context encoded from numerical featuresextracted from users’ search history to characterize potential dif-ferent matching patterns for diversified user groups; 2) we conducta thorough evaluation of CNRM on the data constructed from thesearch logs of one of the world’s largest email search engines andshow that CNRM can outperform the baseline neural model withoutcontext significantly; 3) we cluster the query context vectors andincorporate the query cluster information with LambdaMart andproduce significantly better ranking performances; 4) we analyzethe information carried in the query context vectors, the featureweight distribution of the query clusters in the LambdaMart model,and users’ distribution in terms of their query clusters. Our paper is directly related to two threads of work: email searchand context-aware ranking in web search.
Email Search.
There have been limited existing studies on emailsearch, probably due to the lack of publicly available email data, which is too private and sensitive to expose. The TREC Enterprisetrack dataset [11, 29] is an exception. The dataset contains about200k email messages crawled from the W3C public mailing listarchive and 150
Context-aware rank-ing has been explored extensively in Web search, which is usuallybased on users’ long or short-term search history [6, 13, 17, 18,21, 28, 31, 35, 36]. Short-term search history includes the queriesissued in the current search session and their associated clickedresults. Shen et al. [28] proposed a context-aware language model-ing approach that can extract expansion terms based on short-termsearch history and showed the effectiveness of the model on TRECcollections. Afterward, some studies also focused on leveragingusers’ short-term history as context, especially their clicked docu-ments [17, 31, 35, 36]. Long-term search history is not limited to thesearch behaviors in the current session and can include more users’historical information. It is often used for search personalization[6, 13, 21].Although we also leverage users’ search history to characterizeusers, we focus on differentiating the importance of various rankingsignals for different users rather than refining the semantic intent ofa user query with long or short-term context as in Web search. Moreimportantly, we aim to study the effectiveness of search history inthe context of email search, which is significantly different fromWeb search in terms of privacy concerns and the property of searchlogs.
Table 1: Representative Features in Personal Email Search.Note that no word or n-gram information is available.
Feature Group Type Notation Examplesquery-level discrete 𝐹 D 𝑄 query language, user typecontinuous 𝐹 C 𝑄 IDF of query termsdocument-level discrete 𝐹 D 𝐷 flagged, read, meta-field lengthcontinuous 𝐹 C 𝐷 recency, email lengthq-d matching continuous 𝐹 C 𝑄𝐷 BM25, language model scores
In this section, we first describe the problem formulation and avail-able ranking features. Then we propose two neural models, a) avanilla neural ranking model without using context (NRM) and b)a context-dependent neural ranking model (CNRM) that encodesusers’ historical behaviors as query context. Afterward, we intro-duce how we optimize these neural models. At last, we illustratehow to incorporate the query context information with Lambdar-Mart models.
Let 𝑞 be a query issued by user 𝑢 𝑞 , 𝐷 be the set of candidate docu-ments for 𝑞 which can be obtained from the retrieval model at anearly stage. 𝐷 consists of user 𝑢 𝑞 ’s clicked documents 𝐷 + and theother competing documents 𝐷 − , i.e., 𝐷 = 𝐷 + ∪ 𝐷 − . The objectiveof email ranking is to promote 𝐷 + to the top so that users can findtheir target documents as soon as possible. Email Ranking Features.
For any document 𝑑 ∈ 𝐷 , threegroups of features are extracted from a specifically designed privacy-preserved system. Table 1 shows some representative features ofeach group. ( i ) Query-level features 𝐹 𝑄 ( 𝑞 ) indicate properties ofa user’s query and the user’s mailbox, such as a user type repre-senting whether 𝑢 𝑞 is a consumer (general public) or a commercialuser (employee of a commercial company to which the email ser-vice is provided). 𝐹 𝑄 can be further divided to discrete features 𝐹 D 𝑄 and continuous features 𝐹 C 𝑄 , i.e., 𝐹 𝑄 ( 𝑞 ) = 𝐹 D 𝑄 ( 𝑞 ) ∪ 𝐹 C 𝑄 ( 𝑞 ) . ( ii ) Document-level features 𝐹 𝐷 ( 𝑑 ) characterize properties of 𝑑 suchas its recency and size, independent of 𝑞 . 𝐹 𝐷 also consists of dis-crete 𝐹 D 𝐷 and continuous features 𝐹 C 𝐷 , i.e., 𝐹 𝐷 ( 𝑑 ) = 𝐹 D 𝐷 ( 𝑑 ) ∪ 𝐹 C 𝐷 ( 𝑑 ) . ( iii ) query-document (q-d) matching features 𝐹 𝑄𝐷 ( 𝑞, 𝑑 ) measurehow 𝑑 matches the query 𝑞 in terms of overall and each field inan email, such as “to” or “cc”list, and subject. The 𝑞 - 𝑑 matchingfeatures 𝐹 𝑄𝐷 ( 𝑞, 𝑑 ) are all considered as continuous features, i.e., 𝐹 𝑄𝐷 ( 𝑞, 𝑑 ) = 𝐹 C 𝑄𝐷 ( 𝑞, 𝑑 ) . The score of 𝑑 given 𝑞 can be computedbased on 𝐹 𝑄 ( 𝑞 ) , 𝐹 𝐷 ( 𝑑 ) , and 𝐹 𝑄𝐷 ( 𝑞, 𝑑 ) . To protect user privacy, rawtext of queries and documents are not available during our investi-gation. User Search History.
When we further consider user behav-ior history for the current query 𝑞 issued by 𝑢 𝑞 , more informa-tion is available to rank documents in the candidate set 𝐷 . Let ( 𝑞 𝑘 , 𝑞 𝑘 − , · · · , 𝑞 ) be the recent 𝑘 queries issued by 𝑢 𝑞 before 𝑞 and ( 𝐷 + 𝑘 , 𝐷 + 𝑘 − , · · · , 𝐷 + ) be their associated relevant documents. We will use document and email interchangeably in this paper.
WW ’21, April 19–23, 2021, Ljubljana, Slovenia Keping Bi , Pavel Metrikov , , Chunyuan Li , Byungki Byun q
CNRM ( q , d )
Feature Encoding.
The three groups of email features havedifferent scales; thus it is important to normalize and map themto the same latent space to maximize the information the modelcan learn. Discrete features, such as user type, query language, andwhether the email has be read, will be projected to hidden vectorsby referring to the corresponding embedding lookup tables. Contin-uous features such as email length and recency will be consideredas numerical input vectors directly. Specifically, take query-levelfeatures for instance, discrete features 𝐹 D 𝑄 ( 𝑞 ) will be mapped to 𝑔 D 𝑄 ( 𝑞 ) = [ 𝑒𝑚𝑏 ( 𝑓 ) 𝑇 ; · · · ; 𝑒𝑚𝑏 ( 𝑓 | 𝐹 D 𝑄 ( 𝑞 ) | ) 𝑇 ] 𝑇 (1)where [· ; · ; ·] denotes the concatenation of a list of vectors, | · | indicates the dimension of a vector, and 𝑒𝑚𝑏 ( 𝑓 ) ∈ R 𝑒 × is theembedding of feature 𝑓 with dimension 𝑒 . Then the concatenatedvector is mapped to ℎ D 𝑄 ∈ R 𝑚 × with matrix 𝑊 D 𝑄 ∈ R 𝑚 ×| 𝑔 D 𝑄 ( 𝑞 ) | ,where 𝑚 is the dimension of the hidden vector ℎ D 𝑄 , according to ℎ D 𝑄 ( 𝑞 ) = tanh ( 𝑊 D 𝑄 𝑔 D 𝑄 ( 𝑞 )) . (2) Similarly, the continuous query-level features 𝐹 C 𝑄 ( 𝑞 ) are mappedto ℎ C 𝑄 ∈ R 𝑚 × with matrix 𝑊 C 𝑞 ∈ R 𝑚 ×| 𝐹 C 𝑄 ( 𝑞 ) | , i.e., ℎ C 𝑄 ( 𝑞 ) = tanh ( 𝑊 C 𝑄 𝑔 C 𝑄 ( 𝑞 )) (3)Then the two hidden vectors are concatenated as the representationof query-level features of 𝑞 , denoted as ℎ 𝑄 ( 𝑞 ) ∈ R 𝑚 × , namely, ℎ 𝑄 ( 𝑞 ) = [ ℎ D 𝑄 ( 𝑞 ) 𝑇 ; ℎ C 𝑄 ( 𝑞 ) 𝑇 ] 𝑇 (4)Likewise, the hidden vector corresponding to the document-levelfeatures 𝐹 𝐷 ( 𝑑 ) can be obtained as ℎ 𝐷 ( 𝑑 ) = [ ℎ D 𝐷 ( 𝑑 ) 𝑇 ; ℎ C 𝐷 ( 𝑑 ) 𝑇 ] 𝑇 ∈ R 𝑚 × following the same encoding paradigm but with differentlookup tables for discrete features and mapping matrices in theabove equations.Since all the q-d matching features 𝐹 𝑄𝐷 ( 𝑞, 𝑑 ) are continuousfeatures, they are mapped to ℎ 𝑄𝐷 ( 𝑞, 𝑑 ) = ℎ C 𝑄𝐷 ( 𝑞, 𝑑 ) = tanh ( 𝑊 C 𝑄𝐷 𝐹 𝑄𝐷 ( 𝑞, 𝑑 )) (5)where 𝑊 C 𝑄𝐷 ∈ R 𝑚 ×| 𝐹 𝑄𝐷 ( 𝑞,𝑑 ) | . Note that the bias terms in all themapping functions throughout the paper are omitted for simplicity. Feature Aggregation with Self-attention . To better balancethe importance of each group of features in the ranking process,the hidden vectors corresponding to the query, document, and q-d matching features are aggregated according to a self-attentionmechanism as follows. They are first mapped to another vectorwith 𝑊 𝑎 ∈ R 𝑚 × 𝑚 : 𝑜 𝑄 ( 𝑞 ) = tanh ( 𝑊 𝑎 ℎ 𝑄 ( 𝑞 )) ; 𝑜 𝐷 ( 𝑑 ) = tanh ( 𝑊 𝑎 ℎ 𝐷 ( 𝑑 )) ; 𝑜 𝑄𝐷 ( 𝑞, 𝑑 ) = tanh ( 𝑊 𝑎 ℎ 𝑄𝐷 ( 𝑞, 𝑑 )) (6) 𝑜 𝑄 ( 𝑞 ) , 𝑜 𝐷 ( 𝑑 ) , and 𝑜 𝑄𝐷 ( 𝑞, 𝑑 ) are all in R 𝑚 × . Then the attentionweight of each component is computed based on softmax functionof its dot-product with itself and the other two components. Thenew vector of each component is the weighted combination of the We also tried to aggregate the vectors of the document and q-d matching featuresusing their attention weights with respect to the vector of the query features. However,this way is not better than using self-attention so we do not include this in the paper. everaging User Behavior History for Personalized Email Search WWW ’21, April 19–23, 2021, Ljubljana, Slovenia input and average pooling is applied to the new vectors to obtainthe final vector. In concrete, the aggregated vector 𝑜 ( 𝑞, 𝑑 ) ∈ R 𝑚 × is computed according to 𝑜 𝑐 = [ 𝑜 𝑄 ( 𝑞 ) ; 𝑜 𝐷 ( 𝑑 ) ; 𝑜 𝑄𝐷 ( 𝑞, 𝑑 )] 𝑊 𝑎𝑡𝑡𝑛 = softmax ( 𝑜 𝑇𝑐 𝑜 𝑐 ) 𝑜 ( 𝑞, 𝑑 ) = AvgPool ( 𝑜 𝑐 𝑊 𝑎𝑡𝑡𝑛 ) (7) Scoring Function.
The score of document 𝑑 given 𝑞 is finallycomputed as: 𝑁 𝑅𝑀 ( 𝑞, 𝑑 ) = 𝑊 𝑠 tanh ( 𝑊 𝑠 𝑜 ( 𝑞, 𝑑 )) (8)where 𝑊 𝑠 ∈ R 𝑚 × 𝑚 and 𝑊 𝑠 ∈ 𝑅 × 𝑚 . A user’s historical search behaviors could help characterize theuser and provide a query-level context to the current query. Wepropose a context-dependent neural ranking model (CNRM) toextract a context vector based on the user’s search history andscores a candidate document according to both its features giventhe current query and the query context. We first introduce howCNRM encodes a candidate document’s features and then showhow CNRM obtains the context vector and scores the candidatedocument. The model architecture is shown in Figure 1.Given the current query 𝑞 , for each 𝑑 in the candidate set 𝐷 ,CNRM obtains the final hidden vector 𝑜 ( 𝑞 , 𝑑 ) from its initial threegroups of features following the same paradigm in Section 3.2. Context Encoding.
As shown in Figure 1, for each historicalquery 𝑞 𝑖 ( ≤ 𝑖 ≤ 𝑘 ) , we obtain the hidden vector of its query-level features, i.e., ℎ 𝑄 ( 𝑞 𝑖 ) using the same encoding process as thecurrent query 𝑞 according to Equation (1), (2), and (3). Similarly,for each document 𝑑 𝑖 in the relevant document set 𝐷 + 𝑖 of 𝑞 𝑖 , weencode its document-level features and q-d matching features tothe latent space using the same mapping functions and parametersthat are used for the candidate document 𝑑 given 𝑞 , which canbe denoted as ℎ 𝐷 ( 𝑑 𝑖 ) and ℎ 𝑄𝐷 ( 𝑞 𝑖 , 𝑑 𝑖 ) respectively. Then the aggre-gated document-level and q-d matching vectors corresponding to 𝐷 + 𝑖 are computed as the projected average embeddings of each 𝑑 𝑖 in 𝐷 + 𝑖 : ℎ 𝐷 ( 𝐷 + 𝑖 ) = tanh ( 𝑊 𝑑 Avg ({ ℎ 𝐷 ( 𝑑 𝑖 )| 𝑑 𝑖 ∈ 𝐷 + 𝑖 })) ℎ 𝑄𝐷 ( 𝑞 𝑖 , 𝐷 + 𝑖 ) = tanh ( 𝑊 𝑞𝑑 Avg ({ ℎ 𝑄𝐷 ( 𝑞 𝑖 , 𝑑 𝑖 )| 𝑑 𝑖 ∈ 𝐷 + 𝑖 })) , (9)where 𝑊 𝑑 ∈ R 𝑚 × 𝑚 and 𝑊 𝑞𝑑 ∈ R 𝑚 × 𝑚 . The overall vectorcorresponding to 𝑞 𝑖 , i.e., 𝑜 ( 𝑞 𝑖 , 𝐷 + 𝑖 ) , is computed according to Equa-tion (6) and (7) with ℎ 𝐷 ( 𝑑 ) and ℎ 𝑄𝐷 ( 𝑞, 𝑑 ) replaced by ℎ 𝐷 ( 𝐷 + 𝑖 ) and ℎ 𝑄𝐷 ( 𝑞 𝑖 , 𝐷 + 𝑖 ) respectively.To produce a context vector for 𝑞 , we encode the sequence ofthe overall vectors corresponding to each historical query and theassociated positive documents together with the hidden vector ofthe query-level feature of 𝑞 with a transformer [33] architecture.Specifically, 𝑐 ( 𝑞 ) = Transformer ( 𝑜 ( 𝑞 𝑘 , 𝐷 + 𝑘 ) + 𝑝𝑜𝑠𝐸𝑚𝑏 ( 𝑘 ) , · · · ,𝑜 ( 𝑞 , 𝐷 + ) + 𝑝𝑜𝑠𝐸𝑚𝑏 ( ) , ℎ 𝑄 ( 𝑞 ) + 𝑝𝑜𝑠𝐸𝑚𝑏 ( )) (10)As in Figure 1, the output of Transformer [33] is the vector cor-responding to 𝑞 in the final transformer layer, which acts as the final context vector. As such, the interaction between the currentquery-level features and the historical behaviors can be capturedand used to better balance the importance of each search behaviorin the history.
Scoring Function.
To allow each dimension of the encodedfeatures of a candidate 𝑑 and the context vector interact sufficiently,the final score of 𝑑 given 𝑞 is computed by bilinear matchingbetween the aggregated vector of 𝑑 and the encoded context: 𝐶𝑁 𝑅𝑀 ( 𝑞 , 𝑑 ) = 𝑜 ( 𝑞 , 𝑑 ) 𝑇 𝑊 𝑏 𝑐 ( 𝑞 ) (11)where 𝑊 𝑏 ∈ R 𝑚 × 𝑚 , 𝑜 ( 𝑞 , 𝑑 ) and 𝑐 ( 𝑞 ) are computed accordingto Equation (4) and (10) respectively. We use the softmax cross-entropy list-wise loss [3] to train NRM andCNRM. Specifically, the cross-entropy between the list of documentlabels 𝑦 ( 𝑑 ) and the list of probabilities of each document obtainedwith the softmax function applied on their ranking scores are usedas the loss function. For a query 𝑞 with candidate set 𝐷 , the rankingloss L ∇ can be computed as: L ∇ = − ∑︁ 𝑑 ∈ 𝐷 𝑦 ( 𝑑 ) log exp ( 𝑠 𝑟 ( 𝑑 )) (cid:205) 𝑑 ′ ∈ 𝐷 exp ( 𝑠 𝑟 ( 𝑑 ′ )) , (12)where 𝑠 𝑟 is a scoring function for 𝑁 𝑅𝑀 in Equation (8) or
𝐶𝑁 𝑅𝑀 and in Equation (11). However, the document label 𝑦 ( 𝑑 ) is extractedbased on user clicks and thus biased towards top documents sinceusers usually examine the results from top to bottom and decidewhether to click. To counteract the effect of position bias, betterbalance the feature weights, and let context take a properer effect,we train NRM and CNRM through unbiased learning. In contrastto Web search, where there is one panel for organic results, emailsearch usually has two panels with several results ranked by rele-vance followed by results ranked by time. As stated in [10], allowingduplicates in the two panels leads to better user satisfaction. So eachdocument 𝑑 can have two positions in terms of the relevance andtime window respectively, denoted as 𝑝𝑜𝑠 𝑟 ( 𝑑 ) , 𝑝𝑜𝑠 𝑡 ( 𝑑 ) . We adaptthe unbiased learning algorithm proposed by Ai et al. [4] to ouremail search scenario. An examination model is built to estimatethe propensity of different positions and adjust each document’sweight in the ranking loss with inverse propensity weighting. Mean-while, the ranking scores can adjust the document weights in theexamination loss by inverse relevance weighting as well [4]. Wewill introduce the examination model and the collaborative modeloptimization next. Examination Model.
Two lookup tables are created for rele-vance and time positions respectively. Let 𝑒𝑚𝑏 ( 𝑝𝑜𝑠 𝑟 ( 𝑑 )) ∈ R 𝑚 × and 𝑒𝑚𝑏 ( 𝑝𝑜𝑠 𝑡 ( 𝑑 )) ∈ R 𝑚 × be the embeddings of relevance andtime positions, 𝑊 𝑝 ∈ R 𝑚 × 𝑚 and 𝑊 𝑝 ∈ R × 𝑚 be the mappingmatrices. The score of 𝑑 , i.e., 𝑠 𝑒 ( 𝑑 ) , output by the examination modelis computed as: ℎ 𝑝 ( 𝑑 ) = [ 𝑒𝑚𝑏 ( 𝑝𝑜𝑠 𝑟 ( 𝑑 )) 𝑇 ; 𝑒𝑚𝑏 ( 𝑝𝑜𝑠 𝑡 ( 𝑑 ) 𝑇 ] 𝑇 𝑠 𝑒 ( 𝑑 ) = 𝑊 𝑝 tanh ( 𝑊 𝑝 ℎ ) (13) We do not introduce how Transformers work due to the space limitations. Pleaserefer to [33] for details.
WW ’21, April 19–23, 2021, Ljubljana, Slovenia Keping Bi , Pavel Metrikov , , Chunyuan Li , Byungki Byun Collaborative Optimization . We follow the dual learning al-gorithm in [4] and use the softmax cross-entropy list-wise loss [3]to train both the examination model and the ranking model. Let 𝑑 𝑟 be the document that is ranked to the first position in the relevancepanel, namely, 𝑝𝑜𝑠 𝑟 ( 𝑑 𝑟 ) = . Inverse propensity weighting basedon the relative propensity compared with the first result is appliedto each document and adjust the final loss. Then Equation (12) canbe refined as: 𝑔 𝑒 ( 𝑑 ) = exp ( 𝑠 𝑒 ( 𝑑 )) (cid:205) 𝑑 ′ ∈ 𝐷 exp ( 𝑠 𝑒 ( 𝑑 ′ ))L 𝑟 = − ∑︁ 𝑑 ∈ 𝐷 𝑦 ( 𝑑 ) 𝑔 𝑒 ( 𝑑 𝑟 ) 𝑔 𝑒 ( 𝑑 ) log exp ( 𝑠 𝑟 ( 𝑑 )) (cid:205) 𝑑 ′ ∈ 𝐷 exp ( 𝑠 𝑟 ( 𝑑 ′ )) (14)Regularization terms in Equation (14) has been omitted for simplic-ity. Similarly, the loss of the examination model can be derived byswapping 𝑠 𝑟 and 𝑠 𝑒 in Equation (14). The training of the examina-tion and ranking models depends on each other and both of themare refined step by step until the models converge. It is hard to directly incorporate user behavior history into somelearning-to-rank (LTR) models, such as the state-of-the-art method– LambdaMart [8], since it is grounded on various ready-to-use fea-tures and does not support learning features automatically duringtraining. To study whether user search history can characterize thecontext of a query 𝑞 and benefit search quality for LambdaMart,we investigate a method to incorporate the context vector encodedfrom user search history, i.e., 𝑐 ( 𝑞 ) in Equation (10), to the Lamb-daMart model.We first cluster the context vectors to 𝑛 clusters with k-means and obtain a one-hot vector of length 𝑛 , denoted as 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 ( 𝑞 ) ∈ R 𝑛 × . The dimension of the corresponding cluster id of the contexthas value 1 and the rest dimensions are all 0. To let the contexttake more effect, instead of adding 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 ( 𝑞 ) as features to a Lamb-daMart model directly, we bundle the document and q-d matchingfeatures with the 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 ( 𝑞 ) . Specifically, we extract 𝑙 most impor-tant features in the feature set [ 𝐹 𝐷 ( 𝑑 ) ; 𝐹 𝑄𝐷 ( 𝑞 , 𝑑 )] , denoted as 𝐹 𝑙 ( 𝑞, 𝑑 ) ∈ R 𝑙 × , do 𝐹 𝑙 ( 𝑞, 𝑑 ) · 𝑐𝑙𝑢𝑠𝑡𝑒𝑟 ( 𝑞 ) 𝑇 , flatten the yielded matrixto a vector 𝐹 𝑓 𝑐 of length 𝑙 ∗ 𝑛 , and add 𝐹 𝑓 𝑐 to LambdaMart as addi-tional features. In our experiments, 𝑙 is set to 3 and 𝐹 𝑙 ( 𝑞, 𝑑 ) includesrecency, email length, and the overall BM25 score computed withmultiple fields (BM25f) [26].We also tried other methods of incorporating the context intoLambdaMart, such as adding the context vector as 𝑚 featuresdirectly, adding the cluster id as a single feature, and adding 𝑛 features either with a one-hot vector according to the cluster idor the probability of the context belonging to each cluster. Theseapproaches are worse than the method we propose so we do notinclude the corresponding experimental results in the paper. We use search logs obtained from one of the world’s largest emailsearch engines to collect the data. To protect user privacy, the Other clustering methods can be used as well. We leave it as future work. raw text of their queries and documents cannot be accessed. Onlyfeatures extracted from specifically designed privacy-preservedpipelines are available, as mentioned in Section 3. Each candidatedocument has over 300 associated features. We randomly sampledsome users from two-week search logs and only kept the users whohave issued more than 10 queries during this period. We observethat over 10% of users satisfy this condition. To reduce the effect ofthe uncommon users with too many queries, we filtered out 90% ofthe users who issued more than 20 queries. In this way, we obtainedhundreds of thousands of queries associated with the users in total.There are about 30 candidate documents on average and at most 50documents for each query, out of which about 3 documents havepositive labels on average. The labels are extracted based on users’interaction with the documents, such as click, reply, forward, long-dwell, etc., and there are five levels in total - bad, fair, good, excellent,and perfect. At each level, the average number of documents givena query is 26.91, 1.87, 0.03, 0.55, and 0.64, respectively.Note that the dataset we produce has no user content, and it isalso anonymized and delinked from the end-user. Under the End-user License Agreement (EULA), the dataset can be used to improvethe service quality.
Data Partition.
To obtain data partitions consistent with realscenarios, we ordered the queries by their search time and dividedthe data to the training/validation/test set according to 0.8:0.1:0.1in chronological order. Experiments on both neural models andLambdaMart models are based on this data.To see whether CNRM can generalize well to unseen usersand achieve better search quality based on numerical ranking fea-tures in their search history, we also partition the data accord-ing to users. Specifically, we randomly split all the users to train-ing/validation/test according to 0.8:0.1:0.1 and put their associatedqueries to the corresponding partitions. The users in each partitiondo not overlap.
We conduct two types of evaluation to see whether users’ historicalbehaviors benefit search quality. Specifically, we compare NRMto CNRM and the vanilla LambdaMart model to the LambdaMartmodel with context added as additional features (as stated in Sec-tion 3.5). These comparisons are based on the data partitioned inchronological order, which is consistent with real scenarios. Also,we compare CNRM with NRM on data partitioned according tousers to show the ability of CNRM generalizing to unseen users.When we conduct comparisons between neural models with andwithout incorporating the context, we also include several otherbaselines: • BM25f : BM25f measures the relevance of an email. It com-putes the overall BM25 score of an email based on wordmatching in its multiple fields [12, 26]. • Recency : Recency ranks emails according to the time theywere created, which is the ranking criterion for the recencypanel of most email search engines. • CLSTM : The Context-aware Long Short Term Memory model(CLSTM) encodes the context of user behavior features withLSTM [14] instead of Transformers [33]. In other words,CLSTM replaces the Transformers in CNRM with LSTM. everaging User Behavior History for Personalized Email Search WWW ’21, April 19–23, 2021, Ljubljana, Slovenia
Since top results matter more when users examine the emailsand our data has multi-grade labels, we use Normalized DiscountedCumulative Gain (NDCG) at cutoff 3,5,10 as evaluation metrics.Two-tailed student’s t-tests with p<0.05 are conducted to checkstatistical significance.
We implemented the neural models with PyTorch and trained eachneural model on a single NVIDIA Tesla K80 GPU. All the neuralmodels were trained with 128 as batch size and for 10 epochs,within which they can converge well. We set embedding size ofdiscrete features 𝑒 = (in Equation (1)) and the hidden dimension 𝑚 in Section 3 to 64. The performances do not improve with largerembedding sizes so we only report results with 𝑚 = . For theTransformers in CNRM, we sweep the number of attention headsfrom {1,2,4,8}, the transformer layers from 1 to 3, and the dimensionsize of the feed-forward sub-layer from {128, 256, 512}. For theCLSTM model, we tried layer numbers from 1 to 3. We set thehistory length 𝑘 to 10. We use Adam with 𝛽 = . , 𝛽 = . to optimize the neural models. The learning rate is initially set to0.002 and then warmed up over the first {2,000, 3,000, 4,000} steps,following the paradigm in [33]. The coefficient for regularizationEquation (14) is selected from {1e-5, 5e-5, 1e-6}.We train the LambdaMart models using LightGBM . We set thenumber of trees to be 500, the number of leaves in each tree as 150,the shrinkage rate as 0.3, the objective as LambdaRank, and therounds of early-stop based on the validation performance as 30.We use default parameters for the other settings. The number ofclusters 𝑛 is set to 10. In this section, we first show the experimental results of neuralmodels, conduct ablation studies, and analyze what information thecontext vector captures. Then we show the results of LambdaMartmodels with the context information added and give some analysisaccordingly.
Overall Performance.
STable 2 shows that CNRM and CLSTMoutperform NRM on both versions of data, i.e., partitioned accordingto time or users. This indicates that encoding the ranking featurescorresponding to users’ historical queries and positive documentscan benefit the neural ranking model. This information improvessearch quality not only for the future queries of the same user butalso for an unseen user’s queries given his/her previous queries.CNRM has better performances than CLSTM, showing the superi-ority of Transformers over the recurrent neural networks, similarto the observation in [33].BM25f and Recency both have significantly degraded perfor-mances compared to NRM. It is not surprising since they are usedtogether with other features in the neural models. Also, Recencyoutperforms BM25f by a large margin. In contrast to Web searchwhere relevance usually matters the most, recency plays a more https://pytorch.org/ https://lightgbm.readthedocs.io/en/latest/ Table 2: Comparisons between CNRM and the baselines onthe data partitioned according to time and users. The re-ported numbers are the improvements over NRM. ‘*’ and‘ † ’ indicate statistically significant improvements over NRMand CLSTM respectively. Partition Method NDCG@3 NDCG@5 NDCG@10Time BM25f -51.55% -46.95% -41.37%Recency -38.32% -32.27% -26.69%NRM +0.00% +0.00% +0.00%CLSTM +1.29%* +1.35%* +1.51%*CNRM +2.52% ∗† +2.89% ∗† +2.88% ∗† Users BM25f -51.17% -46.61% -41.37%Recency -37.23% -31.58% -25.94%NRM +0.00% +0.00% +0.00%CLSTM +1.79%* +2.15%* +2.28%*CNRM +2.22% ∗† +2.60% ∗† +2.75% ∗† Table 3: Performance improvements of the variants ofCNRM over NRM. 𝐹 𝑄 , 𝐹 𝐷 , and 𝐹 𝑄𝐷 indicates the CNRM withonly query, document, and q-d matching features encoded.* indicates statistically significant differences with NRM. Model NDCG@3 NDCG@5 NDCG@10Full CNRM +2.52%* +2.89%* +2.88%* with 𝐹 𝑄 +1.12%* +1.19%* +1.11%*with 𝐹 𝐷 +1.71%* +2.03%* +2.17%*with 𝐹 𝑄𝐷 +1.95%* +2.25%* +2.25%*with 𝐹 𝐷 & 𝐹 𝑄𝐷 +2.14%* +2.49%* +2.56%*w/o posEmb +0.72%* +0.99%* +1.15%* crucial role than relevance in email search. This observation is con-sistent with the fact that recency has been the criterion for rankingemails in traditional email search engines and the recency panel inmost current email search engines.NRM has worse NDCG@3,5,10 but better NDCG@1 than thevanilla LambdaMart model on both versions of data. We have notincluded the performances of vanilla LambdaMart in Table 2 sinceTable 2 is for comparison between heuristic methods and neuralmodels. NRM has 6.65% better NDCG@1, while 7.99%, 8.90%, and7.78% worse NDCG@3,5, and 10, than LambdaMart on the datapartitioned according to time. On the data partitioned according tousers, NRM has 7.40% better NDCG@1 but 5.02%, 5.42%, and 4.53%worse NDCG@3,5,10 than vanilla LambdaMart. With the samefeature set, the advantages of LambdaMart, such as the gradientboosting mechanism, the NDCG-aware loss function, and the abilityto handle continuous features, make it hard to beat by a neuralmodel. Ablation Study.
We conduct ablation studies on the data parti-tioned according to time by keeping only some groups of featuresand remove the others when encoding the context to study theircontributions. In other words, we only use the query features 𝐹 𝑄 ,the document features 𝐹 𝐷 , the q-d matching features 𝐹 𝑄𝐷 , or both 𝐹 𝐷 and 𝐹 𝑄𝐷 corresponding to the historical queries and associatedpositive documents during context encoding. The current query WW ’21, April 19–23, 2021, Ljubljana, Slovenia Keping Bi , Pavel Metrikov , , Chunyuan Li , Byungki Byun t s n e - d - t w o consummer commercial u s e r _ t y p e (a) Context vectors encoded with 𝐹 𝑄𝐷 w.r.t. user type t s n e - d - t w o user_type c on s u m e rcommercial (b) Context vectors encoded with 𝐹 𝐷 and 𝐹 𝑄𝐷 w.r.t. user type t s n e - d - t w o consumer user_type c o mm e rc ia l (c) Context vectors encoded with 𝐹 𝐷 w.r.t. user type t s n e - d - t w o cluster_id0123456789 (d) Context vectors encoded with 𝐹 𝐷 w.r.t. cluster ID Figure 2: 2-D visualization of context vectors in CNRM by t-SNE [32] with respect to user_type and cluster id. features are also excluded during encoding (See Figure 1) so that thedifferences between CNRM and NRM only come from the encodingof certain groups of features. As shown in Table 3, each group offeature alone has a contribution, among which q-d matching fea-tures contribute the most and query features contribute the least.It is consistent with our intuition since query features indicate theproperty of the query or the user while document and q-d matchingfeatures carry the most critical information to determine whethera document is a target or not, such as recency and relevance. Incor-porating both 𝐹 𝐷 and 𝐹 𝑄𝐷 leads to better performances while stillworse than full CNRM with all the components.We also remove the position embeddings before feeding the se-quence to transformer layers to see whether the ordering of thesequence matters. Results show that the performance will dropsignificantly without position embeddings, indicating that the or-dering of users’ historical behaviors is important during encoding.In addition, without unbiased learning, for both NRM and CNRM,the performance will degrade by about 15% of the models usingunbiased learning. The improvement percentage of CNRM overNRM is similar to the numbers shown in Table 3. Context Vector Analysis.
To see whether the context vector 𝑐 ( 𝑞 ) (in Figure 1) captures some information that can differentiateusers, we try to visualize 𝑐 ( 𝑞 ) of each query in terms of a usertype. As we mentioned in Section 1, consumer and commercial usershave significantly different behavioral patterns. We want to checkwhether the differences can be revealed when CNRM only encodesdocument features ( 𝐹 𝐷 ), q-d matching features ( 𝐹 𝑄𝐷 ) , or both ofthem. Meanwhile, the query-features of the current query 𝑞 inFigure 1 is replaced by a trainable vector of dimension 𝑚 . Thisensures that all query features that include the user type (consumeror commercial) are excluded during encoding and 𝑐 ( 𝑞 ) has notseen the user type in advance.Based on the data partitioned according to time, we use t-SNE[32], which can reserve both the local and global structures ofthe high-dimensional data well when mapping to low-dimensionalspace as discussed in [32], to map 𝑐 ( 𝑞 ) to the 2-d plane. We ran-domly sample 10,000 queries and obtain their corresponding contextvectors with the same CNRMs that are evaluated in Table 3. Fig-ure 2a, 2b, and 2c shows the 2-d visualization of context vectors everaging User Behavior History for Personalized Email Search WWW ’21, April 19–23, 2021, Ljubljana, Slovenia encoded with 𝐹 𝑄𝐷 , both 𝐹 𝑄𝐷 and 𝐹 𝐷 , and 𝐹 𝐷 in terms of user typerespectively.We observe that by encoding 𝐹 𝑄𝐷 / 𝐹 𝐷 alone, 𝑐 ( 𝑞 ) can differenti-ate commercial and consumer users to some extent by putting mostof consumer users to the right/upper-left part of Figure 2a/2c. Withboth 𝐹 𝐷 and 𝐹 𝑄𝐷 encoded, 𝑐 ( 𝑞 ) further pushes consumer users tothe right part of Figure 2b. These observations indicate that the con-text vectors of CNRM can differentiate consumer and commercialusers by encoding the document and q-d matching features of theirpositive historical documents. The two groups of features take adecent effect individually to capture their different behavioral pat-terns and are more discriminative when used together. It is worthmentioning that the context vectors have captured informationbeyond whether the user is a commercial or consumer user. Thiscan be verified by the fact that the user type of the current query( 𝑞 ) is already used in the candidate document for ranking and thelearned context still leads to significantly better performances, asshown in Section 5.1. In common practice, LambdaMart has been widely used to aggre-gate multiple features to serve online query requests due to itsstate-of-the-art performance and high efficiency. So it is very im-portant to find out whether the context information learned inCNRM can benefit the LambdarMart model. The baseline Lamb-daMart model was trained with the same feature set mentionedin Section 3. For any version of CNRM to be evaluated, the con-text vector for each query was extracted and put to one of the 10clusters; the top 3 features (Recency, EmailLength, and BM25f) arebundled with each cluster and added to LambdaMart as additionalfeatures, as stated in Section 3.5. Then a new LambdarMart modelis trained and evaluated against the baseline version.
Overall Performances.
Based on the data partitioned accord-ing to time (See Section 4.1), the context information from fourversions of CNRM is introduced to the LambdarMart model, whichare CNRM that only encodes document features ( 𝐹 𝐷 ), q-d matchingfeatures ( 𝐹 𝑄𝐷 ), both 𝐹 𝐷 and 𝐹 𝑄𝐷 , and full information in the con-text, same as the settings in our ablation study in Section 5.1. Wedid not include the context encoded with query features 𝐹 𝑄 alonesince user-level information such as user type is the same for alluser queries and contains less extra information than 𝐹 𝐷 and 𝐹 𝑄𝐷 .Since the clustering process forces the representations to degradefrom dense to discrete, much information could be lost. In addition,query-level features usually take less effect than document-level orq-d matching features since they are the same for all the candidatedocuments under a query. Thus, it is challenging to make the querycontext work effectively in the LambdaMart model. Nevertheless, asshown in Table 4, the context learned from CNRM that encodes 𝐹 𝐷 alone leads to significantly better performances than the baselineLambdaMart model, which is also the best performances among all.From Table 4, we observe that the improvements each methodobtains in the LambdaMart model do not follow the same trendsof their improvements over the baseline NRM shown in Table 3.This is not surprising since the clustering process could lose someinformation and the condensed query cluster information may have Table 4: Performance improvements of the LambdarMartmodel with the context cluster information over the base-line version without context on the data partitioned accord-ing to time. * indicates statistically significant differences.
Model NDCG@3 NDCG@5 NDCG@10+ Context with 𝐹 𝐷 +0.60%* +0.49%* +0.56%*+ Context with 𝐹 𝑄𝐷 +0.34% +0.35% +0.28%+ Context with 𝐹 𝐷 & 𝐹 𝑄𝐷 +0.33% +0.41% +0.30%+ Context from Full CNRM +0.37% +0.20% +0.18% F e a t u r e W e i g h t P r o p o r t i o n Cluster IdRecency BM25f EmailLength
Figure 3: Feature weights of the added features (Recency,EmailLength, and BM25f) in the LambdaMart model thatuses context encoded with 𝐹 𝐷 in terms of cluster Id. uncertain overlap with existing user properties such as user typeand mailbox locale, which already take effect in LambdaMart. Feature Weight Analysis . In the LambdaMart models, the top3 features with the most feature weight are Recency, BM25f, andEmailLength, which occupy about 24%, 5%, and 4% of the totalfeature weights. These features could play a different role in terms ofdifferent query clusters. To compare the weights of these 3 featureswhen bundled with each cluster, we plot the feature weights ofthe LambdaMart model that has significant improvements over thebaseline, which is incorporating the context encoded with 𝐹 𝐷 alone,in Figure 3. Cluster 6 has the most overall feature weights and hasa significantly different feature weight distribution than the otherclusters. Queries in Cluster 6 emphasize email length the most andrecency the least, which indicates that this cluster requires differentmatching patterns from the global distribution. As shown in Figure2d, Cluster 6 is located in the lower-right corner of the figure, whichhas a little gap with the other points. This may also indicate thatCluster 6 is quite different from the other clusters. User Distribution w.r.t. Their Query Clusters.
In this part,we study how users are distributed in terms of the clusters theirqueries belong to. We aim to answer the following question: do thequery contexts associated with the same user concentrate or scatterin the latent space? We use two criteria to analyze the correspondinguser distribution: (1) the number of unique clusters a user’s queriesbelong to, (2) the entropy of the cluster distribution of queriesassociated with a user. The entropy criterion differentiates querycounts in each cluster while the cluster count criterion does not.
WW ’21, April 19–23, 2021, Ljubljana, Slovenia Keping Bi , Pavel Metrikov , , Chunyuan Li , Byungki Byun User Bin ID w.r.t. Query Clusters . . . . P e r ce n t a g e o f U s e r s ClusterCountEntropy
Figure 4: The distribution of users when we put each user to10 bins according to the count of unique clusters the user’squeries belongs to and the entropy of the user’s query clus-ter distribution.
Then we divide the possible values of cluster count and entropyinto 10 bins and put users into each bin to see the distribution.Figure 4 shows the user distribution in percentage regarding thetwo criteria using the CNRM that only encodes 𝐹 𝐷 in the context.We have not observed any pronounced patterns from the userdistribution. The two resulting distributions are similar. Most usershave 4 or 5 query clusters (Bin 3 or 4) and fall into Bin 5 or 6 in termsof entropy. Most users do not have concentrated query clusters,indicating that user queries could have dynamic context along withtime. This makes sense since the context learned in CNRM is relatedto not only the users’ static properties but also their issued queriesthat could be diversified. We leave the study of extracting users’static properties and the effect of the user cluster information asour future work. This paper presents the first comprehensive work to improve theemail search quality based on ranking features in the user searchhistory. We investigate this problem on both neural models andthe state-of-the-art learning-to-rank model - LambdaMart [8]. Weleverage the ranking features of users’ historical queries and thecorresponding positive emails to characterize the users and proposea context-dependent neural ranking model (CNRM) that encodesthe sequence of the ranking features with a transformer architecture.For the LambdaMart model, we cluster the query context vectorsobtained from CNRM and incorporate the cluster information intoLambdaMart. Based on the dataset constructed from the search logof one of the world’s largest email search engines, we show thatfor both neural models and the LambdaMart model, incorporatingthis query context could improve the search quality significantly.As a next step, we would like to investigate using another op-timization goal in addition to the current ranking optimization toenhance the neural ranker, which is differentiating the query fromthe same user and the other users given the query context encoded from a user’s search history. We are also interested in studying howto incorporate the query context into LambdaMart more effectively,such as investigating other clustering methods including neuralend-to-end clustering approaches and incorporating the context in-formation differently. As mentioned earlier, extracting users’ staticproperties and use them as query context is another interestingdirection.
ACKNOWLEDGMENTS
This work was supported in part by the Center for Intelligent In-formation Retrieval. Any opinions, findings and conclusions orrecommendations expressed in this material are those of the au-thors and do not necessarily reflect those of the sponsor.
REFERENCES [1] Samir AbdelRahman, Basma Hassan, and Reem Bahgat. 2010. A new emailretrieval ranking approach. arXiv preprint arXiv:1011.0502 (2010).[2] Douglas Aberdeen, Ondrey Pacovsky, and Andrew Slater. 2010. The learningbehind gmail priority inbox. (2010).[3] Qingyao Ai, Keping Bi, Jiafeng Guo, and W Bruce Croft. 2018. Learning a DeepListwise Context Model for Ranking Refinement. arXiv preprint arXiv:1804.05936 (2018), 135–144.[4] Qingyao Ai, Keping Bi, Cheng Luo, Jiafeng Guo, and W Bruce Croft. 2018. Un-biased Learning to Rank with Unbiased Propensity Estimation. arXiv preprintarXiv:1804.05938 (2018).[5] Michael Bendersky, Xuanhui Wang, Donald Metzler, and Marc Najork. 2017.Learning from user interactions in personal search via attribute parameterization.In
Proceedings of the Tenth ACM International Conference on Web Search and DataMining . 791–799.[6] Paul N Bennett, Ryen W White, Wei Chu, Susan T Dumais, Peter Bailey, FedorBorisyuk, and Xiaoyuan Cui. 2012. Modeling the impact of short-and long-termbehavior on search personalization. In
Proceedings of the 35th international ACMSIGIR conference on Research and development in information retrieval . 185–194.[7] Abhijit Bhole and Raghavendra Udupa. 2015. On correcting misspelled queriesin email search. In
Proceedings of the Twenty-Ninth AAAI Conference on ArtificialIntelligence . 4266–4267.[8] Christopher JC Burges. 2010. From ranknet to lambdarank to lambdamart: Anoverview.
Learning
11, 23-581 (2010), 81.[9] David Carmel, Guy Halawi, Liane Lewin-Eytan, Yoelle Maarek, and Ariel Raviv.2015. Rank by time or by relevance? Revisiting email search. In
Proceedings of the24th ACM International on Conference on Information and Knowledge Management .283–292.[10] David Carmel, Liane Lewin-Eytan, Alex Libov, Yoelle Maarek, and Ariel Raviv.2017. Promoting relevant results in time-ranked mail search. In
Proceedings ofthe 26th International Conference on World Wide Web . 1551–1559.[11] Nick Craswell, Arjen P De Vries, and Ian Soboroff. 2005. Overview of the TREC2005 Enterprise Track.. In
Trec , Vol. 5. 1–7.[12] Nick Craswell, Hugo Zaragoza, and Stephen Robertson. 2005. Microsoft cam-bridge at trec-14: Enterprise track. (2005).[13] W Bruce Croft and Xing Wei. 2005.
Context-based topic models for query modifi-cation . Technical Report. CIIR Technical Report, University of Massachusetts.[14] Sepp Hochreiter and Jürgen Schmidhuber. 1997. Long short-term memory.
Neuralcomputation
9, 8 (1997), 1735–1780.[15] Saar Kuzi, David Carmel, Alex Libov, and Ariel Raviv. 2017. Query expansion foremail search. In
Proceedings of the 40th International ACM SIGIR Conference onResearch and Development in Information Retrieval . 849–852.[16] Cheng Li, Mingyang Zhang, Michael Bendersky, Hongbo Deng, Donald Metzler,and Marc Najork. 2019. Multi-view Embedding-based Synonyms for Email Search.In
Proceedings of the 42nd International ACM SIGIR Conference on Research andDevelopment in Information Retrieval . 575–584.[17] Zhen Liao, Daxin Jiang, Jian Pei, Yalou Huang, Enhong Chen, Huanhuan Cao, andHang Li. 2013. A vlHMM approach to context-aware search.
ACM Transactionson the Web (TWEB)
7, 4 (2013), 1–38.[18] Sam Lobel, Chunyuan Li, Jianfeng Gao, and Lawrence Carin. 2019. RaCT: TowardAmortized Ranking-Critical Training For Collaborative Filtering. In
InternationalConference on Learning Representations .[19] Craig Macdonald and Iadh Ounis. 2006. Combining fields in known-item emailsearch. In
Proceedings of the 29th annual international ACM SIGIR conference onResearch and development in information retrieval . 675–676.[20] Joel Mackenzie, Kshitiz Gupta, Fang Qiao, Ahmed Hassan Awadallah, and MiladShokouhi. 2019. Exploring user behavior in email re-finding tasks. In
The WorldWide Web Conference . 1245–1255. everaging User Behavior History for Personalized Email Search WWW ’21, April 19–23, 2021, Ljubljana, Slovenia [21] Nicolaas Matthijs and Filip Radlinski. 2011. Personalizing web search using longterm browsing history. In
Proceedings of the fourth ACM international conferenceon Web search and data mining . 25–34.[22] Yu Meng, Maryam Karimzadehgan, Honglei Zhuang, and Donald Metzler. 2020.Separate and Attend in Personal Email Search. In
Proceedings of the 13th Interna-tional Conference on Web Search and Data Mining . 429–437.[23] Paul Ogilvie and Jamie Callan. 2005. Experiments with Language Models forKnown-Item Finding of E-mail Messages.. In
TREC .[24] Zhen Qin, Zhongliang Li, Michael Bendersky, and Donald Metzler. 2020. MatchingCross Network for Learning to Rank in Personal Search. In
Proceedings of TheWeb Conference 2020 . 2835–2841.[25] Pranav Ramarao, Suresh Iyengar, Pushkar Chitnis, Raghavendra Udupa, andBalasubramanyan Ashok. 2016. Inlook: Revisiting email search experience. In
Proceedings of the 39th International ACM SIGIR conference on Research and Devel-opment in Information Retrieval . 1117–1120.[26] Stephen Robertson, Hugo Zaragoza, and Michael Taylor. 2004. Simple BM25extension to multiple weighted fields. In
Proceedings of the thirteenth ACM inter-national conference on Information and knowledge management . 42–49.[27] Jiaming Shen, Maryam Karimzadehgan, Michael Bendersky, Zhen Qin, and Don-ald Metzler. 2018. Multi-task learning for email search ranking with auxiliaryquery clustering. In
Proceedings of the 27th ACM International Conference onInformation and Knowledge Management . 2127–2135.[28] Xuehua Shen, Bin Tan, and ChengXiang Zhai. 2005. Implicit user modeling forpersonalized search. In
Proceedings of the 14th ACM international conference onInformation and knowledge management . 824–831.[29] Ian Soboroff, Arjen P de Vries, and Nick Craswell. 2006. Overview of the TREC2006 Enterprise Track.. In
Trec , Vol. 6. 1–20. [30] Brandon Tran, Maryam Karimzadehgan, Rama Kumar Pasumarthi, Michael Ben-dersky, and Donald Metzler. 2019. Domain Adaptation for Enterprise EmailSearch. In
Proceedings of the 42nd International ACM SIGIR Conference on Researchand Development in Information Retrieval . 25–34.[31] Yury Ustinovskiy and Pavel Serdyukov. 2013. Personalization of web-searchusing short-term browsing context. In
Proceedings of the 22nd ACM internationalconference on Information & Knowledge Management . 1979–1988.[32] Laurens van der Maaten and Geoffrey Hinton. 2008. Visualizing data usingt-SNE.[33] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones,Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is allyou need. In
Advances in neural information processing systems . 5998–6008.[34] Wouter Weerkamp, Krisztian Balog, and Maarten De Rijke. 2009. Using contextualinformation to improve search in email archives. In
European Conference onInformation Retrieval . Springer, 400–411.[35] Ryen W White, Paul N Bennett, and Susan T Dumais. 2010. Predicting short-terminterests using activity-based search context. In
Proceedings of the 19th ACMinternational conference on Information and knowledge management . 1009–1018.[36] Biao Xiang, Daxin Jiang, Jian Pei, Xiaohui Sun, Enhong Chen, and Hang Li. 2010.Context-aware ranking in web search. In
Proceedings of the 33rd internationalACM SIGIR conference on Research and development in information retrieval . 451–458.[37] Sirvan Yahyaei and Christof Monz. 2008. Applying maximum entropy to known-item email retrieval. In
European Conference on Information Retrieval . Springer,406–413.[38] Hamed Zamani, Michael Bendersky, Xuanhui Wang, and Mingyang Zhang. 2017.Situational context for ranking in personal search. In