LL ISTWISE L EARNING TO R ANK WITH D EEP
Q-N
ETWORKS
Abhishek Sharma
University of California, BerkeleyFebruary 19, 2020 A BSTRACT
Learning to Rank is the problem involved with ranking a sequence of documents based ontheir relevance to a given query. Deep Q-Learning has been shown to be a useful method fortraining an agent in sequential decision making [6]. In this paper, we show that DeepQRank,our deep q-learning to rank agent, demonstrates performance that can be considered state-of-the-art. Though less computationally efficient than a supervised learning approach such aslinear regression, our agent has fewer limitations in terms of which format of data it can usefor training and evaluation. We run our algorithm against Microsoft’s LETOR listwise dataset[7] and achieve an NDCG@1 (ranking accuracy in the range [0 , ) of 0.5075, narrowly beatingout the leading supervised learning model, SVMRank (0.4958). In the Learning to Rank (LTR) problem, we are given a set of queries. Each query is usually accompanied by multiple(hundreds of) "documents", which are items with varying degrees of relevance to the query. These document-querypairs are common in search engine settings such as Google Search, as well as recommendation engines. The goalof LTR is to return a list of these pairs so that the documents are n intelligently ranked by relevance. Most modelsapproximate a relevance function f ( X ) → Y , where Y is the "relevance score" for document X . This is known as pointwise learning to rank. These models require the dataset to include a predetermined score that accompanies eachdocument-query pair. In listwise learning to rank (the focus of this project), the document-query pairs have no targetvalue to predict, but rather a correct order in which the documents are pre-ranked. The model’s job is to reconstructthese rankings for future document-query pairs.Figure 1: Example workflow for RankNet, a neural network for LTR used in Microsoft’s Bing search engine a r X i v : . [ c s . L G ] F e b EBRUARY
19, 2020In this work, we propose DeepQRank, a deep q-learning approach to this problem. Related work has shown theeffectiveness of representing LTR as a Markov Decision Process (MDP). We build upon this research by applying a newlearning algorithm to the application of ranking. In order to apply deep q-learning, we must first represent Learning toRank as an MDP. Essentially, the agent begins with a randomly ordered list of documents. It selects documents (theaction) from this list one by one based on which has the maximum estimated reward. Equipped with this formulation,our agent is now ready to rank documents.In relevant research documents [1], we see that when formulated as an MDP problem, teams have out-performed stateof the art baselines for this problem using policy gradients. Deep Q-Learning can benefit from this MDP representation.This is because the problem of generating a ranking is sequential in nature. Most current methods try to optimize acertain metric over the output list. Known as pointwise methods, these programs generate a list that is sorted based on apredicted value for each element. They require the metrics that are being optimized to be continuous, whereas using aranking evaluation measure for the reward function allows us to optimize the reward without worrying about it beingdifferentiable.With our MDP representation, we propose DeepQRank, a deep q-learning agent which learns to rank. Before we applydeep q-learning, we make a few modifications to the classic algorithm. Our algorithm randomly samples a mini-batchfrom a given Buffer. Next, it computes a target value for the batch based on the actual reward of the action in the batchand the next state. Finally, it updates the target Q-network using this target value. Instead of updating after an arbitrarynumber of iterations, we introduce a running average (similar to Polyak averaging) to update the weights after everyiteration. After building and training the agent, we test it against a special benchmarking dataset.
This project was inspired by the work of a team at Institute of Computing Technology, Chinese Academy of Sciences[1], as well as a team from Alibaba [3]. Their work on framing learning to rank as a reinforcement learning problemwas useful in setting up this experiment. In the first paper [1], they express the ranking task as an MDP. Each stateconsists of the set of unranked, remaining documents, while actions involve selecting a single document to add to aranked list. The second group takes this representation one step further by building a recommendation engine with MDPRanking. Another related project from Boston University [4] deals with the application of a deep neural network forranking items in a list. This neural network learns a metric function to generate a sort-able value for ranking documentsin O ( n log n ) time.While both MDP representations and deep learning approaches performed well in their approaches to LTR, each has itslimitations. The MDP approach isn’t able to learn a complex function to represent a document’s rank since it isn’t pairedwith a neural network. Meanwhile, the deep learning approach requires each document to have a "relevance" label in adiscrete range. For example, some datasets assign a relevance strength in the range [0 , to each document-query pair.This makes the dataset suitable for supervised learning approaches such as the neural network called "FastAP" [4]. Wehope to build a method that serves as advantageous compared to solely using reinforcement learning or deep learning.We believe this because a target neural network network can learn a more complicated relevance function than a methodsuch as a Support Vector Machine. In order to apply Deep Q-Learning, we need to express Learning to Rank in the form of an MDP. Here are the definitionsfor the state, action, transitions, reward, and policy: [1]1.
State:
A state consists of 3 elements: timestep t , a set of ranked documents Y (initially empty), and a set ofunordered documents X from which we must create our final ranked list. A terminal state is achieved when X is empty. An initial state has empty Y and timestep t = 0 .2. Action:
Our agent can take an action a t at timestep t . The action simply consists of a document d fromour unranked set X . When the agent performs an action, d is removed from X and added to Y . Also, thetimestep t is incremented by one. In many traditional MDP formulations, the action set is state-independent.For example, in Pac-man, the action set is consistently [up, down, left, right]. In LTR, this is not the case, asthe action set is dependent on which documents are in the state’s remaining list.3. Transition:
Our transition T maps a state, s , and action, a , to a new state s (cid:48) . These three elements ( s, a, s (cid:48) ) compose a transition. 2 EBRUARY
19, 20204.
Reward:
Every state-action pair has a corresponding reward. We developed the following formula based offDiscounted Cumulative Gain: r s,a = rank ( doc a )log ( t s + 1) In this equation, rank ( doc a ) gives the rank value (between 1 and | query s | ) of a document selected in action a (higher ranking documents have stronger relevance to the query), and t s is the timestep from state s . Wemust add a +1 in the denominator’s log statement to prevent a division by zero (initial states have a timestepof zero). Here, we penalize the selection of high-ranking documents late in the ranking process. In orderto maximize reward, the agent will have to select the highest ranking documents as early as possible. Thisreward function is one area that warrants further exploration. Since there are many other metrics for measuringranking accuracy, such as Kendall’s Tau and the Spearman Rank Correlation, it may be a point of futureinterest to experiment with varying reward functions to see how they affect the agent’s accuracy.5. Policy:
Our policy P : S → A maps any given state to an action A . The agent runs its neural network on eachdocument in the state’s remaining list. It then returns the document with the highest reward, as estimated bythe network.Ideas for this representation are heavily influenced by the work of Wei et al [1]. Some modifications were madeto include the ranked list in state representations, as well as a larger scale of rank values in the reward function.Additionally, from an engineering standpoint, the state representation was modified to include query id, which is usefulwhen training the agent on multiple queries at once.Under the MDP formulation, here’s how a trained RL agent would rank a set of documents X , based on their relevanceto a query q . Algorithm 1:
GetRanking Function for DeepQRank
Result:
A ranked list Y Input:
Trained Agent A , Unranked list X ;set Y = [] ;set timestamp t = 0 ;set current state S = State ( t, Y, X ) ; while length of X > 0 do Run the forward pass of A (cid:48) s model with the current state and every document in X;Remove the document X i with the highest output from X and add it to Y ;t++;Set current state = State(t, Y, X); end return Y ; The main change to the deep q-learning algorithm is the use of polyak averaging. While the classic deep q-learningalgorithm updates the network parameters by copying from the current iteration every N steps, our algorithm updatesits parameters slightly every step with the following formula: φ (cid:48) ← τ φ (cid:48) + (1 − τ ) φ , where φ (cid:48) is the target network parameters, τ is a chosen value ( . works well in practice), and φ is the networkparameters at the end of the current step. 3 EBRUARY
19, 2020
The previously stated modifications to the agent and underlying model result in the following modified learningalgorithm for our Deep Q-Network:
Algorithm 2:
Deep Q-Learning to Rank Algorithm
Result:
A trained Q-network with parameters φ (cid:48) Input:
Number of steps S , Buffer B ;set E ← EpisodeSample () (from algorithm in . ) ;initialize φ (cid:48) int i = 0; while i < S do sample minibatch mb i from B uniformly;compute y i ← r i + γmax a (cid:48) i Q φ (cid:48) ( s (cid:48) i , a (cid:48) i ) ; φ ← φ − α (cid:80) i dQ φ dφ ( s i , a i )( Q φ ( s i , a i ) − y i ) ; φ (cid:48) ← τ φ (cid:48) + (1 − τ ) φ ; i ← i + 1 ; end return φ (cid:48) ;We implemented this algorithm in Python (see appendix) and observed promising results with the LETOR dataset.Here are the specs for the DeepQRank agent: • Target Network Architecture: Our neural network is fully connected with the following layer sizes:47 (input layer), 32 (hidden layer), 16 (hidden layer), 1 (output layer).Figure 2: Feedforward Neural Network Architecture for our Agent • Learning Rate α : ∗ − • Discount factor γ : 0.99 • Polyak averaging factor τ : 0.999 With most supervised learning methods for LTR, the dataset includes a "relevance" label which the model tries topredict. For example, a subset of the LETOR dataset includes labels in the range [0 , . A classifier can be trained toclassify a document into one of these 5 classes using the document’s features. Unfortunately, classifiers like these arenot compatible with datasets that present documents in order of relevance (without specific relevance labels in the [0 , EBRUARY
19, 2020scale). This makes DeepQRank more suitable for listwise learning to rank, in which the agent learns from a ranked listrather than a set of target relevance labels.
While the neural network approach in "Deep Metric Learning to Rank" can rank a list in O ( n log n ) time, this algorithmwould take at least O ( n ) time. This is because the neural network can compute the "metric" measure for everydocument in a single forward pass, and then simply sort the documents by this metric with an algorithm such asMergeSort. Meanwhile, DeepQRank computes a forward pass on the entire batch of remaining documents every time itpicks the next document to add to its running ranked list. Both of these runtime analyses assume roughly equivalenttimes for their neural network forward passes, so that element is simplified to O (1) time. We used the LETOR listwise ranking dataset for this project. In this dataset, each row represents a query-document pair.The headers for the dataset consist of the following: query id, rank label, f eature , f eature , ..., f eature . QueryID identifies which query was requested for a given pair. Rank label corresponds to a document’s relevance for theparticular query in its pair. A higher rank label signifies stronger relevance. Maximum values for rank label depend onthe size of the query. Therefore, the challenge here is in predictive relative relevance values for ranking, rather than justturning this into a regression problem for predicting relevance.One advantage that this dataset serves over traditional learning to rank datasets is that it is "fully ranked". What thismeans is that for a given query, the dataset provides an exact ranking of all of the corresponding documents. Mostdatasets for supervised learning to rank generalize the ranking with provided "target values" that come in small ranges.This modification caters to approaches such as linear regression.For example, a larger dataset in the LETOR collection includes query-document pairs, but substitutes a "relevancelabel" for a rank. This relevance label is in the range [0 , . For a query with , datasets, this means that there aremultiple "correct" rankings, as two documents a, b with a relevance label of can be ordered either way and both beconsidered correct. While these datasets have been produced to accomodate supervised learning approaches to rankingsuch as multivariate regression and decision trees, they don’t paint the full picture of a definitive ranking such as theLETOR listwise dataset. Before running and evaluating our algorithm, we needed to setup the buffer. Buffer collection requires an algorithm ofits own because of the nature of the dataset we are working with.
Algorithm 3:
LETOR Episode Sample for a Single Query Q Result: E , an "episode" consisting of M tuples ( s t , a t , s t +1 , r t +1 ) for a length M queryset E ← [];dataset D (contains rows with document-query pair information);string Q ← random query id from D ;set X ← all rows of D with query id = Q ;set Y ← [];int t ← ;set initial state S ← X, Y, t ; while X is not empty do save oldState ← S;row R ← pop a random row from X ;action a t ← document id of row R ;append R to Y ; t ← t + 1 ;reward r t ← reward r S,a t (defined in section 3);update S ← X, Y, t ;append (oldState, a t , S, r t ) to E end EBRUARY
19, 2020
We tracked 2 performance-related variables over time: the loss of our agent’s target network and the actual validationaccuracy, measured using the NDCG@1 metric on an isolated validation set.1.
Model Loss : Mean Squared Error loss for the agent’s neural network. This is the first measure that we observedover time to ensure improvement for our model.Figure 3: Raw mean squared error over time for our target network.At first glance, this plot paints the picture of a stagnated model that isn’t learning. After manipulating the data,we’re able to better visualize the model’s behavior by observing a moving average of log (applied to smoothout the large range of values) loss instead of plain MSE loss.Figure 4: A moving average of log loss over timeAfter initially jumping upward, the network begins to achieve a stable learning pattern that slowly minimizesloss. We believe the initial jump is due to an outlier which skews the moving average within the first 10iterations of training. The validation loss and training loss both improve at similar rates. This is evidence thatthe model is not overfitting on the training data. One possible explanation for training loss being higher thanvalidation loss is the introduction of an outlier in the training set which spikes the training loss moving averageupward. This in turn affects every moving average after. The importance attribute to notice here is that bothsteadily decrease over time, providing evidence that the model is learning from the samples in the trainingbuffer. 6 EBRUARY
19, 20202.
N DCG @ k : Normalized Discounted Cumulative Gain at position k . This was the main metric used to evaluatethe agent’s performance in ranking. This formula was referenced in our definition of the reward. With thefollowing definitions for DCG K and IDCG k : DCG k = p (cid:88) i =1 rel i − ( i + 1) , IDCG k = | REL k | (cid:88) i =1 rel i − ( i + 1) N DCG @ k is calculated as follows: nDCG k = DCG k IDCG k Rankings with perfect accuracy would score an NDCG@k of , while a really weak ranking scores . Mostwell-trained model achieve an NDCG@1 value between . and . . For benchmarking purposes, wespecifically computed N DCG @1 .Here’s how average NDCG@1 on our validation set improved over time:Figure 5: Our primary ranking accuracy metric, NDCG@1, over time.As the graph shows, the first few iterations sparked a drastic improvement. After about 50 iterations (x=10in the plot), we see a stabilized improvement over time. At the end of 150 iterations, the agent scored anNDCG@k of . , beating out many state-of-the-art techniques. At the end of our training period, we ran a final evaluation on the DeepQRank agent on our test set. This returned amean NDCG@1 value of . . NDCG@1 for various ranking modelsRankSVM 0.4958ListNet 0.4002AdaRank-MAP 0.3821AdaRank-NDCG 0.3876SVMMAP 0.3853RankNet 0.4790MDPRank 0.4061 DeepQRank 0.5075
While all other models score below . with the NDCG @ 1 measure, DeepQRank is the only one which scores above . . Though this is a preliminary experiment, these results signify that DeepQRank is a promising method which shouldbe investigated further to verify its effectiveness. 7 EBRUARY
19, 2020We speculate that RankSVM is limited in its ability to rank because it doesn’t have the ability to approximate therelevance function as well as a deep neural network. In its linear form, it’s even more limited since it can’t approximatea nonlinear relevance function.
In summation, we modified deep q-learning by customizing the reward function and neural network architecture,and introduced a polyak averaging-like method in the training phase. Our experiment saw successful results whenmeasured with the Normalized Discounted Cumulative Gain metric. Based on the outcome of this investigation, we areenthusiastic about further researching and improving this learning to rank method.For future work, there are quite a few potential avenues for replicating and strengthening the results of this investigation.First, it may be useful to incorporate more datasets and evaluate their performance against official benchmarks thathave been recorded for competing ranking methods. For instance, the Yahoo webscope dataset, though only providingrelevance labels in the range [0 , , contains 700 features per document-query pair. This may improve the performanceof our deep q-network. We may want to change the architecture of the target network itself. It’s possible that addingmore neurons / hidden layers or introducing modern features such as dropout improve the neural network’s loss. Lastly,there are many more deep reinforcement learning methods that have yet to be applied to the MDP representation of thisproblem. To our knowledge, policy gradient and deep q-learning are the only such methods to be tested so far.It’s likely that the formulas in this paper can be tuned to improve performance. For example the reward function, basedon Discounted Cumulative Gain, may be modified to yield better results. The current log denominator is a variablethat could modified to change the distribution of the observed reward. References [1] Wei, Z., Xu, J., Lan, Y., Guo, J., & Cheng, X. (2017, August). Reinforcement learning to rank with Markovdecision process. In
Proceedings of the 40th International ACM SIGIR Conference on Research and Developmentin Information Retrieval (pp. 945-948). ACM.[2] Pang, L., Lan, Y., Guo, J., Xu, J., Xu, J., & Cheng, X. (2017, November). Deeprank: A new deep architecture forrelevance ranking in information retrieval. In
Proceedings of the 2017 ACM on Conference on Information andKnowledge Management (pp. 257-266). ACM.[3] Hu, Y., Da, Q., Zeng, A., Yu, Y., & Xu, Y. (2018, July). Reinforcement learning to rank in e-commerce search engine:Formalization, analysis, and application. In
Proceedings of the 24th ACM SIGKDD International Conference onKnowledge Discovery & Data Mining (pp. 368-377). ACM.[4] Cakir, F., He, K., Xia, X., Kulis, B., & Sclaroff, S. (2019). Deep Metric Learning to Rank. In
Proceedings of theIEEE Conference on Computer Vision and Pattern Recognition (pp. 1861-1870).[5] Chapelle, O., & Chang, Y. (2011, January). Yahoo! learning to rank challenge overview. In
Proceedings of thelearning to rank challenge (pp. 1-24).[6] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., ... & Petersen, S. (2015).Human-level control through deep reinforcement learning.
Nature , 518(7540), 529.[7] Qin, T., Liu, T. Y., Xu, J., & Li, H. (2010). LETOR: A benchmark collection for research on learning to rank forinformation retrieval.
Information Retrieval , 13(4), 346-374.[8] Liu, T. Y. (2009). Learning to rank for information retrieval.
Foundations and Trends R (cid:13) in Information Retrieval ,3(3), 225-331.[9] Image from Catarina Moreira’s website. http://web.ist.utl.pt/ ∼∼