[PDF] Neural Feature Selection for Learning to Rank

Abstract

LEarning TO Rank (LETOR) is a research area in the field of Information Retrieval (IR) where machine learning models are employed to rank a set of items. In the past few years, neural LETOR approaches have become a competitive alternative to traditional ones like LambdaMART. However, neural architectures performance grew proportionally to their complexity and size. This can be an obstacle for their adoption in large-scale search systems where a model size impacts latency and update time. For this reason, we propose an architecture-agnostic approach based on a neural LETOR model to reduce the size of its input by up to 60% without affecting the system performance. This approach also allows to reduce a LETOR model complexity and, therefore, its training and inference time up to 50%.

Full PDF

NNeural Feature Selection for Learning to Rank

Alberto Purpura (cid:63) , Karolina Buchner , Gianmaria Silvello (cid:63)(cid:63) , andGian Antonio Susto University of Padua, {purpuraa, silvello, sustogia}@dei.unipd.it Apple, [email protected]

Abstract.

LEarning TO Rank (LETOR) is a research area in the ﬁeldof Information Retrieval (IR) where machine learning models are em-ployed to rank a set of items. In the past few years, neural LETOR ap-proaches have become a competitive alternative to traditional ones likeLambdaMART. However, neural architectures performance grew propor-tionally to their complexity and size. This can be an obstacle for theiradoption in large-scale search systems where a model size impacts latencyand update time. For this reason, we propose an architecture-agnosticapproach based on a neural LETOR model to reduce the size of its inputby up to 60% without aﬀecting the system performance. This approachalso allows to reduce a LETOR model complexity and, therefore, itstraining and inference time up to 50%.

Keywords:

Learning to Rank · Feature Selection · Deep Learning

LEarning TO Rank (LETOR) is a research area in the ﬁeld of InformationRetrieval (IR) where machine learning techniques are applied to the task ofranking a set of items [10]. The input to a LETOR system is a set of real-valuedvectors representing the items to be ranked – in decreasing order of relevance –in return to a certain user query. The output of such systems is usually a set ofrelevance scores – one for each item in input – which estimate the relevance ofeach item and are used to rank them. In the recent years, the attention on neuralapproaches for this task has grown proportionally to their performance. Startingfrom [2], where the authors propose to employ a recurrent neural layer to modeldocuments list-wise interactions, to [12], where the now popular self-attentiontransformer architecture is used. Also, the performance of neural models [12,20] recently became competitive with approaches such as LambdaMART [4]which is often one of the ﬁrst choices for LETOR tasks. However, neural modelsperformance grew at the expense of their complexity and this hampers theirapplication in large-scale search systems. Indeed, in such context, model latencyand update time are as important as model performance. Reducing the input size (cid:63)

Work done as part of Apple internship. (cid:63)(cid:63)

Work supported by the ExaMode project, as part of the European Union Horizon2020 program under Grant Agreement no. 825292. a r X i v : . [ c s . I R ] F e b Purpura et al. can help decreasing model architectural complexity, number of parameters, andconsequently training and inference time. Also, previous works [5, 6, 8] showedthat the document representations used for LETOR can sometimes be redundantand often reduced [6] without impacting the ranking performance.Existing feature selection approaches can be organized into three main groups: ﬁlter , embedded , and wrapper methods [6]. Filter methods, such as the GreedySearch Algorithm (GAS) [5], compute one score for each feature – independentlyfrom the LETOR model that is going to be used afterwards – and select the topones according to it. In GAS the authors minimize feature similarity (KendallTau) and maximize feature importance. They rank the input items using onlyone of the features at a time and consider as importance score the MAP ornDCG@k value. Embedded approaches, such as the one presented in [15], incor-porate the feature selection process in the model. In [15], the authors propose toapply diﬀerent types of regularizations – such as L1 norm regularization – on theweights of a neural LETOR model to reduce redundancy in the hidden repre-sentations of the model and improve its performance. Finally, wrapper methodssuch as the ones presented in [6] and the proposed approach, rely on a LETORmodel to estimate feature importance and then perform a selection.We reimplemented the two best-performing approaches proposed in [6] andconsider them as our baselines: eXtended naive Greedy search Algorithm forfeature Selection (XGAS) – which relies on LambdaMART to estimate featurerelevance – and Hierarchical agglomerative Clustering Algorithm for feature Se-lection (HCAS) employing single likage [7] – which relies on Spearman’s correla-tion coeﬃcient between feature pairs as a proxy for feature importance. To thebest of our knowledge, our approach is the ﬁrst feature selection technique forLETOR speciﬁcally targeted to neural models. The main contributions of thispaper are the following: – we propose an architecture-agnostic Neural Feature Selection (NFS) ap-proach which uses a neural LETOR model to estimate feature importance; – we evaluate the quality of our approach on two public LETOR collections; – we conﬁrm the robustness of the extracted feature set evaluating the perfor-mance of the proposed neural reranker and of a LambdaMART model usingsubsets of features of diﬀerent sizes computed with the proposed approach.Our experimental results show that the document representations used for LETORcan sometimes be redundant and reduced to up to 40% [6] of the total withoutimpacting the ranking performance. The proposed Neural Feature Selection (NFS) approach is organized in the fol-lowing three steps. We ﬁrst train a neural model for the LETOR task, i.e. to We purposely omit a comparison with other dimensionality reduction approachessuch as PCA since these methods often compute a combination of the features toreduce the representation size which is beyond the scope of this paper.eural Feature Selection for Learning to Rank 3 compute a relevance score for each item in the input set to be used to rank it.Second, we use the trained model to extract the most signiﬁcant features groupsconsidered by the model to rank each item. Finally, we perform feature selectionusing the previously computed feature information.

Neural Model Training.

The NFS model architecture is composed of n self-attention layers [19], followed by two fully-connected layers. We train this modelusing the ApproxNDCG loss [3]. Before feeding the document vectors to theself-attention layer we apply the same feature transformation strategy describedin [20]. In [20], the authors apply three diﬀerent feature transformations to eachfeature in the input data and then combine them through a weighted sum.The weights for each transformation are learned by the model so that the bestfeature transformation strategy for each feature could be used each time. Themodel architecture is depicted in Figure 1. Also, we apply batch normalizationto the input of each feed-forward layer and dropout on the output of each hiddenlayer. Note that, since our approach for feature selection is architecture-agnostic ,we can easily make changes to this neural architecture without impacting thefollowing steps for feature selection. Self-Attention LayerFTLi Input Itemsi i n … Feature Transformation LayerFTLFTL FFFFFeed-Forward Hidden Layer FFFFFFOutput LayerFF… … … s s s n Fig. 1.

Architecture of the neural architecture employed in our evaluation.

Feature Groups Mining.

At this step, we use the model trained in the pre-vious step to select the most important features used to rank each item in ourtraining data. To do so, we compute the saliency map – a popular approach inthe computer vision ﬁeld to understand model predictions [1, 16, 17] – i.e. thegradient w.r.t. the each input item feature, corresponding to each item in thetraining dataset. We then apply min-max normalization on each saliency map M i to map the values in each vector to the same range [0 , . Afterwards, weselect from each saliency map the groups of features g which have a saliency score higher than a threshold t . The set of feature groups G extracted at thisstep are the most signiﬁcant features sets that our neural model learned to relyon to compute the relevance score of each item. These features however mightnot be the same for any possible input instance and – as also pointed out in [1]– saliency maps can often be noisy and not always represent the behavior of aneural model. For this reason, we propose to apply a further selection step toprune less reliable feature groups similarly to what proposed in [18] where theauthors compute the statistical signiﬁcance of groups of items by comparing theirfrequency of occurrence in real data to the one in randomly generated datasets. Purpura et al.

We compute K random sets of saliency maps, each of the same cardinality of theexperimental dataset employed. For example, if a dataset contains N queries,each with R documents to be ranked, then we will generate K random datasets,each containing N × R saliency maps. Then, we apply the same feature groupsextraction process on the random saliency maps and compute K diﬀerent setsof feature groups. The saliency maps are computed sampling values from a uni-form distribution with support [0 , . According to this modeling strategy, eachfeature can be considered as salient in the current random saliency map withprobability − t ; where t is the threshold we used in the previous step to selectsalient features. Once we computed these K sets of random feature groups ˆ G k we use their frequency to prune the original ones. In particular, we consider thefrequency f g i of group g i ∈ G and compare it to its frequency in each of the K random datasets f g i,k – the frequency f g i,k might also be 0 if the feature group g i does not appear in the random dataset k . If f g i ≤ f g i,k in more than 2% of therandomly generated feature groups ˆ G k , we discard feature group g i , consideringit as noise. Feature Selection.

In this ﬁnal step, we rely on the feature groups extractedin the previous step and their frequency in the saliency maps to compute afeature similarity matrix. We then use this similarity matrix to perform featureselection. Each feature pair similarity value is computed counting the times thetwo features appear in the same feature group and normalizing that score bythe total number of groups where that feature appears. Finally, we rely on thissimilarity matrix to perform hierarchical clustering as done in [6]. We considerthe number of clusters as the stopping criterion for the single linkage hierarchicalclustering algorithm. The ﬁnal set of features to keep is computed selecting themost frequently occurring feature in the previously computed feature groups,from each feature cluster.

We evaluate our approach on the ﬁrst fold of the MSLR-WEB30K [13] and onthe whole OHSUMED [14] dataset where the items to rank are represented by136 and 45 features, respectively. We use the LambdaMART implementationavailable in the LightGBM library [9] and train and test the proposed neu-ral model considering only the top 128 results returned by LambdaMART. Wetuned the LightGBM model parameters on the validation sets of both datasets,optimizing the ndcg@3 metric. The proposed neural reranking model is trained This value was set empirically to yield a reasonable number of feature groups for thefollowing feature extraction step. https://github.com/microsoft/LightGBM We set the learning rate to 0.05, the number of leaves to 200 and the number of treesto 1000 (500) on the MSLR-WEB30K (OHSUMED) collection.eural Feature Selection for Learning to Rank 5 for 500 epochs – 100 epochs on the OHSUMED dataset – with batch size 128,using Adam optimizer and a learning rate 0.0005. We consider a feature em-bedding size of 128 in the feature transformation layer on the MSLR-WEB30Kdataset – while we removed it for the experiments on the OHSUMED collectiondue to its much smaller size and number of features which limited the beneﬁtsof it – 4 self-attention heads on the MSLR-WEB30K and 1 on the OHSUMEDdataset and a hidden size of 128 for the hidden feed-forward layer. Since eachattention head has an output size equal to the total number of features dividedby the number of attention heads, to compute the results reported in Table 1,we reduce the number of attention heads to 1 when using 5% and 10% of allthe available features (6 and 13 features respectively), we use 4 attention headswhen considering 30% (27 features), and 3 when using 40% (54 features). Thebatch normalization momentum we use is 0.4 and the dropout probability is p = 0 . . In the feature groups mining step, we generate 5000 random datasetsand the threshold t to extract the feature groups is empirically set to 0.95. Forthe evaluation of the approach we consider the nDCG@3 measure, similar resultsare obtained with nDCG at diﬀerent cutoﬀs. In Table 1, we report the results of our experiments on the MSLR-WEB30Kdataset. We trained both a LambdaMART model and the proposed neuralreranking one on diﬀerent subsets of features of increasing size. From these exper-iments, we observe that the proposed Neural Feature Selection (NFS) approachalways outperforms all the other baselines when the selected features are usedto train a LambdaMART model, and in most of the cases when used with theproposed neural model. The evaluation results on the OHSUMED dataset re-

LambdaMART Neural RerankerFeaturesPerc. XGAS HCAS (sin-gle) NFS XGAS HCAS (sin-gle) NFS5% 0.3580 0.3589

20% 0.3781

Table 1.

Evaluation of the proposed Neural Feature Selection (NFS) approach on theMSLR-WEB30K dataset. We report the ndcg@3 values obtained by LambdaMARTand the proposed Neural Reranking model employing diﬀerent subsets of features. ported in Table 2 are computed as the previous case. Here, we consider 60%,70%, 80%, and 90% of the total features in the collection since the total numberof feature is much smaller than in the previous dataset. In our evaluation, NFSoutperforms HCAS in the majority of the cases, even though the latter approachis slightly more competitive than before.The main advantage of using a subset of features to represent the inputs toa neural model is that we can reduce the model complexity. We observe thiseﬀect mainly when our data is represented by a large number of features as in

Purpura et al. the MSLR-WEB30K collection. For example, when using 40% of the featuresof the dataset, the number of attention heads in our model was reduced from 4to 3 and, since we were considering only 54 out of 136 features, the number ofparameters of the self-attention heads – the ﬁrst layer of our model – was alsoreduced. As a consequence, training time was halved and inference time alsodecreased.

LambdaMART Neural RerankerFeaturesPerc. XGAS HCAS (sin-gle) NFS XGAS HCAS (sin-gle) NFS60% 0.3669 0.3781

80% 0.3669 0.3993

Table 2.

Evaluation of the proposed Neural Feature Selection (NFS) approach on theOHSUMED dataset. We report the ndcg@3 values obtained by LambdaMART and theproposed Neural Reranking model employing diﬀerent subsets of features.

It is also interesting to observe the diﬀerences between the features selected bythe proposed NFS approach and other baselines. We focus on the top 3 featuresselected from the OHSUMED collection by each of the considered feature selec-tion algorithms over the 5 diﬀerent dataset folds and refer the reader to [14] for amore detailed description of each feature. NFS most frequently selected featurescomputed with popular retrieval models such as BM25 or QLM [11] (features 4,12 and 28) based on the document abstract or title. On the other hand, HCASselected simpler features derived from raw frequency counts of the query termsin each document’s title and abstract (features 23, 40 and 36). Finally, XGASselected a mix of features computed with traditional retrieval approaches such asQL, and simpler frequency counts (features 2, 44 and 13). We conclude that theadvantage of NFS is likely due to its ability to recognize and select the most so-phisticated and useful matching scores thanks to the information learned duringtraining.

In the recent years, neural models became a competitive alternative to tradi-tional Learning TO Rank (LETOR) approaches. Their performance however,grew at the expense of their eﬃciency and complexity. In this paper, we pro-pose an approach for feature selection for Learning TO Rank (LETOR) basedon a neural ranker. Our approach is speciﬁcally designed to optimize the perfor-mance of neural LETOR models without the need to change their architecture.In our experiments, the proposed approach improved the eﬃciency of a sam-ple neural LETOR model and decreased its training time without impacting itsperformance. We also validated the robustness of the selected features testingthem using a diﬀerent – non neural – model such as LambdaMART. We per-formed our evaluation on two popular LETOR datasets – i.e. MSLR-WEB30Kand OHSUMED – comparing our approach to three state-of-the-art techniquesfrom [6]. The proposed approach outperformed the selected baselines in the ma-jority of the experiments on both datasets. eural Feature Selection for Learning to Rank 7

References

1. Adebayo, J., Gilmer, J., Muelly, M., Goodfellow, I., Hardt, M., Kim, B.: Sanitychecks for saliency maps. In: Advances in Neural Information Processing Systems.pp. 9505–9515 (2018)2. Ai, Q., Bi, K., Guo, J., Croft, W.: Learning a deep listwise context model forranking reﬁnement. In: Proc. of SIGIR 2018. pp. 135–144 (2018)3. Bruch, S., Zoghi, M., Bendersky, M., Najork, M.: Revisiting approximate metricoptimization in the age of deep neural networks. In: Proc. of SIGIR 2019. pp.1241–1244 (2019)4. Burges, C.J.: From ranknet to lambdarank to lambdamart: An overview. Learning (23-581), 81 (2010)5. Geng, X., Liu, T., Qin, T., Li, H.: Feature selection for ranking. In: Proc. of SIGIR2007. pp. 407–414 (2007)6. Gigli, A., Lucchese, C., Nardini, F., Perego, R.: Fast feature selection for learningto rank. In: Proc. of ICTIR 2016. pp. 167–170 (2016)7. Gower, J.C., Ross, G.: Minimum spanning trees and single linkage cluster analysis.Journal of the Royal Statistical Society: Series C (Applied Statistics) (1), 54–64(1969)8. Han, X., Lei, S.: Feature selection and model comparison on microsoft learning-to-rank data sets. arXiv preprint arXiv:1803.05127 (2018)9. Ke, G., Meng, Q., Finley, T., Wang, T., Chen, W., Ma, W., Ye, Q., Liu, T.:Lightgbm: A highly eﬃcient gradient boosting decision tree. In: Advances in neuralinformation processing systems. pp. 3146–3154 (2017)10. Liu, T.: Learning to rank for information retrieval. Springer Science & BusinessMedia (2011)11. Manning, C., Schütze, H., Raghavan, P.: Introduction to information retrieval.Cambridge university press (2008)12. Pobrotyn, P., Bartczak, T., Synowiec, M., Białobrzeski, R., Bojar, J.: Context-aware learning to rank with self-attention. arXiv preprint arXiv:2005.10084 (2020)13. Qin, T., Liu, T.: Introducing letor 4.0 datasets. arXiv preprint arXiv:1306.2597(2013)14. Qin, T., Liu, T.Y., Xu, J., Li, H.: Letor: A benchmark collection for research onlearning to rank for information retrieval. Information Retrieval (4), 346–374(2010)15. Rahangdale, A., Raut, S.: Deep neural network regularization for feature selectionin learning-to-rank. IEEE Access7