Feature Selection and Model Comparison on Microsoft Learning-to-Rank Data Sets
FFeature Selection and Model Comparisonon Microsoft Learning-to-Rank Data Sets
Sen LEI, Xinzhi HAN
Submitted for the PSTAT 231 (Fall 2017) Final Project ONLYUniversity of California, Santa BarbaraDec. 2017 a r X i v : . [ s t a t . A P ] M a r uthor: Sen LEI, Xinzhi HAN University of California, Santa Barbara
Xinzhi Han, Sen LeiFeature Selection and Model Comparison on MicrosoftLearning-to-Rank Data SetsAbstract
With the rapid advance of the Internet, search engines (e.g., Google, Bing, Yahoo!) are usedby billions of users for each day. The main function of a search engine is to locate the mostrelevant webpages corresponding to what the user requests. This report focuses on the coreproblem of information retrieval: how to learn the relevance between a document (very oftenwebpage) and a query given by user.Our analysis consists of two parts: 1) we use standard statistical methods to select importantfeatures among 137 candidates given by information retrieval researchers from Microsoft. Wefind that not all the features are useful, and give interpretations on the top-selected features; 2)we give baselines on prediction over the real-world dataset MSLR-WEB by using various learningalgorithms. We find that models of boosting trees, random forest in general achieve the bestperformance of prediction. This agrees with the mainstream opinion in information retrievalcommunity that tree-based algorithms outperform the other candidates for this problem.ii uthor: Sen LEI, Xinzhi HAN
Contents iii uthor: Sen LEI, Xinzhi HAN
The DATA SET that we use is available on Microsoft Learning to Rank Datasets.The CODE we wrote is available on GitHub. uthor: Sen LEI, Xinzhi HAN Chapter 1
Introduction
In this paper we present our experiment results on Microsoft Learning to Rank dataset MSLR-WEB [20]. Our contributions include: • Select important features for learning algorithms among the 136 features given by Mi-crosoft. • Given baseline evaluation results and compare the performances among several machinelearning models.To our knowledge we are the first to select features on the dataset MSLR-WEB, and we givebaseline results on the models that are not covered by existing works. To make sure our resultsare reproducible, all of our scripts are available online, and detailed experiment procedures aregiven.
Search engines, or information retrieval at large, plays an important role in modern Internet.Given any query from the user, an ideal search engine should match the related web pages, andrank them based on relevance with the query. Since very often the user only takes a look atthe top-ranked webs, it is crucial to locate the most relevant web pages. Hence it is interestingto learn the relevance between the query and web page by data mining algorithms. A line ofworks called Le arning to R ank (LetoR) [13, 27, 24, 4] focused on this learning problem, andseveral algorithms have been proposed, e.g., RankSVM [13], RankBoost [24], AdaRank [27],LambdaMART [4], etc. Some of these algorithms are applied in commercial search engines suchas Google, Bing and Yahoo [16].To better research LetoR, Microsoft Research and Yahoo have provided large scale datasets:LetoR 3.0 [22], LetoR 4.0 [21], MSLR-WEB [20] Yahoo Challenge [6]. Unlike Yahoo Challengedataset, Microsoft Research has given descriptions on how to generate the features of all theirdatasets. Hence Microsoft datasets [22, 21, 20] are more useful in research. However, there aretwo challenges with the datasets MSLR-WEB: • Insu cient baselines reported . For LetoR 3.0 and LetoR 4.0, baseline results are pos-ted on the website [22, 21], and they have extensive research works presenting experimentresults on them [19, 28, 6, 18]. For MSLR-WEB, the authors did not give baselines. To ourknowledge there are only two existing works reporting experiment results on MSLR-WEB[2, 25]. However in [2, 25] only limited models and evaluation metrics are reported. Somecompetitive learning algorithms, e.g., Generalized Linear Model, Logistic Regression, SVM,2 uthor: Sen LEI, Xinzhi HAN
Random Forest (very promising based on a recent report [12]), Boosting Regression Tree,are not reported. In this paper we fill this void, and report both precision and NormalizedDiscounted Cumulative Gain (NDCG) following the dataset authors’ baselines [22]. Henceour results should be comparable with the existing works [2, 25]. • Too many features . MSLR-WEB has 136 features. Note that each of LetoR datasetshas only less than 50 features. It is interesting to investigate the significance of each of the136 features. This paper will try a feature selection. Another issue is that some featuresin MSLR-WEB are developed and privately owned only by Microsoft (e.g., the features onuser click data, boolean model and language model). This will make the dataset not fullyreproducible. By feature selection, we can evaluate the importance of these private features.If some of them do not significantly influence the learning performance, we can simply discardthem in the future research, when we need to generate features for new query and documentdatasets such as TREC Robust05 [26], TREC Enterprise05 [8], Clueweb09 [5], etc.
Tables 1.1, 1.2 and 1.3 are an adaptation of the feature list given by MSLR-WEB website [20].All the features are numeric.The original datasets have the matched document for at most 30,000 queries, separatedin 5 folds for five-fold cross validation. However in this report we only use the Fold 1 data,in order to reduce the training time into an appropriate scale (e.g., only training a RankNetor Coordinate Ascend model on Fold 1 of MSLR-WEB30K without cross validation over thefive folds took more than 2 days). The results in this report only serve as baselines for futurecomparison use.
There are two main contributions: • We use Lasso on logistic regression, Lasso on ordinal regression, SVM, random forest, andgeneralized boosting models to select significant features among the 136 features from thedataset MSLR-WEB. We list the top-selected features and give interpretations on their im-portance. • We report metrics Precision and NDCG for the models including Lasso on logistic regression,Random Forest, Generalized Boosted Regression, SVM, and Continuation Ratio model. Wealso give baselines for the other state-of-art LetoR algorithms.3 uthor: Sen LEI, Xinzhi HAN
Table 1.1: Description of the MSLR-WEB dataset (Part 1).Feature No. Descriptioncovered queryterm number 1 - 5 How many terms in the user query are coveredby the text. The text can be body, anchor, title,url and whole document (for features 1 - 5 re-spectively, similarly below).covered queryterm ratio 6 - 10 Covered query term number divided by thenumber of query terms.stream length 11 - 15 Text length.IDF (inverse doc-ument frequency) 16 - 20 1 divided by the number of documents contain-ing the query terms.sum of term fre-quency 21 - 25 Sum of counts of each query term in the docu-ment.min of term fre-quency 26 - 30 Minimum of counts of each query term in thedocument.max of term fre-quency 31 - 35 Maximum of counts of each query term in thedocument.mean of term fre-quency 36 - 40 Average of counts of each query term in the doc-ument.variance of termfrequency 41 - 45 Variance of counts of each query term in thedocument.4 uthor: Sen LEI, Xinzhi HAN
Table 1.2: Description of the MSLR-WEB dataset (Part 2).Feature No. Descriptionnormalized sumof stream length 46 - 50 Sum of term counts divided by text length.normalized min ofstream length 51 - 55 Minimum of term counts divided by text length.normalized maxof stream length 56 - 60 Maximum of term counts divided by text length.normalized meanof stream length 61 - 65 Average of term counts divided by text length.normalized vari-ance of streamlength 66 - 70 Variance of term counts divided by text length.sum of tf*idf 71 - 75 Sum of the product between term count and IDFfor each query termmin of tf*idf 76 - 80 Minimum of the product between term countand IDF for each query termmax of tf*idf 81 - 85 Maximum of the product between term countand IDF for each query termmean of tf*idf 86 - 90 Average of the product between term count andIDF for each query termvariance of tf*idf 91 - 95 Variance of the product between term count andIDF for each query termboolean model 96 - 100
Unclear. Privately owned by Microsoft. vector spacemodel 101 - 105 dot product between the vectors representingthe query and the document.
The vectors areprivately owned by Microsoft.
BM25 106 - 110 Okapi BM25LMIR.ABS 111 - 115 Language model approach for information re-trieval (IR) with absolute discounting smooth-ing [17]LMIR.DIR 116 - 120 Language model approach for IR with Bayesiansmoothing using Dirichlet priors [3]LMIR.JM 121 - 125 Language model approach for IR with Jelinek-Mercer smoothing [17]5 uthor: Sen LEI, Xinzhi HAN
Table 1.3: Description of the MSLR-WEB dataset (Part 3).Feature No. Descriptionnumber of slashesin URL 126 e.g., “ucsb.edu/pstate/people” has 2 slashes.length of url 127 The number of characters in the URL.Inlink number 128 The number of web pages that cite this webpage.Outlink number 129 How many web pages this web cites.PageRank 130 Evaluates the centrality of this web page basedon web links over the Internet. This gives thesuccess of Google.SiteRank 131 Site level PageRank. E.g., “ucsb.edu/pstat” and“ucsb.edu/math” share the same SiteRank.QualityScore 132 The quality score of a web page. The scoreis outputted by a web page quality classifier.
Privately owned by Microsoft.
QualityScore2 133 The quality score of a web page. The score isoutputted by a web page quality classifier, whichmeasures the badness of a web page.
Privatelyowned by Microsoft.
Query-url clickcount 134 The click count of a query-url pair at a searchengine in a period.
Collected and privatelyowned by Microsoft. url click count 135 The click count of a url aggregated from userbrowsing data in a period.
Collected andprivately owned by Microsoft. url dwell time 136 The average dwell time of a url aggregated fromuser browsing data in a period.
Collected andprivately owned by Microsoft. uthor: Sen LEI, Xinzhi HAN Chapter 2
Data Processing and Models
Speaking of our training data which contains 137 numerical variables based on 10000 observa-tions, thus a 137 by 10000 data frame, due to extremely expansive computation, we randomlysampled those observations out of over 220000 original observations included in, as previouslyreferred, Fold 1 of MLSR-WEB30K. Among all the 137 variables, the first one is the responsevariable named rel , standing for web page relevance. The rest 136 variables, denoted as X1 , X2 , · · · , X136 , are independent variables, out of which the first 125 variables consist of 5 per-spectives that are body, anchor, title, url, and whole document within each 25 larger features.Some of these features are public and well-known, such as query term number , which is thenumber of terms in a users query covered by text; stream length, describing the text length;di↵erent aggregations of term frequency, that are aggregations on counts of each query termin a document; tf-idf , short for term frequencyinverse document frequency, etc. Apart fromthat, there are also 11 independent variables not of above 5 perspectives. Despite some ofthem being privately designed by Microsoft and remained unclear, others are not mysterious,e.g.,
PageRank , a ranking criterion used by Google search engine; length of url , namely, thenumber of characters in a url; Outlink number, a web page quality about the number of otherweb pages cited by this web page. Our test data, of the same schema as the training data,includes about 240000 observations collected in 1961 di↵erent files. We will conduct predictionsusing these test files. Our goal aims at comparing several trained models based on predictedmeasurements of accuracy.
Take a closer look at the response variable rel , we notice that even though it has 5 relevancelevels ranged from 0 to 4, observations at relevance level 3 are insu cient and are only about1 .
5% of total 10000 observations. What is worse, those at level 4 are even less that 1%, whichmay lead to potential over-fitting, hence unpersuasive result. In that case, we realize it mightbe better to combine levels with fewer observations together. We then treat levels of 0 , , , In our intuition, logistic regression does a good job in binary classification, so we first decide it tobe our first model. The interesting thing is that after fitting a logistic model, singularity occurs,7 uthor: Sen LEI, Xinzhi HAN indicating some of the variables have perfect collinearity, that is to say, some of the variablesare exact linear combinations of others. However, in this project we are more interested inSupport Vector Machine on binary classification the predictive ability of our model rather thanindividual coe cients. Aliased variables which do not contribute to the model will not a↵ectour model accuracy. Carrying on from that, still, we think 136 features are way too ine cientand expansive to do computations, and we do not expect each of these features to be relevantfor later prediction, therefore the motivation here is to find a penalized model and apply whatis called shrinkage on those features to tell us what variables are important and thus should bekept, what contribute little to the model and therefore can be thrown away. Lasso, short forleast absolute shrinkage and selection operator, trying to minimize the objective function n X i =1 y i p X j =1 x ij j + p X j =1 | j | is exactly what we are looking for. It shrinks the estimated coe cients to actual 0s as theincrease of the tuning parameter , but nonetheless we have to be aware when to stop penalizingand keep the rest non-zero coe cients. The idea behind that is the bias variance trade-o↵. Toachieve that, we use R function cv.glmnet [10] that performs a cross-validation on the trainingset to get the best – lambda.1se within the result from cv.glmnet, and then use this to selectthe best subset of the features which contains 12 variables. Later, instead of predicting themeasurements of accuracy using these 12 variables, we reset tuning parameter from lambda.1seto lambda.min to include more features in order to increase prediction accuracy but withoutworrying about risks from collinearity. We used test sets to conduct predictions on the responsescale and obtained 1961 probability vectors of length n with each specifying the probabilitythat each of the n observations is assigned to label 1. We will use these vectors to calculate themeasurements of accuracy Precision and NDCG. We will be talking about these measurementsas well as the selected features in detail later in Chapter 3. Support vector machine is a reasonable method for binary classification problems, it solves anoptimization problem looks like [7]max , ··· , p M s.t. pj =1 j = 1 y ( i ) ⇣ + x ( i )1 + · · · + p x ( i ) p ⌘ M (1 ✏ ( i ) ) ✏ ( i ) ) , P ni =1 ✏ ( i ) ) C where j are coe cients; M is the maximum margin distance; ✏ ( i ) > C isthe “budget”: a tuning parameter which controls the amount of slack. We are curious whetherit is a good model for our predictions, or at least better than the previous binary case Lassomodel. Due to time constraints, we are not able to compare the performance of each kernelthrough some parameter optimization experiments. We will stick to the linear kernel, whichonly involves a cost parameter (an inverse version of the budget parameter C ) to optimize,that allows us to focus more on our classification models. For the purpose of saving time, totune the cost parameter, instead of using the whole data, we use a random sub-sample of size500. We also fix the tuning range to be (0 . , . , . , , , , uthor: Sen LEI, Xinzhi HAN the best cost = 0.001 over that range. Knowing that later we need to calculate the modelprecision which involves sorting probabilities, we would rather want an explicit probability forthe class labels; therefore, we create 50 bootstrap replicates of the training data to estimateclass probabilities within each test document. After that, we acquired 1961 number of n by2 probability matrices, within each of which 2 columns names are specified as class label "0" and "1" and n rows names are indices of n observations in that particular file. To have anidea on important features, we apply R function rfe within package caret [14], which is aboutrecursive feature elimination algorithm, to select the best 10 features out of 136 using method "svmLinear" and 20 folds cross validation. One problem with binary classification is that we lose certain information among labels; howdi↵erent is di↵erent? Is the di↵erence between level 3 and 4 the same as that between level 1and 4? Take this into consideration, we are going to build some multiclass classification modelsand to compare the accuracy with binary classification models.
Considering the fact that the response variable is ordinal, and there are too many covariates,we decided to try penalized Continuation Ratio Model.A variety of statistical modeling procedures, namely, proportional odds, adjacent category,stereotype logit, and continuation ratio models can be used to predict an ordinal response.In this paper, we focus attention to the continuation ratio model because its likelihood can beeasily re-expressed such that existing software can be readily adapted and used for model fitting[1].
Statistical Background
Suppose for each observation, i = 1 , ..., n , the response Y i belongs to one ordinal class k =1 , · · · , K and x i represents a p -length vector of covariates. The backward formulation of thecontinuation ratio models the logit as logit [ P ( Y = k | Y k, X = x )] = ↵ k + Tk x For high dimensional covariate spaces, the best subset procedure is computationally prohib-itive. However, penalized methods, places a penalty on a function of the coe cient estimates,permitting a model fit even for high-dimensional data [23].A generalization of these penalized models can be expressed as,˜ = arg min n X i =1 y i p X j =1 x ij j + p X j =1 | j | q , q q = 1 we have the L penalized model, when q = 2 we have ridge regression. Valuesof q (1 ,
2) provide a compromise between the L and ridge penalized models. Because when q > p X j =1 ↵ j + (1 ↵ ) | j | uthor: Sen LEI, Xinzhi HAN Model Building
We separate the original data randomly into a 10000- obs training set and a 2000- obs test set.We use the package glmnetcr in R to fit the model.Figure 2.1 can be used to identify a more parsimonious model having a BIC close to theminimum BIC; finally we chose the model when step = 32 .Figure 2.1: Plot of Bayesian Information Criteria across the regularization path for the fitted glmnetcr object using the training subset of lector data set.Then, I use the remaining 2000 test set to make the prediction. The evaluation methods isdiscussed in Chapter 3, and evaluation results are shown in Table 3.6 and 3.7.
The motivation of using Random Forest method is unlike ordinary bagged decision tree model,it chooses split variable from a random subset of the predictors, in which case collinearity issueswill not be caused by highly correlated variables we presumed to exist. From error reductionperspective, it achieves bias variance trade-o↵ by reducing variances of large complex models.First of all, we use R function rfcv in the package randomForest to try sequentially variableimportance pruning via a nested cross-validation procedure. We set the fraction of variables toremove at each iteration to be 0.7.Table 2.1: Part of Cross-validation error in Random Forest for feature selection
136 95 67 47 33 23 16 11 8 50.45075 0.45063 0.45738 0.45475 0.46375 0.46600 0.47288 0.47325 0.47950 0.48925
From Table 2.1, we notice that the CV error increases as the number of predictors are10 uthor: Sen LEI, Xinzhi HAN reduced, and the error di↵erence between using 136 features and 95 features is very low, whichsuggests the 136-feature model is as good as the 95-feature model, so we decide to use original136 features to fit our random forest model. In R function, we set argument ntree to areasonable size 501, importance to be true since we want our model to assess the importanceof features, and keep other arguments their default settings, where in our case, the number ofvariables randomly sampled at each split is p ⇡
12. After fitting the model, we predictedusing our test sets on probability scale and get 1961 n by 5 probability matrices. 5 columnsnames in each of these matrices are specified as class label "0" , "1" , "2" , "3" , and "4" ; n rowsnames are indices of n observations in that particular file. On the other hand, not like random forest, boosting achieves bias variance trade-o↵ by reducingbiases of low-variance models, which inspires us to see how this model works. The brief ideabehind boosting is that it fits trees multiple times sequentially, and uses information (fractionof mis-classifications) from previous grown trees as weak learners to update classifier weightsafter each iteration, then calculates a weighted average of weak learners’ classifications as thefinal prediction [11]. Before fitting a boosting model with designed parameters, we need tofigure out what a good value is for each of the parameters. Here, we are interested in findingout appropriate values for parameters n.trees and interaction.depth within R function gbm . Fortunately, there is a package in R called caret that does this job. This package is forclassification and regression training, where we decide appropriate ranges for parameters wewant to train, and the function caret::train [14] will tune the best parameters (with smallestcross validation error) for us. We predetermined our tree size to be from 600 to 1500 withincrement 125, and the maximum depth of variable interactions to be 2, 4, and 6. We setthe shrinkage, AKA learning rate, to 0 . n.trees = 1250 and interaction.depth= 4 . Using these parameters to fit our boosting model after setting distribution family to "multinomial" , a summary of this fitted model gives us ranked importance of each featurepresented by a plot as well as a table. 11 uthor: Sen LEI, Xinzhi HAN Chapter 3
Evaluation Results
We here present our experiment setup and baseline results. To make sure our results arereproducible, we make all of our experiment scripts available [15].
We implemented our algorithms by using R, and we also used the state-of-art algorithms givenby Ranklib [9]. We leave all of the baselines generated by Ranklib to the Appendix.We use Precision and NDCG as the measurement of accuracy, as suggested by the authors[20]. These two measurements are widely used in the existing works [22, 20]. • Given any query, precision P @ k is defined as R k /k , where R k the number of truly relevantdocuments among the top k documents selected by the learning algorithm. That is, givenany query, the learning algorithm should give a ranking of documents base on the predictedrelevance, and we want to see how many documents of the top k are really relevance by theground truth. Note that the ground truth relevance in MLSR-WEB has five levels: 0, 1, 2,3, and 4. As suggested by the dataset authors [20], we regard 0 and 1 as irrelevance (i.e., 0)and regard 2, 3, 4 as relevance (i.e., 1) when we evaluate precision. • Given any query, NDCG is defined as
N DCG @ k = DCG @ kIDCG @ k , where DCG is defined as DCG @ k = k X i =1 (2 rel i ( i + 1) , where rel i denotes the true relevance of the i -th ranked document as suggested by the learningalgorithm, and IDCG is defined as IDCG @ k = k X i =1 (2 ideal i ( i + 1) , where ideal i denotes the true relevance of the i -th rankled document if we rank the matcheddocuments by their ground truth relevance.Note that NDCG has several variants. Here we use the version from MLSR-WEB evaluationscript [20]. 12 uthor: Sen LEI, Xinzhi HAN Here we argue why we use information retrieval measurements such as precision and NDCG,rather than traditional statistical errors such as Mean Squared Error (MSE) or Area UnderCurve (AUC) of ROC. There are basically two reasons.1. The relevance scores are only qualitative and very subjective. Note that the relevance levelsfrom 0 to 4 are given by human experts. Thus these relevance levels by no means canbe quantitatively accurate, i.e., they only roughly represent how people feel the relevancebetween a query and a document. For example, relevance level 4 does not imply its relevanceis twice of relevance level 2. However most traditional statistical measurements assume thetargets are quantitatively accurate.2. Only the top ranked documents are considered for evaluation. This is typically how peopleuse the search engine: send some query to the search engine, and only take a look at the very first ranked documents. Thus an appropriate measurement for evaluation should payoverwhelming weights on the top-ranked documents, like Precision and NDCG. Howeverboth AUC and MSE give equal weights to each document in the test dataset.The dataset MSLR-WEB10 has 10,000 queries, MSLR-WEB30K has 30,000 queries. Onaverage, for each query, there are 100 - 200 matched documents that have relevance levelsevaluated by human experts. We average the measurements over all the involved queries.We run our experiments on UCSB Center of Scientific Computing, Cluster Knot’s 93-thnode, which is a DL580 node with 4 Intel X7550, eight core processors and 1TB of RAM.Now refer Precision and NDCG to each of our models:
As we stated above, setting tuning parameter to lambda.1se which is acquired in crossvalidation Lasso gives us 12 non-zero coe cient variables listed below, thus, we consider these12 features to be the most important ones. Notice that there is no order on the importance ofthese 12 features.Table 3.1: Selected features under Binomial Lasso RegressionIndex FeatureX28 min of term frequency in titleX30 min of term frequency in whole documentX64 mean of stream length normalized term frequency in URLX65 mean of stream length normalized term frequency in whole documentX98 boolean model in titleX108 BM 25 in titleX109 BM 25 in URLX123 LMIR.JM in titleX126 Number of slash in URLX127 Length of URLX129 Outlink numberX130 PageRank 13 uthor: Sen LEI, Xinzhi HAN From the table, we find that entire selected features are related with term frequency , stream length , boolean model , BM25 , LMIR.JM , URL , and
PageRank .To calculate the Precision, we already have 1961 probability vectors of length n, where ndepends on the number of observations within each test file. For each of these vectors, we sortthe observations in a decreasing order according to their corresponding probabilities and fetchthe first k indices. Back to the original test data, we look at the class labels of observationsassociated with these k indices and calculate the potion of observations assigned to label 1.Roughly speaking, we would like to know the actual percentage of class label 1 among selectedk observations given that those k observations are predicted to be in label 1. Finally, we takean average of all the percentages of all the test files to be our Precision. The reason of choosinga relatively small k number of observations is because in real life, a client is usually interested inthe first k query results and disregards the rest. The process of calculating the NDCG largelybased on the above given formula. The results of both measurements are presented in the Tables3.6 and 3.7.
Selected features are listed in the table below (Order matters).Table 3.2: Selected features under Support Vector MachineIndex FeatureX55 min of stream length normalized term frequency in whole documentX78 min of tf*idf in titleX80 min of tf*idf in whole documentX65 mean of stream length normalized term frequency in whole documentX50 sum of stream length normalized term frequency in whole documentX51 min of stream length normalized term frequency in bodyX76 min of tf*idf in bodyX88 mean of tf*idf in titleX60 max of stream length normalized term frequency in whole documentX30 min of term frequency in whole documentFeatures selected in SVM model are mainly within the scope of stream length , termfrequency , and tf-idf .The process of calculating the Precision and NDCG are similar to what we did in Lassobinomial regression model, and the result is listed in Table 3.6 and 3.7. As we stated above, using BIC, R gives us 26 non-zero coe cient variables listed in Table 3.3,thus, we consider these 12 features to be the most important ones. Notice that there is no orderon the importance of these 26 features.From the table, one can find that features selected under continuation ratio model are withincategories query term , stream length , term frequency , BM25 , LMIR , PageRank , and
URL .Precision and NDCG can be calculated similarly to those in random forest model, and resultsare given in Table 3.6 and 3.7. 14 uthor: Sen LEI, Xinzhi HAN
Table 3.3: Selected features under Continuation Ratio ModelIndex FeatureX3 covered query term number in titleX4 covered query term number in urlX6 covered query term ratio in bodyX11 stream length in bodyX15 stream length in whole documentX22 sum of term frequency in anchorX26 min of term frequency in bodyX30 min of term frequency in whole documentX41 variance of term frequency in bodyX45 variance of term frequency in whole documentX49 sum of stream length normalized term frequency in urlX70 variance of stream length normalized term frequency in whole documentX72 sum of tf*idf in anchorX98 boolean model in titleX107 BM25 in anchorX108 BM25 in titleX109 BM25 in urlX110 BM25 in whole documentX115 LMIR.ABS in whole documentX123 LMIR.JM in titleX126 Number of slash in URLX129 Outlink numberX130 PageRankX133 QualityScore2X134 Query-url click countX136 url dwell time
Random forest showed us the most important 10 features in an importance plot, shown in figure3.1.Taking both plots into consideration, we made Table 3.4 containing selected features underrandom forest model (Order matters).Selected features under random forest model are mostly related to
PageRank , stream length , SiteRank , URL , and
LMIR .Figuring out Precision for multi-class classification involves a concept called “expected rank”.Each observation has an expected rank, which can be calculated by the summing up the productof each class label and its corresponding probability. Then again, we sort the observations in adecreasing order according to their expected ranks and fetch the first k indices. Then we compareand find the proportion of actual class label 1 in each of our test set and take the average. Noticehere after calculating and sorting the expected ranks, we acquire the proportion by convertinglabels in test files to "0" and "1" using the same rule we did in binary classification.15 uthor: Sen LEI, Xinzhi HAN Figure 3.1: Importance plot under random forest modelTable 3.4: Selected features under Random forest modelIndex FeatureX130 PageRankX15 stream length in whole documentX11 stream length in bodyX131 SiteRankX127 Length of URLX115 LMIR.ABS in whole documentX120 LMIR.DIR in whole documentX111 LMIR.ABS in body
We summarize the boosting model and directly get the feature importance as shown in Table3.5. Table 3.5: Selected features under GBMIndex FeatureX55 min of stream length normalized term frequency in whole documentX88 mean of tf*idf in titleX53 mean of stream length normalized term frequency in titleX15 stream length in whole documentX51 min of stream length normalized term frequency in bodyX123 LMIR.JM in titleX115 LMIR.ABS in whole documentX103 vector space model in titleX134 Query-url click countX11 stream length in body 16 uthor: Sen LEI, Xinzhi HAN
Selected features under boosting model are around stream length , tf-idf , LMIR , vectorspace , Query-url click .Precision and NDCG can be calculated similarly to those in random forest model, and resultsare given in Table 3.6 and 3.7.
Tables 3.6 and 3.7 give the results for our algorithms implemented by R. We observe thatRandom Forest has the best performance.Here we give some insights and interpretations on our results: • We use SVM model to do a binary classification, i.e., for each instance, the target can onlybe relevant (1) or irrelevant (0). We use the criteria above to transform the original 5-levelrelevance into the binary version when training. By this transformation we lose significantinformation on the strength and weakness of each relevance score. This should significantlydecrease the prediction performance, especially when we are evaluating NDCG, which en-courages to rank highly relevant documents at top places. • All the methods we use here are only pointwise [12]. So far we have not given the baselines ofpairwise and listwise methods (see in the Appendix). Our loss functions are not yet directlyrelated to Precision or NDCG. Thus the results here should not be competitive with beyond-pointwise methods. • Tree based models (Random Forest, Boosting Trees) outperform the others. This agrees withthe new founding in LetoR [16, 4, 12].Table 3.6: Precision results for MSLR-WEB10K Fold 1.Model @1 @2 @3 @4 @5 @6 @7 @8 @9 @10Lasso 0.283 0.272 0.268 0.265 0.261 0.257 0.256 0.253 0.252 0.249Random Forest 0.594 0.520 0.478 0.449 0.428 0.415 0.402 0.390 0.379 0.368SVM 0.308 0.292 0.279 0.271 0.263 0.256 0.250 0.245 0.240 0.236Ordinal 0.374 0.344 0.326 0.317 0.312 0.304 0.299 0.289 0.284 0.280Boosting Trees 0.473 0.438 0.413 0.391 0.377 0.364 0.353 0.343 0.333 0.326Table 3.7: NDCG results for MSLR-WEB10K Fold 1.Model @1 @2 @3 @4 @5 @6 @7 @8 @9 @10Lasso 0.227 0.242 0.253 0.264 0.272 0.280 0.289 0.296 0.303 0.309Random Forest 0.456 0.427 0.420 0.418 0.420 0.423 0.426 0.429 0.432 0.435SVM 0.251 0.251 0.255 0.261 0.266 0.272 0.277 0.282 0.287 0.291Ordinal 0.284 0.291 0.298 0.307 0.317 0.325 0.333 0.338 0.344 0.349Boosting Trees 0.377 0.371 0.373 0.376 0.380 0.385 0.389 0.394 0.398 0.40217 uthor: Sen LEI, Xinzhi HAN
Chapter 4
Discussions
We find the following feature categories are important in the 137 candidates of MSLR-WEB: • Term frequency based features . Typical examples include TFIDF, BM25, cover ratio ofthe query, LMIR smoothing, etc. These features are significant in nature since people wouldlike to see the pages containing the words requested. Furthermore, the term frequencies inbody and title weigh more than other parts of the web page based on our results. • Link based features . Typical such features include PageRank, SiteRank, In/Out linknumber, etc. The intuition is also clear: important web pages tend to be much more citedthan the ordinary web pages (also called hubs). If one put all the web pages and the web linksbetween them into one graph, naturally the important web page should have a central place.PageRank (the core of Google’s search engine) proves to capture such centrality tightly. Notethat these features are only document specific, i.e., it will not change given di↵erent queries.This implies a certain potion of a successful relevance evaluation should only focus on thedocument itself, regardless of the query. • Click based features . Features 134, 135 and 136 are in this category. The intuition is thatusers tend to click the most interesting web pages and dwell for long enough time on therelevant pages. Unfortunately very often these features gathered from real users are privateto the search engine companies. • URL lengths . Important and popular web pages are likely to have short URLs, which areeasy to remember. Also the number of slashes in a popular URL should not be too many. • Lengths of web pages or titles . This is also known as stream length (features 11 - 15).The intuition behind is that longer pages are more likely to contain more useful information,which should attract users.
Our results also suggest that some features in MSLR-WEB are not very useful: • Variance features . Typical such features include variance of TFIDF, (normalized) termfrequencies. We agree that low variance of TF means the document is unlikely to have a huge18 uthor: Sen LEI, Xinzhi HAN bias on certain terms in the query. However, the important of this tendency is di cult toargue and lacks enough experiment results to back up. • Inverse Document Frequency (IDF) based features . We agree that the intuition behindthis feature category is that the web page should contain novel information rather than copyfrom other sources. A low IDF implies the content in the web page is unique and can rarelybe found somewhere else. However it is di cult to see the connection between this featurewith relevance given a query. Note that these features are document specific like PageRank,but they seem not to be able to capture the web page quality well enough.
We use R to generate the baselines of Precision and NDCG for the standard statistical models,and we give more baselines in the Appendix for the state-of-art LetoR algorithms. Since ourmodels are somewhat classical, i.e., pointwise compared to the state-of-art which are mostlypairwise and listwise, in nature our baselines cannot outperform the ones given by MART,LambdaMART, etc. For Random Forest, we achieve the similar performance compared to theresults given by Ranklib (see Appendix).
The biggest limitation in this report is that the training set size is too small (no more than 10kinstances for each training). This is mostly because the models we used in R are not scalable,consuming too much time when the number of instances exceeds 10k. Hence we samples 10kinstances from the entire dataset, which contains nearly 3.8 million instances. In the future wemay resort to scalable models (e.g., the ones from Ranklib, tensorflow or scikit-learn) to bettercapture the whole dataset.It is still very open for the feature selection problem. We cannot rule out the possibility thatthe selected features in this report are only subject to the MSLR-WEB dataset. For examplethe significance of variance and IDF is not clear in this dataset, but that does not imply thesefeatures are not useful in any other datasets. We will evaluate the feature importance for otherLetoR datasets to make more solid conclusion. Here we give an example of the bias. Suppose the query is “international organized crime”. A news webpagetalking about local organized crime is irrelevant, even though it has high frequency on part of the query “organizedcrime”. uthor: Sen LEI, Xinzhi HAN Chapter 5
Acknowledgement
We are indebted for Shiyu Ji , who shared with us his suggestions and comments on this paper. [email protected] uthor: Sen LEI, Xinzhi HAN Appendix A
More Baselines on the State-of-ArtLetoR Algorithms
For a better comparison of our results, we give more baselines of the existing LetoR Algorithmsgenerated by RankLib 2.5 [9]. We trained and tested the algorithms on Fold 1 only. We didnot modify any model parameters (number of trees or leaves, bagging size, learning rate, etc.):only the ones default by RankLib are used. • Tables A.1 and A.2 give the baselines of the LetoR algorithms on the dataset MSLR-WEB10K,which is sampled from MSLR-WEB30K [20]. Gradient Boosting Regression Trees (GBRT),Coordinate Ascent and Random Forests have the best performance. • Tables A.3 and A.4 give the baselines of the LetoR algorithms on the dataset MSLR-WEB30K.GBRT has the best performance.Note that since we did not do any cross validation or parameter tuning, these results onlyserve as baselines for future comparisons.Table A.1: Precision results for MSLR-WEB10K Fold 1.
Model @1 @2 @3 @4 @5 @6 @7 @8 @9 @10GBRT uthor: Sen LEI, Xinzhi HAN Table A.2: NDCG results for MSLR-WEB10K Fold 1.
Model @1 @2 @3 @4 @5 @6 @7 @8 @9 @10GBRT 0.401 0.400 0.404
RankNet 0.116 0.130 0.138 0.145 0.151 0.158 0.164 0.170 0.175 0.180RankBoost 0.277 0.284 0.290 0.297 0.306 0.312 0.318 0.324 0.330 0.335AdaRank 0.340 0.333 0.334 0.335 0.337 0.340 0.344 0.347 0.349 0.353Coordinate Ascent
Table A.3: Precision results for MSLR-WEB30K Fold 1.
Model @1 @2 @3 @4 @5 @6 @7 @8 @9 @10GBRT
RankNet 0.125 0.128 0.130 0.129 0.130 0.130 0.132 0.132 0.132 0.133RankBoost 0.367 0.341 0.324 0.311 0.302 0.294 0.286 0.280 0.274 0.269AdaRank 0.251 0.262 0.262 0.259 0.257 0.254 0.250 0.248 0.245 0.242Coordinate Ascent 0.519 0.464 0.427 0.401 0.380 0.364 0.351 0.340 0.331 0.323LambdaMART 0.449 0.423 0.402 0.382 0.370 0.356 0.345 0.337 0.329 0.322ListNet 0.121 0.125 0.128 0.130 0.131 0.133 0.134 0.137 0.138 0.138Random Forests 0.462 0.429 0.406 0.387 0.376 0.366 0.356 0.349 0.342 0.336
Table A.4: NDCG results for MSLR-WEB30K Fold 1.
Model @1 @2 @3 @4 @5 @6 @7 @8 @9 @10GBRT
RankNet 0.126 0.136 0.144 0.151 0.158 0.164 0.171 0.176 0.182 0.187RankBoost 0.276 0.280 0.288 0.296 0.304 0.310 0.317 0.323 0.329 0.334AdaRank 0.215 0.230 0.242 0.253 0.263 0.270 0.278 0.285 0.292 0.298Coordinate Ascent 0.424 0.401 0.397 0.397 0.399 0.401 0.404 0.408 0.412 0.416LambdaMART 0.355 0.358 0.364 0.369 0.376 0.381 0.387 0.392 0.398 0.402ListNet 0.121 0.130 0.139 0.147 0.154 0.160 0.167 0.173 0.179 0.184Random Forests 0.373 0.366 0.371 0.375 0.382 0.389 0.394 0.401 0.407 0.412 uthor: Sen LEI, Xinzhi HAN Bibliography [1] Archer, K. J. (unknown). An r package for ordinal response prediction in high-dimensionaldata settings.
The Ohio State University . 9[2] Balasubramanian, N. and Allan, J. (2011). Modeling relative e↵ectiveness to leverage mul-tiple ranking algorithms. 2, 3[3] Blei, D. M., Ng, A. Y., and Jordan, M. I. (2003). Latent dirichlet allocation.
Journal ofmachine Learning research , 3(Jan):993–1022. 5[4] Burges, C. J. (2010). From ranknet to lambdarank to lambdamart: An overview.
Learning ,11(23-581):81. 2, 17[5] Callan, J., Hoy, M., Yoo, C., and Zhao, L. (2009). Clueweb09 data set. 3[6] Chapelle, O. and Chang, Y. (2011). Yahoo! learning to rank challenge overview. In
Pro-ceedings of the Learning to Rank Challenge , pages 1–24. 2[7] Cortes, C. and Vapnik, V. (1995). Support vector machine.
Machine learning , 20(3):273–297.8[8] Craswell, N., de Vries, A. P., and Soboro↵, I. (2005). Overview of the trec 2005 enterprisetrack. In
Trec , volume 5, pages 199–205. 3[9] Dang, V. (2013). Ranklib. 12, 21[10] Friedman, J., Hastie, T., Simon, N., and Tibshirani, R. (2016). Package glmnet: lasso andelastic-net regularized generalized linear models ver 2.0. 8[11] Friedman, J. H. (2001). Greedy function approximation: a gradient boosting machine.
Annals of statistics , pages 1189–1232. 11[12] Ibrahim, M. and Carman, M. (2016). Comparing pointwise and listwise objective func-tions for random-forest-based learning-to-rank.
ACM Transactions on Information Systems(TOIS) , 34(4):20. 3, 17[13] Joachims, T. (2002). Optimizing search engines using clickthrough data. In
Proceedings ofthe eighth ACM SIGKDD international conference on Knowledge discovery and data mining ,pages 133–142. ACM. 2[14] Kuhn, M. (2008). Caret package.
Journal of Statistical Software , 28(5):1–26. 9, 11[15] Lei, S. and Han, X. (2017). https://github.com/shiyujiucsb/mslr-web-rerank . 12[16] Liu, T.-Y. et al. (2009). Learning to rank for information retrieval.
Foundations andTrends R in Information Retrieval , 3(3):225–331. 2, 1723 uthor: Sen LEI, Xinzhi HAN [17] MacCartney, B. (2005). Nlp lunch tutorial: Smoothing. 5[18] Macdonald, C., Santos, R. L., and Ounis, I. (2013). The whens and hows of learning torank for web search. Information Retrieval , 16(5):584–628. 2[19] Minka, T. and Robertson, S. (2008). Selection bias in the letor datasets. In proceedings ofSIGIR 2008 workshop on learning to rank for information retrieval . 2[20] Qin, T. and Liu, T. (2010). Microsoft learning to rank datasets. 2, 3, 12, 21[21] Qin, T. and Liu, T. (2013). Introducing LETOR 4.0 datasets.
CoRR , abs/1306.2597. 2[22] Qin, T., Liu, T.-Y., Xu, J., and Li, H. (2010). Letor: A benchmark collection for researchon learning to rank for information retrieval.
Information Retrieval , 13(4):346–374. 2, 3, 12[23] R, T. (1996). Regression shrinkage and selection via the lasso.
Journal of the RoyalStatistical Society B, 58, 267288 . 9[24] Rudin, C. and Schapire, R. E. (2009). Margin-based ranking and an equivalence betweenadaboost and rankboost.
Journal of Machine Learning Research , 10(Oct):2193–2232. 2[25] Suhara, Y., Suzuki, J., and Kataoka, R. (2013). Robust online learning to rank via select-ive pairwise approach based on evaluation measures.
Information and Media Technologies ,8(1):118–129. 2, 3[26] Voorhees, E. M. (2005). The trec robust retrieval track. In
ACM SIGIR Forum , volume 39,pages 11–20. ACM. 3[27] Xu, J. and Li, H. (2007). Adarank: a boosting algorithm for information retrieval. In
Proceedings of the 30th annual international ACM SIGIR conference on Research and devel-opment in information retrieval , pages 391–398. ACM. 2[28] Zhang, M., Kuang, D., Hua, G., Liu, Y., and Ma, S. (2009). Is learning to rank e↵ectivefor web search? In