RecSys Challenge 2016: job recommendations based on preselection of offers and gradient boosting
Andrzej Pacuk, Piotr Sankowski, Karol W?grzycki, Adam Witkowski, Piotr Wygocki
aa r X i v : . [ c s . A I] D ec RecSys Challenge 2016: job recommendations based on preselectionof offers and gradient boosting
Andrzej Pacuk ∗ Piotr Sankowski ∗ Karol Wegrzycki ∗ Adam Witkowski ∗ Piotr Wygocki ∗ [apacuk,sank,k.wegrzycki,a.witkowski,wygos]@mimuw.edu.pl Abstract
We present the Mim-Solution’s approach to the RecSys Challenge 2016, which ranked 2nd. The goal ofthe competition was to prepare job recommendations for the users of the website Xing.com.Our two phase algorithm consists of candidate selection followed by the candidate ranking. We rankedthe candidates by the predicted probability that the user will positively interact with the job offer. We haveused Gradient Boosting Decision Trees as the regression tool.
The Recsys Challenge is an annual competition of recommender systems. The 2016th edition was based ondata provided by xing.com – a platform for business networking. On Xing, users search for job offers that couldfit them. Each user has a profile containing information such as: place of living, industry branch and experience(in years). Job offers are described by a related set of properties. There are various ways in which a user caninteract with a job offer (called item): by clicking is, as well as bookmarking interesting ones, replying to offersand finally by deleting a recommendation.The task in the challenge was to predict for a given XING user 30 items that this user will positively interactwith (click, bookmark or reply to). Dataset
The dataset consisted of properties of users and items, interactions of the users (which items userclicked, bookmarked, replied to or deleted) and impressions (items shown to users by XING recommendersystem). The interactions and impressions were gathered from a 3 month period.All data was anonymized by changing all properties to numerical values and adding an unknown number ofartificial users, items, interactions and impressions.We were also given a set of both user and item properties. Common attributes of users and items were:career level, discipline id, industry id, country and region. Besides that users had following attributes: jobroles,experience n entries class, experience years experience, experience years in current, edu degree, edu fieldofstudies.Items had attributes: title, latitude, longitude, employment, created at and active during test . Values of thoseattributes we will denote by, e.g., career level ( i ) for a given item i .Each impression is a tuple containing: user ID, item ID, number of the week in which impression occurred.Each interaction contains: user ID, item ID, interaction type (click, bookmark, reply, delete) and timestamp ofthe interaction. We will denote the positive interactions of a user u by Int u , the negative interactions by Del u and impressions by Imp u .10% of users were target users: users which we needed to compute the predictions for. The predicted itemshad to come from a subset (24%) of all items (these were the job offers open during the period). Exact datasetssizes are presented in Table 1. Evaluation measure
The ground truth is a mapping that assigns to each user, the set of items he interactedpositively with during the test week. Let T be the set of target users, I be the set of all items, and G : T → I the ground truth. We will also denote the sequence of items predicted for a user by pred ( u ).The quality of recommendations was measured by a function: score ( pred , G ) = X u ∈ T userScore ( pred ( u ) , G ( u )) , (1) ∗ Institute of Informatics, University of Warsaw, Poland http://2016.recsyschallenge.com/ A more detailed specification of the dataset can be found on https://github.com/recsyschallenge/2016/blob/master/TrainingDataset.md userScore ( pred ( u ) , G ( u )) =20 · (cid:2) p ( pred ( u ) , G ( u ) ,
2) + p ( pred ( u ) , G ( u ) , us ( pred ( u ) , G ( u )) + r ( pred ( u ) , G ( u )) (cid:3) +10 · (cid:2) p ( pred ( u ) , G ( u ) ,
6) + p ( pred ( u ) , G ( u ) , (cid:3) and for a sequence ¯ a = a , a , . . . , a , a set B and a natural number k : p (¯ a, B, k ) = |{ a , a , . . . , a k } ∩ B | /k r (¯ a, B ) = |{ a , a , . . . , a } ∩ B | / min(1 , | B | ) us (¯ a, B ) = min(1 , |{ a , a , . . . , a } ∩ B | ) . Note that: p (¯ a, B, k ) is the precision for the first k elements of a , r (¯ a, B ) is the recall and us (¯ a, B ) is user success(1 if if we predicted at least one item for this user correctly, 0 otherwise).Solutions were evaluated by the submission system and contestants received instant feedback with score value calculated on a sample of of target users. Our solution consists of two parts:1. for each user we calculate a set of candidate items, much smaller than the whole set I (Only thosecandidates were considered when creating a submission),2. learn for each (user, candidate) pair ( u, i ) the probability P [ i ∈ G ( u )] that the user will interact with thisitem.We submitted, for each user, 30 items with the highest predicted probability of interaction, excludingitems that the user deleted. Limiting the number of considered items per user allowed us to substantiallydecrease the time required to train a model and prepare a submission. Examining all possible user-item pairs(150 . × .
000 pairs) was infeasible considering our resources. To learn the probabilities P [ i ∈ G ( u )], wehave used Gradient Boosting Decision Trees (GBDT) [1] , optimizing the logloss measure. We learned theprobabilities instead of directly optimizing the score function since it does not give results for a single user-itempair. The schema of our solution is presented in Figure 1.Our solution was ranked 2nd in the competition, scoring 675985.03 and 2035964.16 points on the public andprivate leaderboard respectively. To put this into perspective, submitting for each user u , sorted from mostrecent Int u but without items from Del u (adding Imp u if there were less than 30 unique interactions) achieveda score of 495k on the public leaderboard. All of the computation were performed on 12 cores (24 threads), 64GB RAM Linux server. In this problem, there was no clearly defined training set and the first challenge was to create it. Since our taskwas to predict users’ interactions in the week following the end of the available data, we trained our model onall the data except the last available week and then used this last week data to compute the training groundtruth G to evaluate the model. This way, we could calculate the score and determine if we are making progresswithout sending an official submission (every team was allowed max. 5 submissions per day).There was an overlap between the training data and the test data (the full dataset). Both candidates andfeatures were computed separately for the training set and the full dataset. https://github.com/dmlc/xgboost Since there were 150k target users and more than 300k items, making a prediction for each user-item pairwould take too long. To address this issue, we chose for each user u a set of promising items, which we calledcandidates.To define candidates, we will need a few notions of similarity. The Jaccard coefficient between two sets A and B is J ( A, B ) ≡ | A ∩ B || A ∪ B | . Interactions (impressions) similarity between two users u, u ′ , denoted Int - sim ( u, u ′ )( Imp - sim ( u, u ′ )), is the Jaccard coefficient between the sets of items from their positive interactions (impressions).For example Int - sim ( u, u ′ ) = J ( Int u , Int u ′ ).For items i, i ′ , we will denote common - tags ( i , i ′ ) ≡ | tags ( i ) ∩ tags ( i ′ ) | and common - title ( i , i ′ ) ≡ | title ( i ) ∩ title ( i ′ ) | .For an user u , the candidates were:1. Int u sorted by the week of occurrence (most recent first) and the number of interactions,2. Imp u sorted the same way as 1,3. Int u ′ for users u ′ with large Int - sim ( u, u ′ ),4. Imp u ′ of users u ′ with large Imp - sim ( u, u ′ ),5. items i with large max i ′ ∈ Int u common - tags ( i , i ′ ), similarly for max i ′ ∈ Int u common - title ( i , i ′ ), max | title ( i ′ ) ∩ tags ( i ) | and max | title ( i ) ∩ tags ( i ′ ) | ,6. same as 5, just with max taken over i ′ ∈ Imp u ,7. items i with large | jobroles ( u ) ∩ tags ( i ) | ,8. items i with large | jobroles ( u ) ∩ title ( i ) | ,9. the most popular items (globally, this list was the same for all users).The popularity of an item was measured by the number of interactions of all users with this item. From eachcategory, we took 60 best candidates for each user. This approach gave us 43M (user,item) pairs, on average justshort of 300 candidates per user. On the training set, candidates chosen this way covered 37% of the trainingground truth.Of course, we could have chosen a different notions of similarity between users/items. In particular weconsidered similarity based on user and item properties such as region, industry, etc. Candidates based on thosemeasures of similarity did not improve the score sufficiently, probably due to anonymization of the data. We wanted to construct a model that given a ( u, i ) pair and values of features for this pair will compute theprobability P [ i ∈ G ( u )]. In order to estimate the probabilities, we have used XGBoost , a machine learninglibrary implementing GBDT. https://github.com/dmlc/xgboost • all the training candidate items, which occurred in the training ground truth (positive candidates) • and up to 5 training candidate items, which did not occur in the training ground truth (negative candi-dates).Just after deadline for submitting solutions we observed, that training a model on all users with all positiveand 1 / . Evaluation
We used maximum likelihood as the objective function optimized by GBDT. Additionally weverified models by computing the score function based on the validation part of the training ground truth.Major improvements in this validation score translated to comparable improvements on the score achieved viathe submission system. We measured on training ground truth that our way of ordering previously selectedcandidate items achieved score which was 77.5% of best possible result based only on those candidates.
Parameters tuning
We found that the optimal XGBoost parameters for our task were: • maximum depth of a tree (max depth) in range [4 , • minimum weight of node to be splitted (min child weight) in range [4 , • learning rate (eta) = 0 . • minimum loss reduction to make a node partition (gamma) = 1 . • number of rounds (num round) = 1000 (validation logloss did not improve after 1000 rounds). Each feature is a function that for user u and item i maps the pair ( u, i ) to some real number. We ended upwith 273 features. Many of these features were highly correlated, however, we observed that redundant featuresdo not reduce the quality of the model. Many features differ only in: • used different events source: impressions instead of interactions, only positive/negative interactions, onlyinteractions of one type, • instead of Jaccard coefficient we used size of sets intersection | A ∩ B | and vice-versa, • used various aggregate functions: maximum, minimum, sum, average, count or unique count (countwithout duplicate entries).Because of this we will only describe the important groups of features. There are 12 such groups Table 2summarizes the importance of the features we used. Ideas for features were inspired by papers of previousRecSys Challenge [2, 3] and similar competitions hosted on Kaggle platform . Event based features are percentages of items from
Int u that had some property (e.g., item’s career level) equalto item’s i corresponding property. Dually we also used the percentages of users from Int - users ( i ) (i.e., userswho interact a given item i ) that had some attribute equal to user u attribute. From this group of features, thebest were those based on item tags and item title fields. Item global popularity is the number of times item i was clicked by any user. Additionally we computedthe trend of popularity: clicks count in the last week divided by the clicks count in the previous last week.Another way to observe week trend was to compute trend based on events from last and previous last Mondays,Tuesdays, . . . , Sundays. Colaborative filtering most similar are features that measure similarity between the item i and the itemsfrom Int u using I - sim , and dually between the user u and users that interacted with i , using Int - sim . Formally,these are given by formulas: • max i ′ ∈ Int u I - sim ( i, i ′ ), • max u ′ ∈ Int - users ( i ) Int - sim ( u, u ′ ). Note that difference between 1st and 2nd place was 5.7k, so extending training file earlier could result in winning a competitionby our team. event based 5499 tags + title 946 item global popularity 2913 trend 1392weekday 599 cf most similar 1410 item clicked by user 790user who clicked item 620 user total events 1066 in last week 494 seconds from last user activity 882max common tags with clicked item 527position on the candidates list 456user-item events count in last week 288item properties 375 created at 166longitude 125latitude 84 content based user-item similarity 83 career level difference 47count of common user job roles and item titles 15 distance to the closest clicked item 55items cluster 7User total events is just | Int u | both with repetitions of items and without. We also used this feature limitedto the user’s last week of activity. Seconds from last user activity are the differences, in seconds, between: • the last time u clicked i and the maximal timestamp in data, • the last time u clicked i and the last time u clicked any item, • the last time u clicked any item and the maximal timestamp in data.Analogous features based on impressions were timed in weeks. Max common tags with clicked item are features that return the maximum number of common tags betweenitem i and items from Int u : max i ′ ∈ Int u common - tags ( i , i ′ ). We used those features with common - title ( i , i ′ ) insteadof common - tags ( i , i ′ ) and with impressions instead of interactions. Position on the candidates list is position number of item i on user’s u candidates list. There was separatefeature for each candidate algorithm (see Subsection 2.2). User-item events count in last week is the count of how many times user u clicked item i in the last weekof this users activity, or in the last week from the dataset. Item properties are values of item’s i attributes. Content based user-item similarity is a group of features based only on properties of u and i . The featureswere: • career level ( i ) − career level ( u ), • | jobroles ( u ) ∩ title ( i ) | , • | jobroles ( u ) ∩ tags ( i ) | . • attr ( u ) = attr ( i ) else 0, for the rest of matching attributes. Distance to the closest clicked item is the Euclidean distance between the location of item i and thelocation of the closest item from Int u . Item cluster is a boolean feature, true if the item i is in a cluster of some item from Int u , where i ∈ cluster ( i ′ )if i = i ′ and there exists a user u that clicked both i and i ′ within 10 minutes. This feature is another variationof the similarity between items. This time, we say that two items are similar if a user clicked both of them in asimilar time. The motivation is simple: if both those items were interesting for some user, then they probablyhave something in common that appeals to this user. 5ome of the features (especially time-related) had to be calculated separately and with care for the trainingdataset and the full dataset. For example, we had a feature “timestamp of the last interaction”. This featureon training instance should be shifted in order to cover the same range of values as on the full dataset. In a final step in the construction of our solution we merged our best models. Since all of them were similarand XGBoost based, we took for each ( u, i ) pair, the arithmetic mean of probabilities P [ i ∈ G ( u )] calculatedby those models. Finally, for each target user, we sorted candidate items by these averaged probabilities andselected the top 30. We have presented Mim-Solution’s approach to the 2016th RecSys Challenge. We used XGBoost to predictprobabilities that a user will be interested in a job offer, but only for preselected offers. This allowed to efficientlytrain and evaluate the model as well as use complex and robust features.Even tough our score was good (28.5 % of best possible score measured on our train data), there wasdefinitely a lot of room to improve. One easy improvement, mentioned already in the paper, is increasing thesize of the training set. There was also some place for improvements in ordering of candidate items, but wesuspect that twice bigger room for improvement was in expanding the set of candidates (we achieved 77.5% and37% of best possible results in the layer of sorting and selecting candidates respectively).This work is supported by ERC project PAAl-POC 680912.
References [1] T. Chen and C. Guestrin. Xgboost: A scalable tree boosting system. In B. Krishnapuram, M. Shah, A. J.Smola, C. Aggarwal, D. Shen, and R. Rastogi, editors,
Proceedings of the 22nd ACM SIGKDD InternationalConference on Knowledge Discovery and Data Mining, San Francisco, CA, USA, August 13-17, 2016 , pages785–794. ACM, 2016.[2] P. Romov and E. Sokolov. Recsys challenge 2015: Ensemble learning with categorical features. In
Proceedingsof the 2015 International ACM Recommender Systems Challenge , RecSys ’15 Challenge, pages 1:1–1:4, NewYork, NY, USA, 2015. ACM.[3] M. Volkovs. Two-stage approach to item recommendation from user sessions. In