Metric Learning for Session-based Recommendations
aa r X i v : . [ c s . I R ] J a n M ETR IC L EAR NING FOR S ESSION - BASED R EC OMMENDATIONS
A P
REPRINT
Bartłomiej Twardowski Computer Vision Center,Universitat Autónoma de Barcelona Warsaw University of Technology,Institute of Computer Science [email protected]
Paweł Zawistowski
Warsaw University of Technology,Institute of Computer Science [email protected]
Szymon Zaborowski
Sales IntelligenceJanuary 8, 2021 A BSTRACT
Session-based recommenders, used for making predictions out of users’ uninterrupted sequences ofactions, are attractive for many applications. Here, for this task we propose using metric learning,where a common embedding space for sessions and items is created, and distance measures dissimi-larity between the provided sequence of users’ events and the next action. We discuss and comparemetric learning approaches to commonly used learning-to-rank methods, where some synergies ex-ist. We propose a simple architecture for problem analysis and demonstrate that neither extensivelybig nor deep architectures are necessary in order to outperform existing methods. The experimentalresults against strong baselines on four datasets are provided with an ablation study. K eywords session-based recommendations · deep metric learning · learning to rank We consider the session-based recommendation problem, which is set up as follows: a user interacts with a givensystem (e.g., an e-commerce website) and produces a sequence of events (each described by a set of attributes). Sucha continuous sequence is called a session, thus we denote s k = e k, , e k, , . . . , e k,t as the k -th session in our dataset,where e k,j is the j -th event in that session. The events are usually interactions with items (e.g., products) within thesystem’s domain. In comparison to other recommendation scenarios, in the case of session-based recommendations —information about the user across sessions is not available (in contrast to session-aware recommendations). Also, thebrowsing sessions originate from a single site (which is different from task-based recommendations).The sequential nature of session-based recommendations means that it shares some similarities with tasks found withinnatural language processing (NLP), where sequences of characters, words, sentences, or paragraphs are analyzed.This connection leads to a situation where many methods that are successful in NLP are later applied to the field ofrecommendations. One such example is connected with recurrent neural networks (RNNs), which have led to a varietyof approaches applied to recommender systems [1, 2, 3]. Another, one is connected with the transformer model [4]applied to model users’ behavior [5].Despite the apparent steady progress connected with neural methods, there are some indications that properly appliedclassical methods may very well beat these approaches [6]. Therefore in this paper, we propose combining the classicalKNN algorithm with a neural embedding function based on an efficient neighborhood selection of top-n recommenda-tions. The method learns embeddings of sessions and items in the same metric space, where a given distance functionmeasures dissimilarity between the user’s current session, and next items. For this task, a metric learning loss func-tion and data sampling are used for training the model. During the evaluation, the nearest neighbors are found forthe embedded session. This makes the method attractive for real-life applications, as existing tools and methods forneighborhood selection can be used. The main contributions of this paper are as follows:• we verify selected metric learning tools for session-based recommendations,etric Learning for Session-based Recommendations - preprint• we present a comparison of the metric learning approach and learning to rank, where some potential futuredirections for recommender systems can be explored based on the latest progress in deep metric learning,• we introduce a generic model for recommendations, which allows the impact of different architectures ofsession and item encodings on the final performance to be evaluated— which we do in the provided ablationstudies,• we evaluate our approach using known protocols from previous session-based recommendation works againststrong baselines over four datasets; for the sake of reproducibility and future research . Time and sequence models in context-aware recommender systems were used before the deep learning methodsemerged. Many of these approaches can be applied to session-based recommendation problems with some addi-tional effort to represent time, e.g., modeling it as a categorical contextual variable [7, 8] or explicit bias while makingpredictions [9]. The sequential nature of the problem can also be simplified and used with other well-known methods,i.e., Markov chains [10], or applying KNNs combined with calculating the session items sets’ similarities [11].The Gru4Rec method [1] has been an important milestone in applying RNNs to session-based recommendation tasks.The authors focused on sessions solely represented by interactions with items and proposed a few contributions: usingGRU cells for session representation, negative exemplars mining within mini-batch, and a new
TOP1 loss function. Inthe followup work [12] authors proposed further improvements to loss functions. Inspired by the successful applicationof convolutional neural networks (CNNs) for textual data [13], new methods were proposed. One example is the Caserapproach [14], which uses a CNN-based architecture with max pooling layers for top-n recommendations for sessions.Another, proposed in [15], utilises dilated 1D convolutions similar to WaveNet [16]. The embedding techniques knownfrom NLP, e.g. skip-gram and CBOW, were also extensively investigated for recommender systems. Methods suchas item2vec and prod2vec were proposed for embedding-based approaches. However recently conducted experimentswith similar approaches, were unsuccessful in obtaining better results than simple neighbourhood methods for session-based recommendations [17].
Metric learning has a long history in information retrieval. Among the early works, the SVM algorithm was used tolearn from relative comparisons in [18]. Such an approach directly relates to Mahalanobis distance learning, whichwas pursued in [19] and [20]. Even though new and more efficient architectures emerge constantly, the choice ofloss functions and training methods still plays a significant role in metric learning. In [21], the authors proposedthe use of contrastive loss, which minimizes the distance between similar pairs while ensuring the separation of non-similar objects by a given margin. For some applications, it was found hard to train, and in [22], the authors proposedimprovement by using an additional data point— anchor . All three data points make an input to the triplet loss function,where the objective is to keep the negative examples further away from the anchor than the positive ones with a givenmargin. Recently more advanced loss functions were proposed: using angular calculation in triplets[23], signal-to-noise ratio[24], and multi-similarity loss[25]. Still, contrastive and triple losses in many applications have proven tobe a strong baseline when trained correctly[26]. Nevertheless, the high computational complexity of data preparation(i.e. creating point tuples for training) for contrastive and triplet approaches cannot be solved by changing only theloss function. These problems are addressed by different dataset sampling strategies and efficient mining techniques.One notable group here is online approaches, which try to explore relations in a given mini-batch while training, e.g.,hard mining[26], n-pairs[27], the lifted structure method[28], and weighting by distance[29]. Many combinations ofsampling and mining techniques, along with the loss functions, can be created, which makes a fair comparison hard[30, 31, 32].
An ordered output of the session-based recommender in the form of a sorted list for a given input s k is the ranking r k . In learning-to-rank, as well as recommender systems, the main difficulty is the direct optimization of the output’squality measures (e.g., recall, mean average precision, or mean reciprocal rank). The task is hard for many (gradient-based) methods due to the non-smoothness of the optimized function[33]. This problem can be resolved either by https://github.com/btwardow/dml4rec etric Learning for Session-based Recommendations - preprintminimizing a convex upper bound of the loss function, e.g., SVM-MAP[34], or by optimizing a smoothed version ofan evaluation measure, e.g., SoftRank[35]. Many approaches exist, which depend on the model of ranking: pointwise(e.g. SLIM[36], MF[37], FM[8]), pairwise (BPR[38], pLPA[39], GRU4Rec[1]), or list-wise (e.g. CLIMF/xCLIMF[40, 41], GAPfm[42], TFMAP[43]). However, not all are applicable to session-based recommendations. Pairwiseapproaches for ranking top-N items are the most commonly used, along with neural network approaches. In theGRU4Rec method, two pairwise loss functions for training were used — Bayesian Personalized Ranking (BPR)[38]and TOP-1: l BPR ( s k , i p , i n ) = − ln (cid:0) σ (ˆ y s k ,i p − ˆ y s k ,i n ) (cid:1) (1) l TOP1 ( s k , i p , i n ) = σ (ˆ y s k ,i n − ˆ y s k ,i p ) + σ (ˆ y s k ,i n ) (2)where s k denotes the session for which i p is a positive example of the next item, and i n is a negative one. The ˆ y s k ,i isa score value predicted by the model for the session s k and the item i . The score value allows items to be comparedand the ordered list r k to be produced, where i.e. i p > r k i n , and > r k ⊂ I denotes total order[38].In metric learning, the main goal is to align distance with dissimilarity of objects. In [21], the contrastive loss functionfor two vectors x i , x j ∈ R d is given as: l Cont ( x i , x j ) = yd ( x i , x j ) + (1 − y ) max (cid:0) , d ( x a , x n ) − m (cid:1) (3)where y is an indicator variable, 1 if both vectors are from the same class, 0 otherwise, m ∈ R + is the margin, and d ( x i , x j ) is a distance function, e.g., Euclidian or cosine. This loss function pulls similar items ( y = 1 ) and pushes dissimilar ones. A direct extension — the triplet loss[22] – is defined as follows: l Triplet ( x a , x p , x n ) = max (cid:0) , d ( x a , x p ) − d ( x a , x n ) + m (cid:1) (4)where x p and x n are respectively positive and negative items for a given anchor x a and m ∈ R + is the margin.Both contrastive and triplet losses can be used to optimize the goal of the total ordering of objects [44, 38] as inducedby the learned metric. If d ( x i , x j ) = 0 does not imply x i = x j , d is then a pseudo metric [18], and total order cannotbe induced. If we assume that two functions ϕ ( s a ) = x a and item ω ( i k ) = x k are given to embed the session and itemto the same R d space, where scoring is done by cosine similarity ˆ y s,i = 1 − d ( ϕ ( s ) , ω ( i )) , then previously definedranking losses and metric can be presented as: l BPR ( s k , i p , i n ) = − ln( σ ( d kn − d kp )) , (5) l TOP1 ( s k , i p , i n ) = σ ( d kp − d kn ) + σ ((1 − d kn ) ) , (6) l Triplet ( s k , i p , i n ) = max(0 , d kp − d kn + m ) (7)where d kj = d ( ϕ ( s k ) , ω ( i j )) . A direct connection can be seen: that minimizing each of the loss functions will tryto keep i p closer to s k than i n . In all cases, for session-based recommendations, positive items are known, whilethe negatives are sampled from the rest of the items (e.g., uniformly or by a given heuristic). In both BPR and
TOP1 ,a sigmoid σ ( x ) function is used for optimizing AUC in place of a non-differentiable Heaviside function directly, asexplained in [38]. In TOP1 , the authors added a regularization term for negative predictions, which further constrainsthe embedding space by keeping negatives close to zero. Metric learning losses use a rectifier nonlinearity ( max(0 , x ) )to prevent from moving data points that are already in order. When considering partial derivative w.r.t distancesbetween our anchor session s k and positive and negative items, they contribute equally, as was discussed in [25]. If ina single calculation, more relations are explored (usually inside the same mini-batch), techniques like lifted structures[28] are used. However, the relations are made between known classes of examples. In learning to rank, each instanceinside a selected set can be ordered, which can be used i.e., to estimate overall ranking, like in Weighted Approximated-Ranking Pairwise (WARP)[45]. All losses have one more important thing in common: they do not take into accountthe relationship between positive and negative items (without the anchor). This is a subject of further improvementsin metric learning methods like [23, 25]. In our solution, we propose using a simple weighting for ranking to addressthis shortcoming. We propose a method for session-based recommendations using deep metric learning, where the main input is thesequence of user’s actions (i.e. the session) s k = { e k, , e k, , . . . , e k,t } ∈ S , and items i ∈ I . At the high-levelthe network’s architecture can be described as ˆ y s k ,i = d ( ϕ ( s k ) , ω ( i )) , where ϕ and ω denote the session and itemencoders respectively, and ˆ y s k ,i denotes how score for recommending item i in the context of session s k . We decidedon a simple and modular approach in order to investigate the impact of each module on the final outcome — focusingmainly on the session encoder and different metric loss functions. The only constraint of the model towards the usednetwork is the used dimensionality of ϕ ( s k ) , ω ( i ) ∈ R d for learning a common metric space. The outputs of networksare normalized and cosine distance functions d ( ϕ ( s k ) , ω ( i )) are used in final scoring ˆ y s k ,i calculation.etric Learning for Session-based Recommendations - preprint The overall triplet loss function is calculated over the prepared training dataset. Assuming that session s k has L positive items, the final triplet loss function for balanced positive-negative sampling is as follows: L = 1 | S | X s k ∈ S L X j =0 w j max (0 , d ( ϕ ( s k ) , ω ( i p )) − d ( ϕ ( s k ) , ω ( i n )) + m ) (8)where weight w j is weight used for particular position. In experiments, we used p / (1 + j ) for weighting, which isexpected to change the magnitude of the calculated gradient based on the ranking position. To incorporate the relationbetween positives and negatives items we used a swaping technique for a triplet loss, where anchor is exchanged withpositive and the final distance to a negative point is taken as a minimum d ′ kn = min ( d kn , d pn ) . Based on the NCA loss [46, 47] used commonly in deep metric learning we introduce a version prepared for rankingsession-based recommendations as follows: p ( i j | s k ) = exp ( − d ( ϕ ( s k ) , ω ( i j ))) P i j ∈ Z exp ( − d ( ϕ ( s k ) , ω ( i j ))) (9) L NCAS = 1 | S | X s k ∈ S KLD ( p ( i | s k ) || p ′ ( i )) (10)where predictions of true labels inside N -sized mini-batches are smoothed with: p ′ ( i ) = (1 − ǫ ) p ( i ) + ǫ/N and Z is asampled set containing positive and negative examples for each session s k . The main goal of using this loss functionwas to compare the triplet loss to other functions that can be applied for our setting in order to get more insight of itsapplicability and results. We use several neural network architectures for the session encoder module. Each one of these networks takes as aninput a sequence of session events, which are clicked items in all used datasets, and encodes it to a vector of embeddingsize d . Used network architectures as session encoders go as follows:• Pooling - this architecture embeds the sequence of clicked items to a vector of size d by pooling the maximumor average value in each dimension. Inspired by how pre-trained embeddings (e.g. word2vec) are used inNLP downstream tasks. However, all relations in a sequence are lost.• CNN based approaches including TextCNN[13], TagSpace[48], Caser[14].• RNN-based approaches — these use one of the chosen recurrent networks (GRU, LSTM, RNN) to encodethe sequence followed by multiple fully connected layers to generate recommendation scores for individualitems. Training data is prepared from all available users’ sessions S . We want to predict the user’s next action for a givensession s k . Thus, training data preparation tries to enforce this for the model. Each session is split randomly — thefirst part is used as an input for the network s k , and the following actions with items are used as positive examples forthat session i p, , . . . , i p,l . For each l positive, the same number of negatives are sampled randomly. We investigated afew different strategies in case the session after random split has not enough positive examples. One of the successfulapproaches we used is to prepare more positives before training using KNN method. Sampling is done at a beginningof each training epoch. However, the improved MRR score is counterbalanced by lower items’ coverage.In other works, the negative sampling is done randomly from all non-positive items, e.g., [8]. From the optimizationperspective [1] took a different approach and sample negative examples from the same mini-batch given to the network.What relates to online samples mining used in deep metric learning techniques, but here without enforcing a marginof error like in hard negative mining [26].etric Learning for Session-based Recommendations - preprint Dataset Source Items Sessions Events
RR/5
Retail Rocket 32K (117K) 64K (380K) 242K (606K)
RSC15/64
SI-T
Proprietary e-commerce data
SI-D
Proprietary e-commerce data
Table 1: Experimental dataset stats — ( before ) and after preprocessing. N o . o f s e ss i o n s RR/64 SI-D
RSC15/64
No. of events in sessionsSI-T % o f s e ss i o n s RR/64 SI-D
RSC15/64 % of repeating items in a sessionSI-T
Figure 1: Session length distribution (left) and repeating items (right) for each dataset.
To conduct our experiments, we have followed the procedure utilized by [6], and used five splits for RR and one 64’thof RSC15 data. For each dataset, we have split the events into individual user sessions and removed the ones thatcontained only a single event. Furthermore, in our experiments, we have included only items that occurred at least fivetimes in the data. A train-test split was prepared by taking the last of sessions. We further evaluated our modelsby using common information retrieval and ranking evaluation metrics: mean average precision, mean reciprocal rank,recall, precision, and hit ratio. All metrics were computed on a list of top 20 recommendations. Following [6] and [1],in case of MRR@20 and HR@20 only the next item was used as the ground truth. This no look-ahead evaluation canbe considered as a more adequate, when after each of a user’s action a session state is updated and predictions for thenext user’s step is given.
To conduct the experiments, we used four datasets from the e-commerce domain, which are summarized in Table 1.Two of these (
RR/5 and
RSC15/64 ) are standard benchmarks for session-based recommender systems, while theremaining ones are smaller, real-world proprietary datasets with data gathered in the early 2020. The differencebetween
SI-T and
SI-T is the category of products for which data were collected. In all datasets users’ events arerepresented only by interactions with a products (i.e., view, click), thus e k,l ∼ I .Fig. 1 (left) presents a histograms of session lengths for the preprocessed datasets, which shows that short sessionsseem to dominate in all datasets. This might be more challenging for methods that focus on the sequential nature ofthe users’ data. Furthermore, when analyzing the percentage of recurring items among the sessions presented on Fig. 1(right), it may be noticed that the session frequently contain multiple interactions with the same products. The datasuggests that users seem to revisit already seen items quite often. However, this also poses an interesting question fromthe perspective of recommender systems: should such a system suggest items that a user has already seen in the givensession or only new ones? The answer will depend on the specific use case and whether the system should provide amore explorative or exploitative user experience.We compared our Session-based Metric Learning ( SML ) method against six baseline algorithms. Starting from thesimplest ones,
POP denotes a simple popularity-based algorithm, which simply recommends the top-n most popularitems.
SPOP recommend items already seen in the session ordered by number of occurrences and fills the rest withpopular ones. This recommender performs well when predictions are expected to be repetitions. The
KNN algorithmwas the basis of the next two baseline methods:
SKNN and
VSKNN . The
SKNN approach for a given session recommendsthe top-n most frequent items among the K -most similar sessions from the training data, for which a cosine distance isused. The VSKNN [6] approach works similarly, however it puts more weight on more recent events in a given session.The last two methods include a Markov first-order recommender reported as
MARKOV-1 and
GRU4Rec+ [12].etric Learning for Session-based Recommendations - preprint
All the variations of the proposed model were implemented using the PyTorch[49] library and trained in an end-to-endfashion with Adam optimizer using lr = 0 . , for max 150 epochs (early stop after lowering lr three times whenimprovement on validation data is lower than . ) with batch size of 32, and 8 positive/negative samples persession. Max session length was 15 for RR/5 and
RSC15/64 , and 8 for SI — this plays an important role for CNNswhere all sessions are padded to exactly the same size. For item embedding simple feed-forward network with tanh activation is used. The embedding dimension is set to 400 for all the methods. The margin value m for triplet lossis set to . . Smoothing parameter for NCAS is set to ǫ = 0 . . For the RNN encoder a GRU cells are used with, 400dimensions. For the
TextCNN convolution filters of sizes , , were used. In Table 2 we present the results obtained during the experiments conducted with the proposed method and comparethem against the baselines. Not all combinations of session encoders with loss function are presented, only the mostpromising or interesting ones from the future research perspective (e.g.,
NCAS for
RSC15/64 and
RR/5 ).The modification introduced by
VSKNN to the non-weighted version of the method (i.e.,
SKNN ) seems to be effectivefor all the datasets, thus making
VSKNN a strong baseline indeed. Nevertheless, in some cases (like
RR/5 ), the simpler
SKNN method still obtains better results. Dataset specifics and used metrics play an important role here, as can be seenin Fig. 1 (right)—
RR/5 in comparison with other datasets (especially
RSC15/64 ) contains more repeating items. If weplace them at the beginning of our recommendation and fill up the rest with the most popular items, we can receivehigh MRR@20 values. However, the practical usefulness of such recommendations can be questionable.The low results of
MARKOV-1 for all datasets show that a simple association of the item and the next following actionis not enough to obtain good results. Extracting additional information from entire sequences is needed to improverecommendations, which is the basis on which the sequential modeling with
GRU4Rec+ method stands. Still, in mostcases, it is less accurate in the meaning of used metrics than the simple heuristic of
VSKNN . One possible explanationis that the
VSKNN model additionally incorporates recency in the scoring function. We can consider that as a simplyencoded contextual information about when the sequence occurred. This information is not used in other models.When scoring sequences within short periods of time this may not introduce a big difference, but becomes importantas the time difference increases, as e.g., some trends arise, and others fade out.From the overall results, our SML family methods are the best for two datasets, the proprietary
SI-T and the openavailable
RSC15/64 . For
SI-T the proposed triplet loss function seems to be the right choice, wherein the case of
RSC15/64 , training with
NCAS is more stable and is giving overall better results. This situation can be caused byfar bigger inventory size and number of events in this dataset. Moreover, on
SI-D and
RR/5 our methods positionthemselves as the second-best ones with a minimal margin to kNN based methods,
VSKNN and
SKNN , respectively. For
SI-D only the PREC@20 is lower, due to the fact of far better results of
SKNN (which we double-checked for thecorrectness with such good results for both SI datasets). The Retail-Rocket dataset presents consistent results with [6],where many new methods cope to beat
SKNN . With
SML-MaxPooling-NCAS , we get close to the position of being theleader.Between the investigated encoders, we can observe from the results that a simple max-pooling performs well and fallsvery close to the best score for
SI-* datasets. Intuitively, GRU and CNN based methods should be better in encodinglonger sequences of actions, like
RSC15/64 and
RR/5 (see Fig. 1 (left)). However, this proved to be true only for
RSC15/64 results, where CNN and RNN based methods are among the best ones. For
RR/5 simple pooling with theproposed
NCAS loss function is the best one from the SML method family. Additionally, in practical terms, CNN-based models can be preferred from GPU utilization perspective, as the architecture and many libraries are optimizedfor computer vision and image processing.
Similar to [17, 6] we investigated the distribution of predicted items for the selected approaches. Interestingly, ourmetric learning based methods usually give wider spectrum of recommended items. Even checking simple statistic ofoverall unique items being recommended, for SI datasets our methods return almost twice as much unique items as VSKNN method (666 to 1,542 and 1,522 to 2,542 for a sample run, all items 2k, 3k respectively, see Table 1), while for
RR/5 and
RSC the difference is not so big (16,334 to 19,063, 12,232 to 11,216 for a sample run).etric Learning for Session-based Recommendations - preprint
Dataset Method MAP PREC ▽ REC HR MRR
SI-T SML-TextCNN-Triplet
SML-RNN-Triplet
VSKNN
SML-MaxPool-Triplet
SML-MaxPool-NCAS
SML-RNN-NCAS
SML-TextCNN-NCAS
SPOP
SML-TagSpace-Triplet
GRU4Rec+
SKNN
POP
MARKOV-1
SI-D VSKNN
SML-RNN-Triplet
SML-TextCNN-Triplet
SML-MaxPool-NCAS
SML-RNN-NCAS
SML-TextCNN-NCAS
SKNN
SML-MaxPool-Triplet
SML-TagSpace-Triplet
SPOP
GRU4Rec+
MARKOV-1
POP
RSC15/64 SML-RNN-NCAS
SML-MaxPool-NCAS
SML-TextCNN-NCAS
SML-RNN-Triplet
VSKNN
SML-MaxPool-Triplet
SKNN
GRU4Rec+
SML-TextCNN-Triplet
SML-TagSpace-Triplet
MARKOV-1
SPOP
POP
RR/5 SKNN
SML-MaxPool-NCAS
VSKNN
SML-RNN-NCAS
GRU4Rec+
SML-RNN-Triplet
SML-TextCNN-NCAS
SML-MaxPool-Triplet
SPOP
SML-TextCNN-Triplet
SML-TagSpace-Triplet
MARKOV-1
POP
Table 2: Results obtained during the experiments. The baseline
SKNN , VSKNN and
GRU4Rec+ values for
RR/5 and
RSC15/64 are taken from supplementary materials to [6]. Best results for each measure–dataset pair are in boldface , while the second bests are underlined; ▽ indicates the sort column. The SML naming convention is:
SML-SessionEncoder-LossFunction , with:
RNN and
MaxPool denote encoders described in 4.4, and three lossfunctions:
Contrastive , Triplet and smoothed
NCA — NCAS . To verify the impact of each component in our proposed solution, we run a series of experiments on
SI-T dataset fortwo encoders:
RNN and
MaxPool , enabling each improvements one by one. The results with REC@20 and MRR@20are shown in Table 3.One of the first sampling methods evaluated with
SML was a simple sliding window-based technique. For a definednumber of events (padded if necessary), we take only the next following items as positive examples, and negative onesare randomly sampled. We quickly switched to sampling presented in section 4.5, as we notice that the windowingtechnique is not reflecting how the system is utilized in real use cases. Specifically, for various sub-sequences fromthe beginning of a session, predictions are also required, disregarding the sliding window size. As the next step, weevaluated the impact of the inner elements from triplet loss, like normalization (which is very common), margin usageetric Learning for Session-based Recommendations - preprint
Comm.
RNN MaxPool
Emb. Sampler Loss ▽ REC@20 MRR@20 REC@20 MRR@20
True Pos–Neg N-M 0.7435 0.5973 0.7402 0.5978True Pos–Neg N 0.7377 0.5973 0.7340 0.5932True Pos–Neg N-M-S 0.7371 0.5908 0.7359 0.5888False Pos–Neg N 0.6565 0.5746 0.6508 0.5727False Pos–Neg N-M 0.6341 0.5783 0.6192 0.5767False Pos–Neg N-M-S 0.6192 0.5678 0.6247 0.5796False SW N-M-S 0.0022 0.0006 0.0525 0.0191
Table 3: Ablation results obtained for the
RNN and
MaxPool session encoders. Columns labels in order: (1)
True/False is common embedding was used; (2) Sampler: SW - sliding window, Pos-Neg - session positive neg-ative sampling as described in sec. 4.5; (3) Triplet loss with: N - L2 normalization, M - 0.3 margin used, S - swapinganchor–session with positive item. Resultes are sorted by REC@20.(which for some datasets are set to very small values), and swapping of anchor and positive elements. To our surprise,swapping is not always giving good results for a session-based recommendations setting.A crucial role for improving our model was the use of common embeddings for both session encoder ϕ ( s k ) anditems encoder ω ( i j ) for the prediction. This lowered the number or all parameters to train and positively influencedthe overall results. We think that even further improvements can be made to the proposed method by a more globalnetwork parameters search. But this was out of scope of our computational possibilities. Thus, we constrained someof the network’s hyper-parameters that are related (e.g., GRU hidden state dimension and following feed-forwardnetwork dimension to be the same). In this paper, we have presented a novel approach to session-based recommendations that utilizes concepts from thefield of metric learning. The proposed method has a clear and modular architecture that combines session and item em-beddings with a metric loss function. Each of these elements may be individually tweaked and thus defines a potentialdirection for further research. We test our approach against independent results obtained for strong baseline methodsusing a well-established evaluation procedure and receive state-of-the-art results. The analysis is also extended byablation studies, which confirm that the proposed solution does not have unnecessary elements.Our approach’s main advantage is a modular design and extensibility that makes it possible to tweak its componentsto best match the dataset or incorporate some prior knowledge. Moreover, the fact that
SML is based on principlesoriginating from metric learning, many improvements from that field can still be transferred and evaluated for session-based recommendations. From usage perspective, our approach can be attractive in combination with existing pipelines(KNN recommendations) and libraries (optimized CNN).We can identify two main weaknesses of our method. Firstly, sampling has a significant impact on the results both interms of quality and computational efficiency, so careful GPU usage and memory management is required. Secondly,many improvements that can be taken for granted within computer vision do not necessarily improve the final modelwhen combined with other elements for session-based recommendations, which was presented in the ablation study.Although we achieved promising results with the current method, this work only touched the subject of applying metriclearning to session-based recommendations, and much more is to be explored. Apart from the already mentionedembeddings, the positive/negative sampling strategy used during the training phase seems to deserve more attention.Based on good experimental results achieved by some baselines, an introduction of a missing users’ actions timecontext into session-based recommendation also seems worth exploring. Further investigation of improvements in thedeep metric learning field can result in even better session-based recommendations, and similar synergy can be found,like in the case of NLP.
Acknowledgments and Disclosure of Funding
We acknowledge the support from Sales Intelligence and co-funding by European Regional Development Fund, projectnumber: POIR.01.01.01-00-0632/18.etric Learning for Session-based Recommendations - preprint
References [1] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Session-Based Recommendationswith Recurrent Neural Networks. Technical report.[2] Bartłomiej Twardowski. Modelling contextual information in session-aware recommender systems with neuralnetworks. In
Proceedings of the 10th ACM Conference on Recommender Systems , pages 273–276, 2016.[3] Elena Smirnova and Flavian Vasile. Contextual sequence modeling for recommendation with recurrent neuralnetworks. In
Proceedings of the 2nd Workshop on Deep Learning for Recommender Systems , DLRS 2017, page2–9, New York, NY, USA, 2017. Association for Computing Machinery.[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectionaltransformers for language understanding. arXiv preprint arXiv:1810.04805 , 2018.[5] Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. Bert4rec: Sequential rec-ommendation with bidirectional encoder representations from transformer. In
Proceedings of the 28th ACMInternational Conference on Information and Knowledge Management , pages 1441–1450, 2019.[6] Malte Ludewig, Noemi Mauro, Sara Latifi, and Dietmar Jannach. Performance comparison of neural and non-neural approaches to session-based recommendation. In
RecSys 2019 - 13th ACM Conference on RecommenderSystems , pages 462–466. Association for Computing Machinery, Inc, sep 2019.[7] Balázs Hidasi and Domonkos Tikk. General factorization framework for context-aware recommendations.
DataMining and Knowledge Discovery , 30(2):342–371, 2016.[8] Steffen Rendle. Factorization machines. In
Proceedings - IEEE International Conference on Data Mining,ICDM , pages 995–1000, 2010.[9] Yehuda Koren. Collaborative filtering with temporal dynamics. In
Proceedings of the 15th ACM SIGKDDinternational conference on Knowledge discovery and data mining , pages 447–456, 2009.[10] Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. Factorizing personalized markov chains fornext-basket recommendation. In
Proceedings of the 19th International Conference on World Wide Web , WWW’10, page 811–820, New York, NY, USA, 2010. Association for Computing Machinery.[11] Dietmar Jannach and Malte Ludewig. When recurrent neural networks meet the neighborhood for session-basedrecommendation. In
Proceedings of the Eleventh ACM Conference on Recommender Systems , RecSys ’17, page306–310, New York, NY, USA, 2017. Association for Computing Machinery.[12] Balázs Hidasi and Alexandros Karatzoglou. Recurrent neural networks with top-k gains for session-based rec-ommendations. In
Proceedings of the 27th ACM International Conference on Information and Knowledge Man-agement , pages 843–852, 2018.[13] Yoon Kim. Convolutional neural networks for sentence classification. arXiv preprint arXiv:1408.5882 , 2014.[14] Jiaxi Tang and Ke Wang. Personalized Top-N Sequential Recommendation via Convolutional Sequence Embed-ding.
Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining - WSDM ’18 ,pages 565–573, 2018.[15] Fajie Yuan, Alexandros Karatzoglou, Ioannis Arapakis, Joemon M. Jose, and Xiangnan He. A simple con-volutional generative network for next item recommendation.
WSDM 2019 - Proceedings of the 12th ACMInternational Conference on Web Search and Data Mining , (August):582–590, 2019.[16] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalch-brenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A Generative Model for Raw Audio. sep 2016.[17] Malte Ludewig and Dietmar Jannach. Evaluation of session-based recommendation algorithms.
User Modelingand User-Adapted Interaction , page 331–390, December 2018.[18] Matthew Schultz and Thorsten Joachims. Learning a distance metric from relative comparisons. In
Advances inneural information processing systems , pages 41–48, 2004.[19] Brian McFee and Gert R Lanckriet. Metric learning to rank. In
Proceedings of the 27th International Conferenceon Machine Learning (ICML-10) , pages 775–782, 2010.[20] Daryl Lim and Gert Lanckriet. Efficient learning of mahalanobis metrics for ranking. In
International conferenceon machine learning , pages 1980–1988, 2014.[21] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 1735–1742,2006.etric Learning for Session-based Recommendations - preprint[22] Elad Hoffer and Nir Ailon. Deep metric learning using triplet network.
Lecture Notes in Computer Science(including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics) , pages 84–92,2015.[23] Jian Wang, Feng Zhou, Shilei Wen, Xiao Liu, and Yuanqing Lin. Deep metric learning with angular loss. In
Proceedings of the IEEE International Conference on Computer Vision , pages 2593–2601, 2017.[24] Tongtong Yuan, Weihong Deng, Jian Tang, Yinan Tang, and Binghui Chen. Signal-to-noise ratio: A robustdistance metric for deep metric learning. In
Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 4815–4824, 2019.[25] Xun Wang, Xintong Han, Weilin Huang, Dengke Dong, and Matthew R Scott. Multi-similarity loss with generalpair weighting for deep metric learning. In
Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 5022–5030, 2019.[26] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identification. arXiv preprint arXiv:1703.07737 , 2017.[27] Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In
Advances in NeuralInformation Processing Systems , pages 1857–1865, 2016.[28] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structuredfeature embedding. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages4004–4012, 2016.[29] Chao-Yuan Wu, R Manmatha, Alexander J Smola, and Philipp Krahenbuhl. Sampling matters in deep embeddinglearning. In
Proceedings of the IEEE International Conference on Computer Vision , pages 2840–2848, 2017.[30] Istvan Fehervari, Avinash Ravichandran, and Srikar Appalaraju. Unbiased evaluation of deep metric learningalgorithms. arXiv preprint arXiv:1911.12528 , 2019.[31] Kevin Musgrave, Serge Belongie, and Ser-Nam Lim. A metric learning reality check. arXiv preprintarXiv:2003.08505 , 2020.[32] Mahmut Kaya and Hasan ¸Sakir Bilge. Deep metric learning: a survey.
Symmetry , 11(9):1066, 2019.[33] C J C Burges, Robert Ragno, and Q V Le. Learning to Rank with Nonsmooth Cost Functions.
Machine Learning ,19:193–200, 2007.[34] Yisong Yue, Thomas Finley, Filip Radlinski, and Thorsten Joachims. A support vector method for optimizingaverage precision. In
Proceedings of the 30th annual international ACM SIGIR conference on Research anddevelopment in information retrieval , pages 271–278, 2007.[35] Michael Taylor, John Guiver, Stephen Robertson, and Tom Minka. SoftRank: optimizing non-smooth rankmetrics. In
WSDM ’08 , pages 77–86, 2008.[36] Xia Ning and George Karypis. SLIM: Sparse LInear Methods for top-N recommender systems.
Proceedings -IEEE International Conference on Data Mining, ICDM , pages 497–506, 2011.[37] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative filtering model.
Proceedingof the 14th ACM SIGKDD international conference on Knowledge discovery and data mining , pages 426–434,2008.[38] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-thieme. BPR : Bayesian PersonalizedRanking from Implicit Feedback. In
Proceedings of the Twenty-Fifth Conference on Uncertainty in ArtificialIntelligence , volume cs.LG, pages 452–461, 2009.[39] Nn Liu, Min Zhao, and Q Yang. Probabilistic latent preference analysis for collaborative filtering.
Proceedingsof the 18th ACM conference on on Information and knowledge management , pages 759–766, 2009.[40] Harr Chen and David R. Karger. Less is More: Probabilistic Models for Retrieving Fewer Relevant Documents.In
SIGIR , pages 429 – 436, 2006.[41] Yue Shi, Alexandros Karatzoglou, and Linas Baltrunas. xCLiMF: optimizing expected reciprocal rank for datawith multiple levels of relevance.
Proceedings of the 7th ACM conference on Recommender systems , pages 0–3,2013.[42] Yue Shi, Alexandros Karatzoglou, Linas Baltrunas, Martha Larson, and Alan Hanjalic. GAPfm. In
Proceedingsof the 22nd ACM international conference on Conference on information & knowledge management - CIKM ’13 ,pages 2261–2266, 2013.etric Learning for Session-based Recommendations - preprint[43] Yue Shi, Alexandros Karatzoglou, Linas Baltrunas, Martha Larson, Alan Hanjalic, and Nuria Oliver. TFMAP:Optimizing MAP for top-n context-aware recommendation.
Proceedings of the 35th international ACM SIGIRconference on Research and development in information retrieval , pages 155–164, 2012.[44] Tie-Yan Liu et al. Learning to rank for information retrieval.
Foundations and Trends in Information Retrieval ,2009.[45] Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie: Scaling up to large vocabulary image annotation. In
Twenty-Second International Joint Conference on Artificial Intelligence , 2011.[46] Jacob Goldberger, Geoffrey E Hinton, Sam T Roweis, and Russ R Salakhutdinov. Neighbourhood componentsanalysis. In
Advances in neural information processing systems , pages 513–520, 2005.[47] Yair Movshovitz-Attias, Alexander Toshev, Thomas K Leung, Sergey Ioffe, and Saurabh Singh. No fuss distancemetric learning using proxies. In
Proceedings of the IEEE International Conference on Computer Vision , pages360–368, 2017.[48] Jason Weston, Sumit Chopra, and Keith Adams.
Proceed-ings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 1822–1827,Doha, Qatar, October 2014. Association for Computational Linguistics.[49] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen,Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De-Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and SoumithChintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle,A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors,