[PDF] Metric Learning for Session-based Recommendations

Abstract

Session-based recommenders, used for making predictions out of users' uninterrupted sequences of actions, are attractive for many applications. Here, for this task we propose using metric learning, where a common embedding space for sessions and items is created, and distance measures dissimilarity between the provided sequence of users' events and the next action. We discuss and compare metric learning approaches to commonly used learning-to-rank methods, where some synergies exist. We propose a simple architecture for problem analysis and demonstrate that neither extensively big nor deep architectures are necessary in order to outperform existing methods. The experimental results against strong baselines on four datasets are provided with an ablation study.

Full PDF

aa r X i v : . [ c s . I R ] J a n M ETR IC L EAR NING FOR S ESSION - BASED R EC OMMENDATIONS

A P

REPRINT

Bartłomiej Twardowski Computer Vision Center,Universitat Autónoma de Barcelona Warsaw University of Technology,Institute of Computer Science [email protected]

Paweł Zawistowski

Warsaw University of Technology,Institute of Computer Science [email protected]

Szymon Zaborowski

Sales IntelligenceJanuary 8, 2021 A BSTRACT

Session-based recommenders, used for making predictions out of users’ uninterrupted sequences ofactions, are attractive for many applications. Here, for this task we propose using metric learning,where a common embedding space for sessions and items is created, and distance measures dissimi-larity between the provided sequence of users’ events and the next action. We discuss and comparemetric learning approaches to commonly used learning-to-rank methods, where some synergies ex-ist. We propose a simple architecture for problem analysis and demonstrate that neither extensivelybig nor deep architectures are necessary in order to outperform existing methods. The experimentalresults against strong baselines on four datasets are provided with an ablation study. K eywords session-based recommendations · deep metric learning · learning to rank We consider the session-based recommendation problem, which is set up as follows: a user interacts with a givensystem (e.g., an e-commerce website) and produces a sequence of events (each described by a set of attributes). Sucha continuous sequence is called a session, thus we denote s k = e k, , e k, , . . . , e k,t as the k -th session in our dataset,where e k,j is the j -th event in that session. The events are usually interactions with items (e.g., products) within thesystem’s domain. In comparison to other recommendation scenarios, in the case of session-based recommendations —information about the user across sessions is not available (in contrast to session-aware recommendations). Also, thebrowsing sessions originate from a single site (which is different from task-based recommendations).The sequential nature of session-based recommendations means that it shares some similarities with tasks found withinnatural language processing (NLP), where sequences of characters, words, sentences, or paragraphs are analyzed.This connection leads to a situation where many methods that are successful in NLP are later applied to the ﬁeld ofrecommendations. One such example is connected with recurrent neural networks (RNNs), which have led to a varietyof approaches applied to recommender systems [1, 2, 3]. Another, one is connected with the transformer model [4]applied to model users’ behavior [5].Despite the apparent steady progress connected with neural methods, there are some indications that properly appliedclassical methods may very well beat these approaches [6]. Therefore in this paper, we propose combining the classicalKNN algorithm with a neural embedding function based on an efﬁcient neighborhood selection of top-n recommenda-tions. The method learns embeddings of sessions and items in the same metric space, where a given distance functionmeasures dissimilarity between the user’s current session, and next items. For this task, a metric learning loss func-tion and data sampling are used for training the model. During the evaluation, the nearest neighbors are found forthe embedded session. This makes the method attractive for real-life applications, as existing tools and methods forneighborhood selection can be used. The main contributions of this paper are as follows:• we verify selected metric learning tools for session-based recommendations,etric Learning for Session-based Recommendations - preprint• we present a comparison of the metric learning approach and learning to rank, where some potential futuredirections for recommender systems can be explored based on the latest progress in deep metric learning,• we introduce a generic model for recommendations, which allows the impact of different architectures ofsession and item encodings on the ﬁnal performance to be evaluated— which we do in the provided ablationstudies,• we evaluate our approach using known protocols from previous session-based recommendation works againststrong baselines over four datasets; for the sake of reproducibility and future research . Time and sequence models in context-aware recommender systems were used before the deep learning methodsemerged. Many of these approaches can be applied to session-based recommendation problems with some addi-tional effort to represent time, e.g., modeling it as a categorical contextual variable [7, 8] or explicit bias while makingpredictions [9]. The sequential nature of the problem can also be simpliﬁed and used with other well-known methods,i.e., Markov chains [10], or applying KNNs combined with calculating the session items sets’ similarities [11].The Gru4Rec method [1] has been an important milestone in applying RNNs to session-based recommendation tasks.The authors focused on sessions solely represented by interactions with items and proposed a few contributions: usingGRU cells for session representation, negative exemplars mining within mini-batch, and a new

TOP1 loss function. Inthe followup work [12] authors proposed further improvements to loss functions. Inspired by the successful applicationof convolutional neural networks (CNNs) for textual data [13], new methods were proposed. One example is the Caserapproach [14], which uses a CNN-based architecture with max pooling layers for top-n recommendations for sessions.Another, proposed in [15], utilises dilated 1D convolutions similar to WaveNet [16]. The embedding techniques knownfrom NLP, e.g. skip-gram and CBOW, were also extensively investigated for recommender systems. Methods suchas item2vec and prod2vec were proposed for embedding-based approaches. However recently conducted experimentswith similar approaches, were unsuccessful in obtaining better results than simple neighbourhood methods for session-based recommendations [17].

Metric learning has a long history in information retrieval. Among the early works, the SVM algorithm was used tolearn from relative comparisons in [18]. Such an approach directly relates to Mahalanobis distance learning, whichwas pursued in [19] and [20]. Even though new and more efﬁcient architectures emerge constantly, the choice ofloss functions and training methods still plays a signiﬁcant role in metric learning. In [21], the authors proposedthe use of contrastive loss, which minimizes the distance between similar pairs while ensuring the separation of non-similar objects by a given margin. For some applications, it was found hard to train, and in [22], the authors proposedimprovement by using an additional data point— anchor . All three data points make an input to the triplet loss function,where the objective is to keep the negative examples further away from the anchor than the positive ones with a givenmargin. Recently more advanced loss functions were proposed: using angular calculation in triplets[23], signal-to-noise ratio[24], and multi-similarity loss[25]. Still, contrastive and triple losses in many applications have proven tobe a strong baseline when trained correctly[26]. Nevertheless, the high computational complexity of data preparation(i.e. creating point tuples for training) for contrastive and triplet approaches cannot be solved by changing only theloss function. These problems are addressed by different dataset sampling strategies and efﬁcient mining techniques.One notable group here is online approaches, which try to explore relations in a given mini-batch while training, e.g.,hard mining[26], n-pairs[27], the lifted structure method[28], and weighting by distance[29]. Many combinations ofsampling and mining techniques, along with the loss functions, can be created, which makes a fair comparison hard[30, 31, 32].

An ordered output of the session-based recommender in the form of a sorted list for a given input s k is the ranking r k . In learning-to-rank, as well as recommender systems, the main difﬁculty is the direct optimization of the output’squality measures (e.g., recall, mean average precision, or mean reciprocal rank). The task is hard for many (gradient-based) methods due to the non-smoothness of the optimized function[33]. This problem can be resolved either by https://github.com/btwardow/dml4rec etric Learning for Session-based Recommendations - preprintminimizing a convex upper bound of the loss function, e.g., SVM-MAP[34], or by optimizing a smoothed version ofan evaluation measure, e.g., SoftRank[35]. Many approaches exist, which depend on the model of ranking: pointwise(e.g. SLIM[36], MF[37], FM[8]), pairwise (BPR[38], pLPA[39], GRU4Rec[1]), or list-wise (e.g. CLIMF/xCLIMF[40, 41], GAPfm[42], TFMAP[43]). However, not all are applicable to session-based recommendations. Pairwiseapproaches for ranking top-N items are the most commonly used, along with neural network approaches. In theGRU4Rec method, two pairwise loss functions for training were used — Bayesian Personalized Ranking (BPR)[38]and TOP-1: l BPR ( s k , i p , i n ) = − ln (cid:0) σ (ˆ y s k ,i p − ˆ y s k ,i n ) (cid:1) (1) l TOP1 ( s k , i p , i n ) = σ (ˆ y s k ,i n − ˆ y s k ,i p ) + σ (ˆ y s k ,i n ) (2)where s k denotes the session for which i p is a positive example of the next item, and i n is a negative one. The ˆ y s k ,i isa score value predicted by the model for the session s k and the item i . The score value allows items to be comparedand the ordered list r k to be produced, where i.e. i p > r k i n , and > r k ⊂ I denotes total order[38].In metric learning, the main goal is to align distance with dissimilarity of objects. In [21], the contrastive loss functionfor two vectors x i , x j ∈ R d is given as: l Cont ( x i , x j ) = yd ( x i , x j ) + (1 − y ) max (cid:0) , d ( x a , x n ) − m (cid:1) (3)where y is an indicator variable, 1 if both vectors are from the same class, 0 otherwise, m ∈ R + is the margin, and d ( x i , x j ) is a distance function, e.g., Euclidian or cosine. This loss function pulls similar items ( y = 1 ) and pushes dissimilar ones. A direct extension — the triplet loss[22] – is deﬁned as follows: l Triplet ( x a , x p , x n ) = max (cid:0) , d ( x a , x p ) − d ( x a , x n ) + m (cid:1) (4)where x p and x n are respectively positive and negative items for a given anchor x a and m ∈ R + is the margin.Both contrastive and triplet losses can be used to optimize the goal of the total ordering of objects [44, 38] as inducedby the learned metric. If d ( x i , x j ) = 0 does not imply x i = x j , d is then a pseudo metric [18], and total order cannotbe induced. If we assume that two functions ϕ ( s a ) = x a and item ω ( i k ) = x k are given to embed the session and itemto the same R d space, where scoring is done by cosine similarity ˆ y s,i = 1 − d ( ϕ ( s ) , ω ( i )) , then previously deﬁnedranking losses and metric can be presented as: l BPR ( s k , i p , i n ) = − ln( σ ( d kn − d kp )) , (5) l TOP1 ( s k , i p , i n ) = σ ( d kp − d kn ) + σ ((1 − d kn ) ) , (6) l Triplet ( s k , i p , i n ) = max(0 , d kp − d kn + m ) (7)where d kj = d ( ϕ ( s k ) , ω ( i j )) . A direct connection can be seen: that minimizing each of the loss functions will tryto keep i p closer to s k than i n . In all cases, for session-based recommendations, positive items are known, whilethe negatives are sampled from the rest of the items (e.g., uniformly or by a given heuristic). In both BPR and

TOP1 ,a sigmoid σ ( x ) function is used for optimizing AUC in place of a non-differentiable Heaviside function directly, asexplained in [38]. In TOP1 , the authors added a regularization term for negative predictions, which further constrainsthe embedding space by keeping negatives close to zero. Metric learning losses use a rectiﬁer nonlinearity ( max(0 , x ) )to prevent from moving data points that are already in order. When considering partial derivative w.r.t distancesbetween our anchor session s k and positive and negative items, they contribute equally, as was discussed in [25]. If ina single calculation, more relations are explored (usually inside the same mini-batch), techniques like lifted structures[28] are used. However, the relations are made between known classes of examples. In learning to rank, each instanceinside a selected set can be ordered, which can be used i.e., to estimate overall ranking, like in Weighted Approximated-Ranking Pairwise (WARP)[45]. All losses have one more important thing in common: they do not take into accountthe relationship between positive and negative items (without the anchor). This is a subject of further improvementsin metric learning methods like [23, 25]. In our solution, we propose using a simple weighting for ranking to addressthis shortcoming. We propose a method for session-based recommendations using deep metric learning, where the main input is thesequence of user’s actions (i.e. the session) s k = { e k, , e k, , . . . , e k,t } ∈ S , and items i ∈ I . At the high-levelthe network’s architecture can be described as ˆ y s k ,i = d ( ϕ ( s k ) , ω ( i )) , where ϕ and ω denote the session and itemencoders respectively, and ˆ y s k ,i denotes how score for recommending item i in the context of session s k . We decidedon a simple and modular approach in order to investigate the impact of each module on the ﬁnal outcome — focusingmainly on the session encoder and different metric loss functions. The only constraint of the model towards the usednetwork is the used dimensionality of ϕ ( s k ) , ω ( i ) ∈ R d for learning a common metric space. The outputs of networksare normalized and cosine distance functions d ( ϕ ( s k ) , ω ( i )) are used in ﬁnal scoring ˆ y s k ,i calculation.etric Learning for Session-based Recommendations - preprint The overall triplet loss function is calculated over the prepared training dataset. Assuming that session s k has L positive items, the ﬁnal triplet loss function for balanced positive-negative sampling is as follows: L = 1 | S | X s k ∈ S L X j =0 w j max (0 , d ( ϕ ( s k ) , ω ( i p )) − d ( ϕ ( s k ) , ω ( i n )) + m ) (8)where weight w j is weight used for particular position. In experiments, we used p / (1 + j ) for weighting, which isexpected to change the magnitude of the calculated gradient based on the ranking position. To incorporate the relationbetween positives and negatives items we used a swaping technique for a triplet loss, where anchor is exchanged withpositive and the ﬁnal distance to a negative point is taken as a minimum d ′ kn = min ( d kn , d pn ) . Based on the NCA loss [46, 47] used commonly in deep metric learning we introduce a version prepared for rankingsession-based recommendations as follows: p ( i j | s k ) = exp ( − d ( ϕ ( s k ) , ω ( i j ))) P i j ∈ Z exp ( − d ( ϕ ( s k ) , ω ( i j ))) (9) L NCAS = 1 | S | X s k ∈ S KLD ( p ( i | s k ) || p ′ ( i )) (10)where predictions of true labels inside N -sized mini-batches are smoothed with: p ′ ( i ) = (1 − ǫ ) p ( i ) + ǫ/N and Z is asampled set containing positive and negative examples for each session s k . The main goal of using this loss functionwas to compare the triplet loss to other functions that can be applied for our setting in order to get more insight of itsapplicability and results. We use several neural network architectures for the session encoder module. Each one of these networks takes as aninput a sequence of session events, which are clicked items in all used datasets, and encodes it to a vector of embeddingsize d . Used network architectures as session encoders go as follows:• Pooling - this architecture embeds the sequence of clicked items to a vector of size d by pooling the maximumor average value in each dimension. Inspired by how pre-trained embeddings (e.g. word2vec) are used inNLP downstream tasks. However, all relations in a sequence are lost.• CNN based approaches including TextCNN[13], TagSpace[48], Caser[14].• RNN-based approaches — these use one of the chosen recurrent networks (GRU, LSTM, RNN) to encodethe sequence followed by multiple fully connected layers to generate recommendation scores for individualitems. Training data is prepared from all available users’ sessions S . We want to predict the user’s next action for a givensession s k . Thus, training data preparation tries to enforce this for the model. Each session is split randomly — theﬁrst part is used as an input for the network s k , and the following actions with items are used as positive examples forthat session i p, , . . . , i p,l . For each l positive, the same number of negatives are sampled randomly. We investigated afew different strategies in case the session after random split has not enough positive examples. One of the successfulapproaches we used is to prepare more positives before training using KNN method. Sampling is done at a beginningof each training epoch. However, the improved MRR score is counterbalanced by lower items’ coverage.In other works, the negative sampling is done randomly from all non-positive items, e.g., [8]. From the optimizationperspective [1] took a different approach and sample negative examples from the same mini-batch given to the network.What relates to online samples mining used in deep metric learning techniques, but here without enforcing a marginof error like in hard negative mining [26].etric Learning for Session-based Recommendations - preprint Dataset Source Items Sessions Events

RR/5

Retail Rocket 32K (117K) 64K (380K) 242K (606K)

RSC15/64

SI-T

Proprietary e-commerce data

SI-D

Proprietary e-commerce data

Table 1: Experimental dataset stats — ( before ) and after preprocessing. N o . o f s e ss i o n s RR/64 SI-D

RSC15/64

No. of events in sessionsSI-T % o f s e ss i o n s RR/64 SI-D

RSC15/64 % of repeating items in a sessionSI-T

Figure 1: Session length distribution (left) and repeating items (right) for each dataset.

To conduct our experiments, we have followed the procedure utilized by [6], and used ﬁve splits for RR and one 64’thof RSC15 data. For each dataset, we have split the events into individual user sessions and removed the ones thatcontained only a single event. Furthermore, in our experiments, we have included only items that occurred at least ﬁvetimes in the data. A train-test split was prepared by taking the last of sessions. We further evaluated our modelsby using common information retrieval and ranking evaluation metrics: mean average precision, mean reciprocal rank,recall, precision, and hit ratio. All metrics were computed on a list of top 20 recommendations. Following [6] and [1],in case of MRR@20 and HR@20 only the next item was used as the ground truth. This no look-ahead evaluation canbe considered as a more adequate, when after each of a user’s action a session state is updated and predictions for thenext user’s step is given.

To conduct the experiments, we used four datasets from the e-commerce domain, which are summarized in Table 1.Two of these (

RR/5 and

RSC15/64 ) are standard benchmarks for session-based recommender systems, while theremaining ones are smaller, real-world proprietary datasets with data gathered in the early 2020. The differencebetween

SI-T and

SI-T is the category of products for which data were collected. In all datasets users’ events arerepresented only by interactions with a products (i.e., view, click), thus e k,l ∼ I .Fig. 1 (left) presents a histograms of session lengths for the preprocessed datasets, which shows that short sessionsseem to dominate in all datasets. This might be more challenging for methods that focus on the sequential nature ofthe users’ data. Furthermore, when analyzing the percentage of recurring items among the sessions presented on Fig. 1(right), it may be noticed that the session frequently contain multiple interactions with the same products. The datasuggests that users seem to revisit already seen items quite often. However, this also poses an interesting question fromthe perspective of recommender systems: should such a system suggest items that a user has already seen in the givensession or only new ones? The answer will depend on the speciﬁc use case and whether the system should provide amore explorative or exploitative user experience.We compared our Session-based Metric Learning ( SML ) method against six baseline algorithms. Starting from thesimplest ones,

POP denotes a simple popularity-based algorithm, which simply recommends the top-n most popularitems.

SPOP recommend items already seen in the session ordered by number of occurrences and ﬁlls the rest withpopular ones. This recommender performs well when predictions are expected to be repetitions. The

KNN algorithmwas the basis of the next two baseline methods:

SKNN and

VSKNN . The

SKNN approach for a given session recommendsthe top-n most frequent items among the K -most similar sessions from the training data, for which a cosine distance isused. The VSKNN [6] approach works similarly, however it puts more weight on more recent events in a given session.The last two methods include a Markov ﬁrst-order recommender reported as

MARKOV-1 and

GRU4Rec+ [12].etric Learning for Session-based Recommendations - preprint

All the variations of the proposed model were implemented using the PyTorch[49] library and trained in an end-to-endfashion with Adam optimizer using lr = 0 . , for max 150 epochs (early stop after lowering lr three times whenimprovement on validation data is lower than . ) with batch size of 32, and 8 positive/negative samples persession. Max session length was 15 for RR/5 and

RSC15/64 , and 8 for SI — this plays an important role for CNNswhere all sessions are padded to exactly the same size. For item embedding simple feed-forward network with tanh activation is used. The embedding dimension is set to 400 for all the methods. The margin value m for triplet lossis set to . . Smoothing parameter for NCAS is set to ǫ = 0 . . For the RNN encoder a GRU cells are used with, 400dimensions. For the

TextCNN convolution ﬁlters of sizes , , were used. In Table 2 we present the results obtained during the experiments conducted with the proposed method and comparethem against the baselines. Not all combinations of session encoders with loss function are presented, only the mostpromising or interesting ones from the future research perspective (e.g.,

NCAS for

RSC15/64 and

RR/5 ).The modiﬁcation introduced by

VSKNN to the non-weighted version of the method (i.e.,

SKNN ) seems to be effectivefor all the datasets, thus making

VSKNN a strong baseline indeed. Nevertheless, in some cases (like

RR/5 ), the simpler

SKNN method still obtains better results. Dataset speciﬁcs and used metrics play an important role here, as can be seenin Fig. 1 (right)—

RR/5 in comparison with other datasets (especially

RSC15/64 ) contains more repeating items. If weplace them at the beginning of our recommendation and ﬁll up the rest with the most popular items, we can receivehigh MRR@20 values. However, the practical usefulness of such recommendations can be questionable.The low results of

MARKOV-1 for all datasets show that a simple association of the item and the next following actionis not enough to obtain good results. Extracting additional information from entire sequences is needed to improverecommendations, which is the basis on which the sequential modeling with

GRU4Rec+ method stands. Still, in mostcases, it is less accurate in the meaning of used metrics than the simple heuristic of

VSKNN . One possible explanationis that the

VSKNN model additionally incorporates recency in the scoring function. We can consider that as a simplyencoded contextual information about when the sequence occurred. This information is not used in other models.When scoring sequences within short periods of time this may not introduce a big difference, but becomes importantas the time difference increases, as e.g., some trends arise, and others fade out.From the overall results, our SML family methods are the best for two datasets, the proprietary

SI-T and the openavailable

RSC15/64 . For

SI-T the proposed triplet loss function seems to be the right choice, wherein the case of

RSC15/64 , training with

NCAS is more stable and is giving overall better results. This situation can be caused byfar bigger inventory size and number of events in this dataset. Moreover, on

SI-D and

RR/5 our methods positionthemselves as the second-best ones with a minimal margin to kNN based methods,

VSKNN and

SKNN , respectively. For

SI-D only the PREC@20 is lower, due to the fact of far better results of

SKNN (which we double-checked for thecorrectness with such good results for both SI datasets). The Retail-Rocket dataset presents consistent results with [6],where many new methods cope to beat

SKNN . With

SML-MaxPooling-NCAS , we get close to the position of being theleader.Between the investigated encoders, we can observe from the results that a simple max-pooling performs well and fallsvery close to the best score for

SI-* datasets. Intuitively, GRU and CNN based methods should be better in encodinglonger sequences of actions, like

RSC15/64 and

RR/5 (see Fig. 1 (left)). However, this proved to be true only for

RSC15/64 results, where CNN and RNN based methods are among the best ones. For

RR/5 simple pooling with theproposed

NCAS loss function is the best one from the SML method family. Additionally, in practical terms, CNN-based models can be preferred from GPU utilization perspective, as the architecture and many libraries are optimizedfor computer vision and image processing.

Similar to [17, 6] we investigated the distribution of predicted items for the selected approaches. Interestingly, ourmetric learning based methods usually give wider spectrum of recommended items. Even checking simple statistic ofoverall unique items being recommended, for SI datasets our methods return almost twice as much unique items as VSKNN method (666 to 1,542 and 1,522 to 2,542 for a sample run, all items 2k, 3k respectively, see Table 1), while for

RR/5 and

RSC the difference is not so big (16,334 to 19,063, 12,232 to 11,216 for a sample run).etric Learning for Session-based Recommendations - preprint

Dataset Method MAP PREC ▽ REC HR MRR

SI-T SML-TextCNN-Triplet

SML-RNN-Triplet

VSKNN

SML-MaxPool-Triplet

SML-MaxPool-NCAS

SML-RNN-NCAS

SML-TextCNN-NCAS

SPOP

SML-TagSpace-Triplet

GRU4Rec+

SKNN

POP

MARKOV-1

SI-D VSKNN

SML-RNN-Triplet

SML-TextCNN-Triplet

SML-MaxPool-NCAS

SML-RNN-NCAS

SML-TextCNN-NCAS

SKNN

SML-MaxPool-Triplet

SML-TagSpace-Triplet

SPOP

GRU4Rec+

MARKOV-1

POP

RSC15/64 SML-RNN-NCAS

SML-MaxPool-NCAS

SML-TextCNN-NCAS

SML-RNN-Triplet

VSKNN

SML-MaxPool-Triplet

SKNN

GRU4Rec+

SML-TextCNN-Triplet

SML-TagSpace-Triplet

MARKOV-1

SPOP

POP

RR/5 SKNN

SML-MaxPool-NCAS

VSKNN

SML-RNN-NCAS

GRU4Rec+

SML-RNN-Triplet

SML-TextCNN-NCAS

SML-MaxPool-Triplet

SPOP

SML-TextCNN-Triplet

SML-TagSpace-Triplet

MARKOV-1

POP

Table 2: Results obtained during the experiments. The baseline

SKNN , VSKNN and

GRU4Rec+ values for

RR/5 and

RSC15/64 are taken from supplementary materials to [6]. Best results for each measure–dataset pair are in boldface , while the second bests are underlined; ▽ indicates the sort column. The SML naming convention is:

SML-SessionEncoder-LossFunction , with:

RNN and

MaxPool denote encoders described in 4.4, and three lossfunctions:

Contrastive , Triplet and smoothed

NCA — NCAS . To verify the impact of each component in our proposed solution, we run a series of experiments on

SI-T dataset fortwo encoders:

RNN and

MaxPool , enabling each improvements one by one. The results with REC@20 and MRR@20are shown in Table 3.One of the ﬁrst sampling methods evaluated with

SML was a simple sliding window-based technique. For a deﬁnednumber of events (padded if necessary), we take only the next following items as positive examples, and negative onesare randomly sampled. We quickly switched to sampling presented in section 4.5, as we notice that the windowingtechnique is not reﬂecting how the system is utilized in real use cases. Speciﬁcally, for various sub-sequences fromthe beginning of a session, predictions are also required, disregarding the sliding window size. As the next step, weevaluated the impact of the inner elements from triplet loss, like normalization (which is very common), margin usageetric Learning for Session-based Recommendations - preprint

Comm.

RNN MaxPool

Emb. Sampler Loss ▽ REC@20 MRR@20 REC@20 MRR@20

True Pos–Neg N-M 0.7435 0.5973 0.7402 0.5978True Pos–Neg N 0.7377 0.5973 0.7340 0.5932True Pos–Neg N-M-S 0.7371 0.5908 0.7359 0.5888False Pos–Neg N 0.6565 0.5746 0.6508 0.5727False Pos–Neg N-M 0.6341 0.5783 0.6192 0.5767False Pos–Neg N-M-S 0.6192 0.5678 0.6247 0.5796False SW N-M-S 0.0022 0.0006 0.0525 0.0191

Table 3: Ablation results obtained for the

RNN and

MaxPool session encoders. Columns labels in order: (1)

True/False is common embedding was used; (2) Sampler: SW - sliding window, Pos-Neg - session positive neg-ative sampling as described in sec. 4.5; (3) Triplet loss with: N - L2 normalization, M - 0.3 margin used, S - swapinganchor–session with positive item. Resultes are sorted by REC@20.(which for some datasets are set to very small values), and swapping of anchor and positive elements. To our surprise,swapping is not always giving good results for a session-based recommendations setting.A crucial role for improving our model was the use of common embeddings for both session encoder ϕ ( s k ) anditems encoder ω ( i j ) for the prediction. This lowered the number or all parameters to train and positively inﬂuencedthe overall results. We think that even further improvements can be made to the proposed method by a more globalnetwork parameters search. But this was out of scope of our computational possibilities. Thus, we constrained someof the network’s hyper-parameters that are related (e.g., GRU hidden state dimension and following feed-forwardnetwork dimension to be the same). In this paper, we have presented a novel approach to session-based recommendations that utilizes concepts from theﬁeld of metric learning. The proposed method has a clear and modular architecture that combines session and item em-beddings with a metric loss function. Each of these elements may be individually tweaked and thus deﬁnes a potentialdirection for further research. We test our approach against independent results obtained for strong baseline methodsusing a well-established evaluation procedure and receive state-of-the-art results. The analysis is also extended byablation studies, which conﬁrm that the proposed solution does not have unnecessary elements.Our approach’s main advantage is a modular design and extensibility that makes it possible to tweak its componentsto best match the dataset or incorporate some prior knowledge. Moreover, the fact that

SML is based on principlesoriginating from metric learning, many improvements from that ﬁeld can still be transferred and evaluated for session-based recommendations. From usage perspective, our approach can be attractive in combination with existing pipelines(KNN recommendations) and libraries (optimized CNN).We can identify two main weaknesses of our method. Firstly, sampling has a signiﬁcant impact on the results both interms of quality and computational efﬁciency, so careful GPU usage and memory management is required. Secondly,many improvements that can be taken for granted within computer vision do not necessarily improve the ﬁnal modelwhen combined with other elements for session-based recommendations, which was presented in the ablation study.Although we achieved promising results with the current method, this work only touched the subject of applying metriclearning to session-based recommendations, and much more is to be explored. Apart from the already mentionedembeddings, the positive/negative sampling strategy used during the training phase seems to deserve more attention.Based on good experimental results achieved by some baselines, an introduction of a missing users’ actions timecontext into session-based recommendation also seems worth exploring. Further investigation of improvements in thedeep metric learning ﬁeld can result in even better session-based recommendations, and similar synergy can be found,like in the case of NLP.

Acknowledgments and Disclosure of Funding

We acknowledge the support from Sales Intelligence and co-funding by European Regional Development Fund, projectnumber: POIR.01.01.01-00-0632/18.etric Learning for Session-based Recommendations - preprint

References [1] Balázs Hidasi, Alexandros Karatzoglou, Linas Baltrunas, and Domonkos Tikk. Session-Based Recommendationswith Recurrent Neural Networks. Technical report.[2] Bartłomiej Twardowski. Modelling contextual information in session-aware recommender systems with neuralnetworks. In

Proceedings of the 10th ACM Conference on Recommender Systems , pages 273–276, 2016.[3] Elena Smirnova and Flavian Vasile. Contextual sequence modeling for recommendation with recurrent neuralnetworks. In

Proceedings of the 2nd Workshop on Deep Learning for Recommender Systems , DLRS 2017, page2–9, New York, NY, USA, 2017. Association for Computing Machinery.[4] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectionaltransformers for language understanding. arXiv preprint arXiv:1810.04805 , 2018.[5] Fei Sun, Jun Liu, Jian Wu, Changhua Pei, Xiao Lin, Wenwu Ou, and Peng Jiang. Bert4rec: Sequential rec-ommendation with bidirectional encoder representations from transformer. In

Proceedings of the 28th ACMInternational Conference on Information and Knowledge Management , pages 1441–1450, 2019.[6] Malte Ludewig, Noemi Mauro, Sara Latiﬁ, and Dietmar Jannach. Performance comparison of neural and non-neural approaches to session-based recommendation. In

RecSys 2019 - 13th ACM Conference on RecommenderSystems , pages 462–466. Association for Computing Machinery, Inc, sep 2019.[7] Balázs Hidasi and Domonkos Tikk. General factorization framework for context-aware recommendations.

DataMining and Knowledge Discovery , 30(2):342–371, 2016.[8] Steffen Rendle. Factorization machines. In

Proceedings - IEEE International Conference on Data Mining,ICDM , pages 995–1000, 2010.[9] Yehuda Koren. Collaborative ﬁltering with temporal dynamics. In

Proceedings of the 15th ACM SIGKDDinternational conference on Knowledge discovery and data mining , pages 447–456, 2009.[10] Steffen Rendle, Christoph Freudenthaler, and Lars Schmidt-Thieme. Factorizing personalized markov chains fornext-basket recommendation. In

Proceedings of the 19th International Conference on World Wide Web , WWW’10, page 811–820, New York, NY, USA, 2010. Association for Computing Machinery.[11] Dietmar Jannach and Malte Ludewig. When recurrent neural networks meet the neighborhood for session-basedrecommendation. In

Proceedings of the Eleventh ACM Conference on Recommender Systems , RecSys ’17, page306–310, New York, NY, USA, 2017. Association for Computing Machinery.[12] Balázs Hidasi and Alexandros Karatzoglou. Recurrent neural networks with top-k gains for session-based rec-ommendations. In

Proceedings of the 27th ACM International Conference on Information and Knowledge Man-agement , pages 843–852, 2018.[13] Yoon Kim. Convolutional neural networks for sentence classiﬁcation. arXiv preprint arXiv:1408.5882 , 2014.[14] Jiaxi Tang and Ke Wang. Personalized Top-N Sequential Recommendation via Convolutional Sequence Embed-ding.

Proceedings of the Eleventh ACM International Conference on Web Search and Data Mining - WSDM ’18 ,pages 565–573, 2018.[15] Fajie Yuan, Alexandros Karatzoglou, Ioannis Arapakis, Joemon M. Jose, and Xiangnan He. A simple con-volutional generative network for next item recommendation.

WSDM 2019 - Proceedings of the 12th ACMInternational Conference on Web Search and Data Mining , (August):582–590, 2019.[16] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves, Nal Kalch-brenner, Andrew Senior, and Koray Kavukcuoglu. WaveNet: A Generative Model for Raw Audio. sep 2016.[17] Malte Ludewig and Dietmar Jannach. Evaluation of session-based recommendation algorithms.

User Modelingand User-Adapted Interaction , page 331–390, December 2018.[18] Matthew Schultz and Thorsten Joachims. Learning a distance metric from relative comparisons. In

Advances inneural information processing systems , pages 41–48, 2004.[19] Brian McFee and Gert R Lanckriet. Metric learning to rank. In

Proceedings of the 27th International Conferenceon Machine Learning (ICML-10) , pages 775–782, 2010.[20] Daryl Lim and Gert Lanckriet. Efﬁcient learning of mahalanobis metrics for ranking. In

International conferenceon machine learning , pages 1980–1988, 2014.[21] Raia Hadsell, Sumit Chopra, and Yann LeCun. Dimensionality reduction by learning an invariant mapping. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages 1735–1742,2006.etric Learning for Session-based Recommendations - preprint[22] Elad Hoffer and Nir Ailon. Deep metric learning using triplet network.

Lecture Notes in Computer Science(including subseries Lecture Notes in Artiﬁcial Intelligence and Lecture Notes in Bioinformatics) , pages 84–92,2015.[23] Jian Wang, Feng Zhou, Shilei Wen, Xiao Liu, and Yuanqing Lin. Deep metric learning with angular loss. In

Proceedings of the IEEE International Conference on Computer Vision , pages 2593–2601, 2017.[24] Tongtong Yuan, Weihong Deng, Jian Tang, Yinan Tang, and Binghui Chen. Signal-to-noise ratio: A robustdistance metric for deep metric learning. In

Proceedings of the IEEE Conference on Computer Vision andPattern Recognition , pages 4815–4824, 2019.[25] Xun Wang, Xintong Han, Weilin Huang, Dengke Dong, and Matthew R Scott. Multi-similarity loss with generalpair weighting for deep metric learning. In

Proceedings of the IEEE Conference on Computer Vision and PatternRecognition , pages 5022–5030, 2019.[26] Alexander Hermans, Lucas Beyer, and Bastian Leibe. In defense of the triplet loss for person re-identiﬁcation. arXiv preprint arXiv:1703.07737 , 2017.[27] Kihyuk Sohn. Improved deep metric learning with multi-class n-pair loss objective. In

Advances in NeuralInformation Processing Systems , pages 1857–1865, 2016.[28] Hyun Oh Song, Yu Xiang, Stefanie Jegelka, and Silvio Savarese. Deep metric learning via lifted structuredfeature embedding. In

Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages4004–4012, 2016.[29] Chao-Yuan Wu, R Manmatha, Alexander J Smola, and Philipp Krahenbuhl. Sampling matters in deep embeddinglearning. In

Proceedings of the IEEE International Conference on Computer Vision , pages 2840–2848, 2017.[30] Istvan Fehervari, Avinash Ravichandran, and Srikar Appalaraju. Unbiased evaluation of deep metric learningalgorithms. arXiv preprint arXiv:1911.12528 , 2019.[31] Kevin Musgrave, Serge Belongie, and Ser-Nam Lim. A metric learning reality check. arXiv preprintarXiv:2003.08505 , 2020.[32] Mahmut Kaya and Hasan ¸Sakir Bilge. Deep metric learning: a survey.

Symmetry , 11(9):1066, 2019.[33] C J C Burges, Robert Ragno, and Q V Le. Learning to Rank with Nonsmooth Cost Functions.

Machine Learning ,19:193–200, 2007.[34] Yisong Yue, Thomas Finley, Filip Radlinski, and Thorsten Joachims. A support vector method for optimizingaverage precision. In

Proceedings of the 30th annual international ACM SIGIR conference on Research anddevelopment in information retrieval , pages 271–278, 2007.[35] Michael Taylor, John Guiver, Stephen Robertson, and Tom Minka. SoftRank: optimizing non-smooth rankmetrics. In

WSDM ’08 , pages 77–86, 2008.[36] Xia Ning and George Karypis. SLIM: Sparse LInear Methods for top-N recommender systems.

Proceedings -IEEE International Conference on Data Mining, ICDM , pages 497–506, 2011.[37] Yehuda Koren. Factorization meets the neighborhood: a multifaceted collaborative ﬁltering model.

Proceedingof the 14th ACM SIGKDD international conference on Knowledge discovery and data mining , pages 426–434,2008.[38] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, and Lars Schmidt-thieme. BPR : Bayesian PersonalizedRanking from Implicit Feedback. In

Proceedings of the Twenty-Fifth Conference on Uncertainty in ArtiﬁcialIntelligence , volume cs.LG, pages 452–461, 2009.[39] Nn Liu, Min Zhao, and Q Yang. Probabilistic latent preference analysis for collaborative ﬁltering.

Proceedingsof the 18th ACM conference on on Information and knowledge management , pages 759–766, 2009.[40] Harr Chen and David R. Karger. Less is More: Probabilistic Models for Retrieving Fewer Relevant Documents.In

SIGIR , pages 429 – 436, 2006.[41] Yue Shi, Alexandros Karatzoglou, and Linas Baltrunas. xCLiMF: optimizing expected reciprocal rank for datawith multiple levels of relevance.

Proceedings of the 7th ACM conference on Recommender systems , pages 0–3,2013.[42] Yue Shi, Alexandros Karatzoglou, Linas Baltrunas, Martha Larson, and Alan Hanjalic. GAPfm. In

Proceedingsof the 22nd ACM international conference on Conference on information & knowledge management - CIKM ’13 ,pages 2261–2266, 2013.etric Learning for Session-based Recommendations - preprint[43] Yue Shi, Alexandros Karatzoglou, Linas Baltrunas, Martha Larson, Alan Hanjalic, and Nuria Oliver. TFMAP:Optimizing MAP for top-n context-aware recommendation.

Proceedings of the 35th international ACM SIGIRconference on Research and development in information retrieval , pages 155–164, 2012.[44] Tie-Yan Liu et al. Learning to rank for information retrieval.

Foundations and Trends in Information Retrieval ,2009.[45] Jason Weston, Samy Bengio, and Nicolas Usunier. Wsabie: Scaling up to large vocabulary image annotation. In

Twenty-Second International Joint Conference on Artiﬁcial Intelligence , 2011.[46] Jacob Goldberger, Geoffrey E Hinton, Sam T Roweis, and Russ R Salakhutdinov. Neighbourhood componentsanalysis. In

Advances in neural information processing systems , pages 513–520, 2005.[47] Yair Movshovitz-Attias, Alexander Toshev, Thomas K Leung, Sergey Ioffe, and Saurabh Singh. No fuss distancemetric learning using proxies. In

Proceedings of the IEEE International Conference on Computer Vision , pages360–368, 2017.[48] Jason Weston, Sumit Chopra, and Keith Adams.

Proceed-ings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages 1822–1827,Doha, Qatar, October 2014. Association for Computational Linguistics.[49] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen,Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas Kopf, Edward Yang, Zachary De-Vito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy, Benoit Steiner, Lu Fang, Junjie Bai, and SoumithChintala. Pytorch: An imperative style, high-performance deep learning library. In H. Wallach, H. Larochelle,A. Beygelzimer, F. d Alché-Buc, E. Fox, and R. Garnett, editors,