Attentive Neural Architecture Incorporating Song Features For Music Recommendation
AAttentive Neural Architecture Incorporating Song Features ForMusic Recommendation
Noveen Sachdeva
International Institute of InformationTechnologyHyderabad, [email protected]
Kartik Gupta ∗ International Institute of InformationTechnologyHyderabad, [email protected]
Vikram Pudi
International Institute of InformationTechnologyHyderabad, [email protected]
ABSTRACT
Recommender Systems are an integral part of music sharing plat-forms. Often the aim of these systems is to increase the time, theuser spends on the platform and hence having a high commercialvalue. The systems which aim at increasing the average time auser spends on the platform often need to recommend songs whichthe user might want to listen to next at each point in time. This isdifferent from recommendation systems which try to predict theitem which might be of interest to the user at some point in theuser lifetime but not necessarily in the very near future. Predictionof next song the user might like requires some kind of modelingof the user interests at the given point of time. Attentive neuralnetworks have been exploiting the sequence in which the itemswere selected by the user to model the implicit short-term interestsof the user for the task of next item prediction, however we feelthat features of the songs occurring in the sequence could also con-vey some important information about the short-term user interestwhich only the items cannot. In this direction we propose a novelattentive neural architecture which in addition to the sequence ofitems selected by the user, uses the features of these items to betterlearn the user short-term preferences and recommend next song tothe user.
CCS CONCEPTS • Information systems → Recommender systems ; Content rank-ing ; KEYWORDS
Recommender Systems; Short Term Interest
ACM Reference Format:
Noveen Sachdeva, Kartik Gupta, and Vikram Pudi. 2018. Attentive NeuralArchitecture Incorporating Song Features For Music Recommendation. In
Proceedings of Twelfth ACM Conference on Recommender Systems (RecSys ’18).
ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3240323.3240397 ∗ Noveen Sachdeva and Kartik Gupta had equal contribution towards the research workdemonstrated in the paper.Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. Copyrights for components of this work owned by others than ACMmust be honored. Abstracting with credit is permitted. To copy otherwise, or republish,to post on servers or to redistribute to lists, requires prior specific permission and/or afee. Request permissions from [email protected].
RecSys ’18, October 2–7, 2018, Vancouver, BC, Canada © 2018 Association for Computing Machinery.ACM ISBN 978-1-4503-5901-6/18/10...$15.00https://doi.org/10.1145/3240323.3240397
There has recently been an intense focus on recommendation sys-tems by the Information Retrieval community because of theircommercial experience and the ability to provide a better experi-ence to the user while interacting with a large database of items.Often there are a very large number of items in the database thatmight be of interest to the user, to the extent that the user mightnot even know they exist. Hence, they need to be presented to theuser as a recommendation. To give an example, for websites whichsell different kinds of products and have a huge catalog, users mightfeel better if they didn’t have to browse for the items they mightlike and were rather recommended by the system, saving time andeffort of the user, thus creating a pleasant experience.The content of the item chosen by the user is often an indicationof the items that might be of interest to the user. In the case of music,this might not always be constant and might change with time. Ina recent work Gupta [8], tries to model the short-term preferencesof the user for music recommendation. He uses Last.fm [11] tags tofind out song features important to the user instead of the contentderived from the audio. Last.fm tags look promising in describingthe contents of the song and also provide a lot more informationabout the song which could be very hard to derive either from theaudio or the metadata of the song. We align to the claim that Last.fmcould very well be used to model the song features which mightbe of interest to the user. However, the similarity function used byGupta could be better learned and provide a better performance.Gupta also claims that it is the group of items that occur togetherwhich matter while recommendation and not the exact sequencein which they occur.Towards this claim made by Gupta and finding a better similarityfunction, we apply Attentive Neural Networks to the problem ofnext item Prediction. Attentive neural networks indeed give dif-ferent weights to each item in the sequence and the weights arenot in order of the items. The third last item selected by the usercould get more weight than the last item selected by the user andhence the choice of Attentive Neural Networks takes the claim intoaccount. Also, we introduce a content attention component, whichdeals with the tags of the items, assuming these tags indeed canmodel the short-term interests of the user. This component takesthe tags of the items selected by the user in the recent past.
Recommender systems is a well-researched topic and a wide varietyof systems have been developed and it is important that we coversome of them here to provide a context to the reader. a r X i v : . [ c s . I R ] N ov ecSys ’18, October 2–7, 2018, Vancouver, BC, Canada N. Sachdeva et al. It exploits the user − item interactions to find similar users basedon the number of same items selected. A variant is item level collab-orative filtering [2], wherein two items selected by the same userare considered to be similar.There have been improvements to collaborative filtering suchas matrix factorization [19] of the user − item matrix into the user feature matrix and the item feature matrix. Further, there have beenranking algorithms such as Bayesian personalized ranking [15] tofurther provide better and personalized recommendation to users. Content-based systems recommend items based on the similarityof content to the items already selected by the user [1, 3]. If thecontent of a song is similar to the ones the user likes, then thatsong is more probable to be recommended to the user. For example,there are systems which recommend songs based on the melodyof the song [7]. Another example which also assumes that the tagscan indeed be sufficient to model the features of the songs whichmight be of importance to the user is by Liang [5] which generatesa latent vector for each song based on the semantic tags and thenapplies collaborative filtering to provide a recommendation to theusers.
Recommendation can be modeled as a sequence prediction problemand the first attempt at it was by Brafman [12]. The initial attemptswere based on simple models such as Markov chains and they havebeen further improved. One such improvement is having a personalMarkov chain for each user [16]. With the popularity of recurrentneural networks, they have been applied [20] to the problem of nextitem prediction and have performed much better than the othersystems. With the success of Attentive neural networks in fieldssuch as language and speech processing, they have been appliedto recommender systems as well [10]. Our model applies attentionto the sequence of items as well as the content of those items.Two context vectors are computed in the model independently, onewhich gives a context solely based on the items and the other whichgives a context based only on the tags of the items.
Hybrid systems combine two or more techniques in order to providebetter recommendations. Yoshii [18] proposed a system whereinthe recommendations are based on the rating as well as the content,which are modeled based on the polyphonic timbres of the song.Hariri [9] applied topic modeling and models the sequence of songsheard by the user as a sequence of topics and then tries to predictthe next topic and the next song in that topic. The transitionsbetween topics are learned from a collection of playlists. Gupta [8]proposes a hybrid model which takes into account the differentsongs played together and the tags of the song. The approach isable to tell at any given point of time the features of the songs theuser is interested in. Shobu [14] builds an interesting system whichbases its recommendation on the transition of acoustic featuresover the songs. It tries to generate a sequence of songs over whichthe transition of acoustic features is smooth.
We present an Attentive Neural Architecture to tackle the problemof next item prediction which has the ability to include tags of theitems and models the short term user interests based on the featuresof the items as well as the items themselves. We now present theformal problem statement that we try to tackle in this paper.
Predicting Next Song
Given the set of songs heard by the userin sequence S s = { s , s , ..., s i − } and the tag set for each song, T i = { t i , t i , ..., t ji } , predict s i . The architecture we propose is shown in figure 1. The output of themodel are the probabilities of each item occurring next, given theitems occurred in the user history ( P ( s i | s i − , s i − ... s i − m ) ). Thefirst component receives as input the one hot encoding of the songswhich occurred before the song to be predicted. The second com-ponent receives the one hot encoding of all the tags for the itemsoccurring before the song to be predicted. The song-embeddinglayer maps the one hot representations of the songs to a vectorspace which are then fed to a Bi-GRU in the first component. Simi-larly, the tags for each song are also converted to their distributedrepresentations using another embedding layer. For each song, theaverage of the distributed representations of all its tags is fed to aBi-GRU in the second component. For both components, the hiddenstates are given as input to an attention layer where the attention-score or weight for each hidden state is computed. The output ofthe attention layer is the context vector which is the weighted sum(given by the attention layer) of the hidden states of the Bi-GRU.The context vectors coming from both components are concate-nated and fed to a smaller dimension non-linear dense layer, usingReLU as the activation function. The output of this dense layer isthen fed to another dense layer followed by a softmax operation,used to calculate probabilities over all songs modeling the nextsong. Below we present the equations for a better understandingof the model. Let V = { v , v ... v | V | } be the set of all the songs. s ′ i = E ∗ s i (1)where s i is the one hot representation of the song, E ∈ R d ∗| I | isthe embedding layer, d is the length of the embedded song vectorand | I | is the set of all songs. t ′ ji = E ∗ t ji (2)where t ji is the one hot representation of the j th tag of the i th song, E ∈ R d ′ ∗| T | is the embedding layer, d ′ is the length of theembedded tag vector and | T | is the set of all tags. t ′ i = n i n i (cid:213) j = t ′ ji (3)where n i is the number of songs associated to the i th song, and t ′ i is the average of the embedding vector of all the tags associatedto the i th song.The hidden states of both Bi-GRUs, H i and G i , which are fed tothe attention layer are a mere concatenation of the two individualunidirectional hidden states: → h i , ← h i and → д i , ← д i respectively. ttentive Neural Arch. Incorporating Song Features For Music Reco. RecSys ’18, October 2–7, 2018, Vancouver, BC, Canada Figure 1: Attentive Neural Network Architecture for Next Song Prediction
Both the attention layers output a context vector which is aweighted sum of all the hidden states. C s is the context vectorcomputed from the song component of the model and C t is thecontext vector computed from the tag component of the model. C s = i − m (cid:213) j = i − α j H j (4) C t = i − m (cid:213) j = i − β j G j (5)Both the context vectors, C s and C t are then concatenated re-sulting in a final context vector, C which is then fed to a dense layerusing the standard equations. C ′ = ReLU ( W C + b ) (6) C ′ is nothing but a vector representation of C in a smaller di-mension vector space which significantly reduces the training timebecause the following dense layer has a huge dimension (Numberof songs). The final output is a dense layer of the size of the totalnumber of songs followed by a softmax function which gives theprobability of occurrence of each song given the user’s history. O = W C ′ + b (7) P ( v l = s i | s i − , s i − , ..., s i − m ) = e v l (cid:205) | V | p = e v p (8) ecSys ’18, October 2–7, 2018, Vancouver, BC, Canada N. Sachdeva et al. Description Value
Total Logs 3553321Total Users 759Total Sessions 110410Total Unique Songs 386046Total Unique Tags 487844Average Songs Per Session 32.18Average logs per user 4681.58
Table 1: Dataset StatisticsModel k=10 k=20 k=30 k=40 k=50
POP 0.85 0.97 1.24 1.69 2.14BPR-MF 7.34 8.13 8.56 8.98 9.27SSCF 13.69 17.12 19.66 21.30 22.34RNN 14.42 16.26 16.74 17.09 17.38SBRS 19.15 26.14 28.83 30.35 31.40SABR 26.36 28.61 29.97 31.72 32.47STABR
Negative log likelihood was used as the loss function and the opti-mization problem becomes: arдmin X , Y , W , W , b , b − (cid:213) s ′′ (cid:213) t loдP ( v l = s i | s i − , s i − , ..., s i − m ) (9)where s ′′ is a user session in the dataset and v l is the actual songwhich occurs after the m given songs. X and Y are the matricesconsisting of song and tag embeddings respectively. We iterate overall the sessions in the datasets and all time steps in those sessions. The dataset was a subset taken from the Last.fm dataset [11]. Eachlog in the dataset consisted of user id, song name, artist name andtime stamp. We performed experiments on a subset consisting of6-month histories of all the users and the tags for each song wereretrieved using the Last.fm public API. The user histories weredivided into sessions as done by Gupta [8]. The first 70 percent ofthe sessions for each user (in order of occurrence) were put in thetraining set and the last 30 percent in the test set. Sessions havingless than 5 songs were discarded.
The architecture is tested against the following baselines:(1) POP: The most popular items in the training set are recom-mended to the users.(2) BPR-MF: A matrix factorization based model which ranksitems for each user [15] differently. The implementation byMyMediaLite was used with default parameters except forthe number of features which was kept 100 for best results.We report the mean over 5 runs for this model. (3) Session Based Collaborative Filtering(SSCF): This systeminstead of making a user − item matrix makes a session − item matrix and recommends items by finding similar sessionsin the database to the active session based on the songswhich have already occurred in the current session. Thesimilar sessions were found based on the last 5 songs heardby the user and the results are reported based on 100 nearestsessions.(4) RNN: In this method, the sequence of items occurring to-gether is fed to a recurrent neural network trying to predictthe next item at each timestep. All sequences in the train setare used to learn the model and to get the next recommenda-tion, all the songs heard by the user until that point are fedto the network. We used the implementation provided bythe authors of [13] based on mini batch stochastic gradientdescent and we kept the batch size to 20, using the Categori-cal Cross Entropy loss function with a 100 hidden units forthe RNN and a learning rate of 0.1.(5) Subsession Based Recommender System: This method wasproposed by Gupta [8]. In this method, short-term user pref-erences are found using the tags of the songs the user heard.The user history is divided into small windows of constantpreference and songs are found based on the similar windowin the training set to the active window. We use the minibatch Stochastic Gradient Descent (SGD) algorithmcoupled with Adagrad [6] and a learning rate of 0.05 to train eachmodel. Batch size of 32 was used, the embeddings for tags werekept to length 25 and that of songs to 50. The length of the middlelayer, C ′ was kept to 50 and that of the output, O was equal to thenumber of songs in our dataset, 386046. Dropout regularizationwith a 0.1 discard probability was used for both the middle and theoutput layers. We trained the model on a single GTX 1080Ti GPUand the proposed model was implemented using PyTorch[17].For testing the models, we adopt the same methodology as fol-lowed by Gupta [8]. We iterate through the test histories of the userspredicting the next song in the history while giving songs till thatpoint of time as an input to the system. We report HitRatio @ k [4]where k is the number of songs in the predicted set. We tested twosystems based on the attentive neural networks. One was only withthe component which takes only the songs into account and not thetags and is referred as SABR(Song Attention Based Recommenda-tion), and the second one with both the components and is referredas STABR(Song and Tag Attention Based Recommendation). The results are shown in table 1. Attentive neural networks per-form significantly better than all other baseline models and evenfor Attentive neural networks, the one with the tag componentgives a huge gain over the one not having the tag component. Thisshows that the tags indeed are powerful in modeling the short termuser preference and probably the neural network learns a bettersimilarity function than the one proposed by Gupta and hence thegain. ttentive Neural Arch. Incorporating Song Features For Music Reco. RecSys ’18, October 2–7, 2018, Vancouver, BC, Canada
REFERENCES [1] A. V. D. Oord, S. Dieleman, and B. Schrauwen. Deep content-based music recom-mendation. In NIPS , pp. 2643 − − − − − − − − − −−