Support the Underground: Characteristics of Beyond-Mainstream Music Listeners
Dominik Kowald, Peter Muellner, Eva Zangerle, Christine Bauer, Markus Schedl, Elisabeth Lex
KKowald et al.
RESEARCH
Support the Underground: Characteristics ofBeyond-Mainstream Music Listeners
Dominik Kowald , Peter Muellner , Eva Zangerle , Christine Bauer , Markus Schedl and ElisabethLex * Correspondence:[email protected] Know-Center GmbH, Graz,AustriaFull list of author information isavailable at the end of the article
Abstract
Music recommender systems have become an integral part of music streamingservices such as Spotify and Last.fm to assist users navigating the extensivemusic collections offered by them. However, while music listeners interested inmainstream music are traditionally served well by music recommender systems,users interested in music beyond the mainstream (i.e., non-popular music) rarelyreceive relevant recommendations. In this paper, we study the characteristics ofbeyond-mainstream music and music listeners and analyze to what extent thesecharacteristics impact the quality of music recommendations provided. Therefore,we create a novel dataset consisting of Last.fm listening histories of severalthousand beyond-mainstream music listeners, which we enrich with additionalmetadata describing music tracks and music listeners. Our analysis of this datasetshows four subgroups within the group of beyond-mainstream music listenersthat differ not only with respect to their preferred music but also with theirdemographic characteristics. Furthermore, we evaluate the quality of musicrecommendations that these subgroups are provided with four differentrecommendation algorithms where we find significant differences between thegroups. Specifically, our results show a positive correlation between a subgroup’sopenness towards music listened to by members of other subgroups andrecommendation accuracy. We believe that our findings provide valuable insightsfor developing improved user models and recommendation approaches to betterserve beyond-mainstream music listeners.
Keywords:
Music Recommender Systems; Acoustic Features; Last.fm;Clustering; User Modeling; Fairness; Popularity Bias; Beyond-Mainstream Users
In the digital era, users have access to continually increasing amounts of music viamusic streaming services such as Spotify and Last.fm. Music recommender systemshave become an essential means to help users deal with content and choice overloadas they assist users in searching, sorting, and filtering these extensive music collec-tions. Simultaneously, both music listeners and artists benefit from the employedsegmentation and personalization approaches that are typically leveraged in musicrecommendation approaches [1]. As a result, users with different preferences andneeds can be targeted in various ways with the goal that all users are presentedthe information and content that they need or prefer. This also means that currentrecommendation techniques should serve all users equally well, independent of theirinclination to popular content. a r X i v : . [ c s . I R ] F e b owald et al. Page 2 of 28
Figure 1
Recommendation accuracy measured by the mean absolute error (MAE) of anon-negative matrix factorization-based approach (i.e., NMF [2]) and a neighborhood-basedapproach (i.e., UserKNN [3]) for mainstream and beyond-mainstream user groups in Last.fm. Wesee that beyond-mainstream users receive a substantially lower recommendation quality (i.e.,higher MAE) compared to mainstream music listeners. Thus, for recommender systems, it isharder to provide high-quality recommendations to beyond-mainstream music listeners than tomainstream music listeners.
Present work.
In the paper at hand, we focus on music consumers who listento music beyond the mainstream (i.e., users who listen to non-popular music) inthe music streaming platform Last.fm a . As highlighted in Figure 1, current recom-mender systems do not work well for consumers of beyond-mainstream music (seeSection 3.5 for details on this analysis). In contrast, music consumers who listen topopular music seem to get better recommendations. This finding is not essentiallynew. In fact, it is a widely-known problem that recommender systems (and thosebased on collaborative filtering, in particular) are prone to popularity bias, whichleads to the behavior that long-tail items (i.e., items with few user interactions) havelittle chance being recommended. This phenomenon is also present across differentapplication domains such as movies [4] or music [5].Our previous work [6] has shown that users interested in beyond-mainstreammusic tend to have larger user profile sizes (i.e., individual users show a high(er)number of distinct artists they have listened to) compared to users interested inmainstream music. The observation that beyond-mainstream music listeners pro-duce a substantial amount of digital footprints motivates the need to improve therecommendation quality for this group. However, although related research has al-ready studied the long-tail recommendation problem (e.g., [7, 8, 9, 10]; see Section 2for a more detailed discussion of related work), it is still a fundamental challengeto understand and identify the characteristics of beyond-mainstream music andbeyond-mainstream music listeners. Additionally, related work [11] has shown thatthe group-specific concepts of openness and diversity influence recommendationquality, where openness is defined as across-group diversity (i.e., do users of onegroup listen to the music of other groups?) and diversity is defined as within-groupvariability (i.e., how dissimilar is the music listened to by users within groups?).Thus, we are also interested in the correlation between the characteristics of beyond-mainstream music and music listeners with openness and diversity patterns as wellas with recommendation quality. Concretely, our work is guided by the followingresearch question: owald et al. Page 3 of 28
RQ: What are the characteristics of beyond-mainstream music tracks and musiclisteners, and how do these characteristics correlate with openness and diversitypatterns as well as with recommendation quality?
To address this research question, we create, provide, and analyze a novel datasetcalled
LFM-BeyMS , which contains complete listening histories of more than 2,000beyond-mainstream music listeners mined from the Last.fm music streaming plat-form. Besides, our dataset is enriched with acoustic features and genres of mu-sic tracks. Using this enriched dataset, we identify different types of beyond-mainstream music via unsupervised clustering applied to the acoustic features ofmusic tracks. We then characterize the resulting music clusters using music gen-res. Then, we assign beyond-mainstream users to the clusters to further dividethe beyond-mainstream users into subgroups. We study how the characteristics ofthese beyond-mainstream subgroups correlate with openness and diversity patternsas well as with recommendation quality measured through prediction accuracy.
Findings and contributions.
We identify four clusters of beyond-mainstreammusic in our dataset: (i) C folk , music with high acousticness such as “folk”,(ii) C hard , high energy music such as “hardrock”, (iii) C ambi , music with highacousticness and high instrumentalness such as “ambient”, and (iv) C elec , musicwith high energy and high instrumentalness such as “electronica”. By assigningusers to these clusters, we get four distinct subgroups of beyond-mainstream musiclisteners: (i) U folk , (ii) U hard , (iii) U ambi , and (iv) U elec . We also find that thesegroups differ considerably with respect to the accuracy of recommendations theyreceive, where group U ambi gets significantly better recommendations than U hard .When relating our results to openness and diversity patterns of the subgroups, wefind that U ambi is the most open but least diverse group, while U hard is the leastopen but most diverse group. This is in line with related research [11], which hasshown that openness is stronger correlated with accurate recommendations thandiversity. This means that users are more likely to accept recommendations fromdifferent groups (i.e., openness) rather than varied within a group (i.e., diversity).Summed up, our contributions are: • We identify more than 2,000 beyond-mainstream music listeners on theLast.fm platform and enrich their listening profiles with acoustic features andgenres of music tracks listened to (Sections 3.1–3.4). • We validate related research by showing that beyond-mainstream music listen-ers receive a significantly lower recommendation accuracy than mainstreammusic listeners (Section 3.5). • We identify four clusters of beyond-mainstream music using unsupervisedclustering and characterize them with respect to acoustic features and mu-sic genres (Section 4.1). • We define four subgroups of beyond-mainstream music listeners by assigningusers to the music clusters and discuss the relationship between openness,diversity, and recommendation quality of these groups (Section 4.2). • To foster reproducibility of our research, we make available our novel
LFM-BeyMS dataset via Zenodo b and the entire Python-based implementation ofour analyses via Github c . owald et al. Page 4 of 28
We believe that our findings provide useful insights for creating user models andrecommendation algorithms that better serve beyond-mainstream music listeners.
We identify three strands of research that are relevant to our work: (i) modeling ofmusic preferences, (ii) long-tail recommendations, and (iii) popularity bias in musicrecommender systems.
Modeling of music preferences.
A multitude of factors [12] influences musi-cal tastes and musical preferences of users. Characteristics of music listeners andmusic preferences have been studied in various research domains [13], ranging frommusic sociology [14] and psychology [15] to music information retrieval and musicrecommender systems [1]. Studies on music listening behavior showed that personaltraits and long-term music preferences are correlated as people tend to prefer musicstyles that align with their personalities [16, 17]. Furthermore, related work found arelationship between music and motivation [18], music and emotion [19, 20, 21, 22]or both personality and emotion [23]. Openness, a personality trait from the FiveFactor Model [24], has also been shown to positively influence a user’s preferencefor music recommendations [11]. Specifically, the authors of [11] found that peo-ple tend to prefer recommendations from different kinds of music (i.e., openness)rather than varied within a specific kind of music (i.e., diversity). Others showedthat familiarity has a positive influence on music preferences [25, 26] and that mu-sic preferences may change over time [27]. Another strand of research on modelingusers’ music preferences leverages content features, e.g., acoustic features. It hasbeen shown that the distribution of acoustic features of a user’s preferred genresubstantially influences the user’s choice of music within other genres [28]. Also,acoustic features have been utilized to model users’ preferences under different con-textual conditions, in order to refine recommendation quality [29]. Based on tracks’acoustic features, the authors of [30] identified several types of music, and subse-quently modeled each user by linearly combining the acoustic features of the musictypes. In contrast to these works, we focus on using acoustic features of music tracksfor modeling and clustering beyond-mainstream music. Additionally, we link thesebeyond-mainstream music clusters to music genres and users in our Last.fm datasample.
Long-tail recommendations.
Related research [8, 9] has found that individualmusic consumption is biased towards popular music and that usage data for lesspopular music is scarce. Due to the scarcity problem, items with no or few ratings(i.e., long-tail items) have little chance of being recommended [7]. As a consequence,users that particularly favor items with few ratings or interactions are less likelyto be recommended those items that they like [5]. That is problematic becausemany users, from time to time, prefer niche music [10]. Therefore, such users arenot well served as a result of their preference for less popular items. That has beenattributed to popularity bias , which corresponds to over-representation of popularitems in the recommendation lists [31, 32, 33]. Abdollahpouri et al. [4] studied popu-larity bias in a dataset of movies (i.e., the MovieLens 1M dataset [34]) from the user owald et al.
Page 5 of 28 perspective. Their study showed that commonly used recommendation techniquestend to deliver worse recommendations to users who prefer less popular movies. Inour work [6], we found evidence for popularity bias in a Last.fm dataset and showedthat traditional personalized recommendation algorithms such as collaborative fil-tering deliver worse recommendations for consumers of niche music. In the presentwork, we aim to gain a deeper understanding of the behavior and preferences of thisbeyond-mainstreaminess user group. Thus, in contrast to existing works in long-tailrecommendations, we focus on the user rather than the item perspective.
Popularity bias in music recommender systems.
Music recommender sys-tems [1] are crucial tools in online streaming services such as Last.fm, Pandora,or Spotify. They help users find music that is tailored to their preferences. Thebasis of music recommender systems are user models derived from users’ listeningbehavior, user properties such as personality (e.g., [35]), content features of mu-sic, or hybrid combinations of both, e.g., [36, 37, 38, 39]. As discussed earlier, dueto insufficient amounts of usage data for less popular items, many music recom-mendation algorithms do not provide useful recommendations for consumers of lesspopular and niche items. As a remedy, in [40], an approach is suggested that dividesmusic consumers into experts and novices according to their long tail distributionin their playlists. These experts are then converted to nodes with bidirectionallinks connecting all the experts. These links are created to perform link analysison the graph and to assign fine-grained weights to songs. The presented approachhelps add music from the long-tail into the recommendation list. In our previousresearch [41], we use a framework [42] that employs insights from human memorytheory to design a music recommendation algorithm that provides more accuraterecommendations than collaborative filtering-based approaches for three groups ofusers, i.e., low-mainstream, medium-mainstream and high-mainstream users. Whilethe awareness of popularity bias in music recommender systems increases (e.g., [43]),the characteristics of music consumers whose preferences lie beyond popular, main-stream music are still not well understood. In the present work, we shed light onthe characteristics of such beyond-mainstream music consumers and relate them toopenness and diversity patterns as well as recommendation quality. With this, weaim to provide useful insights for creating novel music recommendation models thatmitigate popularity bias.
We investigate the characteristics of beyond-mainstream music listeners in a datasetmined from Last.fm, a popular music streaming platform. We characterize the tracksin our dataset with acoustic features. Besides, we compare the recommendationaccuracy of beyond-mainstream music listeners with the one of mainstream mu-sic listeners to motivate our subsequent analysis of the characteristics of beyond-mainstream music listeners.
For our analyses, we characterize music tracks using acoustic features that describethe content of a given track. Following the lines of, e.g., [44, 45, 46, 30], we rely owald et al.
Page 6 of 28 on acoustic features provided by the Spotify API as a compact characterization oftracks d . The following eight features are extracted from the audio signal of a track: Danceability captures how suitable a track is for dancing and is computed based“on a combination of musical elements including tempo, rhythm stability, beatstrength, and overall regularity”.
Energy describes the perceived intensity and activity of a track and is based on thedynamic range, perceived loudness, timbre, onset rate, and general entropy ofa track.
Speechiness captures the presence of spoken words in a track. High speechinessvalues indicate a high degree of spoken words (e.g., an audiobook), whereasmedium values indicate tracks with both music and speech (e.g., rap music).Low values represent typical music tracks.
Acousticness measures the probability that the given track only contains acousticinstruments.
Instrumentalness quantifies the probability that a track contains no vocals, i.e.,the track is instrumental.
Tempo measures the rate of the track’s beat in beats per minute.
Valence describes the “emotional positiveness” conveyed by a track (i.e., cheerfuland euphoric tracks reach high valence values).
Liveness measures the probability that a track was performed live, i.e., whetheran audience is present in the recording.
To study characteristics of beyond-mainstream users and their listening preferences,we create a novel dataset called
LFM-BeyMS that contains the required informationfor such analyses. We base our work on a dataset gathered from the Last.fm musicplatform, which we considerably enrich with the music tracks’ acoustic features(see Section 3.1) [47]. Additionally, we combine this data with mainstreaminessinformation of Last.fm users (see Section 3.3) as well as music genre information toidentify beyond-mainstream listeners and music (see Section 3.4).An overview of our new
LFM-BeyMS dataset and its data sources is depicted inFigure 2. As shown, the starting point for our new dataset is the publicly available
LFM-1b dataset e of music listening information shared by users of the online musicplatform Last.fm [48]. LFM-1b contains listening histories of 120,322 users; theirlistening records (or “listening events”) have been created between January 2005and August 2014. They sum up to over 1.1 billion listening events (LEs), whereeach LE is described by an (anonymous) user identifier, the artist name, the albumname, the track name, and the timestamp of the listening event. Also, the
LFM-1b dataset includes demographics of some users (i.e., country, age, and gender).To enrich the
LFM-1b dataset to suit our task, we utilize our previously created
CultMRS music recommendation dataset [49]. This dataset contains 55,191 users,who have listened to a total of 26,022,625 distinct tracks, captured by a total of807,890,921 listening events [47].To further enrich the dataset with music acoustic features, we gather the acous-tic features described in Section 3.1 for the tracks remaining in the dataset afterthe filtering described above. To this end, we rely on the Spotify API to gather owald et al.
Page 7 of 28 mainstreaminess score (1)user, track, artist, listening events
LFM-1b datasetacousticfeatures (8) via
LFM-1b listening events
CultMRS dataset via Spotify
API track genres (3,034)matched withSpotify genretaxonomy
LFM-BeyMS dataset
BeyMS , Recommendation user filtering tags
Last.fmIDF scoring
Figure 2
Overview of our new
LFM-BeyMS dataset and its data sources. We depict the differentfeatures, their origin, and relation, and show the feature groups with the number of containedfeatures in brackets.
LFM-BeyMS contains
BeyMS , i.e., data to study the beyond-mainstreamuser group, and
Recommendation , i.e., data to conduct recommendation experiments ofbeyond-mainstream and mainstream music listeners. content-based acoustic features for each track. Particularly, we search tracks usingthe < track, artist, album > triples extracted from the LFM-1b dataset using theSpotify search API f to gather the Spotify track URI of each track by using all threeparts of the triple in a conjunctive query. In total, this allowed gathering 4,326,809Spotify URIs. For the remainder of the tracks, we were not able to retrieve a URI.We attribute this to two factors: either the searched track is not provided by Spotifyor the track, artist, and album information cannot be matched to a Spotify trackunambiguously. Subsequently, we use the obtained track URI to query the acousticfeatures API, which returns the acoustic features of a given track (cf. Section 3.1).In a subsequent cleaning step, we remove all tracks for which the Spotify API didnot provide the full set of acoustic features.That procedure provides us with a set of 3,478,399 unique tracks and their acous-tic features. Within the LFM-1b dataset, this amounts to 13.36% of the distincttracks. Overall, these account for as much as 48.89% of all listening events (i.e.,the tracks listened to by users) of the LFM-1b dataset. The resulting dataset, nowenriched by acoustic music descriptors, comprises a total of approximately 394 mil-lion listening events of 55,149 users. In Table 1 (column “ CultMRS ”), we providefurther descriptive statistics of the
CultMRS dataset. We refine this dataset to cre-ate our new
LFM-BeyMS dataset (column “
BeyMS in Table 1), which consistsof
BeyMS , i.e., data to study the characteristics of beyond-mainstream music lis- owald et al.
Page 8 of 28
Table 1
Descriptive statistics of the
CultMRS dataset and our novel
LFM-BeyMS dataset.
CultMRS comprises acoustic features of tracks.
LFM-BeyMS is based on
CultMRS and consists of
BeyMS and
Recommendation . Our analyses of beyond-mainstream music listeners utilize
BeyMS and ourrecommendation experiments utilize
Recommendation , which includes listening events of both userswith beyond-mainstream and mainstream music taste.
Item
CultMRS [49]
LFM-BeyMS (our novel dataset)
BeyMS Recommendation
Users 55,149 2,074 4,148Tracks 3,478,399 157,444 1,084,922Artists 337,840 14,922 110,898Listening Events (LEs) 394,944,868 4,916,174 16,687,363Min. LEs per user 1 3 9 Q LEs per user 1,442 1,254 2,604Median LEs per user 5,667 2,048 3,766 Q LEs per user 9,738 3,239 5,252Max. LEs per user 399,210 10,536 11,177Avg. LEs per user 7,161.41 ( ± ± ± teners, and Recommendation , i.e., data to conduct recommendation experiments ofbeyond-mainstream and mainstream music listeners.
To identify beyond-mainstream music listeners, for each user, we compute a main-streaminess score, which is generally defined as the overlap between a user’s in-dividual listening history and the aggregated listening history of all Last.fm usersin the dataset. In this vein, the mainstreaminess score reflects a user’s inclina-tion to music listened to by the Last.fm mainstream listeners (i.e., the “average”Last.fm listener in the dataset). In [50], several measures of user mainstreaminessare defined. Out of these, we choose the
M-global-R-APC definition since it yieldedgood results in context-based music recommendation experiments for the
LFM-1b dataset, as evidenced in [50]. The
M-global-R-APC measure approximates a user’smainstreaminess score by computing Kendall’s τ [51] rank correlation between theuser’s vector of artist play counts and the global vector of artist play counts (ag-gregated over all users in the dataset). This definition also explains the name of themeasure, where “M” refers to mainstreaminess, “global” indicates the global per-spective, “R” stands for rank correlation, and “APC” refers to artist play counts.Next, we describe how we identify our beyond-mainstream users via filtering theusers by the number of listening events (see Figure 3 and Section 3.3.1) and bymainstreaminess scores (see Figure 4 and Section 3.3.2). For our study, we select the users so that listeners of different levels of “listeningactivity” are equally represented. We conduct a Gaussian kernel density estima-tion (KDE) [52] on the distribution of listening events over users to estimate thecontinuous probability density function (PDF) [53]. However, KDE estimates thePDF via discrete bins and hence, it is necessary to approximate the gradient via theprinciple of finite differences. The gradient of the PDF helps us identifying regionsof increasing or decreasing probability.Figure 3 shows that two large subsets of users exist that exhibit either very fewor an abundance of listening events. For our analysis, we consider only users whoare not in one of the subsets as mentioned earlier. On the one hand, we exclude owald et al.
Page 9 of 28 D e n s i t y Lower bound (4,688)Upper bound (14,787)0 5000 10000 15000 20000 25000Number of listening events per user0.0000150.0000100.0000050.0000000.0000050.000010 G r a d i e n t Lower bound (4,688)Upper bound (14,787)
Figure 3
Distribution of listening events in our set of Last.fm users. We set the lower and upperbound marked as dashed and dotted lines, respectively based on the gradient, which results in12,814 users with a similar number of listening events. users with too little data available for studying their listening behavior; and onthe other hand, we exclude so-called power listeners who might bias our analyses.Furthermore, such users with a very high number of listening events are often radiostations, which do not contribute reliable data to our investigations.Hence, we define lower and upper bounds regarding the number of users’ listeningevents to include in our study, such that the rate of change in terms of the numberof listening events is minimal and stable within these boundaries. That requiresthe gradient of the region within the lower and upper bound to be near zero (i.e., ± − ). By computing the second-order accurate central differences [54], we obtainan approximation of the gradient and find the longest cohesive region fulfilling therequirements between a lower bound of 4,688 and an upper bound of 14,787 listeningevents per user, which leads to 12,814 users. Figure 4 illustrates the mainstreaminess distribution of the 12,814 users that wehave extracted based on the number of listening events. Here, mainstreaminess isdefined according to the
M-global-R-APC definition taken from [50] (explained inSection 3.3). By setting an appropriate upper bound, we aim to exclude mainstreammusic listeners. In other words, we aim to set the upper bound to the beginningof the distribution’s bulk, which is motivated as follows: Firstly, the first inflectionpoint (i.e., maximal gradient) of a Gaussian distribution is found at E [ X ] − std ( X ),where E [ X ] is the expectation, and std ( X ) is the standard deviation of the Gaussianrandom variable X . Secondly, the first inflection point of a Gaussian distributionis equivalent to the 15.9-percentile. By setting the mainstreaminess threshold to owald et al. Page 10 of 28 D e n s i t y Lower bound (0.097732)0.2 0.0 0.2 0.4 0.6Mainstreaminess0.20.00.2 G r a d i e n t Lower bound (0.097732)
Figure 4
Mainstreaminess distribution of the 12,814 users illustrated in Figure 3. Based on themaximum gradient, we select an upper bound of 0.097732 to identify the 2,074beyond-mainstream users of the
BeyMS user group. this point, we intend to omit the majority of users and hence, only consider the15.9% of users with the lowest mainstreaminess scores. Utilizing this upper boundon the mainstreaminess score, we obtain a set of 2,074 beyond-mainstream users.Furthermore, the Gaussian assumption can be strengthened by the observation thatthe 2,074 beyond-mainstream users represent 16.19% of users. In the remainder ofthis paper, we refer to this set of beyond-mainstream music listeners as
BeyMS . We aim to study beyond-mainstream listeners in terms of their music taste. Wecharacterize music via its acoustic features, as described in Section 3.1, and alsoinvestigate genres as an alternative way to describe a music track via conventionalcategories. As the
LFM-1b dataset does not contain genre annotations of tracks andthe Spotify API only provides genres on artist level g , we leverage the tags assignedto each track by Last.fm users to identify genre annotations. To obtain these tags,we use the respective Last.fm API endpoint h . After having fetched the tags for eachtrack, we de-capitalize them and remove all non-alpha-numeric characters. Since notall tags used by Last.fm users correspond to actual music genres (e.g., the “seenlive”tag is used to indicate that a user has seen an artist performing this track live),we use a fine-grained music genre taxonomy consisting of 3,034 genres that arealso utilized by Spotify, which we gather from the EveryNoise service (2019-07-24) i .Specifically, for each track listened to by any of our BeyMS users, we remove alltags that are not part of the EveryNoise genre taxonomy, using a case-insensitivematching approach. owald et al.
Page 11 of 28
We note that Last.fm users tend to assign very general genre tags to a large num-ber of tracks, such as “pop” or “rock”. To remove these coarse-grained genres andto identify fine-grained beyond-mainstream music genres, we calculate the inversedocument frequency (IDF) [55] metric of our genre-track distribution by treatinggenres as terms and tracks as documents, i.e.,
IDF ( g ) = log | T ||{ t ∈ T with g ∈ G t }| . Moreprecisely, the IDF-score of genre g is determined by relating the number of all tracks | T | to the number of tracks annotated with genre g where | G t | is the set of genresassigned to track t . This way, a coarse-grained genre receives a small IDF-score,while a fine-grained genre receives a high IDF-score. Figure 5 shows the IDF-scoredistribution of the top-100 genres in ascending order (i.e., from coarse-grained tofine-grained). Here, we identify two groups of genres, where the first group con-sists of 6 genres with small IDF-scores, and the second group consists of 94 genreswith high IDF-scores. The visual inspection of Figure 5 indicates that the lowerbound of 0.90 serves as a discriminant between these two groups of coarse-grainedand fine-grained genres. Consequently, we remove the 6 coarse-grained genres (i.e.,“rock”, “pop”, “electronic”, “metal”, “alternativerock”, “indierock”) from the genreassignments of our tracks, which leads to 157,444 out of 799,659 tracks listened to by BeyMS users with at least one remaining genre. In total, these tracks are annotatedwith 1,418 unique genre identifiers.We are aware of the fact to our track filtering procedure leads to incompletelistening profiles of users. Since we rely on genres to describe beyond-mainstreammusic, these filtering steps are necessary for our study. To ensure that the
BeyMS users’ reduced listening profiles are still representative of their music preferences, wefurther investigate the consequences of the filtering procedure. Here, we find that auser’s listening history (i.e., the entirety of a user’s listening events) is reduced to61% on average. However, we also find that there are only 62 of the 2,074
BeyMS users, for whom the listening history is reduced to less than 20%. For these usersmost affected by the filtering, we compare the acoustic feature distributions oftheir listened tracks before and after the filtering steps, and find that filtering onlymarginally affects the acoustic feature distributions (i.e., average change in mean= 0 . ± . BeyMS aresummarized in column “
BeyMS ” in Table 1.
In order to compare the recommendation accuracy of recommendations received bythe users of our
BeyMS group and by mainstream users, we construct a datasetconsisting of
BeyMS ’s listening events and the listening events of an equally-sizedgroup of mainstream users. Therefore, we define the MS user group as 2,074 (i.e.,the size of our BeyMS group) randomly-chosen users with a mainstreaminess scorethat is higher than the upper bound for low mainstreaminess, identified in Fig-ure 4. Furthermore, the MS users are also in between the lower and upper boundsfor listening events identified in Figure 3. As shown in Table 1 (column “ Recom-mendation ”), the dataset used for the evaluation of recommendations contains dataof 4,148 distinct
BeyMS and MS users, 1,084,922 distinct tracks, and 16,687,363listening events. owald et al. Page 12 of 28 G e n r e I D F - s c o r e Lower bound (0.90)
Figure 5
IDF-score distribution of the top-100 genres in ascending order (i.e., from coarse-grainedto fine-grained). The 6 coarse-grained genres below the lower bound of 0.90 are removed from thegenre assignments, i.e., “rock”, “pop”, “electronic”, “metal”, “alternativerock”, “indierock”.
We use the Python-based open-source recommendation library Surprise j to com-pute and evaluate recommendations. One advantage of using Surprise is that itprovides built-in recommendation algorithms as well as a standardized evaluationpipeline, which enhances the reproducibility of our research. Since Surprise is fo-cused on rating prediction, we formulate our music recommendation scenario alsoas a rating prediction problem, in which we predict the preference of a target user u for a target track t . As done in [56], we model the preference of t for u by scalingthe play count (i.e., number of listening events) of t by u to a range of [1; 1,000]using min-max normalization. We perform this normalization on the individual userlevel to ensure that all users share the same preference value ranges. Thus, withthis method, we ensure that each user’s most listened track has a preference valueof 1,000, while their least listened track has a preference value of 1. To ensure thatthis min-max normalization procedure does not disrupt the play count distributionof our users, we compare the original play count distribution with the normalizeddistribution and find that both distributions are strongly right-skewed. Specifically,we find very similar distributions for large amounts of our play count data.We utilize a selection of Suprise’s built-in recommendation methods consistingof one baseline approach (i.e., UserItemAvg), two neighborhood-based approaches(i.e., UserKNN and UserKNNAvg), and one matrix factorization-based approach(i.e., NMF). Specifically, UserItemAvg predicts the average play count in the datasetby also accounting for deviations of u and t , for example, if a user u tends tohave more listening events than the average Last.fm user [57]. UserKNN [3] is auser-based collaborative filtering approach and is calculated using k = 40 nearestneighbors and the cosine similarity metric, which are the default settings of Surprise.UserKNNAvg is an extension of UserKNN [3] that also takes the average rating of owald et al. Page 13 of 28
Table 2
Mean absolute error (MAE) results for the two user groups MS and BeyMS of differentmainstreaminess and a selection of standard recommendation algorithms. A one-tailedMann-Whitney-U test ( α = . ) provides significant evidence, indicated by ***, that all algorithmsperform worse on BeyMS than on MS in terms of MAE. Furthermore, NMF (as shown in bold)outperforms the other three approaches UserItemAvg, UserKNN and UserKNNAvg.User group UserItemAvg UserKNN UserKNNAvg NMF BeyMS *** MS Overall 62.2315 69.8962 65.2469 target user u into account. Finally, NMF, i.e., non-negative matrix factorization [2],is calculated using 15 latent factors, which is the default parameter in the Surpriselibrary. As shown in our previous work [6], NMF is also capable of recommendingnon-popular items from the long tail and should therefore especially be of interestfor our beyond-mainstream recommendation setting.We use Surprise’s default parameters and refrain from performing any hyper-parameter tuning since we are only interested in assessing (relative) performancedifferences between the two user groups BeyMS and MS , and not in outperformingany state-of-the-art algorithm. This is also the reason why we focus on traditional al-gorithms instead of investigating the most recent deep learning architectures, whichwould also require a much higher computational effort.The resulting mean absolute error (MAE) results can be observed in Table 2(and correspond to the ones already shown in Figure 1). We favor MAE over thecommonly used root mean squared error (RMSE) due to several pitfalls, especiallyregarding the comparison of groups with different numbers of observations [58].Here, we perform 5-fold cross-validation leading to 5 different 80/20 train-test splitsand average the MAE over the 5 folds. NMF clearly outperforms UserItemAvg aswell as the two neighborhood-based methods (i.e., UserKNN and UserKNNAvg)both for the two user groups (see rows “ BeyMS ” and “ MS ”) separately and overallwithout distinguishing between the user groups (see row “Overall”). Additionally,we conduct a one-tailed Mann-Whitney-U test ( α = . MS being larger than or equal to the MAE for BeyMS . Results marked with *** indicate that the null-hypothesis was rejected forevery fold. Thus, all algorithms (including NMF) provide a significantly larger errorfor
BeyMS than for MS . In other words, recommendation quality is significantlybetter for users with mainstream taste than for users who prefer beyond-mainstreammusic across all recommendation approaches.These initial results underpin the need to study the characteristics of the BeyMS user group that receives worse recommendations. The corresponding experimentsare presented in the next section.
We identify the types of beyond-mainstream music using unsupervised clusteringand characterize these types with respect to acoustic features and music genres.Besides, we detect subgroups of beyond-mainstream music listeners by assigningusers to these clusters and evaluate the recommendation quality obtained for thesesubgroups. Finally, we discuss the recommendation quality with respect to opennessand diversity. For this, we relate to the definitions given by [11]: owald et al.
Page 14 of 28
Openness is the across-groups diversity (or categorical diversity) and describes ifusers of one group also listen to the music of other groups.
Diversity is the within-groups diversity (or thematic diversity) and describes thedissimilarity of music listened to by users within groups.Based on the findings of [11], we would expect that subgroups with high opennessshould receive more accurate recommendations than subgroups with high diversity.
To study the different types of music listened to by the users in our
BeyMS group,we conduct a cluster analysis. Specifically, we cluster the 157,444 tracks listenedto by
BeyMS users, where each track is described by the eight acoustic featuresdanceability, energy, speechiness, acousticness, instrumentalness, tempo, valence,and liveness (see Section 3.1). We scale the value ranges of these features to [0,1] using min-max normalization. The use of latent representations of musical el-ements such as tracks was shown to be efficient in the area of music informationretrieval [59, 60, 30]. Furthermore, for visually analyzing the obtained music clus-ters and decreasing computation time, we favor a reduction of dimensionality totwo dimensions.We conduct experiments with a broad body of dimensionality reduction methods,i.e., linear and nonlinear principal component analysis (PCA) [61], locally linear em-bedding [62], multidimensional scaling [63], Isomap [64], spectral embedding [65],t-distributed stochastic neighbor embedding (t-SNE) [66] and uniform manifold ap-proximation and projection (UMAP) [67]. We visually inspected the 2-dimensionalfeature spaces created by these methods with regards to the clustering quality, andwe obtained the visually most homogeneous results with UMAP. Moreover, UMAPhas already been successfully used in the music domain [30] and thus, we use itfor the remainder of our experiments. Specifically, we utilize the open-source imple-mentation of UMAP [68], which requires four parameters: (i) the distance metric M in the input space, (ii) the number of latent dimensions D , (iii) the minimumdistance of points in the latent space d min , and (iv) the number of neighbors of apoint N . Based on experimentation and related literature (e.g., [68]), we set thedistance metric M to the Euclidean distance, the number of latent dimensions D to 2, the distance d min to 0.1 and the number of neighbors N to 15.In a next step, we perform clustering on the dimensionality-reduced acoustic fea-tures of tracks. Again, we conduct experiments with various clustering methods,i.e., DBSCAN [69], K -Means [70], Gaussian mixture models [71], affinity propa-gation [72], spectral clustering [73], hierarchical agglomerative clustering [74], OP-TICS [75] and HDBSCAN* [76]. Here, we obtain the best results with respect tocluster cohesion and separation using HDBSCAN*. Furthermore, HDBSCAN* wasalso already used by related work to cluster music items [77]. We employ the open-source implementation of HDBSCAN* [78] that requires four parameters: (i) theminimum cluster size s min that defines the minimum size of a group of points toconsider a cluster, (ii) the minimum number of samples in the neighborhood of acore point N min , which quantifies how conservative the clustering is, (iii) ε , whichenables the recovery of DBSCAN clusters if the s min value is not reached, and(iv) the scaling of the distance α , which is another measure of the clustering’s con-servativeness. In detail, α scales the distance between two points, which determines owald et al. Page 15 of 28
Figure 6
Music clustering results obtained with HDBSCAN* and UMAP for the 2-dimensionalmapping. The outputs are four clusters with the following cluster sizes: 12,148 (blue, hatch: /),92,798 (green, hatch: +), 7,629 (orange, hatch: o) and 30,379 (pink, hatch: x) tracks. 14,490 ofour 157,444
BeyMS tracks have not been assigned to a cluster. whether these points are merged into a cluster. This scaling is used in the con-struction of HDBSCAN*’s hierarchy of clusterings. Again, we find the best-suitedparameters based on experimentation and related literature (e.g., [76]). Specifically,we require each cluster to comprise a sufficiently large number of tracks to increasethe level of significance of our subsequent experiments. We expect the existenceof very small music clusters and thus, search for the optimal value of the minimalcluster size s min in the search space of { , , . . . ; 1 , , } , where weobtain the best results with respect to the within-cluster variance for s min = 1 , N min = s min = 1 , ε = 0 and α = 1.Figure 6 shows the results of the clustering process using HDBSCAN* and UMAPfor the 2-dimensional mapping. This process leads to four music clusters. Here, thegreen cluster (hatch: +) is the largest one with 92,798 tracks, followed by the pinkcluster (hatch: x) with 30,379 tracks and the blue cluster (hatch: /) with 12,148tracks. The smallest cluster is the orange one (hatch: o) as it contains 7,629 tracks.The remaining 14,490 of our 157,444 BeyMS tracks have not been assigned to acluster and thus, will not be included in further analyses and interpretations. Next,we describe how we name these clusters based on their music genre distributions.
In Figure 7, we illustrate the top-10 genres of the four music clusters. For this,we refer to the genre IDF-scores presented in Section 3.4 and weight each genre owald et al.
Page 16 of 28 (a) C folk (b) C hard (c) C ambi (d) C elec Figure 7
Top- genres of the four music clusters C – C according to the aggregated genreIDF-scores. We name the clusters according to the top genre, i.e., (a) blue (hatch: /) → C folk (“folk”), (b) green (hatch: +) → C hard (“hardrock”), (c) orange (hatch: o) → C ambi (“ambient”), and (d) pink (hatch: x) → C elec (“electronica”). assigned to a track in a cluster with its corresponding IDF-score. For example, if agenre with an IDF-score of 1.4 is assigned to 1,000 tracks in a cluster, it is visualizedas an aggregated genre IDF-score of 1,400 in the corresponding plot of Figure 7.Based on the genre distributions, we label each cluster according to its top genre.With respect to the blue cluster (hatch: /) in Plot (a), we find top genres such as“folk” and “singersongwriter”, which typically reflect music with high acousticness.In the remainder of this paper, therefore, we refer to this cluster as C folk . The topgenres of the green cluster (hatch: +) in Plot (b) are typical high energy musicgenres such as “hardrock”, “punk”, “poprock”, and “hiphop”. Based on this, wename this cluster C hard .For the orange cluster (hatch: o) in Plot (c), we find genres that reflect music withhigh acousticness and high instrumentalness such as “ambient”, “experimental”,“newage”, and “postrock”. As “ambient” clearly dominates the genre distributionfor this cluster, we name this cluster C ambi . Similarly to C folk , this cluster containsmusic with high acousticness; yet, while C folk is characterized by low instrumental-ness music, C ambi is characterized by a high level of instrumentalness. Finally, Plot(d) shows the genre distribution of the pink cluster (hatch: x) with “electronica” asthe top genre, which leads to the name C elec for this cluster.Thus, both, C elec and C hard , consist of high energy music but in contrast to C hard , C elec also comprise high instrumentalness values. This also makes sense when look- owald et al. Page 17 of 28 R e l a t i v e g e n r e f r e q u e n c y C folk C hard C ambi C elec Figure 8
Relative genre frequency distribution of the four music clusters. While there aredominating genres in C folk and C ambi , the genre distribution is more diverse in C hard and C elec . ing at other top genres of C elec such as “deathmetal” and “blackmetal” whereguttural vocal techniques are often mistakenly classified as another type of instru-ment [79].To compare the genre distributions among the four music clusters, we illustratethe relative genre frequency distribution of the clusters in Figure 8. The relativefrequency of a genre g depicts the fraction of listening events of tracks within acluster c that are annotated with g . Here, we only show genres with a minimumrelative genre frequency of 0.1. We see that there are clearly dominating genres in C folk and C ambi , whereas the genre distributions in C hard and C elec are more evenlydistributed. When relating this finding to the findings of Figure 7, we clearly see thatthe results correspond to each other: C hard and C elec contain a more diverse genrespectrum (e.g., “hardrock” and “hiphop” are both part of C hard ’s top genres) than C folk and C ambi (e.g., in C ambi ’s top genres, we find “ambient” and “darkambient”). To understand the musical content of these four music clusters, we analyze theacoustic feature distributions of the four music clusters using boxplots in Figure 9.This visualization does not show any obvious differences with respect to danceabilityand tempo among the four clusters. For the acoustic features energy, speechiness,acousticness, valence, and liveness, there are similar values for the cluster pairs C folk and C ambi , and C hard and C elec . We observe differences between these two clusterpairs with respect to energy and acousticness. While C hard and C elec provide highenergy values and small acousticness values, C folk and C ambi feature small energyvalues and high acousticness values.In contrast, for instrumentalness, we see similar values for the cluster pairs C folk and C hard as well as for C ambi and C elec . We observe very high values for C ambi owald et al. Page 18 of 28 C folk C hard C ambi C elec Figure 9
Distribution of the eight acoustic features for the four music clusters. While the clustersdo not show obvious differences with respect to danceability and tempo, we find large differenceswith respect to energy, acousticness and instrumentalness. and C elec , and very small values for C folk and C hard . This difference is also visiblein Figure 6 in the form of the gap between C folk and C hard on the left, and C ambi and C elec on the right.Summing up, in C folk , we find music with low energy, high acousticness, and lowinstrumentalness; C hard contains music with high energy, low acousticness, and lowinstrumentalness; in C ambi , we observe music with low energy, high acousticness,and high instrumentalness; and in C elec , we find high energy, low acousticness, andhigh instrumentalness. Thus, these findings are in line with the genre distributionspresented in Figure 7. owald et al. Page 19 of 28
Table 3
Descriptive statistics of the four subgroups. Here, | U | is the number of users, | A | is thenumber of artists, | T | is the number of tracks, | LE | is the number of listening events, | G | is thenumber of genres, | LE u | is the average number of listening events per user, | T u | is the averagenumber of tracks per user and Age is the average age (along with the standard deviation) of users inthe group.
Subgroup | U | | A | | T | | LE | | G | | LE u | | T u | Age (std.) U folk
369 9,559 72,663 702,635 811 1,904.160 549.650 27.599 ( ± U hard
919 11,966 107,952 2,150,246 1,274 2,339.767 557.470 23.867 ( ± U ambi
143 6,869 39,649 224.327 918 1,568.720 473.308 29.571 ( ± U elec
642 11,814 105,907 1,416,354 1,005 2,206.159 670.402 24.639 ( ± In the next step, we assign the 2,074
BeyMS users to the four music clusters to cat-egorize them into four distinct beyond-mainstream subgroups for further analyses.For each user u , we count the number of listening events LE u,c that u has con-tributed to the tracks in each cluster c , where c ∈ C = { C folk , C hard , C ambi , C elec } .Then, we assign u to the cluster c for which the number of contributed listeningevents LE u,c is the highest. However, because we have varying cluster sizes, theprobability of u listening to a track t of the two larger clusters C hard and C elec ismuch higher than for the two smaller clusters C folk and C ambi , although C folk and C ambi could be more representative choices for u . Thus, similar to the IDF distri-bution of genres (see Figure 5), we take advantage of the IDF scoring to reduce theinfluence of the larger clusters and to assign higher weights to the smaller clusters.Specifically, these cluster IDF-scores are given by IDF ( c ) = log | T ||{ t ∈ T with c t }| , i.e.,by relating the number of all tracks | T | to the number of tracks in cluster c where c t is the music cluster assigned to track t . That lets us define the user–cluster weight w u,c for user u and cluster c as w u,c = IDF ( c ) · LE u,c .Consequently, users are assigned to the highest weighted music cluster and thus,a subgroup U c for cluster c is given by U c = { u ∈ U : arg max c ∈ C ( w u,c ) } .Out of the 2,074 BeyM S users, we can assign 2,073 users to these subgroups.Thus, only 1 user listened to tracks not contained in any cluster in Figure 6. Similarto the naming scheme of music clusters, we label the subgroups according to thename of their assigned music cluster. Hence, we obtain four subgroups U folk , U hard , U ambi , and U elec .Table 3 provides basic descriptive statistics of these four resulting subgroups. Here, U hard is the largest subgroup with | U | = 919 users, followed by U elec with | U | = 642users, U folk with | U | = 369 users, and U ambi with | U | = 143 users. The differenceswith respect to the number of users also correspond to the differences regarding thenumber of artists | A | , the number of tracks | T | , and the number of listening events | LE | contained in the clusters. In the case of the number of genres | G | , this differsslightly because the users in the smaller U ambi cluster listen to more genres (i.e.,918) than the bigger U folk cluster (i.e., 811). This indicates that the users in U ambi listen to a broader set of music than the users in U folk .Considering the average number of listening events per user (i.e., | LE u | ) andthe average number of tracks per user (i.e., | T u | ), we see that, while there is littledifference between U hard and U elec with respect to | LE u | , | T u | is much higher for U elec (i.e., 670.402) than for U hard (i.e., 557.470). This indicates that, although thenumber of listening events is nearly the same, users of U elec tend to listen to a wider owald et al. Page 20 of 28 C folk C hard C ambi C elec U folk U hard U ambi U elec Figure 10
Radar plot illustrating the contribution of each music cluster to a subgroup. While theweight distribution of U hard and U elec is rather narrow, it is more broad in case of U folk and U ambi suggesting that these groups are more open to music outside the own music cluster. set of tracks than users of U hard . With respect to the average age of the users Age ,we see that the users of U folk and U ambi are the oldest ones, and users of U hard and U elec are the youngest ones. However, it is worth noting that the group with thehighest average age (i.e., U ambi ) also shows by far the highest standard deviationof age (i.e., 14.138 years).In Figure 10, we show the contribution of each music cluster to each subgroup inthe form of a radar plot. For this, we use the user-cluster weights w u,c introducedbefore and calculate the average weight over all users in cluster c . One consequenceof the IDF scoring applied to w u,c is that the weight contributions of a user group tothe four clusters does not sum up to 1, which eventually influences the interpretationof the values shown in Figure 10. However, in return, these values account for thevarying cluster sizes and can also be interpreted as preference weights for a usergroup towards a specific music cluster.We observe that the weight distribution of the two larger subgroups U hard and U elec is rather narrow, which indicates that these users do not listen to many tracksof other clusters. Contrary to that, the weights of the two smaller subgroups U folk and U ambi are more broadly distributed over the four music clusters. This suggeststhat users of U folk and U ambi are more open to music outside of their own musiccluster than users of U hard and U elec . To better understand the correlations and connections between the music clustersand subgroups, we plot the Pearson correlation matrix of the four music clusters as aheatmap in Figure 11. Here, we represent each music cluster c by a 2,073-dimensional owald et al. Page 21 of 28 C folk C hard C ambi C elec C folk C hard C ambi C elec Figure 11
Pearson correlation matrix of the four music clusters. While C hard has solely negativecorrelations with all other clusters, and thus, listeners of C hard seem to be the most closedsubgroup, C ambi has positive correlations with C folk and C elec , and thus, listeners of C ambi seem to be the most open subgroup. vector (i.e., one entry for each user) consisting of the user–cluster weights w u,c ,introduced before. Each element in the matrix is then calculated using the Pearsoncorrelation measure based on these cluster vectors. For example, if there is a positivecorrelation between two clusters, we assume that a user who enjoys music from theone cluster likely also enjoys music from the other cluster. This can give us alsoan indication of the openness of a subgroup for music mainly listened to by othersubgroups. Specifically, for C folk , we see a positive correlation between C folk and C ambi , and a negative correlation between C folk and both, C hard as well as C elec .Users listening to the music of C hard seem to represent the most closed subgroup as C hard because it solely has negative correlations with all other clusters, especiallywith C ambi and C elec . In contrast, users listening to the music of C ambi seem torepresent the most open subgroup as C ambi has positive correlations with two otherclusters, i.e., C folk and C elec . The fourth cluster, C elec , is negatively correlated with C folk and especially with C hard , and positively correlated with C ambi . These resultsare also in line with the ones shown in Figure 10, in which we identify the users of U ambi as more open music listeners than the ones of U hard .In order to relate the openness of the subgroups to the diversity of the users withinthe subgroups, we calculate the average pairwise user similarity using the cosinesimilarity metric computed on the users’ genre distributions, i.e., number of listeningevents per genre. Figure 12 shows the resulting boxplots for the four identifiedsubgroups (i.e., C folk , C hard , C ambi , and C elec ). Figure 12 shows that users in U hard and U elec have a rather small average pairwise user similarity and, thus, exhibit amore diverse listening behavior, whereas users in U folk and U ambi tend to listen tomore similar music genres and, thus, have a narrow listening behavior within thegroup. Summed up, we find pronounced differences with respect to openness and owald et al. Page 22 of 28 U folk U hard U ambi U elec $ Y H U D J H S D L U Z L V H X V H U V L P L O D U L W \ Figure 12
Boxplots showing the average pairwise user similarity of the four subgroups using thecosine similarity calculated on the users’ genre distributions. While the users in U hard and U elec exhibit a more diverse listening behavior, users in U folk and U ambi tend to listen to more similar,i.e., less diverse, music genres. diversity across the subgroups. Although U ambi is the most open subgroup (i.e.,also listens to music of other subgroups), it is also the least diverse subgroup (i.e.,the users within the group listen to very similar music). That observation is inline with what is shown in Figures 7, and Figure 8. Here, we see that C ambi , i.e.,the most tightly connected music cluster to U ambi , contains the dominating genre“ambient” as well as genres that are strongly associated with this dominating genre(e.g., “darkambient”). For U hard , we observe the opposite. While it is the leastopen subgroup, it is also the most diverse one (e.g., it contains “hardrock” as wellas “hiphop” listeners). In Section 3.5, we have shown that the recommendation accuracy of four person-alized recommendation algorithms is significantly worse for
BeyMS users than for MS users. Now, we extend this analysis and evaluate the recommendation accuracyof these algorithms for the four subgroups (i.e., U folk , U hard , U ambi , and U elec ).Table 4 shows our results with respect to the mean absolute error (MAE). Addi-tionally, we analyze these results with respect to statistically significant differencesin Table 5 by performing ANOVA ( α = .
01) and a subsequent Tukey-HSD test( α = . U hard subgroup. Next, U folk , U ambi and U elec reach significantly better (i.e., lower MAE scores) than U hard for all algo-rithms. However, there is no statistically significant difference between the recom- owald et al. Page 23 of 28
Table 4
Mean absolute error (MAE) measurements for the four subgroups and four personalizedrecommendation algorithms. NMF (in bold) outperforms all other algorithms for all subgroups.Among the subgroups, the best accuracy results (i.e., lowest MAE scores) are reached by U ambi ,while the worst accuracy results (i.e., highest MAE scores) are reached by U hard . To facilitatecomparison, we also show the MAE measurements for the BeyMS and MS user groups.Subgroup UserItemAvg UserKNN UserKNNAvg NMF U folk U hard U ambi U elec BeyMS MS Statistically significant differences between pairs of subgroups, as determined by ANOVA( α = . ) and a subsequent Tukey-HSD test ( α = . ). UserItemAvg UserKNN UserKNNAvg NMFSubgroup U folk U hard U ambi U elec U folk U hard U ambi U elec U folk U hard U ambi U elec U folk U hard U ambi U elec U folk ** ** ** ** ** ** U hard ** ** ** ** ** ** ** ** ** ** ** ** U ambi ** ** ** ** ** ** ** U elec ** ** ** ** ** mendation accuracy of U folk and U elec . The overall best accuracy results (i.e., lowestMAE scores) are reached for the U ambi subgroup. These results are also statisticallysignificant when compared with the other subgroups for the NMF algorithm. NMFalso gives the overall best accuracy results for all subgroups, which is in line withour results presented in Section 3.5 and in our previous work [6].Furthermore, we find a relationship between openness, diversity, and recommen-dation quality. Here, U hard is the least open but most diverse subgroup and getsthe worst recommendations, while U ambi is the most open but least diverse sub-group and gets the best recommendations. This is in line with the findings of [11],who have shown that users are more likely to accept recommendations from dif-ferent groups (i.e., openness) rather than varied within a group (i.e., diversity).Thus, we find a relationship between the quality of recommendations provided tobeyond-mainstream music listeners and openness as well as diversity patterns ofthese users.Finally, in Figure 13, we visually compare the MAE scores reached by the bestperforming approach NMF for the four subgroups. Additionally, we depict the MAEscore for BeyMS as a black dashed line and the MAE score for MS as a grey dashedline. We see that U hard reaches worse results than BeyMS while U folk and U elec reach slightly better results than BeyMS . Interestingly, U ambi not only reaches betterresults than BeyMS but also better results than MS . Although this improvementover MS is not statistically significant (according to a one-tailed Mann-Whitney-Utest with α = . BeyMS users,where specific subgroups (i.e., U hard ) are disadvantaged in terms of recommendationaccuracy by recommendation algorithms while others (i.e., U ambi ) are not. In this paper, we shed light on the characteristics of beyond-mainstream music andmusic listeners. As our first contribution, we identified 2,074 beyond-mainstreammusic listeners (i.e.,
BeyMS ) in the Last.fm platform, and subsequently created anovel dataset called
LFM-BeyMS based on the listening histories of these users. We owald et al.
Page 24 of 28
54 56 58 60 62Mean absolute error U elec U ambi U hard U folk BeyMSMS
Figure 13
Comparison of the mean absolute error (MAE) scores reached by NMF for the foursubgroups with the ones reached by NMF for
BeyMS (black dashed line) and MS (grey dashedline). While specific subgroups (i.e., U hard ) are treated in an unfair way by recommendationalgorithms, others (i.e., U ambi ) are not. further enriched this dataset with (i) acoustic features of music tracks gathered fromSpotify, and (ii) genre information of tracks derived from Last.fm tags and matchedwith the Spotify microgenre taxonomy. Additionally, for reasons of comparability, LFM-BeyMS contains data of 2,074 Last.fm users listening to mainstream mu-sic. Using this dataset, as our second contribution, we validated related researchby showing that beyond-mainstream music listeners receive a significantly lowerrecommendation accuracy than mainstream music listeners by four standard rec-ommendation algorithms (i.e., UserItemAvg, UserKNN, UserKNNAvg and NMF).As our third contribution, we applied the clustering algorithm HDBSCAN* on theacoustic features of tracks listened by
BeyMS and identified four clusters of beyond-mainstream music: (i) C folk , music with high acousticness such as “folk”, (ii) C hard ,high energy music such as “hardrock”, (iii) C ambi , music with high acousticness andinstrumentalness such as “ambient”, and (iv) C elec , music with high energy andinstrumentalness such as “electronica”.As our fourth contribution, we mapped these clusters to our BeyMS users, whichled to four beyond-mainstream subgroups: (i) U folk , (ii) U hard , (iii) U ambi , and(iv) U elec . We analyzed these subgroups with respect to their openness (i.e., across-groups diversity – do users of one group listen to music of other groups?) anddiversity (i.e., within-groups diversity – how dissimilar is the music listened to byusers within groups?). Here, we found large differences between U hard and U ambi .Although U hard is the most closed subgroup (i.e., users do not listen to music ofother subgroups), it is also the most diverse subgroup (i.e., users listen to a diverseset of genres such as “hardrock” and “hiphop”). For U ambi , we get opposite results:while it is the most open subgroup (i.e., users listen to music of other subgroups aswell), it is also the least diverse one (i.e., the users within the group listen to verysimilar music such as “ambient” and “darkambient”). We related these character-istics of the subgroups to the recommendation quality of the four recommendationalgorithms UserItemAvg, UserKNN, UserKNNAvg and NMF. Here, we found that U hard got music recommendations with lowest accuracy, while U ambi got musicrecommendations with highest accuracy. This is in line with related research [11],which has shown that openness is stronger correlated with accurate recommenda-tions than diversity. U ambi even received better recommendations than the group of owald et al. Page 25 of 28 mainstream music listeners. This result highlights that there are large differencesbetween the subgroups of beyond-music listeners. Finally, to foster reproducibilityof our research, we provide our novel
LFM-BeyMS dataset via Zenodo as well asour source code via Github.We believe that our findings provide useful insights for creating user models andrecommendation algorithms that better serve beyond-mainstream music listeners.As it was shown in [6], beyond-mainstream music listeners tend to have largeruser profile sizes than users interested in mainstream music, which means thatthey provide a substantial amount of listening interaction data for services such asLast.fm and Spotify. We assume that improving the recommendation quality for thisactive user group also leads to another effect, namely a more prominent exposureof (long-tail) music artists due to a better-connected recommendation network [80].We leave such investigations to future work.
Limitations and future work.
Despite the merits of this work, we are aware ofits limitations. The first limitation we recognize is that our analyses are based ona sample of the Last.fm community. The extent to which their listening behavioris representative of the Last.fm community at large, or similar music streamingcommunities such as Spotify, needs further investigation.Next, since we conducted a comparative study of the accuracy of recommendersystems algorithms—and were therefore not interested to beat state-of-the-artalgorithms—we focused on traditional algorithms (e.g., KNN-based collaborative fil-tering) instead of investigating the most current deep learning architectures, whichwould also require a much higher computational effort. Furthermore, an award-winning-paper by Dacrema et al. [81] has recently shown that traditional algorithmsare able to outperform almost all deep learning architectures.While our work serves as a first milestone towards better characterizing beyond-mainstream music and listeners of such music, future work should focus on usermodeling techniques to individually target the different subgroups, for exampleby integrating knowledge about openness and diversity. With respect to analyzingopenness and diversity of users and user groups, we would also like to work ona more formal definition of these dimensions, which would not only allow us tomeasure them more precisely but also to integrate them into the recommendationcalculation process.Additionally, since previous research has shown that the listener’s cultural back-ground impacts the quality of music recommendations [47], we plan to compare thecultural and socioeconomic aspects of beyond-mainstream and mainstream musiclisteners. We plan to employ these aspects by means of Hofstede’s cultural dimen-sions [82] and the World Happiness Report [83].Finally, another avenue for future work is the research in the area of fair musicrecommender systems. Here, we plan to build user models that are capable of ac-counting for the complex characteristics of beyond-mainstream music listeners pre-sented in this paper. While we believe that more specialized user models could helpto provide better recommendations for users who currently receive worse recommen-dations (e.g., the U hard subgroup identified in this paper), we also aim to highlightthat such user models still need to be generalizable to avoid any unfair treatment owald et al. Page 26 of 28 of other users. Hence, future research should work on achieving a specialization-generalization trade-off in music recommender systems. We hope that our open
LFM-BeyMS dataset as well as our source code will be of use to the scientificcommunity for subsequent analyses.
Availability of Data and Materials
The
CultMRS dataset can be found on Zenodo https://doi.org/10.5281/zenodo.3477842 . Additionally, weprovide our novel
LFM-BeyMS dataset via Zenodo: https://doi.org/10.5281/zenodo.3784764 . OurPython-based implementations are available via Github https://github.com/pmuellner/supporttheunderground . Competing interests
The authors declare that they have no competing interests.
Funding
This work is funded by the TU Graz Open Access Publishing Fund and the Austrian Science Fund (FWF): V579.
Author’s contributions
All authors contributed to manuscript revision, read, and approved the submitted version.
Acknowledgements
This work is supported by the Know-Center GmbH within the Austria FFG COMET program.
Endnotes a. b. https://doi.org/10.5281/zenodo.3784764 c. https://github.com/pmuellner/supporttheunderground d. https://developer.spotify.com/web-api/get-several-audio-features/ e. f. https://developer.spotify.com/web-api/search-item/ g. https://developer.spotify.com/documentation/web-api/reference-beta/ h. i. http://everynoise.com/ j. http://surpriselib.com/ Author details Know-Center GmbH, Graz, Austria. University of Innsbruck, Innsbruck, Austria. Utrecht University, Utrecht,The Netherlands. Johannes Kepler University Linz, Linz, Austria. Linz Institute of Technology AI Lab, Linz,Austria. Graz University of Technology, Graz, Austria.
References
1. Schedl, M., Knees, P., McFee, B., Bogdanov, D., Kaminskas, M.: Music recommender systems. In:Recommender Systems Handbook, pp. 453–492 (2015)2. Luo, X., Zhou, M., Xia, Y., Zhu, Q.: An efficient non-negative matrix-factorization-based approach tocollaborative filtering for recommender systems. IEEE Transactions on Industrial Informatics (2), 1273–1284(2014)3. Herlocker, J.L., Konstan, J.A., Terveen, L.G., Riedl, J.T.: Evaluating collaborative filtering recommendersystems. ACM Transactions on Information Systems (TOIS) (1), 5–53 (2004)4. Abdollahpouri, H., Mansoury, M., Burke, R., Mobasher, B.: The unfairness of popularity bias inrecommendation. In: RMSE Workshop at ACM Recsys (2019)5. Celma Herrada, `O., et al.: Music recommendation and discovery in the long tail. PhD thesis, UniversitatPompeu Fabra (2009)6. Kowald, D., Schedl, M., Lex, E.: The unfairness of popularity bias in music recommendation: A reproducibilitystudy. In: European Conference on Information Retrieval, pp. 35–42 (2020). Springer7. Celma, `O., Cano, P.: From hits to niches?: or how popular artists can bias music recommendation anddiscovery. In: Proceedings of KDD’2018 (Netflix Price Workshop) (2008)8. Celma, O.: Music Recommendation and Discovery – The Long Tail, Long Fail, and Long Play in the DigitalMusic Space, (2010)9. Oord, A.v.d., Dieleman, S., Schrauwen, B.: Deep content-based music recommendation. In: Proceedings ofNIPS’2013, pp. 2643–2651. Curran Associates Inc., USA (2013)10. Goel, S., Broder, A., Gabrilovich, E., Pang, B.: Anatomy of the long tail: ordinary people with extraordinarytastes. In: Proceedings of the Third ACM International Conference on Web Search and Data Mining, pp.201–210 (2010) owald et al. Page 27 of 28
11. Tintarev, N., Dennis, M., Masthoff, J.: Adapting recommendation diversity to openness to experience: A studyof human behaviour. In: Carberry, S., Weibelzahl, S., Micarelli, A., Semeraro, G. (eds.) User Modeling,Adaptation, and Personalization, pp. 190–202. Springer, Berlin, Heidelberg (2013)12. Schedl, M., Zamani, H., Chen, C.-W., Deldjoo, Y., Elahi, M.: Current challenges and visions in musicrecommender systems research. International Journal of Multimedia Information Retrieval (2), 95–116 (2018)13. Haas, R., Brandes, V.: Music that Works: Contributions of Biology, Neurophysiology, Psychology, Sociology,Medicine and Musicology, (2010)14. Adorno, T.W.: Introduction to the Sociology of Music, (1988)15. Deutsch, D.: Psychology of Music, (2013)16. Laplante, A.: Improving music recommender systems: What can we learn from research on music tastes? In:Proceedings of the International Society for Music Information Retrieval Conference (2014)17. Rentfrow, P.J., Gosling, S.D.: The content and validity of music-genre stereotypes among college students.Psychology of music (2), 306–326 (2007)18. Kim, Y., Aiello, L.M., Quercia, D.: Pepmusic: motivational qualities of songs for daily activities. EPJ DataScience (1), 13 (2020)19. Juslin, P.N., Sloboda, J.A.: Music and Emotion: Theory and Research., (2001)20. Zentner, M., Grandjean, D., Scherer, K.R.: Emotions evoked by the sound of music: characterization,classification, and measurement. Emotion (4), 494 (2008)21. Juslin, P.N., Laukka, P.: Expression, perception, and induction of musical emotions: A review and aquestionnaire study of everyday listening. Journal of new music research (3), 217–238 (2004)22. Yang, Y.-H., Chen, H.H.: Music Emotion Recognition, (2011)23. Ferwerda, B., Schedl, M., Tkalcic, M.: Personality & emotional states: Understanding users’ music listeningneeds. CEUR-WS.org (2015)24. Goldberg, L.R.: The structure of phenotypic personality traits. American psychologist (1), 26 (1993)25. Schubert, E.: The influence of emotion, locus of emotion and familiarity upon preference in music. Psychologyof Music (3), 499–515 (2007)26. Pereira, C.S., Teixeira, J., Figueiredo, P., Xavier, J., Castro, S.L., Brattico, E.: Music and emotions in thebrain: Familiarity matters. PLoS ONE (11) (2011)27. Moore, J.L., Chen, S., Turnbull, D., Joachims, T.: Taste over time: The temporal dynamics of user preferences.In: Proceedings of the International Society for Music Information Retrieval Conference, pp. 401–406 (2013)28. Barone, M.D., Bansal, J., Woolhouse, M.H.: Acoustic features influence musical choices across multiple genres.Frontiers in psychology , 931 (2017)29. Gong, B., Kaya, M., Tintarev, N.: Contextual personalized re-ranking of music recommendations through audiofeatures. Master’s thesis, TU Delft (2020)30. Zangerle, E., Pichl, M.: Content-based user models: Modeling the many faces of musical preference. In:Proceedings of the 19th International Society for Music Information Retrieval Conference 2018 (ISMIR 2018),pp. 709–716 (2018)31. Ekstrand, M.D., Tian, M., Azpiazu, I.M., Ekstrand, J.D., Anuyah, O., McNeill, D., Pera, M.S.: All the coolkids, how do they fit in?: Popularity and demographic biases in recommender evaluation and effectiveness. In:Conference on Fairness, Accountability and Transparency, pp. 172–186 (2018)32. Brynjolfsson, E., Hu, Y.J., Smith, M.D.: From niches to riches: Anatomy of the long tail. Sloan ManagementReview (4), 67–71 (2006)33. Jannach, D., Lerche, L., Kamehkhosh, I., Jugovac, M.: What recommenders recommend: an analysis ofrecommendation biases and possible countermeasures. User Modeling and User-Adapted Interaction (5),427–491 (2015)34. Harper, F.M., Konstan, J.A.: The movielens datasets: History and context. Acm transactions on interactiveintelligent systems (tiis) (4), 1–19 (2015)35. Cheng, R., Tang, B.: A music recommendation system based on acoustic features and user personalities. In:Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 203–213 (2016). Springer36. Kaminskas, M., Ricci, F., Schedl, M.: Location-aware music recommendation using auto-tagging and hybridmatching. In: Proceedings of RecSys’2013, p. 8. ACM, Hong Kong, China (2013)37. Donaldson, J.: A hybrid social-acoustic recommendation system for popular music. In: Proceedings ofRecSys’2007, pp. 187–190. ACM, New York, NY, USA (2007)38. Aggarwal, C.C.: Ensemble-based and hybrid recommender systems. In: Recommender Systems, pp. 199–224(2016)39. Zangerle, E., Pichl, M.: Content-based user models: Modeling the many faces of musical preference. In: 19thInternational Society for Music Information Retrieval Conference (2018)40. Lee, K., Lee, K.: My head is your tail: applying link analysis on long-tailed music listening behavior for musicrecommendation. In: Proceedings of the 5th ACM Conference on Recommender Systems, pp. 213–220 (2011)41. Lex, E., Kowald, D., Schedl, M.: Modeling popularity and temporal drift of music genre preferences.Transactions of the International Society for Music Information Retrieval (1) (2020)42. Kowald, D., Kopeinik, S., Lex, E.: The tagrec framework as a toolkit for the development of tag-basedrecommender systems. In: Adjunct Publication of the 25th Conference on User Modeling, Adaptation andPersonalization, pp. 23–28 (2017)43. Bauer, C.: Allowing for equal opportunities for artists in music recommendation. In: 1st Workshop on DesigningHuman-Centric MIR Systems (2019)44. Pichl, M., Zangerle, E., Specht, G.: Understanding playlist creation on music streaming platforms. In: IEEEInternational Symposium on Multimedia, ISM 2016, pp. 475–480 (2016)45. Andersen, J.S.: Using the echo nest’s automatically extracted music features for a musicological purpose. In:4th International Workshop on Cognitive Information Processing (CIP), pp. 1–6 (2014)46. McVicar, M., Freeman, T., De Bie, T.: Mining the correlation between lyrical and audio features and theemergence of mood. In: Proceedings of the 11th International Society for Music Information Retrieval owald et al. Page 28 of 28
Conference (ISMIR 2011), pp. 783–788 (2011)47. Zangerle, E., Pichl, M., Schedl, M.: User models for culture-aware music recommendation: Fusing acoustic andcultural cues. Transactions of the International Society for Music Information Retrieval (TISMIR) (1), 1–16(2020). doi:10.5334/tismir.3748. Schedl, M.: The lfm-1b dataset for music retrieval and recommendation. In: Proceedings of the 2016 ACM onInternational Conference on Multimedia Retrieval, pp. 103–110 (2016). ACM49. Zangerle, E.: Culture-Aware Music Recommendation Dataset. https://doi.org/10.5281/zenodo.3477842 .doi:10.5281/zenodo.347784250. Bauer, C., Schedl, M.: Global and country-specific mainstreaminess measures: Definitions, analysis, and usagefor improving personalized music recommendation systems. PLOS ONE (6), 1–36 (2019).doi:10.1371/journal.pone.021738951. Kendall, M.G.: A new measure of rank correlation. Biometrika (1/2), 81–93 (1938)52. Sheather, S.J.: Density estimation. Statistical science, 588–597 (2004)53. Davis, R.A., Lii, K.-S., Politis, D.N.: Remarks on some nonparametric estimates of a density function. In:Selected Works of Murray Rosenblatt, pp. 95–100. Springer, New York (2011)54. Quarteroni, A., Sacco, R., Saleri, F.: Numerical Mathematics. Springer, New York (2007)55. Jones, K.S.: A statistical interpretation of term specificity and its application in retrieval. Journal ofdocumentation (1972)56. Schedl, M., Bauer, C.: Distance-and rank-based music mainstreaminess measurement. In: Adjunct Publicationof the 25th Conference on User Modeling, Adaptation and Personalization, pp. 364–367 (2017). ACM57. Koren, Y.: Factor in the neighbors: Scalable and accurate collaborative filtering. ACM Transactions onKnowledge Discovery from Data (TKDD) (1), 1 (2010)58. Willmott, C.J., Matsuura, K.: Advantages of the mean absolute error (mae) over the root mean square error(rmse) in assessing average model performance. Climate research (1), 79–82 (2005)59. Moore, J.L., Chen, S., Joachims, T., Turnbull, D.: Learning to embed songs and tags for playlist prediction. In:ISMIR, vol. 12, pp. 349–354 (2012)60. Levy, M., Sandler, M.: Learning latent semantic models for music from social tags. Journal of New MusicResearch (2), 137–150 (2008)61. Tipping, M.E., Bishop, C.M.: Mixtures of probabilistic principal component analyzers. Neural computation (2), 443–482 (1999)62. Roweis, S.T., Saul, L.K.: Nonlinear dimensionality reduction by locally linear embedding. science (5500),2323–2326 (2000)63. Kruskal, J.B.: Nonmetric multidimensional scaling: a numerical method. Psychometrika (2), 115–129 (1964)64. Tenenbaum, J.B., De Silva, V., Langford, J.C.: A global geometric framework for nonlinear dimensionalityreduction. science (5500), 2319–2323 (2000)65. Ng, A.Y., Jordan, M.I., Weiss, Y.: On spectral clustering: Analysis and an algorithm. In: Advances in NeuralInformation Processing Systems, pp. 849–856 (2002)66. Maaten, L.v.d., Hinton, G.: Visualizing data using t-sne. Journal of machine learning research (Nov),2579–2605 (2008)67. McInnes, L., Healy, J., Saul, N., Großberger, L.: Umap: Uniform manifold approximation and projection.Journal of Open Source Software (29), 861 (2018)68. McInnes, L., Healy, J., Saul, N., Grossberger, L.: Umap: Uniform manifold approximation and projection. TheJournal of Open Source Software (29), 861 (2018)69. Ester, M., Kriegel, H.-P., Sander, J., Xu, X., et al. : A density-based algorithm for discovering clusters in largespatial databases with noise. In: Kdd, vol. 96, pp. 226–231 (1996)70. Bishop, C.M.: Pattern Recognition and Machine Learning, pp. 424–429. Springer, New York (2006)71. Reynolds, D.: Gaussian mixture models. Encyclopedia of biometrics, 827–832 (2015)72. Frey, B.J., Dueck, D.: Clustering by passing messages between data points. science (5814), 972–976 (2007)73. Shi, J., Malik, J.: Normalized cuts and image segmentation. Departmental Papers (CIS), 107 (2000)74. Murtagh, F., Legendre, P.: Ward’s hierarchical agglomerative clustering method: which algorithms implementward’s criterion? Journal of classification (3), 274–295 (2014)75. Ankerst, M., Breunig, M.M., Kriegel, H.-P., Sander, J.: Optics: ordering points to identify the clusteringstructure. In: ACM Sigmod Record, vol. 28, pp. 49–60 (1999). ACM76. McInnes, L., Healy, J.: Accelerated hierarchical density based clustering. In: Data Mining Workshops (ICDMW),2017 IEEE International Conference On, pp. 33–42 (2017). IEEE77. Yoo, S., Lee, K.: A data-driven approach to identifying music listener groups based on users’ playratedistributions of listening events. In: Adjunct Publication of the 25th Conference on User Modeling, Adaptationand Personalization, pp. 77–81 (2017)78. McInnes, L., Healy, J., Astels, S.: hdbscan: Hierarchical density based clustering. The Journal of Open SourceSoftware (11) (2017)79. York, W.: Voices from hell–the dark, not-so-dulcet cookie monster vocals of extreme metal. The San FranciscoBay Guardian, 14–20 (2004)80. Lamprecht, D., Strohmaier, M., Helic, D.: A method for evaluating discoverability and navigability ofrecommendation algorithms. Computational social networks4