MMusic Data Analysis: A State-of-the-art Survey
Shubhanshu GuptaDhirubhai Ambani Institute of Information and Communication Technologyshubhanshu [email protected]
Abstract
Music accounts for a significant chunk of interest amongvarious online activities. This is reflected by wide array ofalternatives offered in music related web/mobile apps, in-formation portals, featuring millions of artists, songs andevents attracting user activity at similar scale. Availabil-ity of large scale structured and unstructured data has at-tracted similar level of attention by data science commu-nity. This paper attempts to offer current state-of-the-artin music related analysis. Various approaches involving ma-chine learning, information theory, social network analysis,semantic web and linked open data are represented in theform of taxonomy along with data sources and use casesaddressed by the research community.
Music accounts of a significantly large part of online activ-ity today with availability of various online music stores,streaming services, news and podcast services, social net-works, and even cloud-based personal music collection.With these developments, we are moving towards an inter-esting contrasting trend. In days of hard record sales (disc,cassette tapes, records) it was easy to keep track of saleswhile difficult or impossible to track number of times theywere played by the listeners in their music systems. As mu-sic increasingly released, distributed, played and discussedonline, it has become possible to individually keep trackof various aspects by analyzing the data in near real-time.However, with millions of songs, artists, events touched bypotentially billions of listeners online; the resulting onlineactivity opens up vast avenues of research for data sciencecommunity. While interested in exploring research oppor-tunities involving current Big Data technologies, we firstattempt to capture current state-of-the-art in music dataanalysis to clearly identify kinds of datasets, features, ana-lytical techniques that are used by the research communityto support various applications and use cases.We structure our survey around use cases and attempt se-lect representative research employing class of techniques.We also structure the comments to reveal the kind ofdatasets utilized, features extracted, analytical techniquesand variations in experiments performed along with the out-come of the effort.
Music data analysis is widely used for automated predic-tion or recognition of various musical aspects like musicalstyle, genre, mood, emotion, onset, melodic sequence, alongwith predicting the success of a song. Appropriate musicalfeatures that are associated with these aspects are identi-fied, extracted, processed and subjected to various analyti-cal techniques for the purpose of prediction. This use caseis especially useful for automated tagging of songs, synthe-sis of new music, and determining potential success that asong might garner.
In our survey, we came across supervised learning and semi-supervised learning algorithms which analyze music. Super-vised learning algorithms which include statistical classifica-tion methods, contains naive Bayesian, linear classifiers andneural network approaches which can be used to recognizemusical style, which is, classifying music being played lyri-cally, frantically, pointillistically, with syncopation, high,low, quote and blues. The data-set consisted of trumpetperformances of various music styles recorded as MIDI fromactual performances. Thereby, the data-set had a total of8 styles, each consisting 25 examples resulting in 1200 five-second training examples. A real-time music style classifierhas also been built which employs Nave Bayesian classifier,linear classifier and Neural networks using 13 low-level fea-tures extracted from MIDI data. All the training examplesthat rest in the dataset were used for classifying improvisa-tional style rather than music feature selection and featurelearning [7] .In a separate approach, music style modeling consists ofderiving a mathematical model, such as a set of stochasticrules from a set of musical examples. The data-set con-sisted of several MIDI files, which included polyphonic in-strumental and piano music with styles including the earlyrenaissance, baroque music, hard-bop jazz, from eclecticsources. Apparently, some of the regularity can be cap-tured in the composition process by using statistical andinformation-theoretic tools to analyze musical pieces. Andthe resulting model can be used to infer and predict music a r X i v : . [ c s . D B ] N ov tyle. The statistics and information-theoretic tools consistof dictionary based methods and selective dictionary basedmethods. The former, operates by parsing an existing mu-sical text into a lexicon of phrases or patterns, called motifsand then provide a rule that infers which musical object tochoose next that would best follow a current past context.It consists of Incremental Parsing algorithm which helps inbuilding a dictionary of distinct motifs. The latter, withthe help of Prediction Suffix Trees algorithm builds a re-stricted dictionary of only those motifs that both, appear asignificant number of times throughout the complete sourcesequence and at the same time, are meaningful for predict-ing the immediate future [9]. Additional work is being doneso as to come up with a more general OpenMusic (it isa Lisp-based open source software and visual programmingenvironment for music composition and analysis) based realtime performance systems which can have the prowess tocatch the music style, while interacting with several per-formers and responding at the same time.Figure 1: Music Data Analysis approaches applied to styleprediction/recognition Genre detection has been an active area of application inmusic data analysis. Various state-of-the-art techniqueshave been applied and reported by the research commu-nity. The most frequently referred datasets for carryingout genre classification are GTZAN, ISMIR genre, ISMIRrhythm, and Latin music database containing about 698to 3227 songs. The most common approach for the appli-cation is statistical classification method which consists ofKNN, SVM, Random Forest, Naive Bayes and J48 DecisionTree which are used for large scale music genre classifica-tion. Significant performance gains have been achieved bybeat-aligned vector sequences of the features for large vol-ume of data-sets. In order to capture the temporal domain,using six motley combinations of Echonest features, statis-tical moments were calculated. Although, there is still agargantuan room for a large scale evaluation of the remain-ing features (MFCCs Mel Frequency Cepstral Coefficients,Chroma, loudness, tempo, key, dance-ability, hotness infor-mation etc.) provided by the million song data-set [25].In other attempt, a kNN classifier combined with SMBGTwas used for music genre classification of symbolic music. Figure 2: Music data analysis approaches applied to GenrerecognitionThe data-set it used consisted of 100 MIDI songs that spanfour genre of music namely classical, blues, rock, and pop.There were several loopholes in coming up with a propermusic genre classification through SMBGT, similarity mea-sure combined with k-NN classifier, although it was a novelapproach. To begin with, instead of just k-NN, a combi-nation of several diverse and independent trained classifierscan be used. Moreover, short segment of musical piecescould have been used for feature extraction. And one ofthe most prominent space left which can be worked upon isthe analysis of polyphonic music instead of just MIDI [17].Music genre recognition at a web scale has been demon-strated using Linked Open Data employing semantic webtechniques. Adopting the e-science approach for data andcompute intensive jobs, the e-Research infrastructure con-figured to perform NEMA (Network Environment for MusicAnalysis) genre classification workflow over Jamendo freemusic collection dataset that are converted into semanticrepresentations [8].
The number of social, personal, and other activities thatwe are involved in our day-to-day lives is proliferating. Andeach activity usually arrogates different moods. Hence, usu-ally the music buffs prefer listening to different kind of musicthat best fits their current mood. And this is where moodclassification of songs is gaining utmost importance thesedays. Some MIR techniques that are used for mood clas-sification of music are the Network Environment for Mu-sic Analysis (NEMA) system; the jMIR suite. The MIRsolutions that use semantic web technologies incorporateGNAT and GNARQL software tools which use the musicontology and Sonic Visualizer, Annotator tools and theirVAMP audio analysis software plug-ins also use the musicOntology. The methodology of automatic mood classifi-cation of songs also relies on songs data such as lyrics andmeta data and the classification is carried out through SVM,Naive Bayes classifiers of supervised learning algorithms andGraph based methods (NB: content based Naive Bayes clas-ifier; GC-Oh: graph-based method by Oh et al; GC-New:graph based method with extension of neighbor function)of semi-supervised algorithms. The data-set consisted ofabout 6000 songs tagged with mood categories which canconsist up to 132 predefined type of moods from a blog sitecalled LiveJournal whereas the whole data of lyrics camefrom LyricWiki website. It was found that the above usedframework and methodology was not sufficient and assertivein mood classification for a real music search engine system.But if proper audio information like artist, sentiment words,more weightage on words in chorus and title parts, on com-bining with lyrics might fetch better accuracy and resultsof mood classification [6].Besides MIR solutions for mood classification of songs,Semantic information retrieval of music is also playing aninstrumental role in determining listeners emotional re-sponses. The framework here, in this case, evolves fromlow semantic concept level (audio signal) to high seman-tic concept level (mood). The inputs extracted from theweb are the various social and meta data based informationlike socio-cultural tags, editorial data, annotations etc. Inthe web extraction module, since the social media is aug-mented with ever growing rich context and social meta datainformation, an SVM based music mood machine learn-ing method helps in the audio feature extraction. And fi-nally, mood annotation is predicted with semantic associa-tion via semantic reasoning through TBox, ontologys termi-nology and ABox, ontologys assertional axioms. The mood-oriented TBox in constructed on the base of Music Ontologyterms and refines the specific music-mood which has twomain parts: the Web-based part which refines high level so-cial meta data information and the Audio-based part whichrefines audio based information. And ABox is constructedwith information being extracted from raw audio and webinformation. The web based information is extracted fromall the meta data rich websites such as Last.fm, AllMusic,etc., to get ID3 meta data, tags, annotations, editorial in-formation, comments, etc. The data set consists of about1804 tracks, which covers about 21 major genres and 56sub genres and includes 1022 different artists. It was alsoobserved that the accuracy of the mood annotation can im-prove by a large extent by the embellishment of the othermeta data because of the proliferating context based socialmeta data information, burgeoning from the social mediasites [29].Above, wherein mood classification was carried outthrough the statistical classification methods and Graphbased methods of machine learning, it was elicited that com-bining audio information with lyrical data might yield bet-ter results for mood classification and this was addressed inthe methodology discussed above in which ontology basedmethods were used to link the audio information with theavailable web based information which simply outperformedall the other methods (and hence confirming the speculationabout its validity).
Machine learning is also used for modeling fixed length mu-sical pitch sequences in monophonic melodies. For more Figure 3: Music data analysis approaches applied to moodanalysiscomprehensive analysis of sequential structures in musicthat also includes other musical features and polyphonicstructures, musical pitch serves as a starting point. TheRestricted Boltzmann Machine algorithm in the artificialneural networks of the supervised learning algorithms ofmachine learning is used for learning sequences of musi-cal pitch. The data set consisted of 185 J.S.Bach choralemelodies and it was found that, although the neural prob-abilistic model accomplished modeling musical pitch se-quences pretty well but still it is not known that whetherpredictions can be improved if other musical features, likenote durations, intervals etc. are introduced or not. Also,the current model has not been proffered polyphonic mu-sic for modeling and analysis and therefore, there is alsoroom for expanding the model for a wider and bigger dataset instead of just limiting it to the scope of monophonicmelodies [5].
Besides the various kinds and types of music classification,it also is sometimes necessary to get the information aboutvarious high level tasks which come under music informationretrieval paradigm such as onset detection. Onset detectionfunction helps in figuring out the starting points of variousevents relevant to music in an audio stream such as beat-tracking, score following and music transcription. Hencethere exists a peak-picking algorithm based on artificial neu-ral networks (bidirectional recurrent neural network, here),trained in a supervised manner for common onset detectionfunctions. The data set used for evaluation purpose con-sisted of 321 audio excerpts covering different types of mu-sical genres, performed on various instruments and havinga total length of approximately 102 minutes and 25, 927 an-notated onsets [27]. It was found that in comparison to theexisting hand-crafted methods such as basic peak selectionalgorithms based on psychoacoustic theory and heuristics,the spectral flux method (the new neural network basedpeak picking algorithm) is able to clearly outperform theformer for existing onset detection functions. .1.6 Song Hit Prediction
Machine learning is also used for predicting the successof songs even before they are released in the market, re-ferred to as the Hit song science. In this, accurate mod-els are built to predict if a song would be a top 10 dancehit or not, for which a dataset of dance hits was retrievedwhich included 21,692 instances with five features: songtitle, artist, position, peak position and date and these in-stances were retrieved from 3,452 out of 4,120 unique songsin the hit list database. Five machine learning classificationtechniques, models namely, C4.5 decision tree, logistic re-gression (which comes under linear classifiers) of statisticalclassification method and Support Vector machines (SVM),RIPPER ruleset and naive Bayes, are used to build hit songclassification. . The results have clearly shown that logisticregression technique fairs the best in comparison to all theother techniques followed by naive Bayes. Although it hasbeen observed and derived that machine learning is an ef-fective measure for educing the top hit songs but the use ofmusic information retrieval systems has not been exploredas of now for predicting hit song. There has been somework in determining popularity of a song, based on acous-tic, lyric, and human based features, but these factors toohave not been able to deliver the results [13].
Many areas of research in MIR involve music classifica-tion (genre speech segmentation, emotion chord recogni-tion, playlist generation, audio to symbolic transcriptionetc.). The fundamental tasks of music classification includemusical data collections (called instances) audio recordings,scores, cultural data (e.g. playlists, album reviews, bill-board stats, etc.) which also include meta data about theinstances like, artist ID, title, composer, performer, genre,date etc. This musical data collection undergoes featureextraction, wherein features represent characteristic infor-mation about in-stances and then finally, Machine Learningalgorithms (classifiers and learners) learn to associate fea-ture patterns of instances with their classes for music classi-fication. jMIR, a powerful, flexible and accessible softwarehas been developed to meet the need for standardized MIRresearch software in order to design, share and apply a widerange of automatic music classification technologies. ThejMIR software has been designed to facilitate the extractionof meaningful information, available on the web, from theaudio recordings, symbolic musical representations and cul-tural information; it also uses machine learning techniquesto build classification models automatically; the softwarealso collects profiling statistics and automatically detectsmeta data errors in musical collection; it also conduct ex-periments on music collections in both audio and symbolicformats which are a set of large, stylistically diverse andwell labeled collections of music; and it also helps in stor-ing and distributing information in expressive and flexiblestandardized file formats so as to use that information forautomatic music classification . Significant performancegains for music classification was observed when features extracted from multi-modal information like audio record-ings, symbolic recordings and cultural data were combined,instead of using features from just one type of data [18].
Today, the amount of music available on various onlinemusic stores is continuously increasing in spite of the factthat they already house millions of downloadable songs intheir catalog. This leads to a requirement of intelligentmusic search algorithms to discover and navigate severalmillions of music for finding their acoustic neighbors. Thefilter-andrefine method, designed to work with very largedatabases, is based on FastMap which allows quick musicsimilarity processing. It uses Gaussian timbre models andthe Kullback-Leibler divergence as music similarity mea-sure. The data set used is a collection of 2.5 million songswhich consists of 30 second snippets of songs. FastMap isa MultiDimensional Scaling (MDS) technique. MDS is awidely used method for visualizing high-dimensional data.The input it uses is the distance matrix of a set of itemsand the data is mapped to the vectors into an arbitrary-dimensional Euclidean space. Usually higher dimensionsyield a better and accurate mapping of the original simi-larity space [26]. Since the method for music similarity isdesigned for Gaussian music timbre features using the sym-metric Kullback-Leibler divergence, it was observed that itcould be extended and generalized to other distance mea-sures too.Music similarity, which helps to understand why twopieces of music or artists are perceived alike by the listenerwherein the listener might be able to state the resemblancebetween the two songs but not the similarity, has been undera lot of research. Some work has also been done in address-ing the measurement of similarity between music artists viafeatures which are basically text-based and are extractedfrom web pages. Music similarity does not only help findacoustic neighbors of a particular music but also automatedplaylist generation, music recommender system, music in-formation systems or intelligent user interfaces to accessmusic collections. Hence, there exists an enormous roomfor the text-based features extractable from artist-relatedweb pages to be able to contribute in context-based musicinformation (similarity) research.. The dataset for the ap-proach was constructed by querying the search engine forevery particular artist and thereby building a collection ofweb pages which might be in the form of fan pages, bi-ographies, album reviews, track lists, etc. No matter howmany web pages are retrieved for an artist, the whole collec-tion is considered as one large, virtual document describingthe artist for whom the web pages have been extracted.And therefore, web-based music similarity estimation re-volves around constructing text-based feature vectors forIR purposes, for example- term frequency, inverse docu-ment frequency, virtual document modeling, normalizationwith respect to page length, similarity function. The termfrequency of a term in a document estimates the impor-tance the term carries for the document (related to artist).The inverse document frequency estimates the overall im-portance of the term in the whole corpus. Virtual documentodeling relates to the way individual documents are aggre-gated, retrieved for the same artist. The different similarityfunctions come up with the estimation of the proximity be-tween the term vectors of two documents or artists. Butthe interdependency between these leads to a problematicsituation wherein it becomes difficult to choose which vari-ant (e.g., variant of term frequency, variant of similaritymeasure) would produce an overall winning combination.But each variant focuses on the task of text-based similar-ity estimation of music which is a specific, important task ofmusic information research [24]. It was also clinched thatthe above methodology also possesses latency for the de-velopment of personalized music retrieval system. Above,only text-based representation of music data derived fromartist web pages has been mentioned but it is also possibleto consider in the data which burgeons from user-generatedcontent like instant messages or posts and updates made onsocial networking websites. This will help in devising bettermusic similarity with the impetus of social factors. The in-clusion of this new dataset suggested above might also helpin improvising music playlist generation systems.Metric Learning to Rank (MLR) is an extension of theStructural Support Vector Machines (SVM) approach whichcan be applied for learning a Mahalanobis distance that,according to the relative similarity ratings by users, castsan apprehended or stated music similarity based on theMagnaTagATune dataset for acoustic recordings of music,and can be applied in music exploration or recommenda-tion systems. The MagnaTagATune data set comes fromTagATune, a web-based game which collect tags associatedwith certain songs in a human-computation manner. Mag-naTagATune dataset consists of features and tagging infor-mation of 25863 29-second audio clips generated from 5405source MP3s [28], [30] . It was also observed and proved thatthe methodology, stated above, need not be only restrictedto the proposition that models distance as a weighted linearcombination of facets rather it can also be extended to fol-low the approach that incorporates Mahalanobis distance.Music similarity techniques also help in music retrievaland recommendation. By developing methods to identifyand extract relevant entities (e.g., artists, full names, bandline-up, album and track titles, related artists, etc.) andrelationship between these, a better and improved multi-faceted similarity measures can be strived for. And thisleads to a possible solution for determining members ofa music band, i.e., which persons a music band consists(or consisted) of by analyzing texts from the web and tak-ing the fact for granted that, any person that has beena member of a band at any point is considered to be aband member. Band member detection is a case of namedentity recognition which comprises of the identification ofproper names as well as the classification of these namesachieved through rule-based approach or supervised learn-ing approach. There are two rule-based approaches: HearstPattern Approach which automatically extracts the line-up of a music band, whereby, line-up information includesthe member information and the corresponding roles thatthey are playing, for example, the instrument that theyare playing. Another rule based approach uses the GATE,an open source framework (General Architecture for Text Figure 4: Music data analysis applied to similarityEngineering), automatically identifies artist names, to ex-tract band-membership relations, and to extract releasedalbums/media for artists. After the rule-based approach,the supervised approach for band member detection can belisted as: Named Entity Recognition in GATE; followed byextracting band members by supervised learning algorithmssuch as hidden-markov-models, decision trees, or supportvector machines (SVM) wherein the SVM is chosen as aclassifier. So as to extract the band members by super-vised learning algorithms with SVM as classifier, the dataset is constructed, first, on querying 51 rock and metal bandmembers on Google and thereby getting a total of 5,028web pages. Secondly, the data set consists of band biogra-phies fetched from band-membership information of 34,238bands with the help of Echonest API leading to a total of38,753 biographies. . Now, construction of features takesplace wherein two distinct SVM classifiers are trained soas to detect person entities to be marked as band mem-bers. Then entity extraction is carried out (to detect bandmembers and assign a confidence score) followed by entityconsolidation and member prediction in which a list of po-tential band members is obtained from the named entityextraction step for each processed text. On the metal pageset (web pages derived after querying the search engine),the advanced rule-based approach performs better than thesupervised learning approach whereas the case is oppositein case of the biography set where the supervised learn-ing approach performs better [16]. With the methodologydiscussed above for finding out the band member, it is alsopossible to generate really coveted meta-information on mu-sic. Every biography consists of information on the bandmembers, the composers, musicians, vocalists, guitarists (ifany), etc. If all of this semantic information is extracted andproperly annotated, not only it will provide unprecedentedsolutions in music retrieval and recommendation but it willalso be able to give proper credits (and hence royalty) toevery artist who contributed in any way for the making ofa song.
Besides fields like music genre identification, mood detec-tion, style recognition, and even music similarity recog-nition, music emotion recognition is seeing compoundingrowth in research interest because music enjoys a promi-nent status in human lives because of its ability to elicitemotions which are subjected to our mood and changes inphysical condition and actions. It is possible to do musicemotion recognition by a method based on melodic featuresextracted from polyphonic music excerpts through machinelearning algorithms. The dataset used was a set of 903 au-dio excerpts, each of 30-seconds organized in 5 relativelybalanced clusters of 170, 164, 215, 191, 163 excerpts respec-tively. Several supervised learning algorithms namely Sup-port Vector Machines (SMO, LibSVM), K-Nearest Neigh-bors, C4.5, Bayes Network, Nave Bayes, and Simple Logisticof machine learning were applied and ran on Weka, a datamining and machine learning platform with best resultsbeing achieved using SVM classifiers [23]. The methodol-ogy discussed above was applied on Melodic Audio featureswhich can be of three types namely pitch and duration, vi-brato and contour typology. And it was found that if, alongwith Melodic Audio features, Standard Audio features (itincludes spectral shape features like centroid, spread, band-width, skewness, etc. which come under low level descrip-tors and tempo, tonality, key etc. which come under highlevel descriptors) are also incorporated for music emotionclassification, not only performance might increase but theaccuracy of the classification might also increase.The initial works in music emotion recognition used anaudio based approach that demonstrated music being as-sociated with discrete emotion categories. Features such astimbral, rhythmic, and pitch trained in Support Vector Ma-chines (SVM) leads to large variations in the accuracy of es-timating the different categories. Black Propagation NeuralNetwork (BPNN) recognizes the extent to which the musicpieces belong to four emotional categories namely, happi-ness, sadness, anger and fear. Two datasets, the CAL500and the other consisting of approximately 21000 clips fromMagnatune modeled using statistical distributions of spec-tral, timbral and beat features using Multi-Label k-NearestNeighbors (MLkNN), Calibrated Label Ranking (CLR),Backpropagation for Multi-Label Learning (BPMLL), Hier-archy of Multi-Label Classifiers (HOMER), Instance BasedLogistic Regression (IBLR), and Binary Relevance kNN(BRkNN) models. It was found that CLR classifier usinga Support Vector Ma-chine (SVM) outperformed all otherapproaches besides performing competitively with DecisionTrees, BPMLL and MLkNN [1]. It was found that in orderto improve the efficiency of the music emotion recognition,rather than just low-level descriptors (which revolve aroundtempo-related aspects of a song), mid or high level descrip-tors need to be incorporated which carry semantic or syntac-tic meaning like genre and culture, moods and instruments,or rhythm and tempo. Also, most of the current approachesemployed in the music emotion recognition do not accountthe gravity of the relationships that exists between featuresand emotion components and hence lose out on unsulliedmusic emotion recognition. Moreover, there also lies greatscope of adopting semantic web ontology in this field whichhas not been delved into as of now. It has also been proventhat if a multi-model music emotion recognition model isbuilt capitalizing on audio content and semantic associationreasoning, it is bound to give promising results in perfor- Figure 5: Music data analysis applied for music emotionrecognitionmance. Hence, there lie immense possibilities which canrummage better yields in music emotion recognition.There also exists ways to maximize the performance of amusic emotion recognition system based on regression ap-proach of machine learning. The data set consisted of 50ratings per clip for 288 clips where clip is an excerpt of atrack. The various regression algorithms that approach mu-sic emotion recognition differently are: Linear Regression(LR) assumes linear relationship between input and outputvariables and minimizes the least square error; RegressionTree (RT) here each leaf node is a numeric value rather thana class label; Locally Weighted Regression (LWR-SLR) con-structs a one factor linear model on-the-fly based on nearbytraining points when presented with a test sample; ModelTree (M5P) - here each leaf node is a linear mod-el ratherthan a numeric value and it contains a number of parame-ters that must be optimized during training with a param-eter search; Support Vector Regression (SVR-RBF) imple-mented in LIBSVM using the Radial Basis Function; Sup-port Vector Regression with No Parameter Search (SVR-RBF-NP) - here the parameter values are hardcoded tosensible defaults otherwise the rest is same as SVR-RBF.Using all standardized features and a coarse-grid search forthe best parameters, the regression SVR with RBF kernelperforms the best [14]. It was found that the above method-ology achieved zenith in music emotion recognition. But inorder to scale up the desired output, there need to be someefficient technique for gathering colossal data sets properlyannotated with emotion labels. Then higher-level music fea-tures are required which can be ascertained by human musiccognition. Also, certain models of temporal evolution of mu-sic can definitely play certain role in advancement of musicemotion recognition models along with the development ofpersonalized systems that can predict the various emotionalcharacteristics and responses of people of culturally diversebackgrounds, tastes and various other characteristics. .3 Audio Analysis
On the similar grounds as that of jMIR, for audio analy-sis and audio based Music Information Retrieval, there ex-ists an open-source, cross platform C++ library, Essentia2.0 under the Affero GPL License. The library houses anextensive collection of reusable algorithms for implement-ing audio input/output functionality, standard digital signalprocessing blocks, statistical characterization of data, and alarge set of spectral, temporal, tonal and high-level musicaldescriptors. Essentia provides algo-rithms for: basic pro-cessing of audio streams so as to achieve audio input/outputfiltering; for computation of low-level spectral descriptors;computation of time-domain descriptors; computation oftonal descriptors; computation of rhythm descriptors; com-putation of SFX descriptors; and in addition to all the abovelow level descriptors, Essentia also provides algorithms forvarious mid- and high-level descriptors. Essentia has beenused for various research activities, with its major contri-bution in musical classification, mood classification, and se-mantic auto-tagging, music similarity and recommendation,visualization and interaction with music, sound indexing,detection of musical instruments in polyphonies, cover de-tection, instrument solo detection and acoustic analysis ofstimuli for neuroimaging studies [3]. Essentia library mightwitness an update for real time applications and additionof new semantic categories in the set of high-level classifierbased descriptors.
SoCo, is a context aware recommender system, that incor-porates thoroughly processed social network information,recommends music based on the application of random de-cision trees algorithm of statistical classification methodsof machine learning, and takes into account various con-textual information like the characteristics of a user understatic context which includes users age, gender, member-ship, role etc. or an items category, cost, physical prop-erties, etc. and dynamic context which is associated witha ratings spontaneous information which might include ausers mood or his/her location while rating an item (mu-sic), and social factors for making personalized and accuratemusic recommendations since it brings a new perspective inrecommendation because social factors bring a whole newlot of information about a users preference for an item (mu-sic) which can be implicative from the users social circle offriends and followers who are counted upon to share similartaste profiles [10].dbrec-a music recommendation system based on LinkedData, has been built on top of DBpedia (it was chosen fortwo main reasons - the availability of data of more than39,000 artists, and secondly, availability of pictures and de-scription of artists which is useful for building systems userinterface), which proffers recommendation for more than39000 bands and solo artists. It uses LDSD algorithmLinked Data Semantic Distance as a basis for its recom-mendation engine. The system is built, on the roadmapwhich necessitated the need of: identifying the relevant sub- set from DBpedia; followed by reducing the dataset so asto optimize the query process ; then comes computing thedistances using the LDSD algorithm and representing themusing its ontology; and to delineate the recommendations,building of a user-interface comes as a last step for browsingrecommendations [20].Its also possible to let people find and recommend mu-sic and its content based on what they are consuming orproducing by leveraging social music data to the seman-tic web. Rather than going with conventional music rec-ommendation practices like collaborative filtering (recom-mending music to a user based on the stated tastes of otherrelated users), content-based, and recommendation by mod-eling musical audio similarity, relationships between vari-ous types of data (social networks, published content, tags,artist information, etc. ) which are modeled in RDF fromthe social music websites. It is achieved by: interlinkingFOAF (Friend Of A Friend) and linked data with varioussocial networks so as to provide a complete distributed andopen social graph, that can be queried and processed; SIOCSemantically-Interlinked Online Communities is a sharedsemantics in order to represent user-generated data comingfrom various places in a common way, by offering a model torepresent activities of online communities and their contri-butions; MOAT framework that allows people to tag theircontent with URIs, rather than simple keywords, and oncepeople have tagged their data, relationships between thoseURIs can be used to suggest related data. For example-when browsing a blog post about The Clash, the above men-tioned recommender system would recommend to browse apicture tagged with the URI of Joe Strummer on Flickr,because both the blog post and the picture have a relation-ship defined in DBpedia. In this way, FOAF ontology isreused by SIOC and MO (Music Ontology) and the linkingbetween SIOC data to LOD (Linked Open Data) URIs isallowed by MOAT [21].One of the challenging aspects of music recommendationis to implement situation-aware personalized music recom-mendation service which takes both the user situation aswell as user preference into consideration. This requiresmultidisciplinary efforts which includes human mood andemotion recognition from low level features (like beat, pitch,rhythm and tempo) extraction and analysis. Hence, a newscheme, Context-based Music Recommendation (COMUS)ontology, was devised for situation aware/user-adaptive mu-sic recommendation service in semantic web environment.COMUS provide various query interfaces to the user: QueryBy Situation (QBS), Query By Detailed Situation (QBDS),and Query By Mood (QBM). The Jena SPARQL engine isused for fetching and recommending purposes. It is usedfor expressing and processing necessary queries to the on-tology [22]. The dataset used for the above methodology iscounted by the number of RDF triples, each representing asubject, verb and an object. Thereby, the COMUS ontol-ogy is a huge collection of 826 OWL classes and instancesparsed into 3645 RDF triples. The methodology coveredthe aspect of presenting and building ontology based oncontext modeling and reasoning for the purpose of musicrecommendation by modeling musical domain and captur-ing low-level musical features and factors which representigure 6: Music data analysis applied to recommendationvarious music moods and situations like time and locationwhich can influence craving for different types of music.Music Information Retrieval has also been a consistentapproach for music search and recommendations includ-ing the search for items related to a specific query songor their set. There are various online communities whichprovide a huge amount of user generated browsing traces,reviews, play-lists, and recommendations which can be an-alyzed through collaborative filtering methods so as to gen-erate relationship between artists, songs and genres. Theserelationships then in turn result in recommending music tousers based on their music activity [4].There also exist some unprecedented methods of recom-mending music. A user-agnostic evaluation method (or net-work based evaluation applied to artists and large scaleuser similarity graphs) which is based on the analysis ofthe item (or user) similarity network, and the item popu-larity. There is a system prototype, named FOAFing themusic, which provides music recommendations based on theuser preferences, listening habits, profiling, context-basedinformation (which is extracted from music related RSSfeeds), and content-based descriptions (which is automat-ically extracted from the audio itself). Then there is a mu-sic search engine, named Searchsounds, providing keywordbased search, as well as the exploration of similar songs us-ing audio similarity and thereby allowing users to discovermusic, even unknown to them [12].Context-aware music recommendation retrieves and sug-gests music depending upon the users actual situation,for ex-his/her emotional state (can be influenced by age,gender, personality traits, socio-economic, cultural back-ground, etc.) which varies time to time and is pretty com-plex to be understood by a machine, influences the usersperception of music. Its importance in music recommen-dation has led to more refined research for better results.There are some approaches which improve the quality of rec-ommendation: Collaborative filtering (CF) relies on user-generated con-tent (ratings or implicit feedback) - items arerecommended to a user if they were liked by similar users.The dataset used in this case was derived from Last.fm so-cial network that delineates a weighted social graph among users, the tracks they play, the tags they annotate the trackswith. Then there is Content-based approach that relieson traditional music information retrieval techniques likeacoustic fingerprint or genre detection. The hybrid ap-proach incorporates the following techniques: the scoresthat come up as a result of various techniques, they are com-bined to produce a single recommendation; system switchesits judgment of recommendation based on certain criteria,ex- dataset properties, quality of produced recommenda-tions; all the different techniques that produce recommen-dations are mixed and presented together; item features likeratings and content features from different recommendationtechniques are thrown together into a single recommenda-tion algorithm; one recommendation technique refines theoutput of another technique, for ex- CF can be used toproduce a ranking of the items and then the content-basedfiltering can be applied to break the ties; one recommenda-tions output act as a input for another, for ex- CF mightbe used to find items relevant for the target user and thisinformation is used in the content based approach; and themodel learned by one recommender acts as an input for theother this approach uses one system to produce a modelas input for the second system. Since, it is not clear whichof the two (CF or content-based approach) has a bigger im-pact on quality recommendation, its best to mix the twotechniques in hybrid approach for music recommendation[15].Another aspect of music recommendation, using heavymachine learning, is the auto-matic prediction of tags tomusic and audio for music recommendation. Applying tags(a user-generated keyword) in music can be understood assay, listener likes rock music with female voices. The datasetused for this comes from various sources: Social tags arethe ones applied by humans on artists, albums or a song. Itwas gathered from sources like Last.fm which contains morethan 960,000 free-text tags and millions of annotated songs;Games there are various tagging games developed in orderto gather clean data of tags. The Magnatagatune datasetcontains tags applied to about 20,000 songs which is thelargest game data made available; Web documents docu-ments available on the internet can also be used to describeudio but it contains a lot of noise. KNN (K-Nearest Neigh-bors) is also one of the simplest and effective ML techniqueused for automatic tagging algorithm. Neural Networks ofML handles multi-label classification and regression casesand thereby is able to capture highly complex relations be-tween audio and tags. HGMM, SVM, and Boosting areone of the three best performing algorithms on automatictagging [2].
Similar to SoCo, there exists another system called MyMu-sic that exploits social me-dia sources for generating per-sonalized music playlists. It is based on the information ex-tracted from social networks, like Facebook and Last.fm forcarrying out the personalization tasks of defining a modelof user interest based on a users information related to mu-sic preferences on social networks. The social media basedplaylist is enriched with new artists related to those the useralready likes. And specifically, two enrichment techniquesare used: in the first one, the knowledge stored on DB-pedia is leveraged, whereas in the second one, its based onthe content-based similarity be-tween descriptions of artists.Thereafter the final playlist is ranked accordingly and pre-sented to the user for listen to the songs and for feedback[19],It is also possible to contextualize playlist, a set of songs,as a recommendation engine with the help of a novel multi-model similarity measures integrating content-based simi-larity with artist relational social graphs. In an attemptto evaluate the application (driving a user-steerable radiostation by using complex similarity and community segmen-tation), playlists are compared on a novel low-dimensionalsong level feature using social tag descriptors which greatlyimprovise the understanding and construction of playlistsfor music recommendation. The dataset used for the abovementioned techniques is gathered from radio station logs.Data from yes.com consisted of 885810 number of song en-tries, 2543 number of songs which had no tags attached tothem, 70190 total numbers of playlists, 55 minutes of av-erage runtime of these playlists and 12.62 mean numbersof songs per playlists. Data from Rock stations consistedof 105952 number of song entries, 865 number of songswhich had no tags attached to them, 9414 total numbersof playlists, 53 minutes of average runtime of these playlistsand 11.25 mean numbers of songs per playlists. Data fromJazz stations consisted of 36593 number of song entries,1092 number of songs which had no tags attached to them,3787 total numbers of playlists, 55 minutes of average run-time of these playlists and 9.66 mean numbers of songs perplaylists. Data from Radio Paradise consisted of 195691numbers of song entries, 2246 numbers of songs which hadno tags attached to them, 45284 total numbers of playlists,16 minutes of average runtime of these playlists and 4.32mean numbers of songs per playlists [11].
In this investigation, we attempted to identify various as-pects of music data analysis as addressed by the research community. Datasets used for various analysis reportedhere were identified and stated clearly. In most cases,datasets consists of relatively small number of audio filesin the form of MIDI sequences, user generated tags, accom-panying web pages, or user context. While these types ofmusical datasets are certainly critical part of music domain,many other aspects remain untouched by such research ef-forts. Some of these include: music credits data; licens-ing and rights data; digital supply chain data; music salesand distribution data; cataloging, classification and archivalrelated data; music organization data; music-related stan-dards data; live events related data; and studio recordingsrelated data among many others generated throughout thelifecycle of a music professional. These datasets can bemaintained by various organizations at multiple levels ofdetails, accuracy and update frequency with potential over-laps. As majority of these data sources are updated con-stantly at varying frequency, the task of integration andmanagement of datasets itself will require application of ap-propriate Big Data technologies currently available. Nextimportant factor is the features extracted from datasets thatcan be subjected to analysis selected as per the applicationor use case requirements. In this survey we identified vari-ous high and low level features typically being used by thecommunity. However, various additional musical featurescan be identified from extended lists of datasets identifiedearlier. Applying known analytical techniques over thesenovel features will open up opportunities for novel applica-tions and use cases.
In this paper we attempted to offer a state-of-the-art surveyof research efforts involving music data analysis. Our objec-tive is to investigate a report various analytical approachesadopted by the research community focusing on unique mu-sical features. This resulted in depiction of a technologylandscape of analytical techniques including machine learn-ing, semantic web, social network analysis, information re-trieval statistics and information theory. Our analysis alsoidentified opportunities for further exploration keeping inmind new possibilities offered by the recent developmentsin Big Data discipline.
References [1] M. Barthet. Multidisciplinary Perspectives on MusicEmotion Recognition : Implications for Content andContext-Based Models. (June):19–22, 2012.[2] T. Bertin-Mahieux, D. Eck, and M. I. Mandel. Au-tomatic tagging of audio: The state-of-the-art. InW. Wang, editor,
Machine Audition: Principles, Al-gorithms and Systems , chapter 14, pages 334–352. IGIPublishing, 2010.[3] D. Bogdanov, N. Wack, E. G´omez, and S. Gulati. Es-sentia: An Audio Analysis Library for Music Informa-tion Retrieval.
ISMIR , pages 2–7, 2013.4] K. Brandenburg, C. Dittmar, M. Gruhne, J. Abeß er,H. Lukashevich, P. Dunker, D. G¨artner, K. Wolter, andH. Grossmann. Music Search and Recommendation.pages 349–384, Jan. 2009.[5] S. Cherla, A. Garcez, and T. Weyde. A Neural Proba-bilistic Model for Predicting Melodic Sequences. 2013.[6] T.-T. Dang and K. Shirai. Machine Learning Ap-proaches for Mood Classification of Songs toward Mu-sic Search Engine. In , pages 144–149.IEEE, Oct. 2009.[7] R. B. Dannenberg, B. Thom, and D. Watson. A Ma-chine Learning Approach to Musical Style RecognitionA Machine Learning Approach to Musical Style Recog-nition. 1997.[8] D. De Roure, K. R. Page, B. Fields, T. Crawford, J. S.Downie, and I. Fujinaga. An e-Research approach toWeb-scale music analysis.
Philosophical transactions.Series A, Mathematical, physical, and engineering sci-ences , 369(1949):3300–17, Aug. 2011.[9] S. Dubnov, G. Assayag, O. Lartillot, and G. Bejer-ano. Using machine-learning methods for musical stylemodeling.
Computer , (August):3–10, 2003.[10] E. P. F´ed´erale and K. Aberer. SoCo : A Social NetworkAided Context-Aware Recommender System. pages781–791.[11] B. Fields. Contextualize Your Listening : The Playlistas Recommendation Engine. 2011.[12] C. Herrada. Music recommendation and discovery inthe long tail ‘. 2008.[13] D. Herremans and D. Martens. Dance Hit Song Sci-ence. pages 1–4, 2013.[14] A. Huq, J. P. Bello, and R. Rowe. Automated MusicEmotion Recognition: A Systematic Evaluation.
Jour-nal of New Music Research , 39(3):227–244, Sept. 2010.[15] M. Kaminskas and F. Ricci. Contextual music infor-mation retrieval and recommendation: State of the artand challenges.
Computer Science Review , 6(2):89–119,2012.[16] P. Knees. An Approach to Automatic Music BandMember Detection Based on Supervised Learning.[17] A. Kotsifakos, E. E. Kotsifakos, P. Papapetrou, andV. Athitsos. Genre classification of symbolic musicwith SMBGT. In
Proceedings of the 6th Interna-tional Conference on PErvasive Technologies Relatedto Assistive Environments - PETRA ’13 , number Midi,pages 1–7, New York, New York, USA, 2013. ACMPress.[18] C. McKay.
Automatic music classification with jMIR .PhD thesis, Citeseer, 2010. [19] C. Musto, G. Semeraro, P. Lops, M. de Gemmis, andF. Narducci. Leveraging Social Media Sources to Gen-erate Personalized Music Playlists. pages 112–123, Jan.2012.[20] A. Passant. dbrec Music Recommendations Using DB-pedia. 1380:1–16, 2007.[21] A. Passant and Y. Raimond. Combining Social Mu-sic and Semantic Web for music-related recommendersystems.[22] S. Rho, S. Song, Y. Nam, E. Hwang, and M. Kim.Implementing situation-aware and user-adaptive musicrecommendation service in semantic web and real-timemultimedia computing environment.
Multimedia Toolsand Applications , 65(2):259–282, May 2011.[23] B. Rocha, R. Panda, and R. P. Paiva. Music Emo-tion Recognition : The Importance of Melodic Fea-tures. (2008):1–4, 2013.[24] M. Schedl, T. Pohle, P. Knees, and G. Widmer. Ex-ploring the music similarity space on the web.
ACMTransactions on Information Systems , 29(3):1–24, July2011.[25] A. Schindler and A. Rauber. Capturing the tempo-ral domain in echonest features for improved classifica-tion effectiveness.
Proc. Adaptive Multimedia Retrieval ,pages 1–15, 2012.[26] D. Schnitzer, A. Flexer, and G. Widmer. A FILTER-AND-REFINE INDEXING METHOD FOR FASTSIMILARITY SEARCH IN MILLIONS OF MUSICTRACKS. (April):537–542, 2009.[27] B. Sebastian and J. Schl. Enhanced peak picking foronset detection with recurrent neural networks. pages1–4, 2013.[28] S. Stober and N. Andreas. An Experimental Compar-ison of Similarity Adaptation Approaches.[29] J. Wang, X. Anguera, X. Chen, and D. Yang. En-riching music mood annotation by semantic associa-tion reasoning. In , number c, pages 1445–1450.IEEE, July 2010.[30] D. Wolff and T. Weyde. Combining Sources of De-scription for Approximating Music Similarity Ratings.In M. Detyniecki, A. Garc´ıa-Serrano, A. N¨urnberger,and S. Stober, editors,
Adaptive Multimedia Retrieval.Large-Scale Multimedia Retrieval and Evaluation , vol-ume 7836 of