Chord Embeddings: Analyzing What They Capture and Their Role for Next Chord Prediction and Artist Attribute Prediction
Allison Lahnala, Gauri Kambhatla, Jiajun Peng, Matthew Whitehead, Gillian Minnehan, Eric Guldan, Jonathan K. Kummerfeld, Anıl ?amcı, Rada Mihalcea
CChord Embeddings: Analyzing What TheyCapture and Their Role for Next ChordPrediction and Artist Attribute Prediction
Allison Lahnala , Gauri Kambhatla , Jiajun Peng , Matthew Whitehead ,Gillian Minnehan , Eric Guldan , Jonathan K. Kummerfeld , Anıl C¸ amcı ,and Rada Mihalcea Department of Computer Science & Engineering School of Information Department of Performing Arts TechnologyUniversity of Michigan, Ann Arbor MI 48109, USA { alcllahn,gkambhat,pjiajun,mwwhite,gminn,eguldan,jkummerf,acamci,mihalcea } @umich.edu Abstract.
Natural language processing methods have been applied ina variety of music studies, drawing the connection between music andlanguage. In this paper, we expand those approaches by investigating chord embeddings , which we apply in two case studies to address two keyquestions: (1) what musical information do chord embeddings capture?;and (2) how might musical applications benefit from them? In our analy-sis, we show that they capture similarities between chords that adhere toimportant relationships described in music theory. In the first case study,we demonstrate that using chord embeddings in a next chord predictiontask yields predictions that more closely match those by experienced mu-sicians. In the second case study, we show the potential benefits of usingthe representations in tasks related to musical stylometrics.
Keywords:
Chord Embeddings · Representation Learning · Musical Ar-tificial Intelligence.
Natural language processing (NLP) methods such as classification, parsing, orgeneration models have been used in many studies on music, drawing the con-nection that music is often argued to be a form of language. However, whileword embeddings are an important piece of almost all modern NLP applica-tions, embeddings over musical notations have not been extensively explored.In this paper, we explore the use of chord embeddings and argue that it is yetanother NLP methodology that can benefit the analysis of music as a form oflanguage. Our objectives are (1) to probe embeddings to understand what mu-sical information they capture, and (2) to demonstrate the value of embeddingsin two example applications. a r X i v : . [ c s . S D ] F e b A. Lahnala et al.
Using word2vec [18] to create embeddings over chord progressions, we firstperform qualitative analyses of chord similarities captured by the embeddingsusing Principal Component Analysis (PCA). We then present two case studieson chord embedding applications, first in next-chord prediction, and second inartist attribute prediction. To show their value, we compare models that usechord embeddings with ones using other forms of chord representations.In the next chord prediction study, we provide a short chord sequence andask what the next chord should be. We collected human annotations for thetask as a point of reference. By comparing model predictions with the humanannotations, we observe that models using chord embeddings yield chords thatare more similar to the predictions of more experienced musicians. We also mea-sure the system’s performance on a larger set drawn from real songs. This taskdemonstrates a use case for chord embeddings that involves human perception,interaction, and composition. For the artist attribute prediction study, we per-form binary classification tasks on artist type (solo performer or group), artistgender (when applicable), and primary country of the artist. Results on thesetasks demonstrate that chord embeddings could be used in studies of musicalstyle variations, including numerous studies in musicology.This paper contributes analyses of the musical semantics captured in chord2vec embeddings and of their benefits to two different computational music applica-tions. We find that the embeddings encode musical relationships that are im-portant in music theory such as the circle-of-fifths and relative major and mi-nor chords. The case studies provide insight into how musical applications maybenefit from using chord embeddings in addition to NLP methods that havepreviously been employed.
Methods for learning word embeddings [18,22,23,4] have been useful for domainsoutside of language (e.g., network analysis [7]). Recent work has explored em-beddings for chords, including an adaptation of word2vec [15], their use in achord progression prediction module of a music generation application [3], andfor aiding analysis and visualization of musical concepts in Bach chorales [24].However, understanding the musical information captured as latent features inthe embeddings has been limited by the decision to ground evaluation in lan-guage modeling metrics (e.g., perplexity) rather than analyses of their behaviorin downstream tasks. In this work, our first case study shows that languagemodels with no remarkable differences in performance by perplexity exhibit re-markable relationships in their predictions to the experience of musicians, andfurthermore we provide insights into what is captured by the embeddings.NLP methods have benefited computational musicology topics such as au-thorship attribution [30], lyric analysis [6], and music classification tasks usingaudio and lyrics [16]. In the task of composer identification, many approachesdraw inspiration from NLP, applying musical stylometry features and melodic n-grams [2,31,8]. One study used language modeling methods on musical n-grams hord Embeddings: Analysis and Applications 3 to perform composer recognition [10] and another used a neural network over en-codings of the pitches of musical pieces [12]. Similar methods have been used tostudy stylistic characteristics of eight jazz composers [19,1] over chord sequencessimilar to ours. While these studies operated on small datasets (on the order ofhundreds of samples) to identify and analyze music of a small set of musicians,we use a large dataset (on the order of tens of thousands) and predict attributesof artists based on the music.Our attribute prediction tasks are related to NLP work in authorship attribu-tion and are motivated by studies on the connection between language and musicin psychology [21,11], and the intersection of society, music, and language, or so-ciomusicology [5,28]. For instance, Sergeant & Himonides studied the perceptionof gender in musical composition, and found no significant match between thelistener’s guess of a composer’s gender and their actual gender [27].
We compile a dataset of 92,000 crowdsourced chord charts with lyrics from theUltimate Guitar website. We identify and remove duplicate songs in our datausing Jaccard similarity, then extract the chord progressions from our data rep-resentation to learn chord embeddings. · GCDAmFEm BmDm
Fm Ab Bbm Db
Chord song frequency rank S o n g f r e q u e n c y x − . R = . · GCDAmFEm BmDm
Fm Ab Bbm Db
Chord song frequency rank S o n g f r e q u e n c y x − . R = . Fig. 1.
Number of chords in the dataset and a fitted trendline with the parametersgiven in the figure for the 61 most common chords, showing a power law distribution.A sample of chords are labeled that also appear in the embedding visualization inFigure 4.
We remove songs with fewer than six chords, leaving us with a final set of88,874 songs and 4,913 unique chords. The chords’ song frequencies are dis-tributed according to a power law distribution, much like Zipf’s word frequencylaw exhibited in natural language. Figure 1 shows the song frequency and thepower law trend line fitted to the top 61 chords for demonstration, though the Sometimes multiple users submit chord charts for a song. A. Lahnala et al. trend is even stronger over all chords. Beyond there existing many possible chordnotations, we observe a relation between chord frequency and physical difficultyof playing the chord on the guitar or other instruments (e.g., G , C , and D ), andvariations in notation. Word embeddings have been used extensively to represent the meaning of wordsbased on their use in a large collection of documents. We consider a similarprocess for chords, forming representations based on their use in a large collectionof songs.We create chord embeddings for chords that appear in at least 0.1% of oursongs (237 chords) using the continuous bag of words model ( CE cbow ) and the skip-gram language model ( CE sglm ) from word2vec [18], a widely used methodfor creating word embeddings. For CE cbow , a target chord is predicted based oncontext chords, while CE sglm is the reverse: context chords are predicted given atarget chord. In both cases, this has the effect of learning representations thatare more similar for chords that appear in similar contexts. We tested contextsizes of two, five, and seven, and varied the vector dimensions between 50, 100,200, and 300, but observed only minor differences across different models, andchose to use a context window of five and a vector dimension of 200. To better understand the information encoded in chord embeddings, we performa qualitative analysis using PCA and present a 2D projection for the CE sglm model in Figure 4 (our main observations are consistent for the case of CE cbow ).We observe that chords that form a fifth interval are closer together, whichsuggests that the embeddings capture an important concept known as the circleof fifths . Fifth-interval relationships serve broad purposes in tonal music (musicthat adheres to a model of relationships to a central tone). Saker [26] encapsulatestheir structural importance by stating that “circle of fifths relationships domi-nate all structural levels of tonal compositions” and that “the strongest, mostcopious harmonic progressions to be found in tonal music are fifth related.” The circle of fifths relationship is observed in our chord embeddings over differentchord qualities , specifically over major chords (highlighted in Figure 4b), minorchords (highlighted in Figure 4c), major-minor 7 chords, and minor 7 chords.For both chord qualities, the layout captured by the chord embeddings is similarto the ideal/theoretical circle of fifths, illustrated in Figure 4a. This pattern isparticularly interesting as it does not follow the style of word analogy patterns Qualities refers to sound properties that are consistent across chords with differentroots but equidistant constituent pitches. The interaction of intervals between pitchesdetermines the quality.hord Embeddings: Analysis and Applications 5 (a) (b)(c) (d)
Fig. 2.
In (a), we show the circle of fifths with the same colors as in (b), (c) and (d),which show the same 2-dimensional PCA projection of the chord embedding space withlines denoting the circle of fifths over major chords (b) and minor chords (c), and linesdenoting major-minor relatives (d). observed in language. This makes sense, as the “is-a-fifth” relation forms a cir-cle in chords, whereas word analogies connect pairs of words without forming acircle.Additionally, we observe that relative major and minor chords appear rela-tively close together in the embedding space, as shown by their proximity in thePCA plots (highlighted in Figure 4d). We also observe that enharmonics, noteswith different names but the same pitch, are often close together. Not only that,but there is a consistent pattern in the positioning of enharmonics, with sharpsto the left and flats to the right. Relative refers to the relation between the chords’ roots, in which the scale beginningon the minor chord’s root shares the same notes as the scale beginning on the majorchord’s root, but the ordering of the notes give different qualities to the scales. A. Lahnala et al.
These observations suggest that chord embeddings are capable of represent-ing musical relationships that are important to music theory. Transitions betweenthe tonic (I), dominant (V), and subdominant (IV) chords of a scale are prescrip-tive components in musical cadences [25]. Since these chords frequently appearin the same context, their embeddings are more similar. A common deceptivecadence is a transition between the fifth and sixth root chords of a scale [20].An example progression with a deceptive cadence in C major is C - F - G - Am ; thesechords appear in a similar neighborhood in the PCA plots. Because these chordsfrequently co-occur in music, the embeddings capture a relationship betweenthem.We also note that relationships for chords that are used more frequently aremore strongly represented. The major and minor relative pairs ( G , Em ), ( C , Am ),and ( D , Bm ), are among the top ten chords ranked by song frequency (Figure 1)and have clear relations in Figure 4. In contrast, the pairs ( Ab , Fm ) and ( Db , Bbm )are ranked lower and their minor-major relative relationship appears weaker bytheir distance.
In addition to chord embeddings, we also explore two other chord representa-tions: Pitch Representations ( PR ) and Bag-of-Chords ( BOC ). For a fair compari-son, we use the same vocabulary of 237 chords for these representations.
Pitch Representations.
A chord’s identity is determined by its pitches, so wetest if the individual pitches provide a better representation of a chord as a wholethan our chord embeddings. This method represents each chord by encoding eachof its pitches by their index in the chromatic scale { C = 1 , C = 2 , · · · , B = 12 } .The pitches are in order of the triad, followed by an additional pitch if marked,and by one extra dimension for special cases. Additional pitches are indicatedby the interval relative to the root of the pitch that is being added. We alsorepresent chord inversions, e.g., the chord
G/B which is an inversion of G suchthat the bottom pitch is B . Bag-of-Chords.
In this representation, each chord is represented as a “one-hot”vector, where the vectors have length equal to the vocabulary size. We considertwo ways of determining the value for a chord. For
BOC count , we use the frequencyof a chord in a song, divided by the number of chords in the song. For
BOC tfidf ,we use the TF-IDF of each chord (term-frequency, inverse document frequency).
In this section, we present our first case study, which investigates if there isa relationship between chord embedding representations and the ways humans Special cases include: the “*” marking on a chord, which is a special marker specificto the ultimate-guitar.com site; “UNK” which we use to replace chords that do notmeet the 0.1% document frequency threshold; and “H” and “Hm” which indicates“hammer-ons” in the notation on ultimate-guitar.comhord Embeddings: Analysis and Applications 7 perceive and interact with chords. We test the use of chord representations forpredicting the most likely next chord following a given sequence of chords, andthen compare these to human-annotated responses.We train a next chord prediction model using a long short-term memory model [9] (LSTM). We follow standard practice and do not freeze the embed-dings, meaning the chord representations undergo updates throughout training,adjusting to capture the musical features most important for the task. Our mainmodel uses the pre-trained chord embeddings to initialize the chord predictionarchitecture. We test both CE cbow and CE sglm embeddings, and will refer to thesemodels by these acronyms. We also define a baseline model, where the encoder israndomly initialized (denoted NI , for no initialization). Finally, we also evaluatea model where we initialize the encoder with the pitch representations introducedin Section 4.2 (denoted PR ).We divide our data into three sets, with 69,985 songs (80%) for training,8,748 songs (10%) for validation, and 8,748 (10%) for testing. We train usinga single GPU with parameters: epochs = 40, sequence length = 35, batch size= 20, dropout rate = 0 .
2, hidden layers = 2, learning rate = 20, and gradientclipping = 0 . To evaluate the next chord prediction models, we collect data with a humanannotation task in which annotators are asked to add a new chord at the endof a chord progression. For example, given the progression “ A , D , E ,” they mustpick a chord that would immediately follow E . They are also asked to pick oneor two alternatives to this selection. Continuing the example, if an annotatorprovides E7 and A , then the chord progressions they have specified are “ A , D , E , E7 ,” and “ A , D , E , A .” The annotators are given a tool to play 48 different chords(all major, minor, major-minor 7, and minor 7 chords) so they can hear howdifferent options would sound as part of the progression. They were given atotal of 39 samples shown in the same order to all annotators. The samples werechosen randomly from our entire dataset of songs, permitting only one sequenceto come from a single song, and requiring they contain the same 48 chords. Wepresented sequences of length three and six as we expect that patterns in thegiven sequence affect the responses.
Participants.
The annotators were first asked to estimate their expertise inmusic theory on a scale from 0 - 100, where 0 indicates no knowledge of musictheory, 25 - 75 indicates some level of knowledge from pre-university trainingthough self-teaching, private lessons/tutoring, or classroom settings, and 75 - 100indicates substantial expertise gained by formal university studies, performingand/or composing experience. They were given the option to provide commentsabout how they estimated their expertise. We collected this information because We use an open-source repository of neural language models https://github.com/pytorch/examples/blob/master/word_language_model/model.py We did not limit our next chord prediction models to these 48 chords. A. Lahnala et al. we expected that the annotations provided by a participant may vary dependingon their background education in music theory. It also allows us to performcomparisons of our system with sub-groups defined by self-reported knowledge.Nine participants provided complete responses, with expertise ratings of 0, 0,10, 10, 19, 25, 25, 50 and 73. For the following analyses, we define a beginner setcontaining annotators who provided 0 for their self-rating, an intermediate setcontaining annotators with ratings > <
50, and an expert set containingannotators whose ratings are at least 50.
Inter-annotator Agreement.
For pairwise agreement, we compute the pro-portion of chord progressions in which a pair of annotators provided the samechord, averaged over all pairs of annotators. The pairwise agreement across allannotators is 22.51, it is 23.08 for the beginner set, 25.38 for the intermediateset, and 17.95 for the expert set.To account for responses of similar but not identical chords (discussed in Sec-tion 5.2), we measure pairwise agreement on response pitches. We compute thefraction of matching pitches for a pair of annotators’ responses for a given chordprogression, averaged over all pairs of annotators. The pairwise pitch agreementscore for all annotators is 38.00, it is 37.01 for the beginner set, 40.90 for theintermediate set, and 33.59 for the expert set. The average number of uniquechords used by each annotator is 30.2, it is 35.5 for the beginner set, 27.4 forintermediate set, and 32.0 for the expert set.
The main objective of this case study is to investigate whether chord similaritiescaptured by our embeddings reflect human-perceived similarities. We use thechord-prediction systems to perform the same task given to the annotators. Eachmodel provides a probability distribution over the full set of chords, thereforewe treat the chords with highest probabilities as each model’s selection.We evaluate the predictions with the following metrics, which are inspiredby the metrics employed by the Lexical Substitution task [17] but modified forour setup, which weight more frequent responses higher:
M atch best : For each example we calculate the fraction of people who includedthe model’s top prediction in their answer. These values are then averaged overall examples.
M atch oo : This adds together values for the previous metric across the model’stop four predictions. M ode best : The fraction of cases in which the top model prediction is the sameas the most common annotator response, when there is a single most commonresponse.
M ode oo : The fraction of cases in which one of the top four model predictionsis the same as the most common annotator response.Note that only 25 out of the 39 examples had a unique most common re-sponse. Of these, 20 had a chord chosen by three or four annotators, and fivehad a chord chosen by five to seven annotators. The rest of the examples are notconsidered in the M ode metrics. hord Embeddings: Analysis and Applications 9
Pitch Matches : The metrics above penalise all differences in predictionsequally, even though some chord pairs are more different than others, e.g., A7 and A differ only in the addition of a pitch, whereas B and A share no pitches.To address this, we use a metric that is the total number of pitches that matchfor each question between the model’s top response and the annotator’s firstresponse. We calculate this separately per-annotator, and then average acrossannotators ( P M ave ). Loss and
P erplexity ( P P L ): These are two measures from the language mod-eling literature that we apply to see how well the models do on the true contentof songs. Note that this evaluation is over a different set: 8,748 randomly chosensongs that are not included in model training.
Match Mode P Mbest oo best oo ave All NI PR CE cbow CE sglm NI PR CE cbow CE sglm Match Mode P Mbest oo best oo ave Beginner NI PR CE cbow CE sglm NI PR CE cbow CE sglm Table 1.
Results for all models when compared with different sets of annotators basedon their expertise.
Table 1 reports metrics for each system when compared withthe beginner set, intermediate set, expert set, and full set of annotators. Whenevaluating against all annotators, CE sglm is best in M atch best and
M atch oo , NI is best in M ode best , and PR is best in M ode oo .In the results for subsets of annotators, all systems tend to match expertsbetter than the beginner or intermediate groups. In particular, CE cbow obtains thelowest scores when evaluated against the beginner group, but the highest whenevaluated against the expert group. To investigate this pattern, we compute M atch best and
M atch oo for each individual annotator and then the SpearmanCorrelation Coefficient r s between these metrics and their expertise. In Table 2,we observe strong significant correlation between the CE cbow and expertise, andno significant correlation for the other models. This can be explained by the fact that the models are trained on a large collection of songs composed by experts,and the chord embeddings seem to capture the chord use style of the experts. Pitch Matches.
We report
P M ave for the models and expertise groups inTable 1. Similarly to the
M atch and
M ode metrics, we observe that all modelsperform better when compared to annotators with higher expertise, and thedifferences between groups is most extreme with CE cbow . Table 2 shows the resultsof correlation analysis for pitch matches (Pearson Correlation Coefficient r p ),with a significant linear correlation for only CE cbow . This trend is visually shownin Figure 3. P i t c h m a t c h e s . x + 41 Fig. 3.
Total pitch matches between anno-tator and output of CE cbow model, plottedby the annotator’s expertise. The red lineis the line of best fit computed by linearregression. best r s p-val oo r s p-val P M r p p-val NI PR CE cbow CE sglm Table 2.
Correlation coefficients for expertiseand Match and Pitch Match metrics.Test Loss Test PPL NI PR CE cbow CE sglm Table 3.
Loss and perplexity metrics for the chord prediction models on a held-outtest set.
Loss and Perplexity.
Table 3 shows the results on our 8,748 song test set. Allmodels perform similarly in this setting.
We observe that automatic models produce predictions that resemble human re-sponses. Comparing against annotators grouped by expertise, CE cbow comparesbest to the high expertise group for all metrics, while the best model varies hord Embeddings: Analysis and Applications 11 among the other groups. CE cbow ’s predictions also correlate significantly withpitch matches and annotator expertise. NI and PR achieve the highest M ode best and
M ode oo scores, however, fewer samples are considered because only twenty-five had a unique mode chord across the annotators and only five of these sampleshad more than half the annotators agree. Additionally, the chord symbol basedmetrics are strict, requiring an exact match on chords, and had lower interan-notator agreement than the pitch-based metrics.While CE cbow ’s predictions exhibit a strong pitch-match correlation, CE sglm ’spredictions exhibit no significant correlate at all. However, differences betweenthe CE cbow and CE sglm embeddings may not be as apparent in other downstreamapplications; in fact, by the perplexity and test loss metrics shown in Table 3,there is barely a difference between these two, or any, models. Investigating keydifferences between these embedding models in musical contexts is a directionfor future work. Our second case study introduces the task of performing artist attribute predic-tion, demonstrating that these chord representations could be used more broadlyin tasks involving musical stylometry and musical author profiling. With binaryclassifiers using our chord representations, we predict three attributes as sepa-rate tasks: gender (male or female), performing country (U.S. or U.K.), and typeof artist (group or solo artist).
Data.
For these experiments, we augment the dataset with information obtainedwith the MusicBrainz API, which includes the song artist’s location, gender,lifespan, tags that convey musical genres, and other available information for35,351 English songs (identified using Google’s Language Detection project [29]).From this extracted information, we choose artist type, performing country, andgender because of the sufficient quantity of data available with these attributesenabling the tasks; we note that tasks dedicated to genre or time period areof interest for future investigations, and our preliminary experiments using theartists’ lifespan and tags as proxies for time period and genre indicated thesetasks are promising use cases for chord embeddings.We use the top two most frequent classes of each attribute, and balance thedata to have the same number of examples for each class. For artist type, thereare 20,000 songs per class (group and solo). For performing country, there are8,000 songs per class (U.S. and U.K.). For gender, there are 6,000 songs per class(male and female). The number of samples varies because of differences in theraw class counts and because not all songs have a label for each property. Experimental Setup.
We build two binary classifiers and compare their per-formance with each chord representation. The first uses logistic regression (LR)over a single vector for each song by aggregating chord representations. We alsoexperimented with an SVM classifier, but LR was more efficient with minimal https://musicbrainz.org/ performance trade-offs. The BOC methods are defined in Section 4.2. The PR method aggregates the chords with a many-hot encoding vector counting eachchord pitch, normalized by the total number of pitches. The CE methods aggre-gate chord embeddings by max-pooling, taking the most extreme absolute valuein each dimension across all chords.The second classifier is a Convolutional Neural Network (CNN) that consid-ers the chords in sequences. We experimented with an LSTM, and found thatthe CNN functions better for these tasks. We use the CNN model for sentenceclassification by Kim [13] over the chord progressions for each song, using thesame NI , PR , CE sglm , and CE cbow representations from the first case study.For our model parameters, chosen in preliminary experiments on a subsetof the data, we use L2 regularization for the LR classifier, and the CNN modeluses filter window sizes 3, 4, 5 with 30 feature maps, drop-out rate 0.5 andAdam [14] optimization. The sequence limit is 60 chords, cutting off extra chordsand padding when there are fewer. Model Gender Country Artist Type
Logistic regression
BOC count * †‡ † * † BOC tfidf † † † PR CE cbow * † † * † CE sglm † † * † CNN NI † † † PR CE cbow † † CE sglm † † † * p < .
05 over
BOC tfidf , † p < .
05 over PR , ‡ p < .
05 over CE sglm , § p < .
05 over NI Table 4.
Accuracy scores from 10-fold cross validation in artist gender, country, andtype prediction tasks. The significance tests are performed among the logistic regressionmodels and CNN models separately.
Table 4 shows the models’ accuracy scores from experiments using 10-fold crossvalidation. CE sglm CNN is the top performer for all tasks, significantly outper-forming CNN PR and all LR models for all tasks, and NI for country. Allmodels outperform a random baseline (50%) significantly in each task. In eachtask, CNN models significantly outperform their LR counterparts.For insight into the models’ performance, we analyze the gender predictiontask, the only attribute where the LR CE cbow and LR BOC count predictions differed CNN model is built on https://github.com/Shawn1993/cnn-text-classification-pytorch By a paired t-test with p < . significantly. First, we compare the rate of use of each chord between genders.To show the differences in Figure 4a, we divide the higher rate by the lowerrate, subtract one to set equal use to zero, and flip the sign when female use ishigher. We observe greater variations among chords with lower song frequency.For instance,
C/G , F , Bbm , and Ab are twice as salient for one gender than theother. The highest variation among the top 20 chords reaches 1.5 times moresalient, and for the top 10, 1.2 times more salient for one gender.To investigate the impacts of the musical relationships captured in embed-dings (Section 4) to the CE models, we also compared use of five chord qualities .Figure 4b shows higher relative frequency of suspended and diminished chordsamong the songs of male artists, augmented and minor chords among the femaleartist songs, and fairly similar use of major chords. E D C / G D b B b m F A b C F m G D B A f = 2 × mm = fm = 2 × f Fig. 4.
Chord variations by gender, ordered from least to greatest by song frequency.The labeled chords are at least 1.5 more salient for one gender than the other.Aug Min Maj Sus Dim f = 1 . × mm = fm = 1 . × f Fig. 5.
Variation in chord quality usage by gender by ratio of use percentage.
To our knowledge, this is the first time that an association between chord rep-resentations and author attributes has been explored. Each model for each at-tribute showed significant improvement over a random baseline of 50%, indi-cating there are quantifiable differences in our music data between the genders, countries, and artist types. In addition to the tasks we presented, we also ob-served improvement when using the system to predict the life-period of the artistand their associated genres. As life-period is a proxy for the music’s time period,chord embeddings could benefit future work in musical historical analysis.For gender, the variation of rare chords may contribute to
BOC count outper-forming CE cbow . However, CE cbow significantly outperforms BOC tfidf which givesmore weight to rare chords by their inverse-document frequency. This suggeststhat chord rarity is not the only critical feature. The variations of chord qual-ity use may contribute to the performance of the CE models as the embeddingscapture musical relationships.LR PR consistently underperforms all other models, which may indicate theimportance chord structures. Different chords with the same pitches (e.g., G and G/B ) have the same PR vector. Chords with overlapping pitches have similar PR vectors. However, CNN PR , which performs closer to the others, encodes pitchorderings (Section 4.2) and BOC methods encode chord symbols which indicatestructure. CE representations are learned from chord symbols, likely capturingcontextual functions of chord structures. These functions would matter for theCNN models which make predictions from chord sequences rather than a singleaggregated vector. Since we observed the best performance by CNN CE sglm , theresuggests the importance of contextual semantics of chord structures. A deeperstudy into structural semantics captured by chord embeddings is a direction forfuture work. In this paper, we presented an analysis of the information captured by chord em-beddings and explored how they can be applied in two case studies. We foundthat chord embeddings capture chord similarities that are consistent with im-portant musical relationships described in music theory. Our case studies showedthat the embeddings are beneficial when integrated in models for downstreamcomputational music tasks. Together, these results indicate that chord embed-dings are another useful NLP tool for musical studies. The code to train chordembeddings and the resulting embeddings, as well as the next-chord annotationsare publicly available from https://lit.eecs.umich.edu/downloads.html . Acknowledgements.
We would like to thank the anonymous reviewers andthe members of the Language and Information Technologies lab at Michigan fortheir helpful suggestions. We are grateful to MeiXing Dong and Charles Welchfor helping with the design and interface of the next-chord annotation task.This material is based in part upon work supported by the Michigan Institutefor Data Science, and by Girls Encoded and Google for sponsoring Jiajun Pengthrough the Explore Computer Science Research program. Any opinions, find-ings, and conclusions or recommendations expressed in this material are thoseof the authors and do not necessarily reflect the views of the Michigan Institutefor Data Science, Girls Encoded or Google. hord Embeddings: Analysis and Applications 15
References
1. Absolu, B., Li, T., Ogihara, M.: Analysis of chord progression data. In: Advancesin Music Information Retrieval, pp. 165–184. Springer (2010)2. Brinkman, A., Shanahan, D., Sapp, C.: Musical stylometry, machine learning andattribution studies: A semi-supervised approach to the works of josquin. In: Proc.of the Biennial Int. Conf. on Music Perception and Cognition. pp. 91–97 (2016)3. Brunner, G., Wang, Y., Wattenhofer, R., Wiesendanger, J.: Jambot: Music theoryaware chord based generation of polyphonic music with lstms. In: Proc. of ICTAI(2017)4. Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: Pre-training of deepbidirectional transformers for language understanding. In: Proc. of NAACL (2019)5. Feld, S., Fox, A.A.: Music and language. Annual review of anthropology (1),25–53 (1994)6. Fell, M., Sporleder, C.: Lyrics-based analysis and classification of music. In: Proc.of COLING (2014)7. Grover, A., Leskovec, J.: node2vec: Scalable feature learning for networks. In: Proc.of KDD (2016)8. Hillewaere, R., Manderick, B., Conklin, D.: Melodic models for polyphonic musicclassification. In: Second International Workshop on Machine Learning and Music(2009)9. Hochreiter, S., Schmidhuber, J.: Long short-term memory. Neural computation (8), 1735–1780 (1997)10. Hontanilla, M., P´erez-Sancho, C., Inesta, J.M.: Modeling musical style with lan-guage models for composer recognition. In: Iberian Conference on Pattern Recog-nition and Image Analysis. pp. 740–748. Springer (2013)11. J¨ancke, L.: The relationship between music and language. Frontiers in psychology , 123 (2012)12. Kaliakatsos-Papakostas, M.A., Epitropakis, M.G., Vrahatis, M.N.: Musical com-poser identification through probabilistic and feedforward neural networks. In: Eu-ropean Conference on the Applications of Evolutionary Computation. pp. 411–420.Springer (2010)13. Kim, Y.: Convolutional neural networks for sentence classification. In: Proceedingsof the 2014 Conference on Empirical Methods in Natural Language Processing(EMNLP). pp. 1746–1751 (2014)14. Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. ICLR, 2015.(2014)15. Madjiheurem, S., Qu, L., Walder, C.: Chord2vec: Learning musical chord embed-dings. In: Proc. of the Constructive Machine Learning Workshop (2016)16. Mayer, R., Rauber, A.: Musical genre classification by ensembles of audio and lyricsfeatures. In: Proc. of ISMIR (2011)17. McCarthy, D., Navigli, R.: The english lexical substitution task. Language re-sources and evaluation (2), 139–159 (2009)18. Mikolov, T., Chen, K., Corrado, G., Dean, J.: Efficient estimation of word repre-sentations in vector space. In: ICLR (2013)19. Ogihara, M., Li, T.: N-gram chord profiles for composer style representation. In:ISMIR. pp. 671–676 (2008)20. Owen, H.: Music theory resource book. Oxford University Press, USA (2000)21. Patel, A.D.: Language, music, syntax and the brain. Nature neuroscience (7), 674(2003)6 A. Lahnala et al.22. Pennington, J., Socher, R., Manning, C.D.: Glove: Global vectors for word repre-sentation. In: Proceedings of EMNLP (2014)23. Peters, M., Neumann, M., Iyyer, M., Gardner, M., Clark, C., Lee, K., Zettlemoyer,L.: Deep contextualized word representations. In: Proceedings of NAACL (2018)24. Phon-Amnuaisuk, S.: Exploring music21 and gensim for music data analysis andvisualization. In: International Conference on Data Mining and Big Data. pp. 3–12.Springer (2019)25. Randel, D.M.: The Harvard concise dictionary of music and musicians. HarvardUniversity Press (1999)26. Saker, M.N.: A theory of circle of fifths progressions and their application in thefour Ballades by Frederic Chopin. Ph.D. thesis, University of Wisconsin-Madison(1992)27. Sergeant, D.C., Himonides, E.: Gender and music composition: A study ofmusic, and the gendering of meanings. Frontiers in Psychology , 411(2016). https://doi.org/10.3389/fpsyg.2016.00411,
28. Shepherd, J.: A theoretical model for the sociomusicological analysis of popularmusics. Popular music , 145–177 (1982)29. Shuyo, N.: Language detection library for java (2010), http://code.google.com/p/language-detection/
30. Stamatatos, E.: A survey of modern authorship attribution methods. Journal of theAmerican Society for Information Science and Technology (3), 538–556 (2009).https://doi.org/10.1002/asi.21001, https://onlinelibrary.wiley.com/doi/abs/10.1002/asi.21001
31. Wo(cid:32)lkowicz, J., Kulka, Z., Keˇselj, V.: N-gram-based approach to composer recogni-tion. Archives of Acoustics33