[PDF] A Hybrid Approach to Audio-to-Score Alignment

Abstract

Audio-to-score alignment aims at generating an accurate mapping between a performance audio and the score of a given piece. Standard alignment methods are based on Dynamic Time Warping (DTW) and employ handcrafted features. We explore the usage of neural networks as a preprocessing step for DTW-based automatic alignment methods. Experiments on music data from different acoustic conditions demonstrate that this method generates robust alignments whilst being adaptable at the same time.

Full PDF

AA Hybrid Approach to Audio-to-Score Alignment

Ruchit Agrawal Simon Dixon Abstract

Audio-to-score alignment aims at generating anaccurate mapping between a performance audioand the score of a given piece. Standard align-ment methods are based on Dynamic Time Warp-ing (DTW) and employ handcrafted features. Weexplore the usage of neural networks as a prepro-cessing step for DTW-based automatic alignmentmethods. Experiments on music data from dif-ferent acoustic conditions demonstrate that thismethod generates robust alignments whilst beingadaptable at the same time.

1. Introduction and Motivation

Audio-to-score alignment is the task of ﬁnding the optimalmapping between a performance and the score for a givenpiece of music. Dynamic Time Warping (Sakoe & Chiba,1978) has been the de facto standard for this task, typi-cally incorporating handcrafted features (Dixon, 2005; Arzt,2016). Recent advances in Music Information Retrievalhave demonstrated the efﬁcacy of Deep Neural Networks(DNNs) to a variety of tasks like music generation (Eck &Schmidhuber, 2002), audio classiﬁcation (Lee et al., 2009),onset detection (Marolt et al., 2002), music transcription(Marolt, 2001; Hawthorne et al., 2017) as well as musicalignment (Dorfer et al., 2018a). The primary advantageof DNNs is that they can learn directly from data in anend-to-end manner, thereby eschewing the need for com-plex feature engineering. However, DNNs struggle withmodelling long-term dependencies (Bengio et al., 1994) intemporal sequences. End-to-end alignment is a challengingtask since it incorporates dealing with multiple inputs of dif-ferent modalities, in addition to handling of very long termdependencies. This paper is an endeavor towards employingneural networks for music alignment. We present a hybrid

This research is supported by the European Union’s Hori-zon 2020 research and innovation programme under the MarieSkodowska-Curie grant agreement No. 765068. Centre for Digital Music, Queen Mary University of London.Correspondence to: Ruchit Agrawal < [email protected] > . Proceedings of the th International Conference on MachineLearning , Long Beach, California, PMLR 97, 2019. Copyright2019 by the author(s). approach to audio-to-score alignment, which consists of aneural network based preprocessing step as a precursor toDynamic Time Warping. This approach involves comput-ing a frame similarity matrix which is then passed on to aDTW algorithm that computes the optimal warping paththrough this matrix. The advantage of our method is that thepreprocessing step is trainable, thereby making our methodadaptable to a particular acoustic setting, unlike traditionalDTW-based methods which employ handcrafted features.

2. Related Work

Early works on feature learning for MIR tasks employ al-gorithms like Hidden Markov Models (Joder et al., 2013)or deep belief networks (Schmidt et al., 2012). Recently, anumber of works have explored feature learning for MIRusing deep neural networks (Sigtia & Dixon, 2014; Oramaset al., 2017; Thickstun et al., 2016; Lattner et al., 2018;Arzt & Lattner, 2018; Korzeniowski & Widmer, 2016).Work speciﬁcally on learning features for audio-to-scorealignment has mainly focused on an evaluation of cur-rent feature representations (Joder et al., 2010), learningof the mapping for several common audio representationsbased on a best-ﬁt criterion (Joder et al., 2011) and learningtransposition-invariant features (Arzt & Lattner, 2018) foralignment. (Hamel et al., 2013) propose transfer learningfor MIR tasks by learning learn a shared latent represen-tation across related tasks of classiﬁcation and similaritydetection. Weaknesses in standard approaches to choosingsimilarity thresholds has been explored in (Kinnaird, 2017).(˙Izmirli & Dannenberg, 2010) propose the idea of learningfeatures for aligning two sequences of music, as opposedto employing a standard chroma-based feature representa-tion. (Nieto & Bello, 2014) present a novel algorithm tocapture music segment similarity using two-dimensionalFourier-magnitude coefﬁcients. (Korzeniowski & Widmer,2016) explore frame-level audio feature learning for chordrecognition using artiﬁcial neural networks.

3. Experiments and Results

The standard feature representation choice for music align-ment is a time-chroma representation generated from thelog-frequency spectrogram. Since this representation onlyrelies on pitch class information, it ignores variations in a r X i v : . [ ee ss . A S ] J u l Hybrid Approach to Audio-to-Score Alignment timbre and instrumentation, and is not adaptable to differ-ent acoustic settings. Using neural networks helps us tooverride the manual feature engineering whilst providingthe capability to adapt to different settings. Rather thanextracting a feature representation from the inputs, we fo-cus on the task of constructing a frame-similarity matrix.This matrix is then passed on to a DTW-based algorithmto generate the alignments. We employ a “Siamese” Con-volutional Neural Network (Bromley et al., 1994) for thistask. This framework has shown promising results in theﬁeld of computer vision for computing image similarity(Zagoruyko & Komodakis, 2015), as well as in the ﬁeld ofnatural language processing, for learning sentence similarity(Mueller & Thyagarajan, 2016) and speaker identiﬁcation(Lukic et al., 2016) amongst others.We train our Siamese CNN model to determine if twopatches, one from the audio and one from the synthesizedMIDI “match” or not. A similar approach has been used by(˙Izmirli & Dannenberg, 2010), however they use a Multi-Layer Perceptron (MLP) framework to compute if twoframes are the same or not. In addition to using an en-hanced framework which is optimal for this task, our workdiffers from them in that we also compute similarity labels(non-binary) and use this distance matrix further for align-ment. We explain the preprocessing steps below:In order to keep the modality constant, we ﬁrst convert theMIDI ﬁles to audio using FluidSynth (Henningsson & Team,2011). We then transform the frame-level audio patches toimage spectrograms using librosa (McFee et al., 2015), aPython library for audio and music analysis. We conductexperiments using both the Short-Time Fourier transform(STFT) as well as the Constant-Q transform (CQT) transfor-mations of the raw audios. We brieﬂy explain our choice ofloss function below:The objective of our Siamese CNN model is not to classifythe inputs, but to differentiate between them. Hence, a con-trastive loss function is much better suited to this task than astandard classiﬁcation loss function like cross entropy. Thecontrastive loss function is computed as follows:where D w is the Euclidean Distance between the outputs ofthe two Siamese twin networks. More formally, D w can beexpressed as follows:where G w is the output of each of the twin networks and X and X are the two inputs.We train the model on the MAPS database (Emiya et al.,2010), where we have MIDI-aligned audio for a range ofacoustic settings. We only select the subset containing therecordings played using a real piano, and discard the oneswhich are software-synthesized. We compute the similaritymatrix using two mechanisms: • Using binary labels: For this we employ the outputlabels of our Siamese CNN. 0’s imply similar pairs, 1’simply dissimilar pairs. • Using distances: For this we calculate D w . The dis-tance directly corresponds to the dissimilarity betweenthe two inputs. Higher the value of D w , higher thedissimilarity.We then generate an alignment path through this matrix us-ing fast-DTW (Salvador & Chan, 2007), through a readilyavailable DTW implementation in Python . We test the per-formance of our model on a subset of the Mazurka dataset(Sapp, 2007) which contains recordings from various acous-tic settings. The results obtained using both methods aregiven in Table 1. Table 1.

Alignment accuracy (in %)

Type of matrix STFT CQT

Binary 76.3 78.6Distance 78.1 81.4Our results suggest that this method is a promising approachto alignment, especially in non-standard acoustic conditions;since the pre-processing Siamese Network is trainable onsuch data, unlike the manually handcrafted features used bystandard DTW-based algorithms.

4. Conclusion and Future Work

We demonstrated that our hybrid approach to audio-to-scorealignment is capable of generating robust alignments acrossvarious acoustic conditions. The advantage of our methodis that it is adaptable to a particular acoustic setting withoutrequiring a large amount of labeled training data. In thefuture, we would like to conduct an exhaustive evaluation ofthis approach on musically relevant parameters and analyzeits limitations. We would also like to work on learning thefeatures as well as the alignments in a completely end-to-endmanner. https://pypi.org/project/fastdtw/ Hybrid Approach to Audio-to-Score Alignment

References

Arzt, A.

Flexible and robust music tracking . PhD thesis, Ph.D. thesis, Universit¨at Linz, Linz, 2016.Arzt, A. and Lattner, S. Audio-to-score alignment us-ing transposition-invariant features. arXiv preprintarXiv:1807.07278 , 2018.Arzt, A., Widmer, G., and Dixon, S. Adaptive distance nor-malization for real-time music tracking. In , pp. 2689–2693. IEEE, 2012.Bell, S. and Bala, K. Learning visual similarity for prod-uct design with convolutional neural networks.

ACMTransactions on Graphics (TOG) , 34(4):98, 2015.Bengio, Y., Simard, P., Frasconi, P., et al. Learning long-term dependencies with gradient descent is difﬁcult.

IEEEtransactions on neural networks , 5(2):157–166, 1994.Bromley, J., Guyon, I., LeCun, Y., S¨ackinger, E., and Shah,R. Signature veriﬁcation using a” siamese” time delayneural network. In

Advances in neural information pro-cessing systems , pp. 737–744, 1994.Carabias-Orti, J. J., Rodr´ıguez-Serrano, F. J., Vera-Candeas,P., Ruiz-Reyes, N., and Ca˜nadas-Quesada, F. J. An audioto score alignment framework using spectral factorizationand dynamic time warping. In

International Society forMusic Information Retrieval , pp. 742–748, 2015.Dixon, S. An on-line time warping algorithm for trackingmusical performances. In

IJCAI , pp. 1727–1728, 2005.Dorfer, M., Arzt, A., and Widmer, G. Towards scorefollowing in sheet music images. arXiv preprintarXiv:1612.05050 , 2016.Dorfer, M., Arzt, A., and Widmer, G. Learning audio-sheetmusic correspondences for score identiﬁcation and ofﬂinealignment. arXiv preprint arXiv:1707.09887 , 2017.Dorfer, M., Hajiˇc Jr, J., Arzt, A., Frostel, H., and Widmer, G.Learning audio–sheet music correspondences for cross-modal retrieval and piece identiﬁcation.

Transactions ofthe International Society for Music Information Retrieval ,1(1), 2018a.Dorfer, M., Henkel, F., and Widmer, G. Learning to listen,read, and follow: Score following as a reinforcementlearning game. arXiv preprint arXiv:1807.06391 , 2018b.Eck, D. and Schmidhuber, J. A ﬁrst look at music composi-tion using lstm recurrent neural networks.

Istituto DalleMolle Di Studi Sull Intelligenza Artiﬁciale , 103, 2002. Emiya, V., Bertin, N., David, B., and Badeau, R. Maps-apiano database for multipitch estimation and automatictranscription of music. 2010.Hamel, P., Davies, M., Yoshii, K., and Goto, M. Transferlearning in mir: Sharing learned latent representations formusic audio classiﬁcation and similarity. 2013.Hawthorne, C., Elsen, E., Song, J., Roberts, A., Simon,I., Raffel, C., Engel, J., Oore, S., and Eck, D. Onsetsand frames: Dual-objective piano transcription. arXivpreprint arXiv:1710.11153 , 2017.Henningsson, D. and Team, F. D. Fluidsynth real-time andthread safety challenges. In

Proceedings of the 9th Inter-national Linux Audio Conference, Maynooth University,Ireland , pp. 123–128, 2011.Hu, N., Dannenberg, R. B., and Tzanetakis, G. Polyphonicaudio matching and alignment for music retrieval. In

Applications of Signal Processing to Audio and Acoustics,2003 IEEE Workshop on. , pp. 185–188. IEEE, 2003.˙Izmirli, ¨O. and Dannenberg, R. B. Understanding featuresand distance functions for music sequence alignment. In

International Society for Music Information Retrieval , pp.411–416. Citeseer, 2010.Joder, C., Essid, S., and Richard, G. A comparative studyof tonal acoustic features for a symbolic level music-to-score alignment. In , pp. 409–412.IEEE, 2010.Joder, C., Essid, S., and Richard, G. Optimizing the map-ping from a symbolic to an audio representation for music-to-score alignment. In , pp. 121–124. IEEE, 2011.Joder, C., Essid, S., and Richard, G. Learning optimalfeatures for polyphonic audio-to-score alignment.

IEEETransactions on Audio, Speech, and Language Process-ing , 21(10):2118–2128, 2013.Kinnaird, K. M. Examining musical meaning in similaritythresholds. In

International Society for Music Informa-tion Retrieval , pp. 635–641, 2017.Korzeniowski, F. and Widmer, G. Feature learning for chordrecognition: The deep chroma extractor. arXiv preprintarXiv:1612.05065 , 2016.Lattner, S., Grachten, M., and Widmer, G. Learningtransposition-invariant interval features from symbolicmusic and audio. arXiv preprint arXiv:1806.08236 , 2018.

Hybrid Approach to Audio-to-Score Alignment

Lee, H., Pham, P., Largman, Y., and Ng, A. Y. Unsuper-vised feature learning for audio classiﬁcation using con-volutional deep belief networks. In

Advances in neuralinformation processing systems , pp. 1096–1104, 2009.Lukic, Y., Vogt, C., D¨urr, O., and Stadelmann, T. Speakeridentiﬁcation and clustering using convolutional neuralnetworks. In , pp. 1–6.IEEE, 2016.Mandel, M. I., Poliner, G. E., and Ellis, D. P. Support vectormachine active learning for music retrieval.

Multimediasystems , 12(1):3–13, 2006.Marolt, M. Sonic: Transcription of polyphonic piano musicwith neural networks. In

Workshop on Current ResearchDirections in Computer Music , pp. 217–224, 2001.Marolt, M., Kavcic, A., and Privosnik, M. Neural networksfor note onset detection in piano music. In

Proceedingsof the 2002 International Computer Music Conference ,2002.McFee, B., Raffel, C., Liang, D., Ellis, D. P., McVicar, M.,Battenberg, E., and Nieto, O. librosa: Audio and musicsignal analysis in python. In

Proceedings of the 14thpython in science conference , pp. 18–25, 2015.Mueller, J. and Thyagarajan, A. Siamese recurrent architec-tures for learning sentence similarity. In

Thirtieth AAAIConference on Artiﬁcial Intelligence , 2016.Muller, M., Ewert, S., and Kreuzer, S. Making chromafeatures more robust to timbre changes. In , pp. 1877–1880. IEEE, 2009.Nieto, O. and Bello, J. P. Music segment similarity using2d-fourier magnitude coefﬁcients. In , pp. 664–668. IEEE, 2014.Oramas, S., Nieto, O., Barbieri, F., and Serra, X. Multi-labelmusic genre classiﬁcation from audio, text, and imagesusing deep features. arXiv preprint arXiv:1707.04916 ,2017.Pampalk, E., Dixon, S., and Widmer, G. On the evaluationof perceptual similarity measures for music. In of: Pro-ceedings of the sixth international conference on digitalaudio effects (DAFx-03) , pp. 7–12, 2003.Pons, J., Nieto, O., Prockup, M., Schmidt, E., Ehmann,A., and Serra, X. End-to-end learning for music audiotagging at scale. arXiv preprint arXiv:1711.02520 , 2017. Sakoe, H. and Chiba, S. Dynamic programming algorithmoptimization for spoken word recognition.

IEEE transac-tions on acoustics, speech, and signal processing , 26(1):43–49, 1978.Salvador, S. and Chan, P. Toward accurate dynamic timewarping in linear time and space.

Intelligent Data Analy-sis , 11(5):561–580, 2007.Sapp, C. S. Comparative analysis of multiple musical perfor-mances. In

International Society for Music InformationRetrieval , pp. 497–500, 2007.Schmidt, E. M., Scott, J. J., and Kim, Y. E. Feature learn-ing in dynamic environments: Modeling the acousticstructure of musical emotion. In

International Societyfor Music Information Retrieval , pp. 325–330. Citeseer,2012.Sigtia, S. and Dixon, S. Improved music feature learningwith deep neural networks. In

Acoustics, Speech andSignal Processing (ICASSP), 2014 IEEE InternationalConference on , pp. 6959–6963. IEEE, 2014.Thickstun, J., Harchaoui, Z., and Kakade, S. Learn-ing features of music from scratch. arXiv preprintarXiv:1611.09827 , 2016.Vinyals, O., Blundell, C., Lillicrap, T., Wierstra, D., et al.Matching networks for one shot learning. In

Advances inneural information processing systems , pp. 3630–3638,2016.Zagoruyko, S. and Komodakis, N. Learning to compareimage patches via convolutional neural networks. In