[PDF] Low Resource Audio-to-Lyrics Alignment From Polyphonic Music Recordings

Abstract

Lyrics alignment in long music recordings can be memory exhaustive when performed in a single pass. In this study, we present a novel method that performs audio-to-lyrics alignment with a low memory consumption footprint regardless of the duration of the music recording. The proposed system first spots the anchoring words within the audio signal. With respect to these anchors, the recording is then segmented and a second-pass alignment is performed to obtain the word timings. We show that our audio-to-lyrics alignment system performs competitively with the state-of-the-art, while requiring much less computational resources. In addition, we utilise our lyrics alignment system to segment the music recordings into sentence-level chunks. Notably on the segmented recordings, we report the lyrics transcription scores on a number of benchmark test sets. Finally, our experiments highlight the importance of the source separation step for good performance on the transcription and alignment tasks. For reproducibility, we publicly share our code with the research community.

Full PDF

LLOW RESOURCE AUDIO-TO-LYRICS ALIGNMENT FROM POLYPHONIC MUSICRECORDINGS

Emir Demirel , Sven Ahlb¨ack , Simon Dixon Centre for Digital Music, Queen Mary University of London, UK Doremir Music Research AB, SE

ABSTRACT

Lyrics alignment in long music recordings can be memory ex-haustive when performed in a single pass. In this study, we presenta novel method that performs audio-to-lyrics alignment with a lowmemory consumption footprint regardless of the duration of the mu-sic recording. The proposed system ﬁrst spots the anchoring wordswithin the audio signal. With respect to these anchors, the record-ing is then segmented and a second-pass alignment is performed toobtain the word timings. We show that our audio-to-lyrics align-ment system performs competitively with the state-of-the-art, whilerequiring much less computational resources. In addition, we utiliseour lyrics alignment system to segment the music recordings intosentence-level chunks. Notably on the segmented recordings, we re-port the lyrics transcription scores on a number of benchmark testsets. Finally, our experiments highlight the importance of the sourceseparation step for good performance on the transcription and align-ment tasks. For reproducibility, we publicly share our code with theresearch community.

Index Terms — audio-to-lyrics alignment, music informationretrieval, automatic lyrics transcription, long audio alignment, au-tomatic speech recognition

1. INTRODUCTION

Audio-to-lyrics alignment has a variety of applications within themusic technology industry. Some of these applications include lyricsprompting for karaoke applications, captioning, and music score andvideo editing. Moreover, lyrics alignment systems can be leveragedto generate new training data for several tasks within MIR research,such as lyrics transcription, singing voice detection, source separa-tion, music transcription and cover song identiﬁcation.The task of aligning song lyrics with their corresponding musicrecordings is among the most challenging tasks in music informa-tion retrieval (MIR) research due to three major factors: the multi-modality of the information to be processed – namely music and speech , the presence of the musical accompaniment in the acous-tic scene and the length of the music recording to be aligned. Forprocessing linguistically relevant information, previous studies havetaken the approach of adapting automatic speech recognition (ASR)paradigms to singing voice signals [1, 2, 3, 4]. Regarding the musicalaccompaniment, researchers have aligned lyrics on either source sep-arated vocal tracks [4] or utilized acoustic models trained on poly-phonic recordings [2, 3, 5].

ED received funding from the European Union’s Horizon 2020 researchand innovation programme under the Marie Skłodowska-Curie grant agree-ment No. 765068.

For relevant alignment tasks within MIR research, previousstudies have presented efﬁcient ways to handle the duration issue inalignment by using methods based on dynamic time warping (DTW)[6, 7]. The results in the MIREX 2017 public evaluation of audio-to-lyrics alignment systems showed that Viterbi-based alignmentoutperforms DTW [8]. Most lyrics alignment research since thenhas utilized a similar approach due to its performance and efﬁciency,however even Viterbi decoding may become intractable when pro-cessing long audio recordings. Within this context, performingrobust and efﬁcient lyrics alignment in long music recordings stillremains as a bottleneck and preventing factor for audio-to-lyricsalignment technology in industrial applications involving large-scale web services or mobile deployment. In this paper, we aimto contribute to this ﬁeld of research by proposing a low resourcesolution that is robust and efﬁcient in terms of the audio length andthe musical accompaniment to singing. Leveraging its duration-invariant capability, we further show that this approach could beexploited to generate sentence-level lyrics annotations for extendingexisting lyrics transcription training sets.This paper continues as follows: in Section 2, we provide a briefoverview of the state-of-the-art in audio-to-lyrics alignment. We ex-plain the overall details of the proposed biased lyrics search pipelinein Section 3. Then, we evaluate the utilization capability of the over-all system through lyrics alignment and transcription experiments.Also in this section, we conduct the ﬁrst experiments evaluating dif-ferent source separation algorithms in the lyrics alignment and tran-scription context. Finally, we report results that are competitive withthe state-of-the-art in lyrics alignment and best recognition scoresfor lyrics transcription on a public benchmark evaluation dataset.

2. RELATED WORK

Early audio-to-lyrics alignment systems in research use HiddenMarkov Model (HMM) based acoustic models which are utilizedto extract frame-level phoneme (or character) posterior probabili-ties. Then a forward-pass decoding algorithm is applied on theseposteriograms, obtaining phoneme alignments. Then using a lan-guage model (LM), phoneme posteriograms can be converted toword posteriograms to retrieve word-level alignments [1, 3, 4]. Onerecent successful system [2] showed a considerable performanceboost compared to previous research using an end-to-end approachtrained on a large corpus, where alphabetic characters are used assub-word units of speech. Vaglio et al. [5] also employed an end-to-end approach for lyrics alignment applied in a multilingual context,but using phonemes as an additional intermediate representation, Can be accessed from a r X i v : . [ c s . S D ] F e b nd obtained competitive results. In addition, the authors have useda public dataset [9] that is much smaller than the training set usedin [2]. Gupta et al. [3] reported state-of-the-art results using anacoustic model trained on polyphonic music using genre-speciﬁcphonemes. According to the authors, their system applies forcedalignment with a large beam size as their system attempts to processthe entire music recording at once. Although achieving impressiveresults, the application of forced alignment in a single pass can bememory exhaustive.A similar challenge within automatic speech recognition (ASR)research is the lightly supervised alignment task [10] where the goalis to obtain timings of human-generated captions in long TV broad-casts that would be displayed to TV viewers as subtitles. Moreno etal. [11] present a system for long audio alignment which searches forthe input word locations through a recursive application of speechrecognition on a gradually restricted search space and languagemodel. The method is then further improved in terms of robust-ness and efﬁciency in the search space using the factor transducerapproach [12, 13]. The factor transducer introduces an importantdrawback within the lyrics alignment task as it exerts weak timingconstraints during decoding, i.e. the output word alignments are notconstrained to be in order. This would rise as a signiﬁcant issueespecially during successive patterns of similar word sequences,which occur frequently in song lyrics [14].Another major challenge during lyrics alignment (and alsotranscription) is the inﬂuence of accompanying non-vocal musicalsounds. One way to minimize this during the transcription and align-ment is by isolating the vocal track using a vocal source separationsystem. There has been a number of powerful open-source mu-sic source separation toolkits made available for research recently[15, 16, 17], yet the output vocal track is not guaranteed to be free ofsonic artifacts introduced during separation. In turn, these artifactsmight affect the word intelligibility and thus the accuracy duringthe automatic lyrics transcription and alignment (ALTA) tasks. Theeffects of different source separation algorithms on lyrics transcrip-tion and alignment has not been studied extensively; we provide acomparison in this paper.

3. SYSTEM DETAILS

Our overall lyrics alignment pipeline is illustrated in Figure 1. Weinitially extract vocals from the original polyphonic mix and retrievevocal segments using energy-based voice activity detection. Wesearch for lyrics within these segments by applying automatic tran-scription using a decoding graph constructed via a biased languagemodel (LM). The matching portions of the transcribed and origi-nal lyrics and their location in the audio signal are obtained throughthe text and forced alignment techniques respectively. The musicrecording is then segmented with respect to these matching portionsand a ﬁnal-pass alignment is applied to obtain the timings of all thewords in the original lyrics. In order to be able to align and rec-ognize out-of-vocabulary (OOV) words during decoding, we extendthe pronunciation dictionary for the words in the input lyrics.

For achieving a robust lyrics alignment system, out-of-vocabulary(OOV) words have to be taken into account. Lyrics may containlinguistically ambiguous sequences of words such as ‘la’, ‘na’ and‘ooh’, as well as words with repeated syllables or phonemes like‘do re mi fa so oh oh oh’, out-of-language (OOL) words or specialnames. Thus, we extend the pronunciation dictionary with respect

Fig. 1 : Pipeline of our lyrics transcription and alignment systemto the input lyrics, and construct a pronunciation transducer priorto decoding. We trained a grapheme-to-phoneme conversion ( g2p )model [18], using the CMU English pronunciation dictionary , togenerate new pronunciations for each OOV word in the input lyrics. Initially, we separate the vocal track from the original mix and de-termine the voice activity regions (VAR) based on the log-energy(the zeroth component of MFCC features). We merge consecutiveVARs if the silence between them is less than τ silence seconds, al-though we do not merge segments that are already more than τ max seconds long. The values for τ are determined empirically. Notethat if τ silence is too short, there occurs the risk of over-segmentingthe audio which could potentially reduce the word recognition rate.We have set τ silence = 0 . and τ max = 6 , based on our empiricalobservations. The goal of this stage is to detect the word positions in the lyrics andthe audio that we will use for segmentation later on. In order to beable to detect the words in input lyrics with a higher recognition rate,we restrict the search space by building an n-gram language model(LM) on the input lyrics. First, we transcribe the contents within theVARs using the pretrained acoustic model in [19] and the biased LM.Song lyrics often contain repetitive word sequences for which thesystem might recognize a word in the wrong position or repetition inthe lyrics, potentially causing accumulated errors in segmentation.For robustness against such cases, we exert strong constraints on theword order via building the LM with 20-grams. Moreover, prior toprocessing, we add a < NOISE > tag at the beginnings and endings ofeach lyrics line to further reduce the risk of alignment errors betweenlong non-vocal segments.Building the LM from only the input lyrics has two major ad-vantages: First, it increases the chance of recognizing the words ininput lyrics while minimizing the risk of wrong word predictions.Secondly, through constructing the LM on the ﬂy, we do not need anexternal pretrained LM to perform lyrics alignment. This aspect ofour system makes it applicable in a multilingual setting in the pres-ence of a g2p model for target languages. .4. Anchor Selection Next, we apply text alignment between the transcriptions and thereference lyrics to determine the matching portions, i.e. anchoring words. To impose further constraints on word order, we consider N anchor number of successive matching words as the anchoring in-stances between the lyrics and the audio signal. On the correspond-ing VARs, we apply forced alignment using these anchoring wordsto get their positions on the audio signal. We refer these portions ofthe audio signal matched with its lyrics as islands of conﬁdence . Us-ing a large N anchor would carry the risk of detecting lesser anchoringwords while a very small value could cause over-segmentation. Inour experiments, we chose N anchor = 5 (Fig. 2). Fig. 2 : Anchor selection. W n and ˆ W n are the reference and pre-dicted words respectively. D and S stand for word deletions andsubstitutions after text alignment. C are the labels for correctly rec-ognized (matching) words. Once detected, the anchoring words are utilized to split the musicrecording into shorter segments. In order to further alleviate the riskof over-segmenting the recording, we allow a maximum of N segment anchoring words to be in each segment. We start segmenting mono-tonically from the beginning to the very end of the recording. Once N segment words are spotted, the audio is split with respect to the be-ginning of the ﬁrst and the ending of the N segment -th word. Empir-ically, we found this approach to function well for N segment > .If the ﬁrst word in the original lyrics is not spotted, the beginning ofthe ﬁrst segment in the initial voice-activity-based segmentation isused. A similar approach is applied for the last word. Finally, we can apply forced alignment on shorter audio segmentswhich are extracted in the previous step using a smaller beam size.In our experiments, we were able to obtain alignments without anymemory issues using a beam size of 30 and a retry beam size of 300,which are much lower than the values reported in [4].

4. EXPERIMENTAL SETUP

To test the quality of the output audio-to-lyrics segmentation, weevaluate the precision of the word timings produced by the ﬁnal passalignment. Further, we evaluate the transcription performance on theinitial voice-activity segments and the ﬁnal segmentation, to gain aninsight into whether these segments can be used for further training.

Vocal Extraction :

The alignment and decoding are applied onthe separated vocal tracks, in order to minimize the inﬂuence of ac-companying musical sounds. Additionally, the acoustic model em-ployed in decoding is trained on monophonic singing voice record- ings [19]. The output of the vocal source separation has a directeffect on the performance of decoding, and hence the overall perfor-mance of lyrics alignment. There are two mainstream approaches invocal separation: spectrogram-based and waveform-based models.While spectrogram-based approaches have been more widely used[20, 15], there have been recent successful waveform-based musicsource separation methods, motivated by capturing the phase infor-mation in the signal [16]. To test the effect of the source separa-tion step, we compare a waveform-based model (Demucs) [17] anda state-of-the-art spectrogram-based model (Spleeter) [16] in exper-iments.

Lyrics Transcriber :

We use the acoustic model of the lyricstranscriber in [19]. The system was trained on 150 hours of a cap-pella singing recordings with a non-negligible amount of noise [21].After decoding with a 4-gram LM, we rescore lattices with RNNLM[22] and report this value as the ﬁnal transcription result.

Data :

For testing, we use the benchmark evaluation set for thelyrics transcription and alignment tasks, namely JamendoLyrics [2]which consists of 20 music recordings (1.1 hours) from a variety ofgenres including metal, pop, reggae, country, hiphop, R&B and elec-tronic. In addition, the lyrics transcription results are reported alsoon the Mauch [23] dataset which consists of 20 English languagepop songs.

5. RESULTS & DISCUSSION5.1. Audio-to-Lyrics Alignment

The word alignments are evaluated in terms of the standard audio-to-lyrics alignment metrics: mean and median average absolute error(AE) [24] in seconds, and correctly predicted segments (PCS) with atolerance window of 0.3 seconds [23]. These metrics are computedfor each sample in the evaluation set and the mean for each metric isreported as the ﬁnal result.We compare the performance of our system with the most re-cent successful lyrics alignment systems. SD1 [2] applies alignmenton polyphonic recordings using an end-to-end system trained on aprivate dataset consisting of over 44 000 songs. In SD2, alignmentis performed on source separated vocal track using the Wave-U-Net[20] architecture. VA [5] also uses an end-to-end model, but trainedon the DALI (v1.0) dataset, which has over 200 hours of polyphonicmusic recordings and extracts the vocals using Spleeter. GC1 [3]uses the same training data, for constructing an acoustic model inthe hybrid-ASR setting [25] and performs alignment on the originalpolyphonic mix as well. In addition to these models, we refer to ourmodels which align words to the vocal tracks separated by Demucsand Spleeter as DE1 and DE2 respectively.

Mean AE Median AE PCSSD1 [2] 0.82 0.10 0.85SD2 [2] 0.39 0.10 0.87VA [5] 0.37 N/A 0.92GC1 [3]

DE1 0.31

Table 1 : Lyrics alignment results on the Jamendo datasetAccording to Table 1, our system that uses the waveform-basedsource separation model (Demucs) for vocal extraction outperformsother methods that use end-to-end learning, and obtains competitiveresults to the state-of-the-art (GC1). Using Spleeter, we were able tochieve similar results to VA [5], which also uses Spleeter for sourceseparation. Note that the training data we have used is smaller andless diverse than the datasets used by all other systems, highlight-ing that there is room for performance improvement in our method.Moreover, the alignment performance is better when using Demucsinstead of Spleeter in terms of mean AE and PCS, even though themedian AE is the same.Figure 3 shows a comparison of the system presented in this pa-per, DL { } , and GL1 in regard to memory efﬁciency. The memoryusage on the RAM is monitored every 10 seconds during a run overthe Jamendo dataset. We have executed the code for the experimentson Intel ® Xeo TM E5-2620 with 24 GM of RAM capacity. Below,Figure 3 and Table 2 show that the memory consumption of our sys-tem is less than GL1 by a margin more than an order of magnitude.The larger standard deviation in our case is potentially due to thevarying lengths and complexities of the segmented music signals in-put to decoding.

Fig. 3 : Memory usage on RAM in megabytes (MB)

GC1 DE { } Mean (Std.%) 13,740 (8.8%) (31%)Max 16,745

Table 2 : Statistics on memory usage in MB

In order to gain an insight as to whether the ﬁnal lyrics-to-audiosegments can be leveraged as sentence-level annotations for train-ing data, we conduct lyrics transcription experiments. We compareword recognition rates for a pure inference case on VARs and onthe segments extracted as described in Section 3.5. Additionally, weprovide comparisons with the state-of-the-art in lyrics transcriptionfrom polyphonic music recordings: SD1 [2], GC1 and GC2 [3] (vo-cals extracted using [20]). For the evaluation, we use word (WER)and character (CER) error rate computed using the Kaldi toolkit.According to Table 3, unlike the alignment results, usingSpleeter for separating the vocals seems to be more beneﬁcialfor lyrics transcription. Notice that while the gap is small betweenthe recognition rates of DE1 and DE2 on the Mauch dataset, it ismuch higher on the Jamendo dataset. For both DE1 and DE2, thetranscription rates are consistently higher on the audio-to-lyrics seg-ments compared to VAR, implying that our segmentation methodcan be exploited in a semi-supervised setting.During pure inference on VAR, our system outperforms thestate-of-the-art on Jamendo, while the performance on the Mauchdataset is still far behind. In all cases, our system outperformsother systems that apply transcription on separated vocal tracks bya great margin. Note this could be due to the fact that all of theprevious methods use datasets consisting of polyphonic recordingseven though they are much larger in size.Overall, the results show that the application of lyrics transcrip-tion and alignment on separated vocal tracks can achieve competitive

WER CER

Mauch Jamendo Mauch JamendoSD1 [2] 70.09 77.80 48.90 49.20GC1 [3]

N/A N/AGC2 [3] 78.85 71.83 N/A N/ADE1 - VAR 60.92 62.55 44.15 47.02DE1 - segmented 50.44 55.47 38.65 42.11DE2 - VAR 57.36

DE2 - segmented 49.92

Table 3 : Lyrics transcription resultsperformance to state-of-the-art systems working directly on poly-phonic music input. It should be noted that the choice of source sep-aration has a crucial but inconsistent effect on the ﬁnal transcriptionand alignment results. While it is not yet possible to draw a generalconclusion regarding which method works better for which task, wehave shown that the effect of vocal extraction on the performanceof these tasks is worthy to consider when developing music sourceseparation algorithms.

6. CONCLUSION

We presented a novel system that segments polyphonic musicrecordings with respect to its given lyrics, and further aligns thewords with the audio signal. We have reported competitive resultswith the state-of-the-art while outperforming other end-to-end basedmodels. Through lyrics transcription experiments, we provided ev-idence supporting the capability of our system to be exploited forgenerating additional training data for a variety of MIR tasks. Asa pilot study, we have conducted the ﬁrst experiments on the effectof different source separation models on the lyrics transcription andalignment tasks. The recognition rates on separated vocals showthat our system performs better than the previous best systems by aconsiderable margin while showing comparable performance withthe state-of-the-art in ALT from polyphonic music recordings ona public benchmark evaluation set. Moreover, it is shown that ouracoustic model can be exploited for both of the ALTA tasks.It should be noted that the system presented here achieves thisperformance requiring considerably lower computational and dataresources compared to the best performing published work to thisdate, which is shown via a quantitative comparison regarding thememory consumption during runtime. From this perspective, thisframework is shown to be applicable in use cases where low resourcesolutions are required, such as large-scale web services and mobileapplications. As an additional advantage, our approach does not relyon a pretrained language model, which makes it possible for it tobe extended for a multilingual setup. As a ﬁnal remark, we havepublicly shared the code for open science and reproducibility. Copyright 2021 IEEE. Published in the IEEE 2021 International Conference on Acoustics, Speech, and Signal Processing (ICASSP 2021), scheduled for6-11 June, 2021, in Toronto, Canada. Personal use of this material is permitted. However, permission to reprint/republish this material for advertising orpromotional purposes or for creating new collective works for resale or redistribution to servers or lists, or to reuse any copyrighted component of this workin other works, must be obtained from the IEEE. Contact: Manager, Copyrights and Permissions / IEEE Service Center / 445 Hoes Lane / P.O. Box 1331 /Piscataway, NJ 08855-1331, USA. Telephone: + Intl. 908-562-3966.

7. REFERENCES [1] Georgi Bogomilov Dzhambazov and Xavier Serra, “Modelingof phoneme durations for alignment between polyphonic au-dio and lyrics,” in

Sound and Music Computing Conference(SMC) , 2015. Can be accessed from https://github.com/emirdemirel/ASA_ICASSP2021

2] Daniel Stoller, Simon Durand, and Sebastian Ewert, “End-to-end lyrics alignment for polyphonic music using an audio-to-character recognition model,” in

International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) . IEEE,2019.[3] Chitralekha Gupta, Emre Yılmaz, and Haizhou Li, “Automaticlyrics transcription in polyphonic music: Does backgroundmusic help?,” arXiv preprint arXiv:1909.10200 , 2019.[4] Bidisha Sharma, Chitralekha Gupta, Haizhou Li, and Ye Wang,“Automatic lyrics-to-audio alignment on polyphonic musicusing singing-adapted acoustic models,” in

IEEE Interna-tional Conference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2019, pp. 396–400.[5] Andrea Vaglio, Romain Hennequin, Manuel Moussallam, GaelRichard, and Florence d’Alch´e Buc, “Multilingual lyrics-to-audio alignment,” in

International Society for Music Informa-tion Retrieval Conference (ISMIR) , 2020.[6] Meinard M¨uller, Frank Kurth, and Tido R¨oder, “Towards anefﬁcient algorithm for automatic score-to-audio synchroniza-tion.,” in

ISMIR , 2004.[7] Simon Dixon, “An on-line time warping algorithm for trackingmusical performances.,” in

IJCAI , 2005, pp. 1727–1728.[8] Anna Kruspe and IDMT Fraunhofer, “Lyrics alignment us-ing hmms, posteriorgram-based dtw and phoneme-based lev-enshtein alignment,” in

MIREX2017 - Audio-to-Lyrics Align-ment Challenge , 2017.[9] Gabriel Meseguer-Brocal, Alice Cohen-Hadria, and GeoffroyPeeters, “Dali: A large dataset of synchronized audio, lyricsand notes, automatically created using teacher-student machinelearning paradigm,” arXiv preprint arXiv:1906.10606 , 2019.[10] Peter Bell, Mark JF Gales, Thomas Hain, Jonathan Kilgour,Pierre Lanchantin, Xunying Liu, Andrew McParland, SteveRenals, Oscar Saz, Mirjam Wester, et al., “The MGB chal-lenge: Evaluating multi-genre broadcast media recognition,” in

Workshop on Automatic Speech Recognition and Understand-ing (ASRU) . IEEE, 2015, pp. 687–693.[11] Pedro J Moreno, Chris Joerg, Jean-Manuel Van Thong, andOren Glickman, “A recursive algorithm for the forced align-ment of very long audio segments,” in

Fifth International Con-ference on Spoken Language Processing , 1998.[12] Peter Bell and Steve Renals, “A system for automatic align-ment of broadcast media captions using weighted ﬁnite-statetransducers,” in

Workshop on Automatic Speech Recognitionand Understanding (ASRU) . IEEE, 2015.[13] Pedro J Moreno and Christopher Alberti, “A factor automatonapproach for the forced alignment of long speech recordings,”in

International Conference on Acoustics, Speech and SignalProcessing . IEEE, 2009.[14] Emir Demirel, Sven Ahlback, and Simon Dixon, “A recursivesearch method for lyrics alignment,” in

MIREX 2020 Audio-to-Lyrics Alignment and Lyrics Transcription Challenge , 2020.[15] Fabian-Robert St¨oter, Stefan Uhlich, Antoine Liutkus, andYuki Mitsufuji, “Open-unmix: A reference implementationfor music source separation,”

Journal of Open Source Soft-ware , 2019.[16] Romain Hennequin, Anis Khlif, Felix Voituret, and ManuelMoussallam, “Spleeter: A fast and state-of-the art musicsource separation tool with pre-trained models,” 2019, Late-Breaking/Demo Session at ISMIR. [17] Alexandre D´efossez, Nicolas Usunier, L´eon Bottou, and Fran-cis Bach, “Music source separation in the waveform domain,” arXiv preprint arXiv:1911.13254 , 2019.[18] Josef R Novak, D Yang, N Minematsu, and K Hirose,“Phonetisaurus: A WFST-driven phoneticizer,”

The Univer-sity of Tokyo, Tokyo Institute of Technology , 2011.[19] Emir Demirel, Sven Ahlb¨ack, and Simon Dixon, “Automaticlyrics transcription with dilated convolutional networks withself-attention,” in

International Joint Conference on NeuralNetworks (IJCNN) . IEEE, 2020.[20] Daniel Stoller, Sebastian Ewert, and Simon Dixon, “Wave-U-Net: A multi-scale neural network for end-to-end audio sourceseparation,” arXiv preprint arXiv:1806.03185 , 2018.[21] “Smule sing! 300x30x2 dataset,” accessed August, 2020, https://ccrma.stanford.edu/damp/ .[22] Hainan Xu, Tongfei Chen, Dongji Gao, Yiming Wang, Ke Li,Nagendra Goel, Yishay Carmiel, Daniel Povey, and SanjeevKhudanpur, “A pruned RNNLM lattice-rescoring algorithmfor automatic speech recognition,” in

International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) . IEEE,2018.[23] Matthias Mauch, Hiromasa Fujihara, and Masataka Goto, “In-tegrating additional chord information into HMM-based lyrics-to-audio alignment,”

IEEE Transactions on Audio, Speech, andLanguage Processing , vol. 20, no. 1, pp. 200–210, 2011.[24] Annamaria Mesaros and Tuomas Virtanen, “Automatic align-ment of music audio and lyrics,” in

Proceedings of the 11thInt. Conference on Digital Audio Effects (DAFx-08) , 2008.[25] Daniel Povey, Vijayaditya Peddinti, Daniel Galvez, PegahGhahremani, Vimal Manohar, Xingyu Na, Yiming Wang, andSanjeev Khudanpur, “Purely sequence-trained neural networksfor ASR based on lattice-free MMI,” in