End-to-end lyrics Recognition with Voice to Singing Style Transfer
Sakya Basak, Shrutina Agarwal, Sriram Ganapathy, Naoya Takahashi
aa r X i v : . [ c s . S D ] F e b END-TO-END LYRICS RECOGNITION WITH VOICE TO SINGING STYLE TRANSFER
Sakya Basak $ , Shrutina Agarwal $ , Sriram Ganapathy $ , Naoya Takahashi ∗ $ Learning and Extraction of Acoustic Patterns (LEAP) lab, Indian Institute of Science, Bangalore, India, ∗ Sony Corporation, Tokyo, Japan.
ABSTRACT
Automatic transcription of monophonic/polyphonic music is a chal-lenging task due to the lack of availability of large amounts of tran-scribed data. In this paper, we propose a data augmentation methodthat converts natural speech to singing voice based on vocoder basedspeech synthesizer. This approach, called voice to singing (V2S),performs the voice style conversion by modulating the F0 contourof the natural speech with that of a singing voice. The V2S modelbased style transfer can generate good quality singing voice therebyenabling the conversion of large corpora of natural speech to singingvoice that is useful in building an E2E lyrics transcription system.In our experiments on monophonic singing voice data, the V2S styletransfer provides a significant gain (relative improvements of %)for the E2E lyrics transcription system. We also discuss additionalcomponents like transfer learning and lyrics based language model-ing to improve the performance of the lyrics transcription system. Index Terms — Voice-to-singing style transfer, Lyrics Tran-scription, End-to-end modeling.
1. INTRODUCTION
In music, lyrics constitutes the textual component of singing voice.It forms an important constituent of the music signal that contributesto the emotional perception of the song [1] and also aids in foreignlanguage learning [2]. In music information extraction, two prob-lems of interest are the automatic alignment of lyrics and the auto-matic transcription of singing voice. The alignment problem is thetask of finding the timing of the word boundaries for the given lyricswith respect to the polyphonic audio [3], while transcription is thetask of recognizing the lyrics [4]. Several applications such as gen-erating karaoke, music subtitling [5], query-by-singing [6], keywordspotting, and automatic indexing of music according to transcribedkeywords [7] rely on accurate alignment and transcription of music.The key challenges in the automatic recognition of lyrics isthe unique style of singing voice, high variation of fundamentalfrequency and pronunciation, lack of large amounts of transcribedsinging data and background scores. The earlier studies used aphoneme recognition approach [8]. Mesaros et al. [4] adopted anautomatic speech recognition (ASR) based approach for phonemeand word recognition of singing vocals in monophonic and poly-phonic music. In dealing with polyphonic music, one of the commonapproaches is to apply a voice source separation module [9]. In arecent work, Gupta et. al. [3] attempted an adaptation of the modelstrained from solo music to polyphonic music for lyrics alignment.In this paper, we attempt to perform automatic recognition oflyrics by utilizing the large amount of resources available for speech
This work was funded by Sony Corporation, Tokyo. recognition. The conventional modular approach to ASR consistsof several modules like acoustic model, lexical model and languagemodel [10]. The end-to-end (E2E) ASR is a simplified approachto overcome the limitations of the conventional approach by usinga single neural network to perform an entirely data driven learning.It consists of a single deep neural network (DNN) model which isdirectly trained on words, sub-words or character targets, therebyeliminating the need for the hand-crafted pronunciation dictionary.The earliest approach to E2E ASR used the connectionist temporalcost (CTC) function to optimize the recurrent neural network model(RNN) [11]. The attention based models proposed recently do notmake any conditional independence assumption and they attempt tolearn an implicit language model using an encoder-decoder attentionframework [12]. In order to combine the best of both worlds Watan-abe et. al. proposed a hybrid CTC-attention model [13]. In the lastyear, the performance of E2E models have been improved with theuse of Transformer based architectures [14]. The performance ofthe E2E ASR models can be further improved by using data aug-mentation techniques on the input features using methods like timewarping, frequency masking, and time masking [15]. However, E2EASR tends to be data demanding in training, which makes it difficultto adopt this framework for lyrics transcription tasks as there is aconsiderable lack of large supervised datasets.In this paper, we propose a novel approach to data augmenta-tion for end-to-end recognition of lyrics in singing voice. The pro-posed approach, termed as voice-to-singing (V2S), converts naturalspeech to singing voice using a vocoder based speech synthesizer[16]. The V2S model uses the pitch contours from singing voicerecordings along with the spectral envelope of the natural speech toperform voice to singing conversion. The proposed V2S approachcan generate large amounts of “singing” voice for use in E2E modeltraining. In addition, we investigate a transfer learning approach toleverage a large “singing” speech trained model. The use of source-separated singing voice data from polyphonic music, which is rel-atively easy to obtain compared to monophonic singing voice data,is also explored. We also develop a language model (LM) suitablefor lyrics transcription by mining large text corpus of lyrics. Theexperiments are performed on the Dali corpus [17] and a proprietarymusic dataset provided by Sony. In these experiments, we show thatthe proposed V2S approach provides significant performance gainsover baseline systems trained purely on natural speech.The key contributions from this work are,i We propose V2S approach for data augmentation to train theE2E lyric transcription model.ii We investigate a transfer learning approach to leverage a largespeech corpus and source-separated singing voice data frompolyphonic music for E2E system.iii We develop a language model for lyrics transcription by miningarge text corpus of lyrics.iv Experimental results on polyphonic and monophonic lyric tran-scription shows that the proposed V2S data augmentation, trans-fer learning using speech, source-separated singing voice data,speed perturbation and lyrics LM significantly improve the worderror rate over the baseline system trained on natural speech.To the best of our knowledge, this paper constitutes one of theearliest efforts in developing E2E systems for automatic lyrics tran-scription of singing voice.
2. RELATED PRIOR WORK
Recently, Gupta et al. [18] explored singing voice alignment on acappella singing vocals using state of the art Google ASR model.The authors segmented the audio in s segments, and the ASRmodel is used to get the time aligned transcription for the audiosegment. The transcriptions are later refined using published lyricsand are used in training a conventional ASR. An iterative processis thereby ensued to refine the alignments. For polyphonic music,Sharma et al. [19] developed a lyrics alignment system, where theASR models were adapted to singing voices using speaker adaptivetraining on a small dataset of solo singing voices. In addition to this,the authors use source separated vocals from polyphonic music usingconvolutional neural network (CNN) based U-Net models. In a re-cent work on E2E models, D.Stoller et al [20] investigated Wave-U-Net to predict character probabilities using raw audio. The short seg-ment audio snippets ( s) are processed with the end-to-end modelwhich uses CNN with CTC training. There has also been efforts inexploring additional features like voicing, energy, auditory, spectral,and chroma during training of the model for alignment tasks [21].The background music may affect the intelligibility of the lyrics.Gupta et al. [22] explored the significance of background music byperforming genre based learning. Here, the music is divided intothree different genres (hiphop, metal, and pop) and for each genredifferent phoneme and silence (non-vocal) models are trained usinga standard HMM-GMM architecture. The authors report improve-ments using the genre specific modeling.With the control of several acoustic features, Saitou et. al [23]proposed voice conversion approach to transform speech to singingstyle. An encoder-decoder approach to model-based learning ofspeech to singing voice conversion was explored in Parekh [24].However many of the past methods are not scalable to large volumesof speech data needed for acoustic model training in a E2E lyricstranscription system.
3. PROPOSED METHODS
In this section, we describe our approach on the V2S data augmenta-tion, the transfer learning using source separated singing voice, andthe language model development for lyrics.
The V2S converts natural speech to singing voice using a vocoderbased speech synthesizer. We use the WORLD synthesizer [16]since it provide high-quality voice with low computational complex-ity. We first describe the WORLD model, followed by a descriptionof the proposed V2S.
Opera VocalMale Voice Synthesized SingingVoice S y n t h e s i s AperiodicSpectralEnvelopeParameterFundamentalAperiodicSpectralEnvelopeParameterFundamentalFrequency(f0)Frequency(f0)
Fig. 1 : WORLD vocoder used in the proposed V2S system
The speech analysis and synthesis systems which provide high qual-ity reconstruction after modification of parameters are useful in var-ious applications like singing synthesis and voice conversion sys-tems. In order to process large amounts of data, the algorithms needto be computationally efficient. The WORLD model is a fast algo-rithm which is vocoder based and generates high quality speech. Thespeech synthesis system consists of decomposing the speech signalinto three components - fundamental frequency (F0) (which is esti-mated using the DIO algorithm [25]), spectral envelope (estimatedusing the CheapTrick algorithm [26]) and the excitation signal (es-timated using the PLATINUM algorithm [27] denoted as aperiodicparameter). The F0 information is used to determine the temporalpositions of the origin of the vocal cord vibrations. The F0 infor-mation in WORLD vocoder is estimated using a series of low-passfilters which gives multiple candidates. Using these mulitple candi-date estimates for F0, a reliability measure based on the variance ofthe estimates is used to find the final F0 value.
The approach to data augmentation using the V2S model is shown inFigure 1. We use the WORLD vocoder to independently decomposethe natural speech and singing voice (like opera vocals) into the con-stituent components. The F0 contour from the singing voice is thenused along with the spectral envelope and the aperiodic parameterfrom the natural speech and fed to the synthesizer. The synthesizedoutput is the singing voice version of the natural speech.The western opera vocal dataset consists of both male and fe-male opera singers and during the synthesis we make sure that thespeech and the opera vocals are gender matched. Further, our anal-ysis showed that, instead of randomly matching a natural speechrecording with an opera vocal sample, a technique for choosing theopera vocal based on the closest average F0 value with that of thespeech signal under consideration improved the quality of the syn-thesized output drastically. To facilitate this operation, we performthe decomposition of the opera vocals in the dataset apriori and alsostore the average F0 value. Then, for the given speech signal underconsideration, the decomposition is performed and the average F0value is computed. The opera vocal that has the closest average F0value from the database is chosen and its F0 contour is used in thesynthesis of the singing speech. We did not perform any alignmentof the F0 track with the speech signal. H z −40−30−20−100102030 (a) Spectrogram of speech sample H z −40−30−20−100102030 (b) Spectrogram of synthesized waveform Fig. 2 : V2S singing voice conversion of natural speech.An example illustration of the synthesized output from the pro-posed V2S approach is shown in Figure 2b. The original speechsample is also show here in Figure 2a. As seen here, the synthesizedoutput has different harmonicity properties. However, phonemic ac-tivity of the speech is well preserved in the reconstructed output . Although V2S can provide large amounts of synthetic singing voicedata and provides significant performance gain over the modeltrained without V2S data augmentation as shown in Section 5, thereis still a domain mismatch between the real and synthetic singingvoice since the spectral features from standard speech are used inV2S. However, large amount of monophonic singing voice data isnot available, especially for singing voice of professional artists. Toovercome this challenge, we propose a transfer learning approachusing source-separated singing voice data. Since polyphonic musicwith transcriptions are relatively easy to obtain like the DALI corpus[17], we first separate a vocal track from polyphonic music using thestate-of-the-art DNN-based source separation - D3Net [28]. Then,E2E lyric recognition model trained with V2S data augmentationis fine tuned on the source-separated (SS) singing voice data. Onecan also consider fine tuning the E2E models on polyphonic musicdata directly. However, we experimentally show in Section 5 that in-corporating the singing voice source separation works significantlybetter for polyphonic and monophonic cases.
The probability distribution on words and their transition probabilityin lyrics are considerably different from standard speech. In par-ticular, lyrics often contain artistic expressions and may potentially Some audio samples are available here - https://github.com/iiscleap/V2S_Samples
Table 1 : Performance of different E2E architectures trained on nat-ural speech from Librispeech corpus.Architecture SSVHybrid LSTM [13] 53.4Transformer [14] 48.4
Table 2 : WERs of models trained with and without V2S data aug-mentation. Here, SS-DALI denotes Source separated DALI, Poly-DALI denotes Polyphonic DALI. The results indicated as ’–’ corre-spond to conditions when some of the files fail to decode completely.Models SS-DALI Poly-DALI SSVDev. Test Dev. TestLS (original) 47.7 49.5 80.9 86.3 48.4LS with V2S 43.7 46.8 – – 41.6LS with V2S (F0 map) : Perplexity values on various test data for the LM. Here, LSrefers to audio-books in the LibriSpeech test and dev set. The LMmodel in the last row is refered to as the Lyrics LM.Training Data LS SSV DALIAudio-book violate many grammar usage rules in natural speech. To incorporatethe difference, we develop a language model using a combination oftext from audio-books and lyrics data from DALI [17], Sony SingingVoice (SSV) dataset, and web resource . This text data containedlyrics from a variety of genres like hip-hop, pop, classical, country-music, rock, jazz and consisted of about M lines of text.
4. EXPERIMENTAL SETUP4.1. Dataset
We use the LibriSpeech corpus [29] as the natural speech data cor-pus in all our experiments. The LibriSpeech corpus contains hours of read speech sampled at kHz. The training portion of theLibrispeech contains female speakers and male speak-ers with each speaker providing - minutes of speech. The lan-guage model training data is also released as part of the Librispeechcorpus which contains approximately , public domain bookswith around M tokens in total and
K unique words. We usethe train, test and dev partitions as prescribed in the release [29].For fine tuning to real singing voice, we use the DALI dataset[17]. This dataset consists of
English polyphonic songs withlyrics-transcription, which results in a total duration of hours. Wesplit the dataset to a training, development, and test with a ratio of : : . The splits are also carefully performed not to have any overlapin the artists performing in each of these splits. Further, the splits arealso gender balanced to avoid any bias in training/testing. For testingon monophonic singing voice to avoid the effect of accompanimentsounds, we also used a proprietary dataset from Sony Corp termed able 4 : WER values obtained after training the models on mapped modulated LibriSpeech and normal LibriSpeech. SS DALI - Sourceseparated DALI, Poly DALI - Polyphonic DALI, SP - Speed Perturbation.Model Audiobook LM Lyrics LMSS-DALI Poly-DALI SSV SS-DALI Poly-DALI SSVDev. Test Dev. Test Dev. Test Dev. TestNo fine tuning 61.4 65.3 84.4 87.3 39.7 58.9 62.5 82.2 85.3 38.1Fine tuned on SS-DALI 46.2 49.4 77.3 80.9 39.1 44.8 47.0 75.2 78.1 38.4Fine tuned on SS-DALI (SP) 42.8 44.9 75.2 78.9 36.1 Fine tuned on Poly-DALI (SP) 56.0 58.7 59.1 61.8 45.5 55.1 57.4 EnglishSongs. Each of these songs have an approximate duration of minand had a mix of multiple genres. We have used sec chunks of theaudio for model training. The other model transformer model param-eters follow the baseline Librispeech setup from ESPNET toolkit.The singing voice in these recordings is also of professional quality. We train two different language models (LM) - an audio book LMusing the text resources from the Librispeech corpus and a lyricsLM using a combination of audio-book text with M lines of lyricstext described in Sec. 3.3. The LMs are a recurrent neural network(RNN) model with layer of LSTM cells. The language modelis incorporated in the E2E system as described in [12].
We have used the ESPNET toolkit [30] to perform our E2E recog-nition experiments. The features used to train the architecture arelog-filter bank features extracted using mel-spaced windows ofduration ms with a shift of ms. The E2E model used in mostof the experiments is based on the Transformer architecture [14].The encoder used is a -layer transformer network with unitsin the projection layer. The attention used is location attention andthe decoder network is a -layer Transformer network with units in the projection layer. During training, multiple cost functionsare used [14] like connectionist temporal cost (CTC) and the cross-entropy (CE) loss. The model is trained using Adam optimizationand training is performed for several epoch till the loss saturates onthe validation data. The CTC-weight is fixed at . and during de-coding the beam-size is fixed at . Both the LMs (audio-book LMas well as the lyrics LM) have the same architecture and used sub-word units at the output layer.
5. RESULTSE2E model architecture:
The first set of results shown in Table 1highlights the lyrics transcription results on the SSV data using theHybrid LSTM E2E architecture [13] and the transformer architecture[14]. Both the models are trained on natural speech. As seen in thisTable, the transformer architecture provides improved robustness tosinging voice transcription task even when the model is trained forspeech recognition task [14].
V2S:
Table 2 shows the impact of the proposed V2S data augmen-tation for training the E2E model. The E2E model trained on naturalspeech provides very high word-error-rates (WER) on polyphonicDALI dataset. The V2S improves the lyrics transcription perfor-mance on the SSV data significantly (average relative improvements of % over the natural speech based WER). Further, the V2S ap-plied with F0 mapping, where the opera vocals are selected to matchthe average F0 of the speech file under consideration, further im-proves the lyrics transcription accuracy. These results validate theeffectiveness of proposed V2S data augmentation and F0 mapping(relative improvements of % on the SSV dataset). LM comparison:
The comparison of various language models interms of perplexity values is shown in Table 3. The audio-book LMrefers to the LM trained using the text data from the Librispeech cor-pus. The perplexity values highlighted here suggest that, training theLM from the mixed corpus of speech and lyrics (from SSV, DALIand web) provides the best perplexity compared to either fine-tuningthe audio-book LM or using only lyrics text for LM training. Thebest model (last row) in Table 3 is referred to as the lyrics LM in theexperiments that follow. Further, the fine-tuned version of the audio-book LM (second row) is referred to as the audio-book LM for theremainder of the experiments.
Transfer learning:
The impact of fine-tuning the E2E modelswhich were trained with V2S is shown in Table 4. These resultsshow that fine-tuning on the source separated (SS) DALI trainingdata can improve the performance on the DALI test data. Further,the application of speed perturbation in training [31] improves thelyrics transcription performance on both DALI and SSV test data.The application of V2S data augmentation in model training alongwith speed perturbation improves the lyrics transcription WER ofSSV data relatively by about % over the natural speech based E2Esystem. In addition, the WER results for the final system of . %WER on SSV data and . % on DALI data suggest that the highlychallenging task of lyrics transcription on monophonic/polyphonicmusic with large variety of music genres can be explored with par-tial success using the techniques proposed in this work.
6. SUMMARY
This paper presents a data augmentation method for E2E lyric tran-scription. The proposed method, termed as voice-to-singing (V2S),modulates the natural speech to a singing style by replacing a funda-mental frequency contour of natural speech to that of singing voicesusing a vocoder based speech synthesizer. We also propose a trans-fer learning based approach to leverage a large amount of source-separated real singing voices from polyphonic music. The applica-tion of proposed methods are explored in the design of lyrics tran-scription system based on transformer based E2E model. Various ex-periments highlight the performance benefits of using the proposedV2S along with lyrics language modeling and transfer learning. E2EASR model is shown to provide useful features for other applicationssuch as singing voice separation [32]. Adopting the E2E lyric recog-nition model to such music applications is one of our future works. . REFERENCES [1] S. O. Ali and Z. F. Peynircio˘glu, “Songs and emotions: arelyrics and melodies equal partners?,”
Psychology of music , vol.34, no. 4, pp. 511–534, 2006.[2] A. J. Good, F. A. Russo, and J. Sullivan, “The efficacy ofsinging in foreign-language learning,”
Psychology of Music ,vol. 43, no. 5, pp. 627–640, 2015.[3] C. Gupta, E. Yılmaz, and H. Li, “Acoustic modelingfor automatic lyrics-to-audio alignment,” arXiv preprintarXiv:1906.10369 , 2019.[4] A. Mesaros and T. Virtanen, “Automatic recognition of lyricsin singing,”
EURASIP Journal on Audio, Speech, and MusicProcessing , vol. 2010, pp. 1–11, 2010.[5] G. Dzhambazov et al.,
Knowledge-based probabilistic mod-eling for tracking lyrics in music audio signals , Ph.D. thesis,Universitat Pompeu Fabra, 2017.[6] T. Hosoya, M. Suzuki, A. Ito, S. Makino, L. A. Smith, D. Bain-bridge, and I. H. Witten, “Lyrics recognition from a singingvoice based on finite state automaton for music information re-trieval.,” in
ISMIR , 2005, pp. 532–535.[7] H. Fujihara, M. Goto, and J. Ogata, “Hyperlinking lyrics:A method for creating hyperlinks between phrases in songlyrics.,” in
ISMIR , 2008, pp. 281–286.[8] M. Gruhne, C. Dittmar, and K. Schmidt, “Phoneme recognitionin popular music.,” in
ISMIR , 2007, pp. 369–370.[9] H. Fujihara, M. Goto, J. Ogata, and H. G. Okuno, “Lyricsyn-chronizer: Automatic synchronization system between musicalaudio signals and lyrics,”
IEEE Journal of Selected Topics inSignal Processing , vol. 5, no. 6, pp. 1252–1261, 2011.[10] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed,N. Jaitly, A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath,et al., “Deep neural networks for acoustic modeling in speechrecognition: The shared views of four research groups,”
IEEESignal processing magazine , vol. 29, no. 6, pp. 82–97, 2012.[11] A. Graves and N. Jaitly, “Towards end-to-end speech recogni-tion with recurrent neural networks,” in
International confer-ence on machine learning , 2014, pp. 1764–1772.[12] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Ben-gio, “End-to-end attention-based large vocabulary speechrecognition,” in
IEEE ICASSP . IEEE, 2016, pp. 4945–4949.[13] S. Watanabe, T. Hori, S. Kim, J. R. Hershey, and T. Hayashi,“Hybrid ctc/attention architecture for end-to-end speech recog-nition,”
IEEE Journal of Selected Topics in Signal Processing ,vol. 11, no. 8, pp. 1240–1253, 2017.[14] S. Karita, N. Chen, T. Hayashi, T. Hori, H. Inaguma, Z. Jiang,M. Someki, N. E. Y. Soplin, R. Yamamoto, X. Wang, et al.,“A comparative study on transformer vs rnn in speech applica-tions,” in . IEEE, 2019, pp. 449–456.[15] D. S. Park, W. Chan, Y. Zhang, C.-C. Chiu, B. Zoph, E. D.Cubuk, and Q. V. Le, “Specaugment: A simple data augmenta-tion method for automatic speech recognition,” arXiv preprintarXiv:1904.08779 , 2019.[16] M. MORISE, F. YOKOMORI, and K. OZAWA, “World: Avocoder-based high-quality speech synthesis system for real-time applications,”
IEICE Transactions on Information andSystems , vol. E99.D, no. 7, pp. 1877–1884, 2016. [17] G. Meseguer Brocal, “The dali dataset,” Feb. 2019.[18] C. Gupta, R. Tong, H. Li, and Y. Wang, “Semi-supervisedlyrics and solo-singing alignment,” in
ISMIR , 2018.[19] B. Sharma, C. Gupta, H. Li, and Y. Wang, “Automatic lyrics-to-audio alignment on polyphonic music using singing-adaptedacoustic models,” in
IEEE ICASSP , 2019, pp. 396–400.[20] D. Stoller, S. Durand, and S. Ewert, “End-to-end lyrics align-ment for polyphonic music using an audio-to-character recog-nition model,” in
IEEE ICASSP , 2019, pp. 181–185.[21] C. Gupta, E. Yilmaz, and H. Li, “Acoustic modeling for auto-matic lyrics-to-audio alignment,” in
INTERSPEECH , 2019.[22] C. Gupta, E. Yılmaz, and H. Li, “Automatic lyrics alignmentand transcription in polyphonic music: Does background mu-sic help?,” in
IEEE ICASSP , 2020, pp. 496–500.[23] T. Saitou, M. Goto, M. Unoki, and M. Akagi, “Speech-to-singing synthesis: Converting speaking voices to singingvoices by controlling acoustic features unique to singingvoices,” in
IEEE WASPAA . IEEE, 2007, pp. 215–218.[24] J. Parekh, P. Rao, and Y.-H. Yang, “Speech-to-singing con-version in an encoder-decoder framework,” in
IEEE ICASSP .IEEE, 2020, pp. 261–265.[25] M. Morise, H. Kawahara, and H. Katayose, “Fast and reliablef0 estimation method based on the period extraction of vocalfold vibration of singing voice and speech,” in
Audio Engi-neering Society Conference: 35th International Conference:Audio for Games . Audio Engineering Society, 2009.[26] M. Morise, “Cheaptrick, a spectral envelope estimator forhigh-quality speech synthesis,”
Speech Communication , vol.67, pp. 1 – 7, 2015.[27] M. Morise, “Platinum: A method to extract excitation signalsfor voice synthesis system,”
Acoustical Science and Technol-ogy , vol. 33, no. 2, pp. 123–125, 2012.[28] N. Takahashi and yuki Mitsufuji, “D3net: Densely connectedmultidilated densenet for music source separation,”
CoRR, pre-print https://arxiv.org/abs/2010.01733 , 2020.[29] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib-rispeech: An asr corpus based on public domain audio books,”in
IEEE ICASSP , 2015, pp. 5206–5210.[30] S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba,Y. Unno, N. E. Y. Soplin, J. Heymann, M. Wiesner, N. Chen,A. Renduchintala, and T. Ochiai, “Espnet: End-to-end speechprocessing toolkit,” 2018.[31] T. Ko, V. Peddinti, D. Povey, and S. Khudanpur, “Audio aug-mentation for speech recognition,” in
Sixteenth Annual Confer-ence of the International Speech Communication Association ,2015.[32] N. Takahashi, M. K. Singh, S. Basak, P. Sudarsanam, S. Gana-pathy, and Y. Mitsufuji, “Improving Voice Separation by Incor-porating End-To-End Speech Recognition,” in
Proc. ICASSP ,2020.,2020.