Exploring Aligned Lyrics-Informed Singing Voice Separation
EEXPLORING ALIGNED LYRICS-INFORMED SINGING VOICESEPARATION
Chang-Bin Jeon, Hyeong-Seok Choi and Kyogu Lee
Department of Intelligence and InformationMusic and Audio Research Group (MARG)Center for SuperintelligenceSeoul National University {vinyne, kekepa15, kglee}@snu.ac.kr
ABSTRACT
In this paper, we propose a method of utilizing alignedlyrics as additional information to improve the perfor-mance of singing voice separation. We have combined thehighway network-based lyrics encoder into
Open-unmix separation network and show that the model trained withthe aligned lyrics indeed results in a better performancethan the model that was not informed. The question nowremains whether the increase of performance is actuallydue to the phonetic contents that lie in the informed alignedlyrics or not. To this end, we investigated the source of per-formance increase in multifaceted ways by observing thechange of performance when incorrect lyrics were given tothe model. Experiment results show that the model can usenot only just vocal activity information but also the pho-netic contents from the aligned lyrics.
1. INTRODUCTION
Singing voice separation is one of the most widely stud-ied areas in the field of audio signal processing. In par-ticular, the importance of research is greatly emphasizedbecause it can contribute to the pre-processing step of re-search in various fields of Music Information Retrieval(MIR), such as automatic music transcriptions and auto-matic lyrics alignments. With the recent development ofdeep neural networks, a number of music source separationstudies have been published and showed excellent perfor-mance. These studies have a common feature of separatingmusic source by using only information from the soundsource itself, such as a 1-dimensional waveform [5, 19] ora 2-dimensional spectrogram [9, 20, 23].One of the distinguishing characteristics that differen-tiate music signals from other audio signals is that usu-ally there exists the corresponding music scores or lyrics.Therefore, several studies attempted to separate the sourcesby utilizing additional information other than the informa-tion of the sound source itself. For example, the additional c (cid:13) Chang-Bin Jeon, Hyeong-Seok Choi and Kyogu Lee. Li-censed under a Creative Commons Attribution 4.0 International License(CC BY 4.0).
Attribution:
Chang-Bin Jeon, Hyeong-Seok Choi andKyogu Lee. “Exploring Aligned Lyrics-Informed Singing Voice Separa-tion”, 21st International Society for Music Information Retrieval Confer-ence, Montréal, Canada, 2020. information such as pitch [1] or even the whole score [26]can be used as a prior to help separate the source of inter-est from the mixture. In general, however, music scores forcertain songs are not readily available, while lyrics can beeasily collected on the web.The lyrics are particularly closely related informationto singing voice; thus, have promising possibility to beused as additional information for singing voice separa-tion. Recent studies proposed the ways of using linguisticfeatures extracted from the end-to-end automatic speechrecognition model [24] or voice conversion model [3] tothe singing voice separation framework. However, the wayof using explicit lyrics information has not been studiedenough so far, which motivates us to study the possibilityof lyrics-informed singing voice separation. We expect thatsinging voice separation systems can benefit from lyrics in-formation because of the rich information contained in thephonetic features such as formant frequencies.To utilize the lyrics, we combined the highway network-based lyrics encoder [11] into the current state-of-the-artmusic source separation network,
Open-unmix [20]. In ad-dition, we tried two conditioning methods — 1. local con-ditioning, 2. concatenation — and compare the perfor-mance. Note that the alignment between the lyrics andsongs itself is another separate line of research [6, 18]. Forour study, we assume that the alignment is already doneand only focus on the use of the aligned lyrics.The information in the aligned lyrics can be seen fromtwo perspectives: 1. The timing information of vocal ac-tivity, 2. phonetic information. Therefore, it is importantto check if the network is using the phonetic informationother than the vocal activity information. Various evalu-ations were conducted to examine whether the phoneticinformation of the aligned lyrics actually contribute toimproving the performance. We found that the proposedmodel trained with the aligned lyrics clearly show bet-ter performance than the baseline model trained withoutany additional information. Furthermore, the experimentresults show that the performance of the proposed modeleven exceeds the model trained only with additional vocalactivity information. To the best of our knowledge, this isthe first research to directly use the lyrics information fora singing voice separation task. a r X i v : . [ ee ss . A S ] A ug igure 1 : The structure of the baseline Open-unmix net-work.
2. RELATED WORK2.1 Informed Source Separation
Several studies using machine learning algorithms otherthan deep neural networks attempted to separate singingvoice with side information. For example, robust PrincipalComponent Analysis (rPCA) was used with additional vo-cal activity information [2]. Also, rPCA and Non-negativeMatrix Factorization (NMF) were used with pronouncedlyrics [4].Although only few studies have tried to use additionalinformation for singing voice separation using deep neuralnetworks, it was reported that vocal activity informationcan be used as an input to the network along with spec-trogram to enhance the performance of the singing voiceseparation network [14]. Very recently, in the speech en-hancement field, attempts have been made to utilize textinformation to increase the separation performance [15].
Open-unmix [20] is the state-of-the-art music source sepa-ration network using the MUSDB18 dataset [13]. In partic-ular, it consists of 3 bi-directional Long Short-Term Mem-ory (LSTM) layers for source separation with 3 additionalfully-connected layers. Batch normalization [8] was usedafter every fully-connected layer and skip connection [7]was used between the inputs and outputs of 3 consecu-tive bi-directional LSTM layers. Trainable input and out-put scalers through frequency-axis are also the special fea-tures of
Open-unmix , differentiating from other studies thatuse decibel scales.
Figure 2 : The structure of the lyrics encoder.We used
Open-unmix network as our baseline model be-cause we wanted to check whether the aligned lyrics infor-mation could improve the performance of the current state-of-the-art model. The channel inputs and outputs are monoin our study although the original study used stereo. This isbecause our singing dataset is made up of a clean singingvoice without any reverberation, chorus, or doubling, asopposed to the singing tracks in MUSDB18’s singing voicedataset, which are already processed for the stereo. Detailsand the full structure of the baseline
Open-unmix are illus-trated in Figure 1.
Singing voice synthesis and lyrics-informed singing voiceseparation tasks share a similar framework in that the inputand output are the same as lyrics and singing voice spec-trograms, respectively. Recently, [11] proposed the singingvoice synthesis network that succeeds in creating high-quality singing based on 60 Korean songs sung by a singlesinger. It is based on the Text-to-Speech model [22], whichconsists of 1-dimensional convolutional neural networksand highway networks [17]. Therefore, we borrowed theidea of using the highway network-based lyrics encoderand integrated it into the source separation network.
3. PROPOSED METHOD3.1 Lyrics Encoder
The detailed structure of the lyrics encoder is shown inFigure 2 and the structure of the highway networks usedin the lyrics encoder is defined as follows, y = ReLU ( x ∗ W H ) · σ ( x ∗ W T )+ x · (1 − σ ( x ∗ W T )) , (1), where x and y are input and output of the network and · refers to the element-wise multiplication. x ∗ W H and x ∗ W T are the 1-dimensional convolution layers whichhave the same input and output channel sizes. Biases areomitted in Eqn (1). Zero-padding was applied to keep the a) (b) Figure 3 : The structure of the
Open-unmix networks com-bined with the lyrics encoder using (a) local conditioningand (b) concatenation method.input and output length the same in every convolutionlayer. Dropouts [16] with 0.05 dropout rate, were appliedafter the activation functions. Dilated convolution [27]were used to expand the receptive field to 165 frames. Itis about seven times larger than the model without the di-lation. In our experimental settings, 165 frames are equalto about 1.915 seconds.
The local conditioning method [11] was used to insert theencoded lyrics information into the singing voice separa-tion network. The local conditioning is defined as follows, y = ReLU ( x ∗ W f + L ) · σ ( x ∗ W g + L ) , (2)where, x ∗ W f and x ∗ W g are the 1-dimensional convo-lution layers which have same input and output channelsizes. L and L are equally separated features through thechannel axis from the output of the consecutive lyrics en-coder and the 1-dimensional convolutional layer. The 1-dimensional convolution layer with filter size 1 was addedon the output of the lyrics encoder so that the channelsize of each L and L could be 512. σ refers to the sig-moid activation function. Details of the full structure are inFigure 3a. A concatenation method, which is a simple but powerfulconditioning method, was also used in our study to passthe encoded lyrics information into the singing voice sep-aration network. By concatenating the output of the firstfully-connected layer and the lyrics encoder output, thechannel size of LSTM layers input becomes 1024. Also,the channel size of the second fully-connected layer inputbecomes 1536 by the skip-connection of LSTM input andoutput. Details of the full structure is in Figure 3b.
Singer Gender Train Validation Test Total1 Female 79 5 8 922 Male 8 2 0 103 Female 8 2 0 104 Female 9 0 1 105 Female 8 0 1 96 Male 10 0 0 107 Female 8 0 2 108 Female 9 1 0 109 Male 9 0 1 1010 Male 7 1 1 911 Female 7 3 0 1012 Female 0 5 5 1013 Male 0 0 1 1162 19 20 201
Table 1 : The composition of our singing dataset.
4. EXPERIMENTS4.1 Dataset
Here we used a total of 201 Korean pop songs sung by13 amateur singers as target clean singing sources. Thisdataset has a total length of 11 hours and 44 minutes. Ofthese, we used 162 songs (9h 3m) for training, 19 songs (1h7m) for validation, and 20 songs (1h 7m) for test dataset.The detailed composition of the dataset is described inTable 1.We aligned Korean syllable following [11]; one Koreansyllable is made up of onset (consonant), nucleus (vowel)and coda (consonant), we aligned onset and coda for 4frames, and nucleus for other frames. The example of thealignment applied to the spectrogram is shown in Figure 6.A total of 19,113 instrumental songs were used asaccompaniment for training networks because we didnot have real accompaniment tracks corresponding tothe singing voice dataset. Since various studies usingMUSDB18 [13] or DSD100 [12] dataset also used ran-dom mixing techniques, i.e. creating random accompani-ment for each iteration that was not related to the originalsinging, we decided that using arbitrary accompanimentswould not be a problem for training. In addition, if weused the same specific accompaniments for the test singingvoice dataset, we assumed that it is reasonable for identi-fying how the information containing the phonetic featuresof the aligned lyrics has changed the performance of thenetwork, the fact we wanted to identify. Therefore, for thevalidation and test dataset, we randomly chose each 19 and20 instrumental songs which have a longer length than thesinging data, and shorten the length to the same with thesinging.
In our singing dataset, the number of songs recorded bythe first singer is outnumbered compared to the others, ac-counting for about 46 percent of the total. In order to pre-vent bias to a particular singer when training the networks,a singer was first selected with the same probability whenconstructing a batch to be used for each iteration, and thevocal source to be used for training was sampled only forthe songs recorded by the selected singer. igure 4 : The example of inserting the aligned lyrics to thenetworks.A mono sound source with a sample rate of 22050 Hzwas used in the experiment. FFT point size and windowsize were set to 1024 samples (0.0464 seconds) to con-vert them into a spectrogram, and Short-time Fourier trans-form (STFT) hop size to 256 samples (0.0116 seconds).Adam optimization method [10] was used for training witha learning rate of 0.001, β for 0.9, β for 0.999. Meansquared error (MSE) loss function between the ground-truth and the outputs of the models were used in our study.We trained the models for 500 epochs with calculatingvalidation loss for every epoch. The learning rate was re-duced to 30 percent if there was no decrease in validationloss during 25 epochs. Early stopping was applied after 50epochs without a decrease in validation loss. Here we briefly summarize the various experiments thatwill be shown in the following Section 5. The experimentswill be conducted in three following ways.First, in Section 5.1, we compare and analyze how muchperformance improvement there are between the baselinemodel trained without the lyrics and the model trained withthe aligned lyrics.Second, in Section 5.2, we check if the network ex-ploits the vocal activity information included in the alignedlyrics. The lyrics include both the vocal activity informa-tion and phonetic information, and thus expected to use thevocal activity information correctly.Third, in Section 5.3, we check if the given input iscorrectly being used by the network trained with alignedlyrics. This experiment was done by giving incorrect in-puts in the evaluation stage. It is expected that the incorrectinputs will significantly reduce performance.For network performance evaluations, Signal-to-Distortion Ratio (SDR), Signal-to-Interference Ratio(SIR), Signal-to-Artifact Ratio (SAR) scores [25] werecomputed by museval python library [21].
5. RESULTS
The configuration of the models we trained in our exper-iments is in Table 2. We trained four models each usinglocal conditioning and concatenation method. model 1 isthe baseline model that trained without the lyrics encoder. model 2 is the model that trained with meaningless 0 value
Model name Inputs to the lyrics encoder model 1
None model 2
Meaningless inputs (all 0) model 3
Vocal activity information model 4
Aligned lyrics
Table 2 : The description of each models in our experi-ments.
Figure 5 : The examples of the mixture, ground truth vocal,separated vocal spectrograms of baseline model 1 and
CC-model 4 .inputs to the lyrics encoder. This model is only for check-ing the performance change of the networks caused bythe extended network capacity. model 3 is the model thattrained with only vocal activity information to the lyricsencoder. We simply used 0 value as unvoiced sections and1 as voiced sections so that the 128-dimensional embed-ding can train useful meaning from it. model 4 is thattrained with aligned lyrics information. For both model 3 and model 4 , note that 0 value has clear meaning, unvoicedsections, unlike 0 value in model 2 is meaningless. Sincethe baseline model is the same for each local conditioningand concatenation method, we trained a total of 7 modelsfor the experiments. Except model 1 , we will use the abbre-viation of local conditioning and concatenation methods,each LC and CC , in front of the model names for conve-nience. For example, the model trained with aligned lyricsand the local conditioning method is LC-model 4 . The quantitative performance evaluation scores of themodels are shown in Table 3. Median scores were takenfrom the median values of 20 songs, which were calcu-lated for every frame (a median of frames, a median oftracks). Mean scores represent the scores taken by a meanof frames, a mean of tracks. Each frame was set to 1 sec-odels Median MeanSDR SIR SAR SDR SIR SAR model 1
LC-model 2
LC-model 3
LC-model 4
CC-model 2
CC-model 3
CC-model 4 : Evaluation scores of our singing voice separation models. All scores are in [dB] scale.ond.It was confirmed that the separation performance ofboth
LC-model 4 and
CC-model 4 improved from the model 1 . This implies that aligned lyrics information canbe used as helpful features for singing voice separationnetworks. Comparing to
LC-model 3 and
CC-model 3 , wecould verify that there were clear performance gains notonly from the lyrics alignment information but also thephonetic features of the lyrics itself. It was also confirmedthat the performance gains do not come from just networkcapacity growth, given that there are no significant differ-ences in the performance of model 1 and model 2 . Thespectrograms of the separated sample are in Figure 5.Despite the expectation that the vocal activity infor-mation is powerful information to the networks, perfor-mance gains observed in
LC-model 3 were very slight. Itwas much smaller than the improvements achieved from
CC-model 3 . By these, we analyzed that the concatenationmethod is slightly better for making the networks to reflectthe vocal activity information. Nevertheless, we consideredthat both conditioning methods were effective when givingthe networks aligned lyrics information.
In this section, we quantitatively assessed how well the net-works leverage vocal activity information of aligned lyrics.The purpose of this is to see if the models have not beentrained by focusing only on either one of the vocal activityinformation or the phonetic information of aligned lyrics,which are both critical for the separation performance.To this end, the separated spectrogram values were di-vided by the largest values of each source for normaliza-tion, so that the minimum and maximum values become 0and 1. Then, the energy of each time axis was summed tocreate a vector that contains vocal activity information. Itwas decided whether vocal activity exists or not based onwhether the values were larger or smaller than 0.1 as wasdone in [14]. Precision, recall, and F1 scores were calcu-lated with the created vocal activity vectors by taking theplace where the lyric exists as the ground-truth voiced sec-tions. Scores are contained in Table 4.From the results of Table 4, we have confirmed that model 4 can separate the vocal source by reflecting thevocal’s timing information more accurately than model 1 for both lyrics conditioning methods. Also, it was con-firmed that model 3 achieved higher scores for all measures
Figure 6 : The example of making the vocal activity vectorfrom the separated vocal spectrogram.than model 4 . This is a reasonable result because model3 were trained with vocal activity information only, while model4 needed to learn how to leverage both vocal activityinformation and phonetic information appropriately whiletraining. Nevertheless, F1 score differences were negligi-ble, which means model 4 was also capable of reflectingtiming information as well as model 3 . To check if model 4 effectively uses the information inlyrics, we observed the performance change when incor-rect lyrics were given as input during the evaluation stage.The results are shown in Table 5.If the networks had learned to effectively use the infor-mation in lyrics it is expected to output silence when thelyrics meaning unvoiced sections are given in the evalua-tion step. To validate this assumption, we inserted 0 values(
Zero ) to the lyrics encoder in to model 4 in the evalua-tion step. As expected, almost every sound has been erasedfrom the mixture with only a little noise left as seen inFigure 7 and critical performance degradation, over 10 dBin SDR score, has occurred.Furthermore, performance degradation was observedwhen all the lyrics were replaced with random valuesodels Precision Recall F1 score model 1
LC-model 2
LC-model 3
LC-model 4
CC-model 2
CC-model 3
CC-model 4
Table 4 : The precision, recall, and F1 scores to evaluatehow well the networks used the vocal activity informationfrom aligned lyrics.
Models Inputs SDR SIR SAR
LC-model 4
Zero
Random
VA+Random
AlignedLyrics
CC-model 4
Zero
Random
VA+Random
AlignedLyrics
Table 5 : Performance comparisons when different inputsare given in the evaluation stage.
Zero : 0 value inputs.
Random : Random value inputs.
VA+Random : Replacevoiced sections with random value.
AlignedLyrics :Aligned lyrics (The proposed method). All scores are in[dB] scale.(
Random ). This also shows that the network is signifi-cantly dependent on the encoded lyrics information and theproposed conditioning method is effectively applied.Next, we experimented to see if the network is ableto use the phonetic information included in the lyrics.We show this by removing all the phonetic informationfrom the aligned lyrics. In other words, the changed lyricsstill contain the vocal activity information. More specifi-cally, it was done by replacing all the voiced sections withrandom values and leaving the unvoiced sections intact(
VA+Random ). Interestingly, the performance was stillfar below than the model trained tested on aligned lyrics,which indicates that the network can reflect the phoneticinformation into the separation process.It is noteworthy that the SIR scores of
VA+Random are not much different from the aligned lyrics input(
AlignedLyrics ). Since SIR scores are heavily relatedto the remained accompaniment sources of the separatedsinging voice, we expected that the networks were still ca-pable of removing the accompaniments only using the vo-cal activity information. On the other hand, the impact onSDR and SAR scores were significant. This implies thatwhile the network was able to erase the accompanimentwell by using vocal activity information only in unvoicedsections, it was able to remove the accompaniment betterby using phonetic information in voiced sections.In the experiments of using (
Random ) and(
VA+Random ) inputs, median values of 5 differentexperimental tries with different random seeds were taken.
Figure 7 : The examples of the separated vocal spectro-grams with incorrect inputs and correct aligned lyrics in-puts were given to
LC-model 4 . The dashed line shows theenhanced parts when the aligned lyrics are used. Note thatthe region is closely related to the formant frequencies.
6. CONCLUSION
In this study, we proposed an integrated framework ofcombining the lyrics encoder into the state-of-the-art
Open-unmix separation network. Local conditioning andconcatenation methods were shown to be able to effec-tively condition the aligned lyrics into the singing voiceseparation networks. Through various experiments, it wasconfirmed that the phonetic information of aligned lyricscan contribute to the performance improvements as wellas the vocal activity information. We plan to use the un-aligned lyrics for the singing voice separation for the futureworks.
7. ACKNOWLEDGEMENTS
This work was supported partly by Kakao and Kakao Braincorporations and partly by Next-Generation InformationComputing Development Program through the NationalResearch Foundation of Korea (NRF) funded by the Min-istry of Science and ICT (NRF-2017M3C4A7078548).
8. REFERENCES [1] Estefanía Cano, Gerald Schuller, and ChristianDittmar. Pitch-informed solo and accompaniment sep-aration towards its use in music education applications.
EURASIP Journal on Advances in Signal Processing ,2014(1):23, 2014.[2] Tak-Shing Chan, Tzu-Chun Yeh, Zhe-Cheng Fan,Hung-Wei Chen, Li Su, Yi-Hsuan Yang, and RogerJang. Vocal activity informed singing voice separationith the ikala dataset. In , pages 718–722. IEEE, 2015.[3] Pritish Chandna, Merlijn Blaauw, Jordi Bonada, andEmilia Gómez. Content based singing voice extractionfrom a musical mixture. In
ICASSP 2020-2020 IEEEInternational Conference on Acoustics, Speech andSignal Processing (ICASSP) , pages 781–785. IEEE,2020.[4] Z Chen, PS Huang, and YH Yang. Spoken lyricsinformed singing voice separation. In
Proc. HAMR ,2013.[5] Alexandre Défossez, Nicolas Usunier, Léon Bottou,and Francis Bach. Music source separation in the wave-form domain. arXiv preprint arXiv:1911.13254 , 2019.[6] Chitralekha Gupta, Emre Yılmaz, and Haizhou Li.Acoustic modeling for automatic lyrics-to-audio align-ment. arXiv preprint arXiv:1906.10369 , 2019.[7] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and JianSun. Deep residual learning for image recognition. In
Proceedings of the IEEE conference on computer vi-sion and pattern recognition , pages 770–778, 2016.[8] Sergey Ioffe and Christian Szegedy. Batch nor-malization: Accelerating deep network training byreducing internal covariate shift. arXiv preprintarXiv:1502.03167 , 2015.[9] Andreas Jansson, Eric Humphrey, Nicola Montecchio,Rachel Bittner, Aparna Kumar, and Tillman Weyde.Singing voice separation with deep u-net convolutionalnetworks. 2017.[10] Diederik P Kingma and Jimmy Ba. Adam: Amethod for stochastic optimization. arXiv preprintarXiv:1412.6980 , 2014.[11] Juheon Lee, Hyeong-Seok Choi, Chang-Bin Jeon,Junghyun Koo, and Kyogu Lee. Adversarially trainedend-to-end korean singing voice synthesis system. arXiv preprint arXiv:1908.01919 , 2019.[12] Antoine Liutkus, Fabian-Robert Stöter, Zafar Rafii,Daichi Kitamura, Bertrand Rivet, Nobutaka Ito, Nobu-taka Ono, and Julie Fontecave. The 2016 signal sep-aration evaluation campaign. In
International confer-ence on latent variable analysis and signal separation ,pages 323–332. Springer, 2017.[13] Zafar Rafii, Antoine Liutkus, Fabian-Robert Stöter,Stylianos Ioannis Mimilakis, and Rachel Bittner. TheMUSDB18 corpus for music separation, December2017.[14] Kilian Schulze-Forster, Clément Doire, Gaël Richard,and Roland Badeau. Weakly informed audio sourceseparation. In ,pages 273–277. IEEE, 2019. [15] Kilian Schulze-Forster, Clement SJ Doire, GaëlRichard, and Roland Badeau. Joint phoneme align-ment and text-informed speech separation on highlycorrupted speech. In
ICASSP 2020-2020 IEEE Inter-national Conference on Acoustics, Speech and SignalProcessing (ICASSP) , pages 7274–7278. IEEE, 2020.[16] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky,Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:a simple way to prevent neural networks from over-fitting.
The journal of machine learning research ,15(1):1929–1958, 2014.[17] Rupesh Kumar Srivastava, Klaus Greff, and Jür-gen Schmidhuber. Highway networks. arXiv preprintarXiv:1505.00387 , 2015.[18] Daniel Stoller, Simon Durand, and Sebastian Ew-ert. End-to-end lyrics alignment for polyphonic mu-sic using an audio-to-character recognition model. In
ICASSP 2019-2019 IEEE International Conference onAcoustics, Speech and Signal Processing (ICASSP) ,pages 181–185. IEEE, 2019.[19] Daniel Stoller, Sebastian Ewert, and Simon Dixon.Wave-u-net: A multi-scale neural network for end-to-end audio source separation. arXiv preprintarXiv:1806.03185 , 2018.[20] F.-R. Stöter, S. Uhlich, A. Liutkus, and Y. Mitsufuji.Open-unmix - a reference implementation for musicsource separation.
Journal of Open Source Software ,2019.[21] Fabian-Robert Stöter, Antoine Liutkus, and NobutakaIto. The 2018 signal separation evaluation campaign.In
Latent Variable Analysis and Signal Separation:14th International Conference, LVA/ICA 2018, Surrey,UK , pages 293–305, 2018.[22] Hideyuki Tachibana, Katsuya Uenoyama, and Shun-suke Aihara. Efficiently trainable text-to-speech sys-tem based on deep convolutional networks with guidedattention. In ,pages 4784–4788. IEEE, 2018.[23] Naoya Takahashi, Nabarun Goswami, and Yuki Mitsu-fuji. Mmdenselstm: An efficient combination of convo-lutional and recurrent neural networks for audio sourceseparation. In , pages 106–110. IEEE, 2018.[24] Naoya Takahashi, Mayank Kumar Singh, Sakya Basak,Parthasaarathy Sudarsanam, Sriram Ganapathy, andYuki Mitsufuji. Improving voice separation by incor-porating end-to-end speech recognition. In
ICASSP2020-2020 IEEE International Conference on Acous-tics, Speech and Signal Processing (ICASSP) , pages41–45. IEEE, 2020.25] Emmanuel Vincent, Rémi Gribonval, and CédricFévotte. Performance measurement in blind audiosource separation.
IEEE transactions on audio, speech,and language processing , 14(4):1462–1469, 2006.[26] John F Woodruff, Bryan Pardo, and Roger B Dan-nenberg. Remixing stereo music with score-informedsource separation. In
ISMIR , pages 314–319, 2006.[27] Fisher Yu and Vladlen Koltun. Multi-scale contextaggregation by dilated convolutions. arXiv preprintarXiv:1511.07122arXiv preprintarXiv:1511.07122