[PDF] Depthwise Separable Convolutions Versus Recurrent Neural Networks for Monaural Singing Voice Separation

Abstract

Recent approaches for music source separation are almost exclusively based on deep neural networks, mostly employing recurrent neural networks (RNNs). Although RNNs are in many cases superior than other types of deep neural networks for sequence processing, they are known to have specific difficulties in training and parallelization, especially for the typically long sequences encountered in music source separation. In this paper we present a use-case of replacing RNNs with depth-wise separable (DWS) convolutions, which are a lightweight and faster variant of the typical convolutions. We focus on singing voice separation, employing an RNN architecture, and we replace the RNNs with DWS convolutions (DWS-CNNs). We conduct an ablation study and examine the effect of the number of channels and layers of DWS-CNNs on the source separation performance, by utilizing the standard metrics of signal-to-artifacts, signal-to-interference, and signal-to-distortion ratio. Our results show that by replacing RNNs with DWS-CNNs yields an improvement of 1.20, 0.06, 0.37 dB, respectively, while using only 20.57% of the amount of parameters of the RNN architecture.

Full PDF

aa r X i v : . [ ee ss . A S ] J u l Depthwise Separable Convolutions Versus RecurrentNeural Networks for Monaural Singing VoiceSeparation

Pyry Pyykk ¨onen ∗ , Styliannos I. Mimilakis † , Konstantinos Drossos ‡ , and Tuomas Virtanen ‡∗

3D Media Research Group, Tampere University, Tampere, FinlandEmail: { ﬁrstname.lastname } @tuni.ﬁ † Semantic Music Technologies Group, Fraunhofer-IDMT, Ilmenau, GermanyEmail: [email protected] ‡ Audio Research Group, Tampere University, Tampere, FinlandEmail: { ﬁrstname.lastname } @tuni.ﬁ Abstract —Recent approaches for music source separationare almost exclusively based on deep neural networks, mostlyemploying recurrent neural networks (RNNs). Although RNNsare in many cases superior than other types of deep neuralnetworks for sequence processing, they are known to have speciﬁcdifﬁculties in training and parallelization, especially for thetypically long sequences encountered in music source separation.In this paper we present a use-case of replacing RNNs with depth-wise separable (DWS) convolutions, which are a lightweight andfaster variant of the typical convolutions. We focus on singingvoice separation, employing an RNN architecture, and we replacethe RNNs with DWS convolutions (DWS-CNNs). We conduct anablation study and examine the effect of the number of channelsand layers of DWS-CNNs on the source separation performance,by utilizing the standard metrics of signal-to-artifacts, signal-to-interference, and signal-to-distortion ratio. Our results show thatby replacing RNNs with DWS-CNNs yields an improvement of1.20, 0.06, 0.37 dB, respectively, while using only 20.57% of theamount of parameters of the RNN architecture.

Index Terms —Depthwise separable convolutions, recurrentneural networks, mad, madtwinnet, monaural singing voiceseparation

I. I

NTRODUCTION

The task of audio source separation is to extract theunderlying audio sources from an observed audio mixture.A particular problem that has attracted great attention inaudio and music source separation, is the estimation of thesinging voice and accompaniment sources [1]. To address thisproblem, a common and successfully employed work ﬂow,consists of computing non-negative signal representations, andemploying deep neural networks (DNNs) to estimate the targetsources.Although different methods have been recently proposed forcomputing and learning signal adaptive/dependent represen-tations for source separation [2]–[4], the short-time Fouriertransform (STFT) remains a popular choice among state-of-the-art (SOTA) approaches in music source separation [5]–

K. Drossos and T. Virtanen would like to acknowledge CSC Finland forcomputational resources. Stylianos I. Mimilakis is supported in part by theGerman Research Foundation (AB 675/2-1, MU 2686/11-1). [9]. Speciﬁcally, by using the STFT, the complex valuedrepresentation of the mixture signal is computed. Then, thecorresponding magnitude information of the mixture signal isprocessed by an appropriated method, e.g. DNNs, yielding themagnitude information of the target source. Using the phaseinformation of the mixture, the time-domain signals of theestimated sources are recovered by means of the inverse STFT(ISTFT).Focusing on the DNNs that estimate the target source in theSTFT domain, a certain approach that state-of-the-art methodsemploy is that of ﬁltering/masking. This approach, enforcesDNNs to output ﬁlters that are optimized for separating audioand music sources, and has led to good results for bothseparation quality [5], [7], [8] and computational costs [9].In more details, DNNs are conditioned on the mixture signalmagnitude spectrogram and are optimized, in a supervisedfashion, to yield a time-varying ﬁlter, i.e., a time-frequencymask. The time-frequency mask is applied to the input mixturespectrogram, resulting into a ﬁltered version of the input mix-ture. The parameters of the DNNs are optimized to minimizethe difference between the ﬁltered and the targeted sourcespectrograms, available in the training dataset. The mainbeneﬁt of employing such approach versus other approaches isthat the DNNs are more efﬁcient in learning the spectrogramstructure of the target music source [10].Typical DNN masking-based approaches for music sourceseparation rely on recurrent neural networks (RNNs) to encodeinformation from the mixture magnitude spectrogram [5],[6], [8], that is then decoded to obtain the source-dependentmask. However, many previous works have highlighted thatthe optimization of the DNNs could be difﬁcult, due to theinvolved RNNs, resulting into a very slow, or even sub-optimal learning process. A few reasons to that are impropergradient norms of the RNN parameters during training [11],and the large number of parameters RNNs require to efﬁcientlyprocess long sequences [12]. Although techniques, such asskip-connections [12], bi-directional sequence sampling [12],and regularization schemes [13] have been proposed to al-eviate the above severe issues, CNNs have an increasedpopularity [7], [14]–[16]. In contrast to RNNs, CNNs havefewer parameters and can be easily parallelised, resulting into afaster learning process. Furthermore, recent works have shownthat depth-wise separable CNNs can even perform better thantypical CNNs in a wide range of applications spanning fromimage recognition [17] to sound event detection [16] andspeech [18] and music source separation [15].Because of the above, in this work we conduct an ablationstudy and examine the objective performance differences insinging voice separation, by replacing the RNNs with depth-wise separable CNNs. To that aim, we particularly focus onthe Masker and Denoiser (MaD) architecture presented inthe following works [5], [6], [19]. We do so because MaDarchitecture incorporates the RNN techniques that have beenpreviously presented in [6] and in [5], serving a fair, yetcompetitive baseline for the scope of this work.The rest of the paper is organized as follows. In Section IIwe present our proposed method, consisting of the replace-ment of RNNs with depth-wise separable convolutions at theMaD architecture. In Section III we presented the followedevaluation procedure, and the obtained results are presentedin Section IV. Section V concludes the paper.II. P

ROPOSED METHOD

Our method accepts as an input the magnitude spectrogram V ∈ R T + L × F ≥ of the musical mixture, consisting of T + L timeframes with F frequency bands, and outputs the magnitudespectrogram ˆ V j ∈ R T × F ≥ of the j -th targeted source, byapplying a two-step process. First, our method ﬁlters V ,producing an initial estimate of the magnitude spectrogramof the j -th source, ˆ V ′ j ∈ R T × N ≥ , where the extra L vectorsof V are used as temporal context for the initial estimate ˆ V ′ j .Then, our method enhances ˆ V ′ j , producing the ﬁnal estimateof the magnitude spectrogram of the j -th source, ˆ V j .Our proposed method in based on the MaD system [5], [6],[19], which takes as an input V and employs two denoisingauto-encoders (DAEs), one for estimating ˆ V ′ j , and one forcalculating ˆ V j . The ﬁrst DAE in MaD is based on RNNs,which are known to be hard to use for parallelized training,and more hard to optimize than CNNs [16], [20], [21]. A. MaD system

MaD consists of two modules; the masker and the denoiser.The masker accepts as an input V and outputs ˆ V ′ j , and itconsists of a trimming operation, T r , a bi-directional RNNencoder, RNN enc , a unidirectional RNN decoder, RNN dec , anda feed-forward layer, FNN m .The trimming operation, T r , takes as an input V andreduces the amount of frequency bands from F to N , resultingin V tr ∈ R T + L × N ≥ . This is done in order to reduce the inputdimensionality of RNN enc , consequently reducing the amountof parameters of RNN enc . Though, the complete V will beused later on, after the RNN enc . The bi-directional RNN enc consists of a forward RNN, −−→ RNN enc , and a backward RNN, ←−−

RNN enc , takes as an input V tr and processes it according to −→ h ′ t ′ enc = −−→ RNN enc ( v t ′ tr , −→ h ′ t ′ − enc ) and (1) ←− h ′ t ′ enc = ←−− RNN enc ( ←− v t ′ tr , ←− h ′ t ′ − enc ) , (2)where −→ h ′ t ′ enc , ←− h ′ t ′ enc ∈ [ − , N are the latent outputs of −−→ RNN enc and ←−−

RNN enc , respectively, at the t ′ -th time frame, t ′ =1 , . . . , T + L , −→ h ′ enc = ←− h ′ enc = { } N , ←− v t ′ tr is the time-ﬂipped(i.e. backwards) version of V tr , and H ′ enc = [ h ′ enc , . . . , h ′ T + L enc ] .Bi-directional RNN enc is used to encode the input magnitudespectrogram, using extra information from the L temporalcontent vectors. The output of the encoder H ′ enc , is summedwith the input V , using residual connections as H enc = H ′ enc + [ V ⊤ tr , ←− V ⊤ tr ] ⊤ , (3)where ←− V tr is the magnitude spectrogram V tr ﬂipped in time(i.e. backwards) and H enc ∈ R T + L × N . Finally, the extra L time-frames are dropped from H enc , so the subsequent decoderwill be able to focus on the time frames that correspond tothe targeted output, as H enc-tr = [ h ⌊ L/ ⌋ enc , . . . , h T + ⌊ L/ ⌋ enc ] , (4)where h i enc ∈ [ − , N is the i -th vector of H enc and ⌊·⌋ isthe ﬂoor function. H enc-tr is used as an input to RNN dec ofmasker, obtaining H dec as h t dec = RNN dec ( h t enc-tr , h t − dec ) , (5)where h t dec is the latent output of the RNN dec at the t -th time-frame, t = 1 , . . . , T , h dec = { } N , and H dec =[ h dec , . . . , h T dec ] . H dec is given as an input to a feed-forwardlinear layer with shared weights through time, followed by arectiﬁed linear unit (ReLU) as h t m = ReLU ( FNN m ( h t dec )) , (6)where h t m ∈ R F ≥ and H m = [ h m , . . . , h T m ] . Finally, the outputof the masker, ˆ V ′ j , is calculated as ˆ V ′ j = V ′ ⊙ H m , (7)where “ ⊙ ” is the Hadamard product and V ′ =[ v ⌊ L/ ⌋ , . . . , v T + ⌊ L/ ⌋ ] is a time-trimmed version of theinput magnitude spectrogram V (i.e. before the trimmingprocess T r ).The denoiser, accepts as an input the ˆ V ′ j and outputs ˆ V j ,and it consists of two feed-forward layers with shared weightsthrough time and functioning as an auto-encoder, FNN d1 andFNN d2 , where each one is followed by a ReLU. Speciﬁcally,the ﬁrst layer, FNN d1 , process the input to the decoder as h t d1 = ReLU ( FNN d1 (ˆ v ′ tj )) , (8)where h t d1 ∈ R ⌊ F/ ⌋≥ and H d1 = [ h d1 , . . . , h T d1 ] . Then, thesecond layer, FNN d1 , process H d1 as h t d2 = ReLU ( FNN d2 ( h t d1 )) , (9)here h t d2 ∈ R F ≥ and H d2 = [ h d2 , . . . , h T d2 ] . The output ofthe denoiser, ˆ V j is calculated as ˆ V j = ˆ V ′ j ⊙ H d2 . (10)Finally, the masker and the denoiser are jointly optimizedby minimizing L = D KL ( V j || ˆ V ′ j ) + D KL ( V j || ˆ V j )+ λ | diag { W FNN m }| + λ || W FNN d2 || , (11)where V j is the targeted magnitude spectrogram of the j -thsource, D KL is the generalized Kullback-Leibler divergence, λ = 1 × − and λ = 1 × − are regularization terms, | · | is the ℓ vector norm, and || · || is the L matrix norm.diag { W FNN m } is the main diagonal of the weight matrix ofthe FNN m (i.e. the elements w ij of W FNN m with i = j ).More information about the speciﬁc regularizations terms andoptimization process, can be found at the original MaD andthe MaDTwinNet papers [5], [19]. B. Replacing RNNs

In our proposed method, we replace the bi-directionalRNN enc and the unidirectional RNN dec with two sets of convo-lutional blocks, CNN enc and CNN dec , respectively. Followingrecent and SOTA published work [16], we opt to employdepth-wise separable (DWS) convolutions and not typicalconvolutions for our CNN blocks. The DWS convolution is afactorized version of the typical convolution, that ﬁrst applies aspatial-wise convolution, and then a channel-wise convolution.The spatial-wise convolution learns spatial relationships in theinput features to the convolution. The channel-wise convolu-tion, learns cross-channel relationships between the channelsof the spatial-wise convolution.Speciﬁcally, each DWS convolution block of our methodconsists of a CNN (the spatial-wise convolution CNN d ),followed by a leaky ReLU (LReLU), a batch-normalizationprocess, another CNN (the channel-wise convolution CNN s ),and a ReLU, as H = ReLU ( CNN s ( BN ( LReLU ( CNN d ( X ))))) , where (12) D c i ,x h − K h ,x w − K w = CNN d ( X c i ; K c i d )=( K c i d ∗ X c i )( x h − K h , x w − K w )= K h X k h =1 K w X k w =1 X c i ,x h − k h ,x w − k w K c i ,k h ,k w d ,(13) H c o ,φ h ,φ w = CNN s ( D : ,φ h ,φ w ; K c o s )= C i X c i =1 D c i ,φ h ,φ w K c o ,c i s , (14)LReLU ( x ) = ( x, if x ≥ ,βx otherwise , (15)BN is the batch normalization process, ∗ indicates convolution, D ∈ R C ′ i × Φ h × Φ w and K d ∈ R C i × K d h × K d w are the output and kernel tensors of CNN d , respectively, H ∈ R C o × Φ ′ h × Φ ′ w and K s ∈ R C o × C i are the output tensor and kernel matrixof CNN s , respectively, and β < is a hyper-parameter.Eq. (13) is used to learn the spatial relationships of thedata X ∈ R C i × X h × X w , and Eq. (14) is used to learn thecross-channel relationships. We employ LReLU according toprevious studies using depth-wise separable convolutions [16].Our CNN enc consists of one DWS convolution block thatis followed by a batch normalization process, a max-poolingoperation, and a dropout with p enc probability, and then of L enc DWS convolution blocks, with each block followed by a batchnormalization process and a dropout with probability p enc (butno max-pooling operation). The output of each of the L enc DWS convolution blocks has the same dimensionality as theinput. That is, at each of the L enc DWS convolution blocks,we utilize proper zero padding (i.e. depending on the kernelsize) in order not to alter the dimensions of the input. Eachblock of CNN enc gets as an input the output of the previousone, the ﬁrst gets as an input V , and the last outputs the tensor H L enc ∈ R C enc × H enc × W enc ≥ .CNN dec consists of a transposed convolution, followed bytwo DWS convolution blocks, batch-normalization and max-pooling processes, a dropout with probability p dec , a CNN,and the FNN enc . The transposed convolution of CNN dec getsas an input the H L enc , and the FNN enc outputs H m . Finally, theoutput of the masker of our method is calculated accordingto Eq. (7). The ﬁnal audio signal of the output is calculatedaccording to the original MaD paper [5].III. E XPERIMENTAL P ROCEDURE

A. Dataset and pre-processing

We use the development sub-set of Demixing SecretDataset (DSD100) for optimizing the parameters of the pro-posed method, in a supervised fashion. From each multi-trackwe compute a monaural version of each of the four sources, byaveraging the two available channels. Then, we compute theSTFT of each monaural signal using a Hamming window of samples (46ms) over a step size of samples (8ms).Each windowed segment is zero-padded to samples.After the STFT, we remove the redundant information ofthe STFT retaining the ﬁrst N = 2049 frequency bands,and then compute the absolute values. Then the magnitudespectrogram of the mixture and singing voice are segmentedinto B = ⌈ M/T ⌉ sequences, with T being the length of thesequence, and ⌈·⌉ is the ceiling function. Each sequence b is employed as our V and V j , for the mixture and targetsource respectively, and overlaps with the preceding one byan empirical factor of L × . The overlap factor is used foraggregating context information in the previously describedstages of encoding. B. Hyperparameters and training of proposed method

We evaluate our method by conducting an ablation study,employing different amounts of CNN enc blocks, L enc , and dif-ferent number of channels, C o , for our convolutional kernels. AND

SAR

VALUES , AND AMOUNT OF PARAMETERS ( N PARAMS ) FOR THE DIFFERENT AMOUNTS OF

CNN enc

BLOCKS ( L enc ) AND CHANNELS OFTHE CORRESPONDING KERNEL ( C o ). V ALUES OF

SDR, SIR,

AND

SAR

ARE PRESENTED IN dB. W

ITH BOLD FONTS ARE THE VALUES FOR THECOMBINATION OF L enc AND C o , THAT YIELDS THE BIGGER

SDR.

SDR SIR SAR N params N params N params Value of L enc Value of C o

64 128 256 64 128 256 64 128 256 64 128 2565 4.47 4.84 4.91 8.11 8.07 8.59 6.71 6.74 6.98 4 783 426 4 922 754 5 447 1707 4.44 4.65

Speciﬁcally, we employ six different number of L enc , namely5, 7, 9, 11, 13, and 15, and three different C o , namely 64,128, and 256. We indicate the amount of L enc and C o , usinga subscript, e.g. CNN enc- , for the L enc = 5 and C o = 64 combination. All DWS convolution blocks of CNN enc havea square kernel of K d h = K d w = 5 . At the CNN dec weutilize the same C o amount of channels with the CNN enc , K d h = K d w = 5 , and a unit stride, and β = 1 e − . The valuesfor K d h and K d w are chosen according to previous work thatemployed DWS convolutions [16] and the value for β as thedefault value for the LReLU in the PyTorch framework.We optimize the parameters of our method following theapproach in the original papers of MaD [6], [19], using 100epochs on the training dataset, with a batch size of 4. Weutilized the Adam optimizer for updating the weights of ourmethod, with a learning rate of 1e-4 and for betas we usedthe values proposed in the original corresponding paper [22].Additionally, we employ a clipping of the gradient L normequal to 0.5, similar to the training process of the originalMaD system. The above are implemented using the PyTorchframework, and our code is freely available online . C. Objective Evaluation

We compare our method with an established masking basedapproach to singing voice separation, denoted as the Maskerand Denoiser (MaD) architecture and friends, namely the MaDTwinNet [5] and the MaD architecture with the recurrentinference algorithm [6]. The length of the sequences for theMaD and friends is set to T = 60 timeframes, accordingto the corresponding papers [5]. We focus on those twoparticular approaches because to the best of our knowledgethose approaches are the only ones that do not estimate allthe other music sources in an attempt to re-ﬁne the estimatedsinging voice signal [6]–[8], [16]. This allows us to clearlyexamine the potentials of using depth-wise separable convolu-tional networks for masking based approaches to singing voiceseparation. For assessment, the evaluation subset of DSD100(50 mixtures and corresponding sources) is used for measuringthe objective performance of our method, in terms of signal-to-distortion (SDR), signal-to-interference (SIR), and signal-to-artifacts (SAR) ratios. The computation of SDR, SIR, and SARfor all the compared methods is performed over overlapping https://github.com/pppyykknen/mad-twinnet TABLE IIC OMPARISON OF OUR PROPOSED METHOD WITH M A D, ON DSD

DATASET .V ALUES OF

SDR, SIR,

AND

SAR

ARE PRESENTED IN D B. N PARAMS -M ISTHE AMOUNT OF PARAMETERS FOR M ASKER . R

ESULTS OF M A D ARETAKEN FROM THE LITERATURE . Approach SDR SIR SAR N params-M N params-M N params-M MaD [6] 3.62 7.06 5.88 22 996 113MaD-Rec.Inferece [6] 4.20 7.94 5.91MaDTwinNet [5] 4.57 8.17 5.95CNN enc- , signal segments, following the proposed rules of the ofﬁcialSignal Separation and Evaluation Campaign (SiSEC) [23].IV. R ESULTS AND DISCUSSION

In Table I are amount of parameters and the obtained valuesfor the SDR, SIR, and SAR versus the different L enc and C o .From that table, it can be seen that the increase of C o has abigger impact to the obtained SDR, SIR, and SAR, comparedto the increase of L enc . That is, the increase at the amount ofchannels beneﬁts more the obtained SDR, SIR, and SAR, thanthe increase of the depth of the CNNs. Though, this beneﬁtfrom C o could be attributed to the more pronounced effect that C o has on N params . From Table I, it can be seen that the increaseof C o has more impact on the total amount of parameters N params , than increasing L enc . Regarding the best performingcombination, we focus on the SDR and we consider as bestperforming the combination of L enc = 7 and C o = 256 .To evaluate the beneﬁt of our proposed method compared tothe usage of RNNs, we compare our results with the vanilla,with recurrent inference, and with twin networks variants ofthe MaD system. In Table II are the SDR, SIR, and SARvalues of the best performing combination according to SDRand Table I (i.e. CNN enc- , ), compared to the values for thesame metrics obtained from MaD system. Additionally, sinceone of the main beneﬁts of DWS convolutions is that theyhave quite few parameters, we also list in Table I the amountof parameters of the Masker. We do not list the parameters forthe Denoiser, since the Denoiser is the same in all the listedsystems in Table I. For reference, the amount of parametersof the Denoiser is 4 199 425.As can be seen from Table II, our proposed method sur-passed all variants of the MaD system, while, at the sametime, it has only 6% of the parameters at the Masker (i.e.4% reduction) compared to MaD. Speciﬁcally, we achievean increase of 0.37 dB, 0.06 dB, and 1.20 dB for SRD, SIR,and SAR, respectively, when using our method and comparedto the MaD system trained with the TwinNet regularization(i.e. MaD TwinNet), which is the best performing variant ofMaD. As can be seen, the improvement is mainly attributed onthe reduction of artifacts in the separated signal (i.e. increasein the SAR). This indicates that the replacement of the RNNswith DWS convolutions can result in signals that have lessdistortion from the separation method [24].Finally, comparing Tables I and II, we can see that with ourmethod, with L enc = 5 and C o = 64 , we can use the 2.54%of the parameters of the Masker, and still have an increase of0.76 dB at the SAR, while having a marginal reduction of 0.10dB and 0.06 dB at SDR and SIR, respectively. Basically, thismeans that with our method we can signiﬁcantly reduce theparameters of the Masker by 97.5%, while still getting someimprovement at the reduction of distortion from the separationmethod (i.e. increase at the SAR). In terms of the amount oftotal parameters, with the best performing combination of ourmethod, CNN enc- , , we get a reduction of 79.43% (i.e. weuse only 20.57% of the total MaD parameters), and with theCNN enc- , we get a reduction of 82.41% (i.e. we use only17.59% of the total MaD parameters).V. C ONCLUSIONS

In this work we examined the effect in objective separa-tion performance of replacing RNNs with with depth-wiseseparable (DWS) CNNs. To assess our proposed approach,we focused on the singing voice separation task and we em-ployed a SOTA performing architecture for monaural singingvoice separation that is based on RNNs, we implementedour proposed replacements, and we evaluated the performanceof the method with the replacements using an establishedand freely available dataset for music source separation. Weevaluated the performance of the singing voice separationusing the widely employed source separation metrics of signal-to-distortion (SDR), signal-to-artifacts (SAR), and signal-to-interference (SIR) ratios. The results show a clear beneﬁt ofusing our approach, both in the performance and the totalamount of parameters needed. Speciﬁcally, with our approachwe managed to reduce the amount of total parameters by79.43%, and achieve an increase of 0.37 dB, 0.06 dB, and1.20 dB at SDR, SIR, and SAR, compared to the originalmethod with RNNs. For future work, we intend to examinethe usage of dilated convolutions, in order to exploit the strongtemporal context of music (e.g. melody). Additionally, furtherinvestigation could be carried, regarding the beneﬁt or havinga bigger kernel at the channel-wise convolution, in the depth-wise separable convolution.R

EFERENCES[1] Z. Raﬁi, A. Liutkus, F. R. St¨oter, S. I. Mimilakis, D. FitzGerald,and B. Pardo, “An Overview of Lead and Accompaniment Separationin Music,”

IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 26, no. 8, pp. 1307–1335, Aug. 2018. [2] S. Venkataramani, J. Casebeer, and P. Smaragdis, “End-To-End SourceSeparation With Adaptive Front-Ends,” in , Oct. 2018, pp. 684–688.[3] E. Tzinis and S. Venkataramani and Z. Wang and C. Subakan and P.Smaragdis, “Two-Step Sound Source Separation: Training on LearnedLatent Targets,” in

ICASSP 2020 - 2020 IEEE International Conferenceon Acoustics, Speech and Signal Processing (ICASSP) , May 2020.[4] S. I. Mimilakis, K. Drossos, and G. Schuller, “Unsupervised Inter-pretable Representation Learning for Singing Voice Separation,” in , Aug. 2020.[5] K. Drossos, S. I. Mimilakis, D. Serdyuk, G. Schuller, T. Virtanen,and Y. Bengio, “Mad twinnet: Masker-denoiser architecture with twinnetworks for monaural sound source separation,” in , 2018, pp. 1–8.[6] S. I. Mimilakis, K. Drossos, J. F. Santos, G. Schuller, T. Virtanen,and Y. Bengio, “Monaural Singing Voice Separation with Skip-FilteringConnections and Recurrent Inference of Time-Frequency Mask,” in

Proceedings of the 43rd International Conference on Acoustics, Speechand Signal Processing (ICASSP 2018) , Apr. 2018.[7] L. Pr´etet, R. Hennequin, J. Royo-Letelier, and A. Vaglio, “Singing voiceseparation: A study on training data,” in

ICASSP 2019 - 2019 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) , May 2019, pp. 506–510.[8] F.-R. St¨oter, S. Uhlich, A. Liutkus, and Y. Mitsufuji, “Open-Unmix- A Reference Implementation for Music Source Separation,”

Journal of Open Source Software , 2019. [Online]. Available:https://doi.org/10.21105/joss.01667[9] M. Huber, G. Schindler, C. Sch¨orkhuber, W. Roth, F. Pernkopf, andH. Fr¨oning, “Towards real-time single-channel singing-voice separationwith pruned multi-scaled densenets,” in

ICASSP 2020 - 2020 IEEEInternational Conference on Acoustics, Speech and Signal Processing(ICASSP) , May 2020, pp. 806–810.[10] S. I. Mimilakis, K. Drossos, E. Cano, and G. Schuller, “Examiningthe Mapping Functions of Denoising Autoencoders in Singing VoiceSeparation,”

IEEE/ACM Transactions on Audio, Speech, and LanguageProcessing , vol. 28, pp. 266–278, 2020.[11] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difﬁculty of trainingrecurrent neural networks,” in

Proceedings of the 30th InternationalConference on International Conference on Machine Learning - Volume28 , ser. ICML13. JMLR.org, 2013, p. III1310III1318.[12] A. Graves, “Generating sequences with recurrent neural networks,”

CoRR , vol. abs/1308.0850, 2013.[13] D. Serdyuk, N.-R. Ke, A. Sordoni, A. Trischler, C. Pal, andY. Bengio, “Twin Networks: Matching the future for sequencegeneration,”

CoRR , vol. abs/1708.06742, 2017. [Online]. Available:http://arxiv.org/abs/1708.06742[14] N. Takahashi and Y. Mitsufuji, “Multi-scale multi-band densenets foraudio source separation,” in , Oct. 2017.[15] A. D´efossez, N. Usunier, L. Bottou, and F. Bach, “Music SourceSeparation in the Waveform Domain,” HAL, Tech. Rep. 02379796v1,2019.[16] K. Drossos, S. I. Mimilakis, S. Gharib, Y. Li, and T. Virtanen, “SoundEvent Detection with Depthwise Separable and Dilated Convolutions,” in

Proceedings of the 2020 IEEE International Joint Conference on NeuralNetworks (IJCNN) , Jul. 2020.[17] J. Guo, Y. Li, W. Lin, Y. Chen, and J. Li, “Network decoupling: Fromregular to depthwise separable convolutions,” in

British Machine VisionConference (BMVC) , Sep. 2018.[18] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal timefre-quency magnitude masking for speech separation,”

IEEE/ACM Trans-actions on Audio, Speech, and Language Processing , vol. 27, no. 8, pp.1256–1266, Aug. 2019.[19] S. I. Mimilakis, K. Drossos, T. Virtanen, and G. Schuller, “A recurrentencoder-decoder approach with skip-ﬁltering connections for monauralsinging voice separation,” in , 2017, pp. 1–6.[20] Y. He and J. Zhao, “Temporal convolutional networks for anomalydetection in time series,”

Journal of Physics: Conference Series , vol.1213, p. 042050, Jun. 2019.[21] C. Lea, R. Vidal, A. Reiter, and G. D. Hager, “Temporal convolutionalnetworks: A uniﬁed approach to action segmentation,” in

ComputerVision – ECCV 2016 Workshops , G. Hua and H. J´egou, Eds. SpringerInternational Publishing, 2016, pp. 47–54.22] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,”in

Proceedings of the International Conference on Learning Represen-tations (ICLR-15) , 2015.[23] A. Liutkus, F.-R. St¨oter, Z. Raﬁi, D. Kitamura, B. Rivet, N. Ito, N. Ono,and J. Fontecave, “The 2016 signal separation evaluation campaign,”in

Latent Variable Analysis and Signal Separation: 13th InternationalConference, LVA/ICA 2017 , Feb. 2017, pp. 323–332. [24] E. Vincent, R. Gribonval, and C. Fevotte, “Performance measurementin blind audio source separation,”

IEEE Transactions on Audio, Speech,and Language Processing , vol. 14, no. 4, pp. 1462–1469, 2006., vol. 14, no. 4, pp. 1462–1469, 2006.