LaSAFT: Latent Source Attentive Frequency Transformation for Conditioned Source Separation
LLASAFT: LATENT SOURCE ATTENTIVE FREQUENCY TRANSFORMATION FORCONDITIONED SOURCE SEPARATION
Woosung Choi (cid:63) , Minseok Kim (cid:63) , Jaehwa Chung † , Soonyoung Jung (cid:63)(cid:63) Department of Computer Science, Korea University † Department of Computer Science, Korea National Open University
ABSTRACT
Recent deep-learning approaches have shown that Fre-quency Transformation (FT) blocks can significantly improvespectrogram-based single-source separation models by cap-turing frequency patterns. The goal of this paper is to extendthe FT block to fit the multi-source task. We propose the La-tent Source Attentive Frequency Transformation (LaSAFT)block to capture source-dependent frequency patterns. Wealso propose the Gated Point-wise Convolutional Modulation(GPoCM), an extension of Feature-wise Linear Modula-tion (FiLM), to modulate internal features. By employingthese two novel methods, we extend the Conditioned-U-Net(CUNet) for multi-source separation, and the experimentalresults indicate that our LaSAFT and GPoCM can improvethe CUNet’s performance, achieving state-of-the-art SDRperformance on several MUSDB18 source separation tasks.
Index Terms — conditioned source separation, attention
1. INTRODUCTION
Most of the deep learning-based models for Music SourceSeparation (MSS) are dedicated to a single instrument. How-ever, this approach forces us to train an individual model foreach instrument. Besides, trained models cannot use the com-monalities between different instruments. A simple exten-sion to multi-source separation is to generate several outputsat once. For example, models proposed in [1, 2] generatemultiple outputs. Although it shows promising results, thisapproach still has a scaling issue: the number of heads in-creases as the number of instrument increases, leading to (1)performance degradation caused by the shared bottleneck, (2)inefficient memory usage.Adopting conditioning learning [3, 4] is a neat alternative.It can separate different instruments with the aid of the controlmechanism. Since it does not need a multi-head output layer,there is no shared bottleneck. For example, the Conditioned-U-Net (CUNet) [3] extends the U-Net [5, 6] by exploitingFeature-wise Linear Modulation (FiLM) [7]. It takes as input
This work was supported by the National Research Foundationof Korea(NRF) grant funded by the Korea government(MSIT)(No.2020R1A2C1012624, 2019R1A6A3A13095526). the spectrogram of a mixture and a control vector that indi-cates which instrument we want to separate and outputs theestimated spectrogram of the target instrument.Meanwhile, recent spectrogram-based methods for SingingVoice Separation (SVS) [8] or Speech Enhancement (SE) [9]employed Frequency Transformation (FT) blocks to capturefrequency patterns. Although stacking 2-D convolutionshas shown remarkable results [6, 10], it is hard to cap-ture long-range dependencies along the frequency axis forfully convolutional networks with small sizes of kernels.FT blocks, which have fully-connected layers applied in atime-distributed manner, are useful to this end. Both mod-els designed their building block to have a series of 2-Dconvolution layers followed by an FT block, reporting thestate-of-the-art performance on SVS and SE, respectively.In this paper, we aim to exploit FT blocks in a CUNetarchitecture. However, merely injecting an FT block to aCUNet does not inherit the spirit of FT block (although it doesimprove Source-to-Distortion Ratio (SDR) [11] performanceby capturing common frequency patterns observed across allinstruments). We propose the Latent Source-Attentive Fre-quency Transformation (LaSAFT), a novel frequency trans-formation block that can capture instrument-dependent fre-quency patterns by exploiting the scaled dot-product atten-tion [12]. We also propose the Gated Point-wise Convolu-tional Modulation (GPoCM), a new modulation that extendsthe Feature-wise Linear Modulation (FiLM) [7]. Our CUNetwith LaSAFT and GPoCMs outperforms the existing methodson several MUSDB18 [13] tasks. Our ablation study indicatesthat adding LaSAFT or replacing FiLMs with GPoCMs im-proves separation quality.Our contributions are three-fold as follows:• We propose LaSAFT, an attention-based novel fre-quency transformation block that captures instrument-dependent frequency patterns.• We propose GPoCM, an extension of FiLM, to modu-late internal features for conditioned source separation.• Our model achieves state-of-the-art performance onseveral MUSDB18 tasks. We provide an ablation studyto investigate the role of each component. a r X i v : . [ c s . S D ] O c t . BASELINE ARCHITECTURE The baseline is similar to the CUNet [3]. It consists of (1) theConditioned U-Net and (2) the Condition Generator.1.
The Conditioned U-Net is a U-Net [5] which takes amixture spectrogram as input and outputs the estimatedtarget spectrogram. It applies FiLM layers to modulateintermediate features with condition parameters gener-ated by the condition generator.2.
The Condition Generator takes as input a conditionvector and generates condition parameters. A conditionvector is a one-hot encoding vector that specifies whichinstrument we want to separate.
We extend the U-Net architecture used in [8], on which astate-of-the-art singing voice separation model is based. Wefirst describe the common parts between ours and the original[8]. As shown in the left part of Fig. 1, it consists of an en-coder and a decoder: the encoder transforms the input mixturespectrogram M into a downsized spectrogram-like represen-tation. The decoder takes it and returns the estimated targetspectrogram ˆ T . It should be noted that spectrogram M and ˆ T are complex-valued adopting the Complex-as-Channel (CaC)separation method [8]. In CaC, we view real and imaginaryas separate channels. Thus, if the original mixture waveformis c -channeled (i.e., c = 2 for stereo), then the number ofchannels of M and ˆ T is ( c ). Fig. 1 . The baseline architectureThere are four types of components in the structure: are used for adjusting the numberof channels. To increase the number of channels, we apply a × convolution with ¯ C output channels followed by ReLU[14] activation to the given input M . Intermediate layers keepthe number of channels ¯ C . To restore the original number ofchannels, we apply another × convolution with (2 c ) outputchannels to the last intermediate block’s output. Since thetarget spectrogram is complex-valued, we do not apply anyactivation functions. An Intermediate Block transforms an input spectrogram-liketensor into an equally-sized tensor. For each block in thebaseline, we use a Time-Frequency Convolution [8] (TFC), ablock of densely connected 2-D convolution layers [15]. Wedenote the number of intermediate blocks in the encoder by L .The decoder also has L blocks. There is an additional blockbetween them. A Down/Up-sampling Layer halves/doubles the scale of aninput tensor. We use a strided/transposed-convolution.
Skip Connections concatenate output feature maps of thesame scale between the encoder and the decoder. They helpthe U-Net recover fine-grained details of the target.Unlike in the original U-Net [8], we modulate internal fea-tures in the decoder by applying FiLM layers, as shown in theright part in Fig. 1. Applying a FiLM is an effective way tocondition a network, which applies the following operation tointermediate feature maps. We define a Film layer as follows:
F iLM ( X ic | γ ic , β ic ) = γ ic · X ic + β ic (1), where γ ic and β ic are parameters generated by the conditiongenerator, and X i is the output of the i th decoder’s interme-diate block, whose subscript refers to the c th channel of X . The condition generator is a network that predicts conditionparameters γ = ( γ , ..., γ L ¯ C ) and β = ( β , ..., β L ¯ C ) . Our con-dition generator is similar to ‘Fully-Connected Embedding’of the CUNet [3] except for the usage of the embedding layer.It takes as input the one-hot encoding vector z ∈ { , } I thatspecifies which one we want to separate among I instruments.The condition generator projects z into e z ∈ R E , the embed-ding of the target instrument, where E is the dimension of theembedding space.It then applies a series of fully connected (i.e., linear ordense) layers, which doubles the dimension. We use ReLU[14] as the activation function for each layer, and apply adropout with p = 0 . followed by a Batch Normalization(BN) [16] to the output of each last two layers. The lasthidden units are fed to two fully-connected layers to predict γ = ( γ , ..., γ L ¯ C ) ∈ R | γ | and β = ( β , ..., β L ¯ C ) ∈ R | β | , where | γ | = | β | = ¯ CL .
3. PROPOSED METHODS3.1. Latent Source Attentive Frequency Transformation
We introduce the Time-Distributed Fully-connected layers(TDF) [8], an existing FT, and then propose our LaSAFT.
A TDF block is a series of two fully-connected layers. Sup-pose that the 2-D convolutional layer (e.g., TFC [8]) takes annput spectrogram-like internal representation X ∈ R ¯ C × T × F and outputs equally-sized tensor X (cid:48) . A TDF block is appliedseparately and identically to each frame (i.e., X (cid:48) [ i, j, :] ) in atime-distributed fashion. Each layer is defined as consecutiveoperations: a fully-connected layer, BN, and ReLU. The num-ber of hidden units is defined as (cid:98) F ( l ) /bn (cid:99) , where we denotethe bottleneck factor by bf . Although injecting TDFs to the baseline also improves theSDR performance by capturing the common frequency pat-terns observed across all instruments (see §4), it does notinherit the spirit of TDF. To this end, we propose the La-tent Source Attentive Frequency Transformation (LaSAFT)by adopting the scaled dot-product attention mechanism [12].We first duplicate I L copies of the second layer of theTDF, as shown in the right side of Fig. 2, where I L refersto the number of latent instruments . I L is not necessarily thesame as I for the sake of flexibility. For the given frame V ∈ R F , we obtain the I L latent instrument-dependent frequency-to-frequency correlations, denoted by V (cid:48) ∈ R F ×I L . We usecomponents on the left side of Fig. 2 to determine how mucheach latent source should be attended. The LaSAFT takes asinput the instrument embedding z e ∈ R × E . It has a learnableweight matrix K ∈ R I L × d k , where we denote the dimensionof each instrument’s hidden representation by d k . By apply-ing a linear layer of size d k to z e , we obtain Q ∈ R d k . Wenow can compute the output of the LaSAFT as follows: Attention ( Q, K, V (cid:48) ) = sof tmax ( QK T √ d k ) V (cid:48) (2) Fig. 2 . Latent Source Attentive Frequency TransformationWe apply a LaSAFT after each TFC in the encoder andafter each Film/GPoCM layer in the decoder. We employ askip connection for LaSAFT and TDF, as in TFC-TDF [8]
Before we describe the Gated Point-wise Convolutional Mod-ulation (GPoCM), we first introduce the Point-wise Convolu-tional Modulation (PoCM).
The PoCM is an extension of FiLM. While FiLM does nothave inter-channel operations, PoCM has them as follows:
P oCM ( X ic | ω ic , β ic ) = β ic + (cid:88) j ω icj · X ij (3), where ω ic = ( ω ic ..., ω ic ¯ C ) and β ic are condition parameters,and X i is the output of the i th decoder’s intermediate block,as shown in Fig. 3. Since this channel-wise linear combina-tion can also be viewed as a point-wise convolution, we nameit as PoCM. With inter-channel operations, PoCM can modu-late features more flexibly and expressively than FiLM. Fig. 3 . PoCM layers
In the decoder, we use the following ‘Gated PoCM(GPoCM)’instead, as follows:
GP oCM ( X ic | ω ic , β ic ) = σ ( P oCM ( X ic | ω ic , β ic )) (cid:12) X ic (4), where σ is a sigmoid and (cid:12) means the Hadamard product.
4. EXPERIMENT4.1. Experiment Setup
We use MUSDB18 dataset [13], which contains 84 tracks fortraining, 15 tracks for validation, and 50 tracks for test. Eachtrack is stereo, sampled at 44100 Hz, consisting of the mixtureand four sources (i.e., I = 4 ): vocals, drums, bass, and other.We train models using Adam [17] with learning rate lr ∈{ . , . } depending on model depth. Each model istrained to minimize the Mean Squared Error (MSE) betweenthe ground-truth spectrogram and the estimated. For valida-tion, we use the MSE of the target signal and the estimated.We apply data augmentation [18] on the fly to obtain mixtureclips comprised of sources from different tracks. We use SDR[11] for the evaluation metric, by using the official tool forMUSDB. We use the median SDR value over all the test settracks for each run and report the mean SDR over three runs.More details are available online . https://github.com/sigsep/sigsep-mus-eval https://github.com/ws-choi/Conditioned-Source-Separation-LaSAFT odel vocals drums bass other AVG dedicated [8] 7.07 5.38 5.62 4.61 5.66FiLM CUNet 5.14 5.25 4.20 3.40 4.49+ TDF 5.88 5.70 4.55 3.67 4.95+ LaSAFT 6.74 5.64 5.13 4.32 5.46GPoCM CUNet 5.43 5.30 4.43 3.51 4.67+ TDF 5.94 5.46 4.47 3.81 4.92+ LaSAFT . An ablation study: dedicated means U-Nets for thesingle source separation, trained separately. FiLM CUNetrefers the baseline in §2. The last row is our proposed model. We perform an ablation study to validate the effectiveness ofthe proposed methods compared with the baseline. In everymodel in the Table, we use an FFT window size of 2048 andhop size of 1024 for spectrogram estimation. The configu-ration of the baseline (FiLM CUNET) is as follows: we use7-blocked CUNet (i.e., L = 3 ), and we use a TFC [8] for eachblock with the same configuration used in [8], where 5 con-volution layers with kernel size × are densely connected,and the growth rate [15] is set to be 24. We set ¯ C to be 24as in [8]. We use 32 for the dimensionality of the embeddingspace of conditions (i.e., R E ).In Table 1, we can observe a considerable performancedegradation when employing the existing method (FiLMCUNet), compared to the dedicated U-Net, which is trainedseparately for each instrument with the same configuration.Injecting TDF blocks to the baseline (FiLM CUNet + TDF)improves SDR by capturing the common frequency patterns.Replacing TDFs with LaSAFTs (FiLM CUNet + LaSAFT)significantly improves the average SDR score by 0.51dB,indicating that our LaSAFTs are more appropriate for multi-instrument tasks than TDFs. We set LI to be 6 in this ex-periment. Our proposed model (GPoCM CUNet + LaSAFT),which uses GPoCM instead of FiLM and LaSAFT insteadof TDF, outperforms the others, achieving comparable butslightly inferior results to dedicated models.Also, we compare against existing state-of-the-art modelson the MUSDB18 benchmark. The first four rows of Table 2show SDR scores of SOTA models. We take the SDR perfor-mance from the respective papers. For fair comparison, weuse a 9-blocked ‘GPoCM CUNet + LaSAFT’ with the samefrequency resolution as the other SOTA models [1, 2] (FFTwindow size = 4096). Our model yields comparable resultsagainst the existing methods and even outperforms the oth-ers on ‘vocals’ and ‘other.’ LaSAFT’s ability, which allowsthe proposed model to extract latent instrument-attentive fre-quency patterns, significantly improves the SDR of ‘other’,since it contains various instruments such as piano and gui-tars. Also, existing works [8, 9] have shown that FT-basedmethods are beneficial for voice separation, which explains model vocals drums bass other AVGDGRU-DConv[1] 6.85 5.86 4.86 4.65 5.56Meta-TasNet[4] ∗ ∗ proposed Table 2 . A comparison SDR performance of our models withother systems. ‘ ∗ ’ denotes model operating in time domain.our model’s excellent SDR performance on vocals.
5. RELATION TO PRIOR WORK
The original C-U-Net[3] is the most relevant work. We extendtheir model by employing LaSAFT and GPoCM. Besides, ourbaseline is also different from it. We summarize the differ-ences between them as follows: (1) our U-Net is based on ageneralized U-Net for source separation used in [8], but theU-Net of [3] is more similar to the original U-Net [5], and(2) we apply FiLM to internal features in the decoder, but [3]apply it in the encoder. The authors of [3] tried to manipulatelatent space in the encoder, assuming the decoder can performas a general spectrogram generator, which is ‘shared’ by dif-ferent sources. However, we found that this approach is notpractical since it makes the latent space (i.e., the decoder’sinput feature space) more discontinuous. Via preliminary ex-periments, we observed that applying FiLMs in the decoderwas consistently better than applying FilMs in the encoder.For multi-source separation, [4] employed meta-learning,which is similar to conditioning learning. It also has externalnetworks that generate parameters for the target instrument.Whiling we focus on modulating internal representations withGPoCM, [4] focuses on generating parameters of the maskingsubnetwork. Also, models proposed in [4, 19] operate in timedomain, while ours in time-frequency domain.While other multi-source separation methods [1, 2] esti-mate multiple sources simultaneously, we try to condition ashared U-Net. We expect that ours can be easily extended tomore complicated tasks such as audio manipulation.
6. CONCLUSION
We propose LaSAFT that captures source-dependent fre-quency patterns by extending TDF to fit the multi-sourcetask. We also propose the GPoCM that modulates featuresmore flexibly and expressively than FiLM. We have shownthat employing our LaSAFT and GPoCM in CUNet can sig-nificantly improve SDR performance. In future work, we tryto reduce the number of parameters and the memory usage ofLaSAFT to consider more latent instruments. Also, we ex-tend the proposed model to audio manipulation tasks, wherewe can condition the model by providing various instructions. . REFERENCES [1] Jen-Yu Liu and Yi-Hsuan Yang, “Dilated convolutionwith dilated gru for music source separation,” in
Pro-ceedings of the Twenty-Eighth International Joint Con-ference on Artificial Intelligence, IJCAI-19 . 7 2019, pp.4718–4724, International Joint Conferences on Artifi-cial Intelligence Organization.[2] Naoya Takahashi and Yuki Mitsufuji, “D3net: Denselyconnected multidilated densenet for music source sepa-ration,” arXiv preprint arXiv:2010.01733 , 2020.[3] Gabriel Meseguer-Brocal and Geoffroy Peeters,“Conditioned-u-net: Introducing a control mechanismin the u-net for multiple source separations.,” in , ISMIR, Ed., November 2019.[4] David Samuel, Aditya Ganeshan, and Jason Narad-owsky, “Meta-learning extractors for music source sep-aration,” in
ICASSP 2020-2020 IEEE InternationalConference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2020, pp. 816–820.[5] Olaf Ronneberger, Philipp Fischer, and Thomas Brox,“U-net: Convolutional networks for biomedical imagesegmentation,” in
International Conference on Med-ical image computing and computer-assisted interven-tion . Springer, 2015, pp. 234–241.[6] Jansson Andreas, Humphrey Eric, Montecchio Nicola,Bittner Rachel, Kumar Aparna, and Weyde Tillman,“Singing voice separation with deep u-net convolutionalnetworks,” in , 2017, pp. 23–27.[7] Ethan Perez, Florian Strub, Harm de Vries, Vincent Du-moulin, and Aaron C Courville, “Film: Visual reasoningwith a general conditioning layer,” in
AAAI , 2018.[8] Woosung Choi, Minseok Kim, Jaehwa Chung, Dae-won Lee, and Soonyoung Jung, “Investigating u-netswith various intermediate blocks for spectrogram-basedsinging voice separation.,” in , ISMIR,Ed., OCTOBER 2020.[9] Dacheng Yin, Chong Luo, Zhiwei Xiong, and Wen-jun Zeng, “Phasen: A phase-and-harmonics-aware speech enhancement network,” arXiv preprintarXiv:1911.04697 , 2019.[10] Naoya Takahashi, Nabarun Goswami, and Yuki Mit-sufuji, “Mmdenselstm: An efficient combination ofconvolutional and recurrent neural networks for au-dio source separation,” in .IEEE, 2018, pp. 106–110.[11] Emmanuel Vincent, R´emi Gribonval, and C´edricF´evotte, “Performance measurement in blind audiosource separation,”
IEEE transactions on audio, speech,and language processing , vol. 14, no. 4, pp. 1462–1469,2006.[12] Ashish Vaswani, Noam Shazeer, Niki Parmar, JakobUszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser,and Illia Polosukhin, “Attention is all you need,” in
Ad-vances in neural information processing systems , 2017,pp. 5998–6008.[13] Zafar Rafii, Antoine Liutkus, Fabian-Robert St¨oter,Stylianos Ioannis Mimilakis, and Rachel Bittner,“MUSDB18 - a corpus for music separation,” Dec.2017, MUSDB18: a corpus for music source separa-tion.[14] Xavier Glorot, Antoine Bordes, and Yoshua Bengio,“Deep sparse rectifier neural networks,” in
Proceedingsof the fourteenth international conference on artificialintelligence and statistics , 2011, pp. 315–323.[15] Gao Huang, Zhuang Liu, Laurens Van Der Maaten, andKilian Q Weinberger, “Densely connected convolu-tional networks,” in
Proceedings of the IEEE confer-ence on computer vision and pattern recognition , 2017,pp. 4700–4708.[16] Sergey Ioffe and Christian Szegedy, “Batch normaliza-tion: Accelerating deep network training by reducinginternal covariate shift,” in
International Conference onMachine Learning , 2015, pp. 448–456.[17] Diederik P. Kingma and Jimmy Ba, “Adam: A methodfor stochastic optimization,” in , Yoshua Bengio and Yann LeCun, Eds., 2015.[18] Stefan Uhlich, Marcello Porcu, Franck Giron, MichaelEnenkl, Thomas Kemp, Naoya Takahashi, and YukiMitsufuji, “Improving music source separation basedon deep neural networks through data augmentation andnetwork blending,” in . IEEE, 2017, pp. 261–265.[19] Eliya Nachmani, Yossi Adi, and Lior Wolf, “Voice sep-aration with an unknown number of multiple speakers,” arXiv preprint arXiv:2003.01531arXiv preprint arXiv:2003.01531