Evolving Multi-Resolution Pooling CNN for Monaural Singing Voice Separation
Weitao Yuan, Bofei Dong, Shengbei Wang, Masashi Unoki, Wenwu Wang
11 Evolving Multi-Resolution Pooling CNN forMonaural Singing Voice Separation
Weitao Yuan, Bofei Dong, Shengbei Wang, Masashi Unoki, and Wenwu Wang,
Senior Member, IEEE
Abstract —Monaural Singing Voice Separation (MSVS) is achallenging task and has been studied for decades. Deep neuralnetworks (DNNs) are the current state-of-the-art methods forMSVS. However, the existing DNNs are often designed manually,which is time-consuming and error-prone. In addition, the net-work architectures are usually pre-defined, and not adapted tothe training data. To address these issues, we introduce a NeuralArchitecture Search (NAS) method to the structure design ofDNNs for MSVS. Specifically, we propose a new multi-resolutionConvolutional Neural Network (CNN) framework for MSVSnamely Multi-Resolution Pooling CNN (MRP-CNN), which usesvarious-size pooling operators to extract multi-resolution features.Based on the NAS, we then develop an evolving frameworknamely Evolving MRP-CNN (E-MRP-CNN), by automaticallysearching the effective MRP-CNN structures using genetic algo-rithms, optimized in terms of a single-objective considering onlyseparation performance, or multi-objective considering both theseparation performance and the model complexity. The multi-objective E-MRP-CNN gives a set of Pareto-optimal solutions,each providing a trade-off between separation performance andmodel complexity. Quantitative and qualitative evaluations onthe MIR-1K and DSD100 datasets are used to demonstratethe advantages of the proposed framework over several recentbaselines.
Index Terms —Evolving multi-resolution pooling CNN, neuralarchitecture search, genetic algorithm, monaural singing voiceseparation
I. I
NTRODUCTION
Popular music, which plays a central role in entertainmentindustries, usually consists of two components: singing voice(Vocal) and music accompaniment (Acc) [1]. Human beingscan easily hear out/distinguish the singing voice from musicaccompaniment when listening to a popular song. This effortlesstask for human, however, is very difficult for machines, whichraises both challenges and opportunities to advance audiosignal processing techniques [1], [2]. Monaural singing voiceseparation (MSVS), as an important research branch of musicsource separation (MSS), aims to separate the singing voice andthe background music accompaniment from a single-channelmixture signal. The research on MSVS is useful in manyareas such as automatic lyrics recognition/alignment, singeridentification, and music information retrieval [2]. Moreover,it would benefit our understanding of the perception andinterpretation mechanisms of the human auditory system [2].
Weitao Yuan, Bofei Dong and Shengbei Wang are with Tianjin Key Labora-tory of Autonomous Intelligence Technology and Systems, School of ComputerScience and Technology, Tianjin Polytechnic University, Tianjin, China, e-mail:[email protected], [email protected], [email protected] Unoki is with the School of Information Science, Japan AdvancedInstitute of Science and Technology, Japan, e-mail: [email protected] Wang is with the Centre for Vision, Speech and Signal Processing,University of Surrey, Guildford, UK, e-mail: [email protected].
Traditional (largely unsupervised) methods have providedmany effective frameworks for MSVS [1], e.g., time-frequency(T-F) masking methods [2], and robust principal componentanalysis (RPCA) based methods [3]. A comprehensive overviewof the traditional MSVS methods can be found in [1]. Benefitingfrom these methods, recent data-driven methods, especiallythe Deep Neural Network (DNN) [4], strongly boosts theperformance of MSVS with the help of large scale data. Thebasic building blocks of DNNs for MSVS mainly includeFeed-Forward Network (FFN) [5], Recurrent Neural Network(RNN) [6], Convolutional Neural Network (CNN) [7], andattention mechanism [8]. In these building blocks, CNNis proven to be very effective in extracting vocal/musicalfeatures for MSVS, since efficient representations relatedto discriminative features of vocal/music can be learned byconvolutional filters via sharing weights.In fact, music relies heavily on its multi-scale repetitions(e.g., from very basic elements such as individual notes,timber, or pitch to larger structure chords [9]) to build thelogical structure and meaning [10]. These multi-resolutionrepetitions appearing at various musical levels also distinguishthe music accompaniment from vocals which are less redundantand mostly harmonic [1]. As an important CNN for MSVS,the Multi-Resolution CNN (MR-CNN) [11]–[14], which cancapture multi-resolution features via constructing various-sizereceptive fields (RFs), has been found effective in modelingthe multi-scale repetitive music structures and extractingdiscriminative features (e.g., global or local features). TheMR-CNN has been widely employed by many state-of-the-art(SOTA) MSVS methods and it is also our research focus inthis work.According to different implementations of multi-resolutionRFs, existing MR-CNNs for MSVS/MSS can be divided intotwo types. The first type, e.g., Stacked Hourglass Network(SHN) [11] and U-net [12], is constructed in a cascademanner with fixed-size or single-resolution RF in each layer.The input signal is repeatedly convoluted and downsampledto form multiple consecutive layers. In this case, differentresolution features can only be found in different layersand thus the cascade structure of the first type MR-CNNshould be deep enough to extract effective multi-resolutionfeatures. In contrast, the second type MR-CNN such asMulti-Resolution Convolutional Auto-Encoder (MRCAE) [13]and Multi-Resolution Fully Convolutional Neural Network(MR-FCNN) [14], directly implements multi-resolution RFsin the same layer by using multiple sets of various-sizeconvolutional operators. Accordingly, multi-resolution featurescan be extracted in one or a few layers without deepening thecascade structure. a r X i v : . [ ee ss . A S ] A ug In spite of these achievements, several issues need to beaddressed for current MR-CNNs: (1) Architecture limitations:
The first type MR-CNNdepends on its cascade structure to extract multi-scale musicfeatures. However, according to [15], the optimization algo-rithms would be less effective in capturing the dependenciesacross multiple layers. This problem could be aggravated in thefirst type MR-CNN since it heavily relies on its deep cascadestructure to improve the separation performance of MSVS.In contrast, the second type MR-CNN does not sufferfrom the optimization issue. However, in order to extractglobal features, large-size convolutional filters should be used.According to [16], large convolutional filter results in lowcomputational efficiency. Moreover, for MSVS, a minor linearshift in T-F representations (e.g., magnitude spectrogram) couldcause significant distortions on vocal and music perception [12].To address this issue, many MSVS networks employ skipor similar connections to directly transmit the low-levelinformation between different layers [11], [12]. However,such skip connections (or similar mechanisms) have not beenimplemented for the second type MR-CNN. (2) Manual design:
Current MR-CNNs (or DNNs) basedMSVS methods are usually designed manually. This manualdesign procedure usually has the following shortcomings.1) Manual design is often achieved empirically via trial anderror: MSVS is a challenging task as the music accom-paniment and vocals often exhibit highly synchronousnon-stationary spectro-temporal structures over time andfrequency [1]. The MR-CNN learns hierarchical featureextractors (e.g., the coefficients of the convolutionaloperators) from the data in an “end-to-end” fashion.In this case, slight modifications to the architecturemay significantly affect the separation performance. Tofind suitable structures for MSVS, a large amount ofarchitecture modifications and repetitive training andtesting are required, which is inevitably time-consuming,error-prone, and ineffective.2) Domain knowledge may be not sufficient for detailedarchitecture design: For MSVS, domain knowledgemay suggest to use vertical filters to learn timbralrepresentations [17] and horizontal filters to learn longtemporal cues [18] in the T-F domain. However, whendealing with an actual MSVS network, how to combineand deploy these filters and how to select an effectivecombination from so many possible combinations maynot be answered sufficiently by domain/expert knowl-edge.3) Pre-designed structures lack a mechanism to adapttheir architectures to the training data: The data-drivenoptimization process of MR-CNNs can learn parametersof the convolutional filters. However, the pre-definedconvolutional operator sizes, the hyper-paremters, andthe architecture of MR-CNNs, cannot be changed oradapted to the dataset during the training process. Asa result, the information learned from real data is notutilized for improving the pre-designed structures.To address these issues, this paper proposes a flexible andeffective MR-CNN for MSVS namely Multi-Resolution Pooling CNN (MRP-CNN). We also extend the proposed MRP-CNNinto an evolving framework, i.e., E-MRP-CNN, using NeuralArchitecture Search (NAS) technique. The E-MRP-CNN canautomatically evolve its neural architecture according to thelearned data using two kinds of genetic algorithms: the single-objective genetic algorithm and the multi-objective geneticalgorithm. The details of our work are described below. (1) Multi-resolution Pooling CNN:
The MRP-CNN uti-lizes sets of average pooling operators of various sizes inparallel at the same layer to obtain multi-resolution features. Allthese pooling operators are embedded in stacked convolutionnetworks with small and fixed-size convolutional kernels.Compared with the cascade framework U-net or SHN (thefirst type MR-CNN), the MRP-CNN does not need to optimizethe deep cascade structure. Compared with the second typeMR-CNN, large-size pooling (downsampling) operators ratherthan large-size convolutional filters are used to extract globalfeatures, which reduces the number of trainable parametersand leads to much better memory and computational efficiency.Moreover, the MRP-CNN is a flexible design and allows skipconnections (or other similar connections) to be implementedbetween different layers for low-level features transmission. (2) Automatic Neural Architecture Search:
We introduceNAS to the MRP-CNN and construct the E-MRP-CNN, whichcan automatically search effective MRP-CNN architecturesfor MSVS. As the first attempt to introduce NAS in theMSVS field, we aim to enhance the existing MR-CNNs andmake the DNN based MSVS methods less dependent ondomain/expert knowledge, with single-objective E-MRP-CNNand multi-objective E-MRP-CNN.The single-objective E-MRP-CNN evolves its architecturewith the only objective of optimizing the separation perfor-mance. This evolving process will provide an insight abouthow different architectures of MRP-CNN affect the separationperformance and what structures work well on the MSVSproblem. The single-objective E-MRP-CNN tends to optimizethe separation performance, but choosing a more complexmodel. In some real applications (e.g., the embedded FPGAplatform) [19], however, the computing resources and on-chipmemory are usually limited, in this case, both the modelcomplexity and separation performance should be considered.The multi-objective E-MRP-CNN is proposed to address thebalance between model complexity and separation performance.It provides a set of Pareto-optimal solutions [20] for MSVS,i.e., Pareto-optimal MRP-CNN architectures. Each solution(architecture) is Pareto-optimal, that is, no objective can beimproved without degrading the other objective, e.g., theseparation performance can not be improved without increasingthe model complexity. We approximate the Pareto-optimalsolution set based on a classic multi-objective evolutionarygenetic algorithm: Non-dominated Sorting Genetic AlgorithmII (NSGA-II) [20]. With the multi-objective E-MRP-CNN, wecan obtain multiple architectures with each providing a goodseparation performance under a fixed model complexity.Our main contributions are summarized as follow. • We propose a flexible MR-CNN framework, i.e., MRP-CNN, for extracting multi-resolution spectro-temporalfeatures for MSVS;
Block 2 (a) Full
Block 1 + Block 4
Block 3 + Block 5 ++ b OutputInput bbbbbbbbb
Code: FC-FS-B-B-B-B-B (b) Block CG × × C i,0 Pooling T i,1 × F i,1 PCG × × C i,1 Up- sampling
PCG × × C i,J+1 Concatenate ... ...
Block code: B=CG-PL...PL-PCG
Pooling code: PL=PS-PC-PCG CG code: CG=C-S-A-APCG code: PCG=S-A-A
2D Conv × × C Activation + (c) Convolution
Activation
2D Conv × × C J -th PL Pooling T i,j × F i,j PCG × × C i,j Up- sampling Pooling T i,J × F i,J PCG × × C i,J Up- sampling j -th PL Fig. 1: The architecture of the proposed MRP-CNN. • Based on MRP-CNN, we introduce the first evolutionaryscheme for MSVS, i.e., the E-MRP-CNN, which canevolve its architecture and search effective architecture forMSVS based on training data. This automatic schemenot only avoids the empirical manual design processbut also provides better separation performance (via thesingle-objective E-MRP-CNN) and a well-balanced modelcomplexity and separation performance (via the multi-objective E-MRP-CNN) for MSVS.II.
RELATED WORKS
The existing deep networks for MSVS/MSS mainly useRNN [21], [22] and CNN structures [7], [11], [12], [23], [24].The RNN can effectively model dependencies of temporalpatterns and structures of music (e.g. rhythm, beat/tempo,melody) [21], [22]. The CNN, which is effective for featureextraction in the T-F domain, is usually constructed as aconvolutional encoder-decoder architecture with skip connec-tions, such as the U-net [12], Wave-U-net [23], Exp-Wave-U-Net [24], and SHN [11]. The CNN can also be combined withother structures to obtain better MSS/MSVS performance. Forexample, in [25], CNN and RNN are combined to improve theMSS performance; the Skip Attention (SA) [8] inspired fromTransformer [26] was introduced into CNN encoder-decoderstructure to improve the separation performance. In addition tothese works, [27], [28] considered using generative adversarialnetworks (GANs) for (semi-supervised) MSVS; [29] designedChimera network for singing voice separation based on deepclustering; [30] examined the mapping functions of neuralnetworks based on the denoising autoencoder (DAE) model forMSS. However, these works lack flexibility for adapting thearchitectures to the data, as compared with the use of varioussize pooling operators in our approach. In addition, all of thesenetworks are designed manually and none of them consideredthe use of NAS for automatic architecture design.Over the past few years, the NAS has achieved impressiveprogress in many research areas and begun to outperformhuman-designed deep models [31], [32]. As a classic search strategy of NAS, the NeuroEvolution of Augmenting Topolo-gies (NEAT) [33] adopted the Genetic Algorithm (GA) toevolve both its artificial neural networks and their weights.Recently, the Evolved Transformer [31] considered the useof NAS to find a better alternative to the Transformer forsequence-to-sequence tasks. The Reinforcement Learning (RL)based NAS has also been introduced to Generative AdversarialNetworks (GANs) [34]. However, to our knowledge, the NAShas not been explored for the MSVS/MSS tasks and no workhas yet attempted to design an evolving MR-CNN frameworkfor MSVS/MSS. In particular, since the neural architecture forMSVS usually has millions of weights, we use GA to optimizethe neural architecture while the gradient based method tooptimize the weights [32], which is different from NEAT [33].In addition, compared with the RL based NAS (e.g., [34]), theevolution guided NAS would be more simple and efficient forMSVS. III. T HE MRP-CNN
A. Proposed framework
The proposed MRP-CNN in Fig. 1(a) is composed of fivestacked Blocks . Each Block (indexed by i , ≤ i ≤ ) worksas a basic unit to extract multi-resolution features and fiveBlocks form a stacked structure. Skip connections (dotted linesin Fig. 1(a)) can be optionally used between different Blocksto improve the separation performance.As illustrated in Fig. 1(b), each Block consists of aconvolution-group (CG), multiple pooling layers (PLs, indexedby j , ≤ j ≤ J ), concatenation, and post-convolution-group(PCG) layer. Skip connection can be used optionally (dottedlines in Fig. 1(b)). The j -th PL in the i -th Block is composedwith three components: an average pooling operator of size T i,j × F i,j , a PCG layer, and an upsampling operation. Eachpooling layer (PL, ≤ j ≤ J ) is responsible for extracting onespecific resolution feature and the Block which has multiplePLs can extract multi-resolution features. The CG and PCG ineach Block have the same structure. As shown in Fig. 1(c), bothCG and PCG are made of two consecutive convolution layerswith the same size of × × C and a possible skip connection,where × represents the kernel size of 2D convolutionaloperator and C is the channel number.Using the hyper-parameters (e.g., T i,j , F i,j , C , etc.) andflexible components (e.g., skip connection) of the basic MRP-CNN framework, many different MRP-CNN architectures canbe induced. For example, in each Block, the exact PL number,i.e., J , can be adjusted by the data-driven evolution processof E-MRP-CNN. In particular, when the size of the averagepooling operator of one PL is changed to T i,j = F i,j =1 duringthe evolution process, this PL will not be used in the currentBlock. In addition, the CG/PCG can have different channelnumbers (different C ) and when C =0 , CG/PCG is turned intodirect connection; skip connections can be used optionallybetween different Blocks; nonlinear activation functions canbe different (e.g., ReLU or sigmoid). Hence, the proposedMRP-CNN provides a flexible framework for MSVS. The number of blocks was choosen empirically here, and can be chosenflexibly in an application. Using more (than 5) stacked blocks, a higherseparation performance may be obtained, but with a computationally highercomplexity.
TABLE I: Encoding method of the proposed MRP-CNN. FC FS ∈ { , } ) B CG C S A A PL PS (2bit)x(2bit): 00(1), 01(4), 11(16), 10(64) PC PCG S A A PL .... .... .... PCG S A A B .... B. Encoding method
The encoding process is to assign each specific MRP-CNNarchitecture a unique code, i.e., the gene. With the gene-encodedMRP-CNN architectures, a search space is constructed, thusenabling our NAS to find the appropriate architectures forMSVS (see Section IV) under the defined objective. For theconvenience of presentation, we divide the proposed MRP-CNN framework in Fig. 1 into the following four levels fromlow to high
Convolution-level ⊂ Pooling-level ⊂ Block-level ⊂ Full-level , where Convolution-level represents convolutional layers andCG and PCG belong to this level, the
Pooling-level , Block-level ,and
Full-level correspond to PL, Block, and the whole MRP-CNN structure, respectively. The whole MRP-CNN structurecan be encoded as in Table I, where all the four levels areincluded.
1) Full-level:
The
Full-level , i.e., the whole MRP-CNNstructure, is encoded by FC − FS − B − B − B − B − B , where FC encodes the number of channels of the last PCGlayer in all Blocks, i.e., C i,J +1 (see Fig. 1(b)), FS encodespossible skip connections between different Blocks, B standsfor Block, and “ − ” represents concatenation of codes.The value of FC can be / / / , as shown in Table I,where we use 2 bits to represent four options: 00(32), 01(64),11(128), 10(256), respectively. Here, the same FC (one of thefour options) is used for all Blocks in one MRP-CNN structure,since the output channels of different Blocks should be thesame to enable skip connections.The FS is encoded in form of “b-bb-bbb-bbbb" using tenbits (see the second row in Table I). The first bit ‘b’ stands forthe skip connection from the first Block to the second Block,the second ‘bb’ stands for skip connections from the first andsecond Block to the third Block, and so on. The value of bdecides if skip connection exists (b= ) or not (b= ).
2) Block-level:
This level is important to extract multi-resolution features. Each Block is encoded as
B = CG − PL − · · · − PL (cid:124) (cid:123)(cid:122) (cid:125) J − PCG , where CG , PL , and PCG have been defined earlier. Both CGand PCG belong to
Convolution-level and PLs working inparallel belong to
Pooling-level . TABLE II: The code (gene) of an example MRP-CNN. FC FS → Block 1 Block 2 Block 3 Block 4 Block 5
CG C S A A PL PS PC PCG S A A PL PS PC PCG S A A PCG S A A
3) Convolution-level:
The CG and PCG which have thesame architecture (see Fig. 1(c)) are encoded differently. The CG is encoded as CG = C − S − A − A , where C encodes the number of channels of convolutionallayers in CG, i.e., C i, in Fig. 1(b), S stands for the skipconnection ( S ∈ { , } ), and two consecutive bits A − A implythe activation functions for the two-layer convolution operators,where A=0 represents ReLU and
A=1 represents Sigmoid.The values of C can be / / / . When C=0 , the CGturns into a direct connection, i.e., there is no convolution,activation, or skip connection. In this case, the S − A − A isignored.The code of PCG is similar to CG but without the channelnumber information, i.e., PCG = S − A − A . According to Fig. 1(b), the PCG is employed in both Blockand PL. Thus the channel number of PCG in Block and in PLis decided by FC in Full-level and PC in Pooling-level (seethe following), respectively.
4) Pooling-level:
Each PL is encoded using
PL = PS − PC − PCG , where PS is the size of pooling operator in PL, PC is thechannel number of PCG (i.e., C i,j of the j -th PL in the i -thBlock in Fig. 1(b)), and PCG represents the post convolutiongroup. For the j -th PL in the i -th Block, PS is defined as[ T i,j , F i,j ], where T i,j is the downsampling size in time axisand F i,j in frequency axis. When T i,j = F i,j = 1 , the j -thPL will not appear in the i -th Block and the code PC − PCG will be ignored. We use 2 bits to encode T i,j and F i,j of PS ,respectively. As shown in Table I, four possible values arerepresented by 00(1), 01(4), 11(16), and 10(64). The PC isalso encoded by 2 bits: 00(16), 01(32), 11(64), and 10(128).The upsampling operator in PL is not encoded, since it hasno freedom but to upsample the extracted features back to thesame size as the input of the current PL.A simple example of MRP-CNN is shown in Table II, whereall five Blocks have two PLs. The PS of the second PL is 0000( T i, = F i, =1 ), i.e., the PC and PCG are ignored (shown in gray). This MRP-CNN (or other MRP-CNN architectures) canbe used as a seed in E-MRP-CNN.IV. T HE E-MRP-CNNUsing the above encoding method, each possible MRP-CNNstructure can be represented by a unique code (i.e., gene). Allthese genes form a big searching space. The proposed E-MRP-CNN utilizes genetic algorithm to automatically search effectivegenes, i.e., effective MRP-CNN structures, from this searchingspace. Here, we propose two types of evolution schemes: thesingle-objective and the multi-objective E-MRP-CNN scheme.Both single/multi-objective schemes start with an initialpopulation, which is made of a seed gene (a specific MRP-CNN structure) and other genes (structures) randomly mutatedfrom this seed gene. After initialization, the single/multi-objective schemes iteratively generate new offspring genesby applying genetic operations (i.e., crossover and mutation)to randomly selected gene(s) from the current population. Thenew offspring genes are decoded to corresponding MRP-CNNstructures which are then trained/tested and assigned withfitness values. The fitness values for single-objective and multi-objective schemes are computed in different ways: the single-objective scheme considers only the separation performancewhile the multi-objective scheme considers both separationperformance and model complexity. When the fitness valuesfor all genes are computed, the genes with low fitness in currentgeneration will be removed. This evolution iteration is repeatedand finally ended in a generation made of well-performinggenes (structures).
A. Single-objective E-MRP-CNN
According to BSS-EVAL toolkit [35], there are usuallythree metrics to measure the separation performance of MSVS:source-to-distortion ratio (SDR), source-to-interferences ratio(SIR), and sources-to-artifacts ratio (SAR). As a proof ofconcept, we choose SDR as the fitness function to guide theevolution process of the single-objective scheme, because itis a global performance measure considering three goals asequally important [35]. In particular, since each gene is onlypartially trained in the evolution process (to accelerate thecomputation), the global measure SDR would be more suitablethan the SIR and SAR.The single-objective scheme is presented in Algorithm 1,where Rows 1-4 show the initialization process and Rows 5-12show the evolution process.
1) Initialization process: • In the first step (Row 1), we generate the initial populationof size n , including one seed gene and the other n − genes randomly mutated from this seed. To do this, the n b bits of the seed gene are flipped to generate a newgene, where n b is a random number and ≤ n b ≤ u ( u is the maximum flipping number). We repeat this processuntil n − different genes are obtained. • In the second step (Row 2), we divide the training datasetdenoted by D into three subsets D → { D tr , D te , D v } , According to [35], three goals are (i) rejection of the interferences, (ii)absence of forbidden distortions and “burbling” artifacts, and (iii) rejection ofthe sensor noise. where the training subset D tr is used for training, thetesting subset D te is used for computing the fitness, andthe validation subset D v is used to decide when to stopthe evolution process of the single-objective scheme. • In the third step (Row 3-4), we compute the fitness of eachgene in the initial population. Specifically, the MRP-CNNstructure decoded from each gene is trained with D tr foronly a few iterations (i.e., partial training). These partiallytrained structures are tested on D te and we computethe averaged SDR performance over all clips of D te as the fitness of each gene. The genes with low-fitnessare removed according to the population limit Z .
2) Evolution process: • In each iteration of evolution, we use crossover (Row6) and mutation (Row 7) operators to generate newoffspring genes. The crossover operator recombines theinformation of the two randomly selected genes, whereone gene is used as the baseline and each bit withinit has a probability (prob.) p to be exchanged with thecorresponding bit of the other gene. We apply crossover tocreate o c new offsprings. The mutation operator producesa new offspring by randomly flipping each bit of one genewith a prob. p . We apply the mutation operator to eachgene of the current generation and the newly obtained o c offsprings (generated by the crossover) to create total o m = o c + Z new offsprings. • The SDR fitness values of all new offsprings ( o c + o m )are computed (Row 8). All populations including the newoffsprings ( o c + o m ) and the current populations ( Z ) aresorted by their fitnesses (Row 9) and the low-fitness genesare removed according to the population limit Z (Row10). • We check if the stopping criterion is satisfied with thevalidation subset D v (Row 11). Specifically, we test thebest-fitness gene of the current generation on D v tocompute its SDR, which can be considered as the bestSDR performance of the current generation. This SDR isthen compared with the SDRs of several recent generationsand if there is no improvement on this value for a fewgenerations ( S generation), the evolution iteration willbe stopped and the earliest generation with no SDRimprovement will be given as the output. B. Multi-objective E-MRP-CNN
The single-objective scheme evolves only to improve theseparation performance. Thus it may pick up the more complexneural structures that provide better separation performance.Since the model complexity is an important factor for limitedmemory applications [19], the multi-objective scheme isdesigned to balance two objectives, i.e., model complexityand separation performance. In fact, these two objectives areconflicting: a complicated structure is more likely to provide ahigher performance than a simple one. Thus the multi-objectivescheme tries to approximate the Pareto front set, where many This averaged SDR score is computed on the subset D te , which can beconsidered as an approximation of the separation performance on the fulltesting dataset in the final evaluation. Algorithm 1
Single-objective E-MRP-CNN Generate the initial population of size n Data preparation: training set D → { D tr , D te , D v } Compute SDR fitness of each gene in the initial population Remove low-fitness genes according to population limit Z for i = 1 to N (maximum generation) do Generate o c new genes by crossover with prob. p Generate o m new genes by mutation with prob. p Compute SDR fitness for all new offsprings Sort all genes (current+new) by SDR fitness
Remove low-fitness genes by population limit Z break, if stopping criterion is satisfied end for solutions are included and each solution provides a goodseparation performance under a fixed model complexity.There are generally two properties to design evolutionarymulti-objective optimization algorithms: convergence and diver-sity [36]. The convergence measures the distances of solutionstoward the Pareto front (i.e., Pareto-optimal front) which shouldbe as small as possible [36]. The diversity is the spread ofsolutions along the Pareto front and should be as uniformas possible [36]. For MSVS, the convergence encourageseach evolved structure to offer a separation performance asgood as possible under a certain complexity; the diversityencourages the evolved structures to be various enough tohandle different complexity levels. To achieve these, theproposed multi-objective scheme is implemented based onNSGA-II [20], where the fast non-dominated sorting is usedto promote convergence and the crowded-comparison operatoris employed to address diversity [20].The multi-objective scheme is presented in Algorithm 2,where Rows 1-4 show the initialization process and Rows 5-11show the evolution iteration.The first two steps in the initialization process (Rows 1-2)are the same as those in the single-objective scheme (note thatthe subset D v is not used here). In the third step, we computethe fitness of each gene in the initial population, but insteadof considering the SDR as the only fitness, we calculate boththe SDR score and the model complexity (measured by theamount of parameters (Params)) of each gene. Then we usethe fast non-dominated sorting of NSGA-II [20] to calculatethe non-dominated levels of all genes. By sorting all theselevels with crowded-comparison operator, low-fitness genes areremoved according to the population limit Z (Row 4).In each iteration of the evolution, we use crossover (Row 6)and mutation (Row 7) to generate o c and o m ( o m = o c + Z )new offsprings, respectively. The SDR and model complexityof all o c + o m new offsprings are computed. Both the currentpopulations ( Z ) and the new offsprings ( o c + o m ) are sorted byfast non-dominated sorting and crowded-comparison operatorof NSGA-II. We remove low-fitness genes according to thepopulation limit Z . The multi-objective scheme will stop untilthe maximum iteration number is reached.V. E XPERIMENT SETTING
A. Datasets and evaluation metrics
The proposed method was evaluated on two popular datasets:MIR-1K [37] and DSD100 [38]. The MIR-1K dataset contains a
Algorithm 2
Multi-objective E-MRP-CNN Generate the initial population of size n Data preparation: training set D → { D tr , D te , D v } Compute SDR and model complexity and then perform fast non-dominated sorting and crowded-comparison Remove low-fitness genes according to the population limit Z for i = 1 to N (maximum generation) do Generate o c new genes by crossover with prob. p Generate o m new genes by mutation with prob. p Compute SDR and Params for all new offsprings Sort all genes (current+new) using fast non-dominated sortingand crowded-comparison
Remove low-fitness genes by the population limit Z end for thousand song clips extracted from 110 karaoke songs. For a faircomparison, we followed the evaluation conditions in [8], [11],[39], [40]: clips performed by one male singer ‘abjones’and one female singer ‘amy’ were used for training, the other clips performed by singers were used for testing. OnDSD100, songs of the "Dev" subset were used for training andwe followed [8], [11] to convert all sources to monophonicand then added three sources except for the vocals together toform the musical component (i.e., the Acc) source.The separation performance was quantitatively measured bythe BSS-EVAL toolkit [35] with respect to three criteria: SDR,SIR, and SAR. Normalized SDR (NSDR) [41] was calculatedto show the improvement of SDR compared to the originalmixture. Global NSDR (GNSDR), Global SIR (GSIR), andGlobal SAR (GSAR) were computed by taking the weightedmeans of the NSDRs, SIRs, SARs, respectively, over all thetest clips weighted by their length [11], [39]. Some qualitativeresults were also presented to verify the separation performanceof the proposed method. B. T-F masking framework
The proposed MRP-CNN and E-MRP-CNN were evaluatedbased on the T-F masking framework in Fig. 2, where thered rectangular is the key separation module (can be theproposed structure or other compared structures). The outputof the separation module is fed to the convolution layer (bluerectangular), which has two outputs for estimating the T-Fmasks for Vocal and Acc sources in MSVS. This framework (orsimilar frameworks) is widely employed in many MSVS/MSSmethods (see [11], [21], [22], [42]).The above framework was used in both evolution process(denoted by Evo) and the final evaluation (denoted by Eva).For each situation, we have two scenarios: training (Tra) andtesting (Tes). For Evo, we trained the evolved structures in theT-F masking framework using D tr (Evo&Tra) and then testedthe trained structures on D te (Evo&Tes) to obtain the SDRfitness. For Eva, the final evolved structures were trained in theT-F masking framework using the full training set (Evo&Tra)and then tested on the full testing set (Eva&Tes). Although it is advantageous to use independent separation module for eachsource, i.e., two separation modules for two sources, it is computationallyexpensive according to [11]. Hence, following [11], we use only one separationmodule.
Mixture signalDownsampling8K HzSTFT Mag. spectrogramNorm.& SplitBatchesSeparation model Convolution layerEstimated sourcesUpsampling8K HzISTFT Mag. spectrogramCombine&De-norm.MasksPhase spectrogram Compute Loss Evo&Tra and Eva&TraForming batches MultiplicationBatches
Fig. 2: The T-F masking framework.When using the T-F masking framework, the input mixturesignal (time-domain) was first downsampled to kHz to speedup computation [11]. The kHz mixture signal was transformedto its spectrogram (a complex matrix) via short-time Fouriertransform (STFT) using a window size of and a hopsize of . The magnitude spectrogram of the mixture wasnormalized by dividing its maximum value and then split intoblocks of size × (frequency × frames) to form batches.The batches of the mixture were fed to the separation moduleand its output was fed to the convolution layer to predict themasks (in batches) for Vocal and Acc sources. The predictedmasks were used in ( i ) the training process (Evo&Tra andEva&Tra) to compute the loss function and ( ii ) the testingprocess (Evo&Tes and Eva&Tes) to output the time-domainestimated sources.In the training process (Evo&Tra and Eva&Tra), the lossfunction L , norm in [11], [43] was adopted for a faircomparison. Formally, given the mixture X , the i -th groundtruth source Y i , and the predicted mask M i for the i -th source( i = 1 ...s , s = 2 in MSVS), the loss function is defined as J = (cid:80) si =1 (cid:107) Y i − X (cid:12) M i (cid:107) , , where (cid:12) denotes the element-wise multiplication of matrices. Note that when computing theloss funciton, the magnitude spectrograms of the ground-truthVocal and Acc sources were also normalized by dividing themaximum value of their mixture’s magnitude spectrogram.In testing process (Evo&Tes and Eva&Tes), the predictedmasks for Vocal and Acc were truncated to the range of [0 . , . and multiplied with the normalized spectrogram of themixture [11]. After de-normalization and batch combination, thetime-domain sources were obtained via inverse STFT (ISTFT)followed by upsampling.In particular, for Eva&Tra, two data augmentation operations,gain and sliding, were applied to original time-domain ground-truth sources, to creat new mixtures. The gain operation multi-plied the original source by a random factor a ( . ≤ a ≤ . )and the sliding operation added a random short delay d ( s ≤ d ≤ . s) to the beginning of the original source. Thenewly obtained ground-truth sources were mixed to form newmixtures. The ratio of the augmented data to the original datais . All the differences of using the T-F masking frameworkfor four scenarios are summarized in Table III. C. Hyperparameters of the E-MRP-CNN
Table IV lists the hyperparameters of the E-MRP-CNN.Since the multi-objective scheme requires more diversity, its TABLE III: Differences scenarios of using the T-F framework.
Scenarios Evo&Tra Evo&Tes Eva&Tra Eva&TesData augmentation (cid:88)
Training dataset D (cid:88) (cid:88) (cid:88) Testing dataset (cid:88)
Subset D tr of D (cid:88) Subset D te of D (cid:88) Subset D v of D (cid:88) (Single)Truncation (cid:88) (cid:88) TABLE IV: Hyperparameters of the E-MRP-CNN.
Scheme n u N Z o c o m p p D tr D te D v S Single 22 20 100 15 10 25 0.5 0.02 100/30 55/15 20/5 8Multi. 37 20 100 25 10 35 0.5 0.02 100/30 55/15 – – population limit Z and mutation number o m were higher thanthose of the single-objective scheme. For MIR-1K, the D tr , D te ,and D v were set as / / (clips). For DSD-100, the D tr , D te , and D v were set as / / (songs). For the multi-objective scheme, D v and S were not used. D. Training parameters
The Adam optimizer [44] was employed to train the T-Fmasking framework. In Evo&Tra, we aim to compute the‘fitness’ of each gene, so the T-F masking framework was onlypartially trained with , iterations for the MIR-1K datasetand , iterations for the DSD100 dataset using batch size . In Eva&Tra, the framework was fully trained with , iterations for the MIR-1K dataset and , iterations forthe DSD100 dataset using batch size .In both Evo&Tra and Eva&Tra, two tricks were used: (i)cosine decay learning rate and warm restart [45] and (ii)learning rate warmup [46]. For (i), we set T =100 and T m =2 for both datasets in Evo&Tra, and T =1000 ( ) forMIR-1K (DSD100) and T m =2 in Eva&Tra, where T is thelength of first decay period [45] and T m is the multiplicationfactor for decay period length at every new warm restart [45].The maximum learning rate for Evo&Tra and Eva&Tra was × − . The minimum learning rates for Evo&Tra andEva&Tra were × − and × − , respectively (moredetails can be found in [45]). For (ii), we scaled the learningrate in the first ( ) iterations for Evo&Tra (Eva&Tra)with a factor . , to avoid the maximum learning rate beingtoo large for some genes.VI. E XPERIMENTAL RESULTS
A. Evolution process of the E-MPR-CNN
For both single-objective and multi-objective schemes, theMRP-CNN structure in Table II was used as the seed of theinitial population of E-MRP-CNN on two datasets. The evolvedgenes (structures) of E-MRP-CNN are represented in form of“S/M-G-Index-Dataset”, where S and M denote the single-objective scheme and multi-objective scheme, respectively, Grepresents the generation (evolution) number, Index is the geneindex in the G-th generation, and Dataset can be MIR (MIR-1K) or DSD (DSD100). For single-objective scheme, the Indexis the SDR ranking of a gene in the current generation; formulti-objective scheme, the Index is the gene index in thecurrent generation. For example, “S-25-2-MIR” represents the P a r a m s [ M ] SeedM-25-5-MIRM-50-2-MIRM-50-5-MIRM-99-2-MIRM-99-4-MIRM-99-5-MIRM-99-8-MIR 020406080 G e n e r a t i o n (a) Multi-objective scheme P a r a m s [ M ] SeedS-1-1-MIRS-8-1-MIRS-16-1-MIRS-29-1-MIR 0510152025 G e n e r a t i o n (b) Single-objective scheme Fig. 3: The evolution processes of the single-objective and multi-objective E-MPR-CNN on MIR-1K. P a r a m s [ M ] SeedM-25-4-DSDM-50-8-DSDM-99-2-DSDM-99-4-DSDM-99-6-DSDM-99-7-DSD 020406080 G e n e r a t i o n (a) Multi-objective scheme P a r a m s [ M ] SeedS-1-1-DSDS-8-1-DSDS-16-1-DSDS-31-1-DSDS-31-2-DSDS-49-1-DSD 010203040 G e n e r a t i o n (b) Single-objective scheme Fig. 4: The evolution processes of the single-objective and multi-objective E-MPR-CNN on DSD100.structure with the second highest SDR performance in the th generation of the single-objective scheme on the MIR-1Kdataset, “M-99-2-DSD” represents the No. evolved structurein the th generation of the multi-objective scheme on theDSD100 dataset.We recorded the dynamic evolution process of E-MRP-CNN in Fig. 3 for the MIR-1K dataset and Fig. 4 for theDSD100 dataset. The vertical axis in each figure represents themodel complexity measured by Params and the horizontalaxis represents the fitness score measured by SDR. Eachcolored data point stands for a gene, i.e., a MRP-CNNstructure. The genes of different generations are distinguishedby colors changing from red (initial generation) to pink (highestgeneration). We set the highest evolution number as . Inour experiments, the single-objective scheme stopped evolvingat the th generation on the MIR-1K dataset and the rdgeneration on the DSD100 dataset when the SDR of the bestgene has no improvement for S = 8 consecutive generations.For the multi-objective scheme, we can observe the evolutionprocess of all generations (i.e., ≤ G ≤ ) on both MIR-1K and DSD100 datasets. By comparing the two subfigures in Fig. 3 and in Fig. 4, we can find that the single-objectivescheme and the multi-objective scheme had different evolutiontrends.As shown in Fig. 3(a) and Fig. 4(a), the multi-objectivescheme pushed the genes toward the Pareto front generationby generation during the evolution process. In each generation,a set of genes with different model complexities and SDRfitnesses were obtained. More specifically, we can see thatthe seed gene (represented by the black inverted triangle)had a relatively high model complexity (Params =2 . M)and a low SDR score ( . dB for MIR-1K and . dB forDSD100). As the evolution proceeded, the new generationsgradually moved toward the Pareto optimal front. For example,the first generations in Fig. 3(a) and Fig. 4(a) (red andyellow points) spread widely, the generations from to (yellow and green points) started to move to the lower-rightboundary, and the higher generations, e.g., to generations(blue and pink points), converged to the Pareto optimal frontapproximately. These results suggested that we could obtainbetter genes (in model complexity, in SDR performance, or inboth) as the evolution proceeded. Finally, a set of structures with better overall performance in model complexity and/orSDR performance were obtained, which can deal with differentcomplexity requirements.Compared with the multi-objective scheme, the modelcomplexity of genes in the single-objective scheme was notreduced during the evolution process, as shown in Fig. 3(b)and Fig. 4(b). This is because the model complexity was notconsidered in the single-objective scheme. In particular, we cansee from Fig. 3(b) and Fig. 4(b) that the single-objective scheme,without the constraint of model complexity, could steadilyimprove the SDR performance generation by generation. Whilein the multi-objective scheme, the genes of one generation(Fig. 3(a) and Fig. 4(a)) showed much difference in SDRperformance (so that they can deal with different complexitylevels). In addition, by comparing Fig. 3(a) with Fig. 3(b), andFig. 4(a) with Fig. 4(b), we can find that the single-objectivescheme could achieve a similar SDR performance to the multi-objective scheme with much fewer generations. For example inFig. 3(b), the single-objective scheme reached a SDR = 11 dBusing only ≤ G ≤ generations, while this required at least generations in the multi-objective scheme. Nevertheless, we canobserve that the multi-objective scheme could achieve a lowermodel complexity at SDR = 11 dB compared with the single-objective scheme. It is also found that the single-objectivescheme behaved differently on two datasets. On the MIR-1Kdataset (see Fig. 3(b)), the model complexity was significantlyimproved at high SDR score while this phenomenon was notapparent on the DSD100 dataset (see Fig. 4(b)).We also labelled some representative genes in Fig. 3 and Fig.4 (see the legend in each subfigure). For the multi-objectivescheme in Fig. 3(a) and Fig. 4(a), the seed gene, genes ofearly generations (G =25 and G =50 ), and genes of the finalgenertion (G =99 ) are plotted. It is clear that better genes (inmodel complexity, in SDR performance, or in both) can beobtained during the evolution process. For the single-objectivescheme, we intentionally continued the evolution process for afew more generations. Typical genes including the seed gene,genes of early generations (G =1 , G =8 for MIR-1K and G =1 ,G =8 , G =16 for DSD100), genes of the final generation (G =16 for MIR-1K and G =31 for DSD100), and genes after the finalgeneration (G =29 for MIR-1K and G =49 for DSD100) areplotted in Fig. 3(b) and Fig. 4(b). It is found from Fig. 3(b)that the gene in later generation, i.e., S-29-1-MIR, providedhigher SDR performance than the best gene obtained in theevolution process, i.e., S-16-1-MIR, on the testing subset D te .For DSD100 in Fig. 4(b), the gene after the final generation,i.e., S-49-1-DSD, provided similar SDR performance to thefinal evolved genes, i.e., S-31-1-DSD and S-31-2-DSD. Theperformance of all these evolved genes was evaluated andcompared using the full training and testing datasets, as shownin the next section. B. Final evaluations
In this section, we compare the evolved structures and otherSOTA MSVS methods using the full MIR-1K and DSD100datasets. In accordance with previous methods [47]–[51], onthe DSD100 dataset, we computed SDRs/SIRs/SARs of allsongs and then computed their median values. TABLE V: Computational efficiency of the proposed method(Seed, M- ∗ , and S- ∗ ) and the existing methods (MR-FCNN,SHN- ∗ , and SA-SHN- ∗ ). Method Params FLOPs Training Speed Inferring Speed[M] [G] [bat./s] [bat./s]Seed 2.33 129.72 31.61 93.09M-25-5-MIR 1.21 39.33 57.19 194.54M-50-2-MIR 0.37 18.27 97.50 349.71M-50-5-MIR 1.04 52.30 53.13 171.80M-99-2-MIR 0.30 11.64 121.37 445.10M-99-4-MIR 1.24 48.69 50.76 171.02M-99-5-MIR 2.42 130.66 31.58 94.34M-99-8-MIR 8.27 454.28 13.13 35.78M-25-4-DSD 0.77 26.91 71.30 249.70M-50-8-DSD 2.73 139.41 29.21 87.59M-99-2-DSD 0.15 7.91 175.22 621.22M-99-4-DSD 0.38 15.64 108.06 408.32M-99-6-DSD 0.59 21.63 87.76 311.76M-99-7-DSD 3.18 151.41 27.83 82.70S-1-1-MIR 2.47 135.31 30.40 89.67S-8-1-MIR 3.91 193.09 23.09 67.28S-16-1-MIR 6.76 404.45 13.89 38.72S-29-1-MIR 6.67 400.50 13.90 38.79S-1-1-DSD 2.55 135.85 30.16 89.48S-8-1-DSD 2.80 138.79 29.87 88.55S-16-1-DSD 2.53 136.47 30.45 91.05S-31-1-DSD 2.15 116.94 33.62 102.16S-31-2-DSD 2.84 144.63 29.13 85.66S-49-1-DSD 2.48 125.90 32.72 97.97MR-FCNN 0.56 36.56 9.03 18.59SHN-1 9.06 168.29 29.94 87.70SHN-2 17.46 292.87 16.70 49.19SHN-4 34.18 537.66 8.84 26.09SA-SHN-1 9.85 197.29 14.41 40.08SA-SHN-2 19.03 350.87 7.56 20.95SA-SHN-4 37.33 653.67 3.87 10.70
1) Quantitative evaluations:
The evolved structures in Fig. 3and Fig. 4 were first compared with some typical MR-CNNbased MSVS methods on the T-F masking framework: MR-FCNN [14], SHN [11], and SA-SHN [8]. The performancesof all structures were evaluated with respect to computationalefficiency and separation performance. The results on com-putational efficiency are listed in Table V and the results onseparation performance are listed in Table VI (for MIR-1Kdataset) and Table VII (for DSD100 dataset). In these Tables,we use SHN- n and SA-SHN- n to represent the n -layer SHNand n -layer SA-SHN, respectively. Computational efficiency:
The computational efficiencyin Table V was calculated in theory and measured in realhardware/software environment. The theoretical efficiency wasgiven by Params and FLOPs, where Params denotes thenumber of trainable parameters of each structure and FLOPsrepresents the number of floating-point operations for testing(inferring) in one batch. In practice, two structures withsimilar Params and FLOPs may have different computationspeeds, thus the computational efficiency was also measured inreal hardware/software environment . The real computationalefficiency in training and inferring was given in bat./s. that is,the number of batches per second.According to Table V, the multi-objective scheme providedmultiple structures with varying model complexities in onegeneration, e.g., M-99-2/4/5/8-MIR. In particular, most evolved The GPU is RTX 2080Ti, CPU is Intel Core i9 9900K, and the memory is4 ×
16G DDR4 (3200 MHz). In Linux operating system, we use TensorFlow2.0 with CUDA 10.1 and cuDNN 7.6. TABLE VI: The separation performance on MIR-1K (in dB)of the proposed method (Seed, M- ∗ , and S- ∗ ) and the existingSOTA methods (MR-FCNN, SHN- ∗ , and SA-SHN- ∗ ). Method Acc Vocal Mean
GNSDR GSIR GSAR GNSDR GSIR GSAR GNSDR GSIR GSAR
Seed 10.23 14.16 13.08 11.26 17.29 12.94 10.74 15.72 13.01M-25-5-MIR 10.03 13.24 13.56 11.80
S-29-1-MIR 10.51 14.20 13.58 11.83 17.79 13.51 11.17 16.00 13.54MR-FCNN 8.65 11.65 12.35 9.66 15.72 11.40 9.16 13.68 11.87SHN-1 9.85 13.66 12.85 10.88 16.63 12.71 10.36 15.15 12.78SHN-2 9.94 13.67 12.96 11.10 17.13 12.82 10.52 15.40 12.89SHN-4 9.97 13.65 13.08 11.13 17.09 12.89 10.55 15.37 12.98SA-SHN-1 10.12 13.78 13.25 11.32 17.15 13.10 10.72 15.47 13.18SA-SHN-2 10.34 13.99 13.46 11.71 17.58 13.44 11.02 15.79 13.45SA-SHN-4 10.53 14.54 13.38 11.75 17.87 13.40 11.14 16.21 13.39 structures in the multi-objective scheme had a lower modelcomplexity than the seed on both datasets. For single-objectivescheme, the model complexity of the evolved structures onthe MIR-1K was increased generation by generation and moststructures had a slightly higher model complexity than the seed.On DSD100, however, the increase in the model complexitywas not apparent during the evolution process.The theoretical model complexity of MR-FCNN was lowerthan those of the seed and some of the evolved structures(see Params and FLOPs). However, in real environment, itscomputation speed was much slower than the seed and theevolved structures, e.g., MR-FCNN vs. S-8-1-MIR, MR-FCNNvs. M-50-8-DSD. In particular, we can also find that someevolved structures, e.g., M-50-2-MIR, M-99-2-MIR, and M-99-2-DSD, could achieve lower theoretical model complexitythan MR-FCNN. In SHN and SA-SHN, the model complexitywas increased with layer number and the model complexitiesof these two methods were much higher than those of the seed,the multi-objective scheme, the single-objective scheme, andthe MR-FCNN.
Separation performance:
We can see from Table VI (MIR-1K dataset) that the evolved structures in both single-objectiveand multi-objective schemes achieved higher GNSDR andGSIR performance on the Vocal source and higher GSARperformance on the Acc source than the seed. For DSD100in Table VII, most evolved structures achieved higher SDRperformance on Acc and Vocal sources than the seed. ForVocal source, most evolved structures achieved higher SIR/SARperformance. The last three columns of Table VI and Table VIIlisted the mean results of Vocal and Acc. One can see that theoverall separation performances of most evolved structures insingle-objective and multi-objective schemes outperform theseed in three evaluation metrics. In addition, by comparingthe proposed method (including the seed, the single-objectivescheme, and the multi-objective scheme) with other methods,one can see that the single-objective scheme, the multi-objectivescheme, and the SA-SHN outperformed the MR-FCNN and TABLE VII: The separation performance on DSD100 (in dB)of the proposed method (Seed, M- ∗ , and S- ∗ ) and the existingSOTA methods (MR-FCNN, SHN- ∗ , and SA-SHN- ∗ ). Method Acc. (Median) Vocal (Median) MeanSDR SIR SAR SDR SIR SAR SDR SIR SARSeed 12.18 18.36 14.47 5.47 13.16 7.01 8.83 15.76 10.74M-25-4-DSD
S-1-1-DSD 12.33 the SHN.
Computational efficiency vs. Separation performance: (i) Proposed method:
By comparing Table V and the meanresults in Tables VI-VII, we can find that within the samegeneration of the multi-objective scheme, the structures with ahigher model complexity can provide higher performance onboth datasets, e.g. from M-99-2-MIR to M-99-8-MIR, fromM-99-2-DSD to M-99-7-DSD. In single-objective scheme, ahigher generation (with increased model complexity) usuallyachieved better separation performance, e.g. from S-1-1-MIRto S-16-1-MIR, from S-1-1-DSD to S-31-1-DSD. In particular,according to Fig. 3(b), the S-29-1-MIR (a structure of latergeneration after the stopping criterion was satisfied) providedhigher SDR performance than the final evolved gene S-16-1-MIR on the testing subset D te , while according to Tables VI,this gene does not outperform the S-16-1-MIR on the fullMIR-1K dataset. This result verified the effectiveness of ourstopping criteria of the single-objective scheme.By comparing the multi-objective scheme and the seed,one can see that the multi-objective scheme can obtain betterseparation results than the seed with a lower model complexity.For example, the M-99-6-DSD achieved . dB improvementon mean SDR than the seed using only . % Params and . % FLOPs of the seed. In real environment, this structurewas also . times (Training) and . times (Inferring) fasterthan the Seed. When comparing the single-objective schemewith the seed, we can find that the single-objective schemeachieved much better separation results with a slightly highermodel complexity on the MIR-1K dataset, e.g., the S-16-1-MIR, which had . dB improvement on mean GNSDR thanthat of the seed with additional cost of . M Params and . FLOPs. On the DSD100 dataset, the single-objectivescheme achieved much better separation results with similar oreven lower model complexity to the seed, e.g., the S-31-1-DSD,which obtained . dB improvement on mean SDR and hada lower model complexity (only . in Params and . inFLOPs) than the seed. −1 P a a m s [ M ] SeedS-1-1-MIRS-8-1-MIRS-16-1-MIRS-29-1-MIRM-25-5-MIRM-50-2-MIRM-50-5-MIRM-99-2-MIRM-99-4-MIRM-99-5-MIRM-99-8-MIRSHN-1SHN-2SHN-4SA-SHN-1SA-SHN-2SA-SHN-4 G e n e r a t i o n (a) On MIR-1K dataset −1 P a a m s [ M ] SeedS-1-1-DSDS-8-1-DSDS-16-1-DSDS-31-1-DSDS-31-2-DSDS-49-1-DSDM-25-4-DSDM-50-8-DSDM-99-2-DSDM-99-4-DSDM-99-6-DSDM-99-7-DSDSHN-1SHN-2SHN-4SA-SHN-1SA-SHN-2SA-SHN-4 G e n e r a t i o n (b) On DSD100 dataset Fig. 5: Visualization of all structures in Params (vertical axis) and mean GNSDR/SDR (horizontal axis). F r e q u e n c y [ H z ] G.T.: Vocal
MR-FCNN: Vocal
SHN-4: Vocal
SA-SHN-4: Vocal F r e q u e n c y [ H z ] Seed: Vocal
S-1-1-MIR: Vocal
S-8-1-MIR: Vocal
S-16-1-MIR: Vocal F r e q u e n c y [ H z ] M-99-2-MIR: Vocal
M-99-4-MIR: Vocal
M-99-5-MIR: Vocal
M-99-8-MIR: Vocal (a) Vocal F r e q u e n c y [ H z ] G.T.: Acc.
MR-FCNN: Acc.
SHN-4: Acc.
SA-SHN-4: Acc. F r e q u e n c y [ H z ] Seed: Acc.
S-1-1-MIR: Acc.
S-8-1-MIR: Acc.
S-16-1-MIR: Acc. F r e q u e n c y [ H z ] M-99-2-MIR: Acc.
M-99-4-MIR: Acc.
M-99-5-MIR: Acc.
M-99-8-MIR: Acc. (b) Acc
Fig. 6: Qualitative comparison of the proposed method with MR-FCNN, SHN-4, and SA-SHN-4 on MIR-1K dataset.When comparing the single-objective scheme with the multi-objective scheme, we can find that the multi-objective schemecould sometimes find more effective and efficient structures(similar or lower model complexity but better separationperformance) than the single-objective scheme. For example,the M-99-6-DSD is . dB higher in the mean SDR than theS-31-1-DSD but with only . % Params and . % FLOPsof the S-31-1-DSD. In the real environment, the M-99-6-DSD was . times (Training) and . times (Inferring)faster than the S-31-1-DSD. Such phenomenon can also beobserved on the MIR-1K dataset, e.g., the M-50-5-MIR withonly . % Params and . % FLOPs of the S-8-1-MIR wasonly . dB lower than the S-8-1-MIR in mean GNSDR.In the real environment, the M-50-5-MIR was . times(training) and . times (Inferring) faster than the S-8-1-MIR.These observations suggested that the multi-objective schemecan greatly reduce the model complexity while maintainingacceptable separation performance. (ii) Proposed method vs. other methods: Compared with theproposed method (the seed, the single-objective scheme, and themulti-objective scheme), the MR-FCNN has lower theoretical model complexity ( . in Params and . in FLOPs), whilein real environment, it was much slower in training ( . bat./s)and inferring ( . bat./s) than the proposed method. Forseparation performance, the MR-FCNN was much worse thanthe proposed method. In particular, we can see from Tables VI-VII that the evolved structures in multi-objective scheme, e.g.,M-99-2-MIR, M-50-2-MIR, and M-99-2-DSD, could achievebetter separation performance in mean GNSDR/SDR than MR-FCNN even with lower model complexity.The SHN and SA-SHN also achieved good separationperformance, especially the SA-SHN. However, these twomethods had low computational efficiency. For example, on theMIR-1K dataset, the SHN-4 (the best performance of SHN) andthe M-99-2-MIR of the multi-objective scheme, have similarmean GNSDR results, while the model complexity of SHN-4was . times (Params) and . times (FLOPs) of thoseof M-99-2-MIR. In real environment, the SHN-4 was . times (training) and . times (Inferring) slower than theM-99-2-MIR. Similar phenomenon can be also observed from,e.g. SHN-4 vs. S-1-1-MIR, SHN-4 vs. S-1-1-DSD, SHN-4 vs.M-99-6-DSD. For SA-SHN, we can see from Tables VI-VII TABLE VIII: Comparison of the proposed method (Seed, M- ∗ ,and S- ∗ ) with other MSVS methods (MLRR, DRNN, ModGD,and U-Net) on MIR-1K dataset (in dB), where “–” meanscorresponding results were not provided by the method. Method Vocal AccGNSDR GSIR GSAR GNSDR GSIR GSARMLRR [40] 3.85 5.63 10.70 4.19 7.80 8.22DRNN [39] 7.45 13.08 9.68 – – –ModGD [52] 7.50 13.73 9.45 – – –U-Net [12] 7.43 11.79 10.42 7.45 11.43 10.41Seed 11.26 17.29 12.94 10.23 14.16 13.08M-25-5-MIR 11.80
S-16-1-MIR that when the SA-SHN-4 (the best performance of SA-SHN)had similar GNSDR/SDR results to the single-objective andmulti-objective schemes on the MIR-1K and DSD-1K datasets,its model complexity was much higher than the proposedstructures, e.g., SA-SHN-4 vs. S-16-1-MIR, SA-SHN-4 vs.M-25-4-DSD.All the above results suggested that the proposed method(especially the multi-objective scheme) was more effective andefficient than the MR-FCNN, the SHN, and the SA-SHN. Inorder to clearly visualize these quantitive results, we plottedall the data (except for MR-FCNN) of Tables V-VII in Fig. 5,where the vertical axis is the Params and the horizontal axisis the mean GNSDR/SDR.
2) Qualitative Results:
We also qualitatively compared theseparation performance of the above methods. The separationresults on an exemplar MIR-1K song clip ( geniusturtle_6_04 )are shown in Fig. 6. By comparing the ground truth (G.T.)Vocal and Acc, one can see that the MR-FCNN, the SHN-4,the SA-SHN-4, and the seed of the proposed method wronglyassigned an important frequency component of Acc (
Hzappearing around s ∼ s) to Vocal. Besides, the MR-FCNNand the SHN-4 could not capture some of the fine vocal details.In contrast, the evolved structures in single-objective scheme,e.g., S-1-1-MIR, S-8-1-MIR, and S-16-1-MIR, correctly putthis frequency component back to the Acc. In multi-objectivescheme, the separation results of several structures in the th generation, e.g., M-99-2/4/5/8-MIR, are exhibited. It isshown that the M-99-4/5/8-MIR correctly assigned the Hz frequency component back to Acc while the M-99-2-MIRdid not. According to Table V, we can find that the M-99-2-MIR compromised the separation performance with a verylow model complexity. Finally, one can see that the estimatedmagnitude spectrograms of the Vocal and Acc obtained byM-99-4/5/8-MIR were quite similar to those of the groundtruth Vocal and Acc sources.
3) Comparsion with other methods:
We finally comparedthe proposed method with other MSVS methods. The resultsare listed in Tables VIII-IX. These numerical results verifiedthe separation performance of the proposed method. TABLE IX: Median SDR values of the proposed method (Seed,M- ∗ , and S- ∗ ) and other MSVS methods (DeepNMF, wRPCA,NUG, BLEND, and MM-DenseNet) on DSD100 dataset (indB). Method Vocal AccDeepNMF [47] 2.75 8.90wRPCA [48] 3.92 9.45NUG [49] 4.55 10.29BLEND [50] 5.23 11.70MM-DenseNet [51] 6.00 12.10Seed 5.47 12.18M-25-4-DSD 6.21
M-50-8-DSD 6.31 12.70M-99-2-DSD 5.36 11.96M-99-4-DSD 5.95 12.52M-99-6-DSD 6.15 12.64M-99-7-DSD
VII. C
ONCLUSIONS
As the first attempt in the field of MSVS, this paperproposed an evolutionary framework, i.e., the E-MRP-CNN,to automatically find effective neural networks for MSVS.The proposed E-MRP-CNN is based on a novel MR-CNNnamely MRP-CNN, which utilizes various-size average poolingoperators for feature extraction. Compared with existing MR-CNNs, the MRP-CNN has a low computational complexity andcan effectively extract multi-resolution features for MSVS. Wederived the E-MRP-CNN using single-objective and multiple-objective genetic algorithms. The single-objective E-MRP-CNN considers only the separation performance while themulti-objective E-MRP-CNN considers both the separationperformance and the model complexity, and thus it providesa set of solutions to handle different separation performanceand/or model complexity requirements. Experimental resultson the MIR-1K and DSD100 datasets showed that the pro-posed method (especially the multi-objective scheme) is moreeffective and efficient than the SOTA MSVS methods, whichverified the effectiveness of the proposed method.VIII. A
CKNOWLEDGE
This work was supported by National Natural ScienceFoundation of China (No. 61902280, 61373104), NaturalScience Foundation of Tianjin (No. 19JCYBJC15600). It wasalso supported by a Grant-in-Aid for Scientific Research (B)(No. 17H01761) and I-O DATA foundation.R
EFERENCES[1] Z. Rafii, A. Liutkus, F. Stöter, S. I. Mimilakis, D. FitzGerald, andB. Pardo, “An overview of lead and accompaniment separation in music,”
IEEE/ACM Trans. Audio, Speech & Language Processing , vol. 26, no. 8,pp. 1307–1335, 2018.[2] Y. Li and D. Wang, “Separation of singing voice from music accompani-ment for monaural recordings,”
IEEE Trans. Audio, Speech & LanguageProcessing , vol. 15, no. 4, pp. 1475–1487, 2007.[3] P. Huang, S. D. Chen, P. Smaragdis, and M. Hasegawa-Johnson,“Singing-voice separation from monaural recordings using robust principalcomponent analysis,” in
Proc. IEEE Int. Conf. Acoust., Speech, SignalProcess. , 2012, pp. 57–60.[4] Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep learning,”
Nature , vol.521, no. 7553, pp. 436–444, 2015. [5] A. J. R. Simpson, G. Roma, and M. D. Plumbley, “Deep karaoke:Extracting vocals from musical mixtures using a convolutional deepneural network,” in Proc. 12th Int. Conf. Latent Var. Anal. SignalSeparation , 2015, pp. 429–436.[6] P. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Jointoptimization of masks and deep recurrent neural networks for monauralsource separation,”
IEEE/ACM Trans. Audio, Speech & LanguageProcessing , vol. 23, no. 12, pp. 2136–2147, 2015.[7] E. M. Grais and M. D. Plumbley, “Single channel audio source separationusing convolutional denoising autoencoders,” in
Proc. IEEE Global Conf.Signal and Info. Process. , 2017, pp. 1265–1269.[8] W. Yuan, S. Wang, X. Li, M. Unoki, and W. Wang, “A skip attentionmechanism for monaural singing voice separation,”
IEEE Signal Process.Lett. , vol. 26, no. 10, pp. 1481–1485, 2019.[9] J. Paulus, M. Müller, and A. Klapuri, “State of the art report: Audio-based music structure analysis,” in
Proc. 11th Int. Soc. Music Inf. Ret.Conf. , 2010, pp. 625–636.[10] C. A. Huang, A. Vaswani, J. Uszkoreit, N. Shazeer, C. Hawthorne, A. M.Dai, M. D. Hoffman, and D. Eck, “An improved relative self-attentionmechanism for transformer with application to music generation,”
CoRR ,vol. abs/1809.04281, 2018.[11] S. Park, T. Kim, K. Lee, and N. Kwak, “Music source separation usingstacked hourglass networks,” in
Proc. 19th Int. Soc. Music Inf. Ret. Conf. ,2018, pp. 289–296.[12] A. Jansson, E. J. Humphrey, N. Montecchio, R. M. Bittner, A. Kumar,and T. Weyde, “Singing voice separation with deep u-net convolutionalnetworks,” in
Proc. 18th Int. Soc. Music Inf. Ret. Conf. , 2017, pp. 745–751.[13] E. M. Grais, H. Wierstorf, D. Ward, and M. D. Plumbley, “Multi-resolution fully convolutional neural networks for monaural audio sourceseparation,” in
Proc. 14th Int. Conf. Latent Variable Anal. SignalSeparation (LVA/ICA) , 2018, pp. 340–350.[14] E. M. Grais, D. Ward, and M. D. Plumbley, “Raw multi-channel audiosource separation using multi- resolution convolutional auto-encoders,”in
Proc. 26th European Signal Processing Conf. , 2018, pp. 1577–1581.[15] H. Zhang, I. J. Goodfellow, D. N. Metaxas, and A. Odena, “Self-attentiongenerative adversarial networks,” in
Proc. 36th Int. Conf. on Mach. Learn. ,2019, pp. 7354–7363.[16] A. Stoutchinin, F. Conti, and L. Benini, “Optimally scheduling CNNconvolutions for efficient memory access,”
CoRR , vol. abs/1902.01492,2019.[17] H. Lee, P. T. Pham, Y. Largman, and A. Y. Ng, “Unsupervised featurelearning for audio classification using convolutional deep belief networks,”in
Proc. Adv. Neural Inf. Process. Syst. , 2009, pp. 1096–1104.[18] J. Schlüter and S. Böck, “Improved musical onset detection withconvolutional neural networks,” in
Proc. IEEE Int. Conf. Acoust., Speech,Signal Process. , 2014, pp. 6979–6983.[19] K. Guo, S. Zeng, J. Yu, Y. Wang, and H. Yang, “A survey of fpga-based neural network inference accelerators,”
TRETS , vol. 12, no. 1, pp.2:1–2:26, 2019.[20] K. Deb, S. Agrawal, A. Pratap, and T. Meyarivan, “A fast and elitistmultiobjective genetic algorithm: NSGA-II,”
IEEE Trans. EvolutionaryComputation , vol. 6, no. 2, pp. 182–197, 2002.[21] S. I. Mimilakis, K. Drossos, J. F. Santos, G. Schuller, T. Virtanen,and Y. Bengio, “Monaural singing voice separation with skip-filteringconnections and recurrent inference of time-frequency mask,” in
Proc.IEEE Int. Conf. Acoust., Speech, Signal Process. , 2018, pp. 721–725.[22] S. I. Mimilakis, K. Drossos, T. Virtanen, and G. Schuller, “A recurrentencoder-decoder approach with skip-filtering connections for monauralsinging voice separation,” in
Proc. Int. Workshop on Mach. Learn. forSignal Processing , 2017, pp. 1–6.[23] D. Stoller, S. Ewert, and S. Dixon, “Wave-u-net: A multi-scale neuralnetwork for end-to-end audio source separation,” in
Proc. Conf. Int. Soc.Music Inf. Retrieval , 2018, pp. 334–340.[24] O. Slizovskaia, L. Kim, G. Haro, and E. Gómez, “End-to-end soundsource separation conditioned on instrument labels,” in
Proc. IEEE Int.Conf. Acoust., Speech, Signal Process. , 2019, pp. 306–310.[25] J. Liu and Y. Yang, “Dilated convolution with dilated GRU for musicsource separation,” in
Proc. 28th Int. Joint Conf. on Artificial Intelligence ,2019, pp. 4718–4724.[26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez,L. Kaiser, and I. Polosukhin, “Attention is all you need,” in
NIPS , 2017,pp. 6000–6010.[27] M. Michelashvili, S. Benaim, and L. Wolf, “Semi-supervised monauralsinging voice separation with a masking network trained on syntheticmixtures,” in
Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. ,2019, pp. 291–295. [28] Y. C. Sübakan and P. Smaragdis, “Generative adversarial source separa-tion,” in
Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. , 2018,pp. 26–30.[29] Y. Luo, Z. Chen, J. R. Hershey, J. L. Roux, and N. Mesgarani, “Deepclustering and conventional networks for music separation: Strongertogether,” in
Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. ,2017, pp. 61–65.[30] S. I. Mimilakis, K. Drossos, E. Cano, and G. Schuller, “Examining themapping functions of denoising autoencoders in music source separation,”
CoRR , vol. abs/1904.06157, 2019.[31] D. So, Q. Le, and C. Liang, “The evolved transformer,” in
Proc. 36thInt. Conf. Mach. Learn. , 2019, pp. 5877–5886.[32] T. Elsken, J. H. Metzen, and F. Hutter, “Neural architecture search: Asurvey,”
J. Mach. Learn. Res. , vol. 20, pp. 55:1–55:21, 2019.[33] K. O. Stanley and R. Miikkulainen, “Evolving neural networks throughaugmenting topologies,”
Evolutionary Computation , vol. 10, pp. 99–127,2001.[34] X. Gong, S. Chang, Y. Jiang, and Z. Wang, “Autogan: Neural architecturesearch for generative adversarial networks,”
CoRR , vol. abs/1908.03835,2019.[35] E. Vincent, R. Gribonval, and C. Févotte, “Performance measurement inblind audio source separation,”
IEEE Trans. Audio, Speech & LanguageProcessing , vol. 14, no. 4, pp. 1462–1469, 2006.[36] K. Li, S. Kwong, Q. Zhang, and K. Deb, “Interrelationship-based selectionfor decomposition multiobjective optimization,”
IEEE Trans. Cybernetics ,vol. 45, no. 10, pp. 2076–2088, 2015.[37] C. Hsu and J. R. Jang, “On the improvement of singing voice separationfor monaural recordings using the MIR-1K dataset,”
IEEE Trans. Audio,Speech & Language Processing , vol. 18, no. 2, pp. 310–319, 2010.[38] N. Ono, Z. Rafii, D. Kitamura, N. Ito, and A. Liutkus, “The 2015 signalseparation evaluation campaign,” in
Proc. 12th Int. Conf. Latent Var.Anal. Signal Separation , 2015, pp. 387–395.[39] P. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Jointoptimization of masks and deep recurrent neural networks for monauralsource separation,”
IEEE/ACM Trans. Audio, Speech & LanguageProcessing , vol. 23, no. 12, pp. 2136–2147, 2015.[40] Y. Yang, “Low-rank representation of both singing voice and musicaccompaniment via learned dictionaries,” in
Proc. 14th Int. Soc. MusicInf. Ret. Conf. , 2013, pp. 427–432.[41] A. Ozerov, P. Philippe, F. Bimbot, and R. Gribonval, “Adaptation ofbayesian models for single-channel source separation and its applicationto voice/music separation in popular songs,”
IEEE Trans. Audio, Speech& Language Processing , vol. 15, no. 5, pp. 1564–1578, 2007.[42] A. Liutkus and R. Badeau, “Generalized wiener filtering with fractionalpower spectrograms,” in
Proc. IEEE Int. Conf. Acoust., Speech, SignalProcess. , 2015, pp. 266–270.[43] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networksfor biomedical image segmentation,” in
Proc. 18th Int. Conf. Med. ImageComputing and Computer-Assisted Intervention , 2015, pp. 234–241.[44] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”in
Proc. 3rd Int. Conf. Learning Rep. , 2015.[45] I. Loshchilov and F. Hutter, “SGDR: stochastic gradient descent withwarm restarts,” in
Proc. of the 5th Int. Conf. Learning Representations ,2017.[46] P. Goyal, P. Dollár, R. B. Girshick, P. Noordhuis, L. Wesolowski,A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatchSGD: training imagenet in 1 hour,”
CoRR , vol. abs/1706.02677, 2017.[47] J. L. Roux, J. R. Hershey, and F. Weninger, “Deep NMF for speechseparation,” in
Proc. IEEE Int. Conf. Acoust., Speech, Signal Process. ,2015, pp. 66–70.[48] I. Jeong and K. Lee, “Singing voice separation using RPCA with weightedl_1 -norm,” in
Proc. 13th Int. Conf. Latent Var. Anal. Signal Separation ,2017, pp. 553–562.[49] A. A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel musicseparation with deep neural networks,” in
Proc. 24th European SignalProcess. Conf. , 2016, pp. 1748–1752.[50] S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp, N. Takahashi, andY. Mitsufuji, “Improving music source separation based on deep neuralnetworks through data augmentation and network blending,” in
Proc.IEEE Int. Conf. Acoust., Speech, Signal Process., , 2017, pp. 261–265.[51] N. Takahashi and Y. Mitsufuji, “Multi-scale multi-band densenets foraudio source separation,” in
Proc. IEEE Workshop on App. of SignalProcess. to Audio and Acoustics , 2017, pp. 21–25.[52] J. Sebastian and H. A. Murthy, “Group delay based music sourceseparation using deep recurrent neural networks,” in