Content based singing voice source separation via strong conditioning using aligned phonemes
CCONTENT BASED SINGING VOICE SOURCE SEPARATION VIASTRONG CONDITIONING USING ALIGNED PHONEMES
Gabriel Meseguer-Brocal
STMS UMR9912, Ircam/CNRS/SU, Paris [email protected]
Geoffroy Peeters
LTCI, Institut Polytechnique de Paris [email protected]
ABSTRACT
Informed source separation has recently gained re-newed interest with the introduction of neural networksand the availability of large multitrack datasets contain-ing both the mixture and the separated sources. Theseapproaches use prior information about the target sourceto improve separation. Historically, Music InformationRetrieval researchers have focused primarily on score-informed source separation, but more recent approachesexplore lyrics-informed source separation. However, be-cause of the lack of multitrack datasets with time-alignedlyrics, models use weak conditioning with non-alignedlyrics. In this paper, we present a multimodal multi-track dataset with lyrics aligned in time at the word levelwith phonetic information as well as explore strong con-ditioning using the aligned phonemes. Our model followsa U-Net architecture and takes as input both the magni-tude spectrogram of a musical mixture and a matrix withaligned phonetic information. The phoneme matrix is em-bedded to obtain the parameters that control Feature-wiseLinear Modulation (FiLM) layers. These layers conditionthe U-Net feature maps to adapt the separation process tothe presence of different phonemes via affine transforma-tions. We show that phoneme conditioning can be success-fully applied to improve singing voice source separation.
1. INTRODUCTION
Music source separation aims to isolate the different in-struments that appear in an audio mixture (a mixed mu-sic track), reversing the mixing process. Informed-sourceseparation uses prior information about the target sourceto improve separation. Researchers have shown that deepneural architectures can be effectively adapted to thisparadigm [1, 2]. Music source separation is a particularlychallenging task. Instruments are usually correlated intime and frequency with many different harmonic instru-ments overlapping at several dynamics variations. With-out additional knowledge about the sources the separationis often infeasible. To address this issue, Music Informa- c (cid:13) Gabriel Meseguer-Brocal, Geoffroy Peeters. Licensedunder a Creative Commons Attribution 4.0 International License (CC BY4.0).
Attribution:
Gabriel Meseguer-Brocal, Geoffroy Peeters, “Con-tent based singing voice source separation via strong conditioning usingaligned phonemes”, in
Proc. of the 21st Int. Society for Music Informa-tion Retrieval Conf.,
Montréal, Canada, 2020. tion Retrieval (MIR) researchers have integrated into thesource separation process prior knowledge about the differ-ent instruments presented in a mixture, or musical scoresthat indicate where sounds appear. This prior knowledgeimproves the performances [2–4]. Recently, conditioninglearning has shown that neural networks architectures canbe effectively controlled for performing different musicsource isolation tasks [5–10]Various multimodal context information can be used.Although MIR researchers have historically focused onscore-informed source separation to guide the separationprocess, lyrics-informed source separation has becomean increasingly popular research area [10, 11]. Singingvoice is one of the most important elements in a musi-cal piece [12]. Singing voice tasks (e.g. lyric or notetranscription) are particularly challenging given its vari-ety of timbre and expressive versatility. Fortunately, re-cent data-driven machine learning techniques have boostedthe quality and inspired many recent discoveries [13, 14].Singing voice works as a musical instrument and at thesame time conveys a semantic meaning through the use oflanguage [14]. The relationship between sound and mean-ing is defined by a finite phonetic and semantic represen-tations [15, 16]. Singing in popular music usually has aspecific sound based on phonemes, which distinguishes itfrom the other musical instruments. This motivates re-searchers to use prior knowledge such as a text transcript ofthe utterance or linguistic features to improve the singingvoice source sparation [10, 11]. However, the lack of mul-titrack datasets with time-aligned lyrics has limited themto develop their ideas and only weak conditioning scenar-ios have been studied, i.e. using the context informationwithout explicitly informing where it occurs in the signal.Time-aligned lyrics provide abstract and high-level infor-mation about the phonetic characteristics of the singingsignal. This prior knowledge can facilitate the separationand be beneficial to the final isolation.Looking for combining the power of data-driven modelswith the adaptability of informed approaches, we proposea multitrack dataset with time-aligned lyrics. We explorethen how we can use strong conditioning where the con-tent information about the lyrics is available frame-wise toimprove vocal sources separation. We investigate strongand weak conditioning using the aligned phonemes viaFeature-wise Linear Modulation (FiLM) layer [17] in U-Net based architecture [18]. We show that phoneme con-ditioning can be successfully applied to improve standard a r X i v : . [ ee ss . A S ] A ug inging voice source separation and that simplest strongconditioning outperforms any other scenario.
2. RELATED WORK
Informed source separation use context information aboutthe sources to improve the separation quality, introduc-ing in models additional flexibility to adapt to observedsignals. Researchers have explored different approachesfor integrating different prior knowledge in the separa-tion [19]. Most of the recent data-driven music source sep-aration methods use weak conditioning with prior knowl-edge about the different instruments presented in a mix-ture [3, 5, 7–9]. Strong conditioning has been primarilyused in score-informed source separation. In this section,we review works related to this topic as well as novel ap-proaches that explore lyrics-informed source separation.
Scores provide prior knowledge for source separation invarious ways. For each instrument (source), it defineswhich notes are played at which time, which can be linkedto audio frames. This information can be used to guidethe estimation of the harmonics of the sound source ateach frame [2,4]. Pioneer approaches rely on non-negativematrix factorization (NMF) [20–23]. These methods as-sume that the audio is synchronized with the score and usedifferent alignment techniques to achieve this. Neverthe-less, alignment methods introduce errors. Local misalign-ments influence the quality of the separation [21, 24]. Thisis compensated by allowing a tolerance window aroundnote onsets and offsets [20, 23] or with context-specificmethods to refine the alignment [25]. Current approachesuse deep neural network architectures and filtering spec-trograms by the scores and generating masks for eachsource [2]. The score-filtered spectrum is used as input toan encoder-decoder convolutional neural network (CNN)architecture similar to [26]. [27] propose an unsupervisedmethod where scores guide the representation learning toinduce structure in the separation. They add class ac-tivity penalties and structured dropout extensions to theencoder-decoder architecture. Class activity penalties cap-ture the uncertainty about the target label value and struc-tured dropout uses labels to enforce a specific structure,canceling activity related to unwanted note.
Due to the importance of singing voice in a musicalpiece [12], it is one of the most useful source to separatein a music track. Researchers have integrated the vocalactivity information to constrain a robust principal compo-nent analysis (RPCA) method, applying a vocal/non-vocalmask or ideal time-frequency binary mask [28]. [10] pro-pose a bidirectional recurrent neural networks (BRNN)method that includes context information extracted fromthe text via attention mechanism. The method takes as in-put a whole audio track and its associated text information and learn alignment between mixture and context informa-tion that enhance the separation. Recently, [11] extract arepresentation of the linguistic content related to cogni-tively relevant features such as phonemes (but they do notexplicitly predict the phonemes) in the mixture. The lin-guistic content guide the synthetization of the vocals.
3. FORMALIZATION
We use the multimodal information as context to guide andimprove the separation. We formalize our problem satisfy-ing certain properties summarized as [29]:
How is the multimodal model constructed?
We dividethe model into two distinct parts [30]: a generic networkthat carries on the main computation and a controlmechanism that conditions the computation regardingcontext information and adds additional flexibility. The conditioning itself is performed using FiLM layers [17].FiLM can effectively modulate a generic source sepa-ration model by some external information, controllinga single model to perform different instrument sourceseparations [3, 5]. With this strategy, we can explore the control and conditioning parts regardless of the generic network used.
Where is the context information used? at which placein the generic network we insert the context information,and defining how it affects the computation, i.e. weak (orstrong) conditioning without (or with) explicitly inform-ing where it occurs in the signal.
What context information?
We explore here prior infor-mation about the phonetic evolution of the singing voice,aligned in time with the audio. To this end, we introducea novel multitrack dataset with lyrics aligned in time.
4. DATASET
The DALI (Dataset of Aligned Lyric Information) [31]dataset is a collection of songs described as a sequence oftime-aligned lyrics. Time-aligned lyrics are described atfour levels of granularity: notes , words , lines and para-graphs : A g = ( a k,g ) K g k =1 where a k,g = ( t k , t k , f k , l k , i k ) g (1)where g is the granularity level and K g the number of el-ements of the aligned sequence, t k and t k being a textsegment’s start and end times (in seconds) with t k < t k , f k a tuple ( f min , f max ) with the frequency range (in Hz)covered by all the notes in the segment (at the note level f min = f max , a vocal note), l k the actual lyric’s infor-mation and i k = j the index that links an annotation a k,g with its corresponding upper granularity level annotation a j,g +1 . The text segment’s events for a song are orderedand non-overlapping - that is, t k ≤ t k +1 ∀ k .There is a subset of DALI of 513 multitracks withthe mixture and its separation in two sources, vocals and accompaniment . This subset comes from WASABI igure 1 . Method used for creating the vocals , accompa-niment and mixture version.dataset [32]. The multitracks are distributed in 247 differ-ent artists and 32 different genres. The dataset contains35.4 hours with music and 14.1 hours with vocals, witha mean average duration per song of 220.83s and 98.97swith vocals. All the songs are in English.The original multitracks have the mixture decomposedin a set of unlabeled sources in the form track_1, track_2,..., track_n. Depending of the songs, the files can be RAW(where each source is an instrument track e.g. a drumsnare) or STEMS (where all the RAW files for an instru-ment are merged into a single file). In the following, we ex-plain how the vocals and accompaniment tracks are auto-matically created from these unlabelled sources. The pro-cess is summarized in Figure 1.For each track τ of a multitrack song, we computea singing voice probability vector overtime, using a pre-trained Singing Voice Detection (SVD) model [33]. Weobtain then a global mean prediction value per tracks (cid:15) τ .Assuming that there is at least one track with vocals, wecreate the vocals source by merging all the tracks with (cid:15) τ > = max τ ( (cid:15) τ ) · ν where ν is a tolerance value setto . . All the remaining tracks are fused to definethe accompaniment . We manually checked the resultingsources. The dataset is available at https://zenodo.org/record/3970189 .The second version of DALI adds the phonetic informa-tion computed for the word level [33]. This level has thewords of the lyrics transcribed into a vocabulary of 39 dif-ferent phoneme symbols as defined in the Carnegie MellonPronouncing Dictionary (CMUdict) . After selecting thedesired time resolution, we can derive a time frame basedphoneme context activation matrix Z , which is a binarymatrix that indicates the phoneme activation over time. Weadd an extra row with the ’non-phoneme’ activation with at time frames with no phoneme activation and other-wise. Figure 2 illustrates the final activation matrix.Although we work only with phonemes per word infor-mation, we can derive similar activation matrices for othercontext information such as notes or characters.
5. METHODOLOGY
Our method adapts the C-U-Net architecture [5] to thesinging voice separation task, exploring how to use theprior knowledge defined by the phonemes to improve thevocals separation.
Input representations . Let X ∈ R T × M be the mag-nitude of the Short-Time Fourier Transform (STFT) with https://github.com/cmusphinx/cmudict Figure 2 . Binary phoneme activation matrix. Note howwords are represented as a bag of simultaneous phonemes. M = 512 frequency bands and T time frames. Wecompute the STFT on an audio signal down-sampled at8192 Hz using a window size of 1024 samples and a hopsize of 768 samples. Let Z ∈ R T × P be the alignedphoneme activation matrix with P = 40 phoneme typesand T the same time frames as in X . Our model takesas inputs two submatrix x ∈ R N × M and z ∈ R N × P of N = 128 frames (11 seconds) derived from X and Z . Model . The C-U-Net model has two components (see[5] for a general overview of the architecture): a condi-tioned network that processes x and a control mecha-nism that conditions the computation with respect to z . Wedenote by x d ∈ R W × H × C the intermediate features of the conditioned network , at a particular depth d in the archi-tecture. W and H represent the ‘time’ and ‘frequency’ di-mension and C the number of feature channels (or featuremaps). A FiLM layer conditions the network computationby applying an affine transformation to x d : FiLM ( x d ) = γ d ( z ) (cid:12) x d + β d ( z ) (2)where (cid:12) denotes the element-wise multiplication and γ d ( z ) and β d ( z ) are learnable parameters with respect tothe input context z . A FiLM layer can be inserted at anydepth of the original model and its output has the same di-mension as the x d input, i.e. ∈ R W × H × C . To performEqn (2), γ d ( z ) and β d ( z ) must have the same dimension-aly as x d , i.e. ∈ R W × H × C . However, we can define themomitting some dimensions. This results in a non-matchingdimensionality with x d , solved by broadcasting (repeating)the existing information to the missing dimensions.As in [5, 18, 34], we use the U-Net [18] as conditionednetwork , which has an encoder-decoder mirror architec-ture based on CNN blocks with skip connections betweenlayers at the same hierarchical level in the encoder and de-coder. Each convolutional block in the encoder halves thesize of the input and doubles the number of channels. Thedecoder is made of a stack of transposed convolutional op-eration, its output has the same size as the input of the en-coder. Following the original C-U-Net architecture, we in-sert the FiLM layers at each encoding block after the batchnormalization and before the Leaky ReLU [5].We explore now the different control mechanism weuse for conditioning the U-Net.
Weak conditioning refers to the cases where odel U-Net W si W co S a S a ∗ S c S c ∗ S f S f ∗ S s S s ∗ θ . · +14 ,
060 +2 . · +1 . · +327 ,
680 +80 ,
640 +40 ,
960 +40 ,
320 +640 +480 +80
Table 1 . Number of parameters ( θ ) for the different configurations. We indicate increment to the U-Net architecture. • γ d ( z ) and β d ( z ) ∈ R : they are scalar parametersapplied independently of the times W , the frequen-cies H and the channel C dimensions. They de-pend only on the depth d of the layer within the net-work [5]. • γ d ( z ) and β d ( z ) ∈ R C : this is the original config-uration proposed by [17] with different parametersfor each channel c ∈ , ..., C .We call them FiLM simple ( W si ) and FiLM complex ( W co ) respectively. Note how they apply the same trans-formation without explicitly informing where it occurs inthe signal (same value over the dimension W and H ).Starting from the context matrix z ∈ R N × P , we definethe control mechanism by first apply the autopool layerproposed by [35] to reduce the input matrix to a time-less vector. We then fed this vector into a dense layer andtwo dense blocks each composed by a dense layer, 50%dropout and batch normalization. For FiLM simple , thenumber of units of the dense layers are 32, 64 and 128.For
FiLM simple , they are 64, 256 and 1024. All neuronshave ReLU activations. The output of the last block is thenused to feed two parallel and independent dense layer withlinear activation which outputs all the needed γ d ( z ) and β d ( z ) . While for the FiLM simple configuration we onlyneed γ d and β d (one γ d and β d for each of the 6 differentencoding blocks) for the FiLM complex we need (theencoding blocks feature channel dimensions are , , , , and , which adds up to ). In this section, we extend the original
FiLM layer mecha-nism to adapt it to the strong conditioning scenario.The context information represented in the input matrix z describes the presence of the phonemes p ∈ { , . . . , P } over time n ∈ { , . . . N } . As in the popular Non-NegativeMatrix factorization [36] (but without the non-negativityconstraint), our idea is to represent this information as theproduct of tensors: an activation and two basis tensors.The activation tensor z d indicates which phoneme oc-curs at which time: z d ∈ R W × P where W is the dimensionwhich represents the time at the current layer d (we there-fore need to map the time range of z to the one of the layer d ) and P the number of phonemes.The two basis tensors γ d and β d ∈ R H × C × P where H is the dimension which represents the frequencies at thecurrent layer d , C the number of input channels and P thenumber of phonemes. In other words, each phoneme p isrepresented by a matrix in R H × C derived from Eqn (1).This matrix represents the specific conditioning to apply The auto-pool layer is a tuned soft-max pooling that automaticallyadapts the pooling behavior to interpolate between mean and max-poolingfor each dimension
Figure 3 . Strong conditioning example with ( γ d × z d ) (cid:12) x d .The phoneme activation z d defines how the basis tensors( γ d ) are employed for performing the conditioning on x d .to x d if the phoneme exists (see Figure 3). These matricesare learnable parameters (neurons with linear activations)but they do not depend on any particular input informa-tion (at a depth d they do not depend on x nor z ), theyare rather “activated” by z d at specific times. As for the‘weak‘conditionning, we can define different versions ofthe tensors • the all-version ( S a ) described so far with three di-mensions: γ d , β d ∈ R H × C × P • the channel-version ( S c ): each phoneme is repre-sented by a vector over input channels (thereforeconstant over frequencies): γ d , β d ∈ R C × P • the frequency-version ( S f ): each phoneme is repre-sented by a vector over input frequencies (thereforeconstant over channels): γ d , β d ∈ R H × P • the scalar-version ( S s ): each phoneme is repre-sented as a scalar (therefore constant over frequen-cies and channels): γ d , β d ∈ R P The global conditioning mechanism can then be written as
FiLM ( x d , z d ) = ( γ d × z d ) (cid:12) x d + ( β d × z d ) (3)where (cid:12) is the element-wise multiplication and × the ma-trix multiplication. We broadcast γ d and β d for missing di-mensions and transpose them properly to perform the ma-trix multiplication. We test two different configurations:inserting FiLM at each encoder block as suggested in [5]and inserting FiLM only at the last encoder block as pro-posed at [3]. We call the former ‘complete’ and the latter‘bottleneck’ (denoted with ∗ after the model acronym). Weresume the different configurations at Table 1.
6. EXPERIMENTSDATA.
We split DALI into three sets according to the nor-malized agreement score η presented in [31] (see Table 2).This score provides a global indication of the globalalignment correlation between the annotations and thevocal activity. rain Val TestThreshold . > η > = . . > η > = . . > η Songs 357 30 101
Table 2 . DALI split according to agreement score η .Training Test Aug SDR SIR SARMusdb18 Musdb18 False 4.27 13.17 5.17(90) (50) True 4.46 12.62 5.29DALI(357) Musdb18 False 4.60 14.03 5.39(50) True 4.96 13.50 5.92DALI False 3.98 12.05 4.91(101) True 4.05 11.40 5.32 Table 3 . Data augmentation experiment.
DETAILS.
We train the model using batches of spectrograms randomly drawn from the training set with batches per epoch. The loss function is the meanabsolute error between the predicted vocals (masked inputmixture) and the original vocals. We use a learning rateof . and the reduction on plateau and early stoppingcallbacks evaluated on the validation set, using patience of or respectively and a min delta variation for earlystopping to e − . Our output is a Time/Frequency maskto be applied to the magnitude of the input STFT mixture.We use the phase of the input STFT mixture to reconstructthe waveform with the inverse STFT algorithm.For the strong conditioning, we apply a softmax onthe input phoneme matrix z over the phoneme dimension P to constrain the outputs to sum to , meaning it lies on asimplex, which helps in the optimization. We evaluate the performances of the separation using themir evaltoolbox [37]. We compute three metrics: Source-to-Interference Ratios (SIR), Source-to-Artifact Ratios(SAR), and Source-to-Distortion Ratios (SDR) [38]. Inpractice, SIR measures the interference from other sources,SAR the algorithmic artifacts introduce in the process andSDR resumes the overall performance. We obtain themglobally for the whole track. However, these metrics areill-defined for silent sources and targets. Hence, we com-pute also the Predicted Energy at Silence (PES) and Energyat Predicted Silence (EPS) scores [10]. PES is the mean ofthe energy in the predictions at those frames with silent tar-get and EPS is the opposite, the mean of the target energyof all frames with silent prediction and non-silent target.For numerical stability, in our implementation, we add asmall constant (cid:15) = 10 − which results in a lower bound-ary of the metrics to be − dB [3]. We consider as silentsegments those that have a total sum of less than − dB ofthe maximum absolute in the audio. We report the median values of these metrics over the all tracks in the DALI testset. For SIR, SAR, and SDR larger values indicate betterperformance, for PES and EPS smaller values, mean betterperformance. Model SDR SIR SAR PES EPSU-Net 4.05 11.40 5.32 -42.44 -64.84 W si -49.44 -65.47 W co -59.53 -63.46 S a -59.68 -61.73 S a ∗ -54.16 -64.56 S c -57.11 -65.48 S c ∗ -54.27 -66.35 S f S f ∗ -48.75 -72.40 S s -63.44 S s ∗ -57.37 -65.62 Table 4 . Median performance in dB of the different mod-els on the DALI test set. In bold are the results that signif-icantly improve over the U-Net (p < 0.001) and inside thecircles the best results for each metric.
Similarly as proposed in [39], we randomly created ‘fake’input mixtures every real mixtures. In non-augmentedtraining, we employ the mixture as input and the vocalsas a target. However, this does not make use of the ac-companiment (which is only employed during evaluation).We can integrate it creating ‘fake’ inputs by automaticallymixing (mixing meaning simply adding) the target vocalsto a random sample accompaniment from our training set.We test the data augmentation process using the stan-dard U-Net architecture to see whether it improves the per-formance (see Table 3). We train two models on DALI andMusdb18 dataset [40] . This data augmentation enablesmodels to achieve better SDR and SAR but lower SIR. Ourbest results ( . db SDR) are not state-of-the-art wherethe best-performing models on Musdb18 achieve (approx-imately . db SDR) [41].This technique does not reflect a large improvementwhen the model trained on DALI is tested on DALI . How-ever, when this model is tested on Musdb18, it shows a bet-ter generalization (we have not seen any song of Musidb18during training) than the model without data augmentation(we gain . dB). One possible explanation for not havinga large improvement on DALI testset is the larger size ofthe test set. It also can be due to the fact that vocal targetsin DALI still contain leaks such as low volume music ac-companiment that come from the singer headphones. Weadopt this technique for training all the following models.Finally, we confirmed a common belief that trainingwith a large dataset and clean separated sources improvesthe separation over a small dataset [42]. Both modelstrained on DALI (with and without augmentation) improvethe results obtained with the models trained on Musdb18.Since we cannot test the conditioning versions onMusdb (no aligned lyrics), the results on the DALI test( . dB SDR) serves as a baseline to measure the contri-bution of the conditioning techniques (our main interest). We use 10 songs of the training set for the early stopping and reduc-tion on plateau callbacks igure 4 . Distribution of scores for the the standar U-Net (Blue) and S s (Orange).
7. RESULTS
We report the median source separation metrics (SDR,SAR, SIR, PES, ESP) in Table 4. To measure the sig-nificance of the improvement differences, we performeda paired t-test between each conditioning model and thestandard U-Net architecture, the baseline. This test mea-sures ( p -value) if the differences could have happened bychance. A low p -value indicates that data did not occurby chance. As expected, there is a marginal (but statisti-cal significance) improvement over most of the proposedmethods, with a generalized p < . for the SDR, SIR,and PES, except for the versions where the basis tensorshave a ‘frequency’ H dimension. This is an expected resultsince when singing, the same phoneme can be sung at dif-ferent frequencies (appearing at many frequency positionsin the feature maps). Hence, these versions have difficul-ties to find generic basis tensors. This also explains whythe ‘bottleneck’ versions (for both S f ∗ and S a ∗ ) outper-forms the ‘complete’ while this is not the case for the otherversions. Most versions also improve the performance onsilent vocal frames with a much lower PES. However, thereis no difference in predicting silence at the right time (sameEPS). The only metric that does not consistently improve isSAR, which measures the algorithmic artifacts introducedin the process. Our conditioning mechanisms can not re-duce the artifacts that seem more dependent on the qualityof the training examples (it is the metric with higher im-provement in the data augmentation experiment Table 3).Figure 4 shows a comparison with the distribution of SDR,SIR, and SAR for the best model S s and the U-Net. Wecan see how the distributions move toward higher values.One relevant remark is the fact that we can effectivelycontrol the network with just a few parameters. S s justadds (or just for S s ∗ ) new learnable parametersand have significantly better performance than S a that adds . · . We believe that the more complex control mech-anisms tend to find complex basis tensors that do not gen-eralize well. In our case, it is more effective to performa simple global transformation. In the case of weak con-ditioning, both models behave similarly although W si has . · fewer parameters than W co . This seems to in-dicate that controlling channels is not particularly relevant.Regarding the different types of conditioning, when re-peating the paired t-test between weak and strong modelsonly S s outperforms the weak systems. We believe that strong conditioning can lead to higher improvements butseveral issues need to be addressed. First, there are mis-alignments in the annotations that force the system to per-form unnecessary operations which damages the computa-tion. This is one of the possible explanations of why mod-els with fewer parameters perform better. They are forcedto find more generic conditions. The weak conditioningmodels are robust to these problems since they process z and compute an optimal modification for a whole inputpatch (11s). We also need to “disambiguate” the phonemesinside words since they occur as a bag of phonemes atthe same time (no individual onsets per phonemes insideone word, see Figure 2). This prevents strong condition-ing models to properly learn the phonemes in isolation, in-stead, they consider them jointly with the other phonemes.
8. CONCLUSIONS
The goal of this paper is twofold. First, to introduce anew multimodal multitrack dataset with lyrics aligned intime. Second, to improve singing voice separation usingthe prior knowledge defined by the phonetic characteris-tics. We use the phoneme activation as side informationand show that it helps in the separation.In future works, we intend to use other prior alignedknowledge such as vocal notes or characters also defined inDALI . Regarding the conditioning approach and since it istransparent to the conditioned network, we are determinedto explore recent state-of-the-art source separation meth-ods such as Conv-Tasnet [43]. The current formalizationof the two basis tensors γ d and β d does not depend on anyexternal factor. A way to exploit a more complex controlmechanisms is to make these basis tensors dependent onthe input mixture x which may add additional flexibility.Finally, we plan to jointly learn how to infer the alignmentand perform the separation [44, 45].The general idea of lyrics-informed source separationleaves room for many possible extensions. The presentformalization relies on time-aligned lyrics which is not thereal-world scenario. Features similar to the phoneme ac-tivation [46, 47] can replace them or be used to align thelyrics as a pre-processing step. This two options adapts thecurrent system to the real-world scenario. These featurescan also help in properly placing and disambiguating thephonemes of a word to improve the current annotations. cknowledgement. This research has re-ceived funding from the French National ResearchAgency under the contract ANR-16-CE23-0017-01 (WASABI project). Implementation available athttps://github.com/gabolsgabs/vunet
9. REFERENCES [1] K. Kinoshita, M. Delcroix, A. Ogawa, and T. Nakatani,“Text-informed speech enhancement with deep neuralnetworks,” in
Sixteenth Annual Conference of the In-ternational Speech Communication Association , 2015.[2] M. Miron, J. Janer Mestres, and E. Gómez Gutiérrez,“Monaural score-informed source separation for clas-sical music using convolutional neural networks,” in
Hu X, Cunningham SJ, Turnbull D, Duan Z. ISMIR2017. 18th International Society for Music Informa-tion Retrieval Conference; 2017 Oct 23-27; Suzhou,China.[Canada]: ISMIR; 2017. p. 55-62.
Interna-tional Society for Music Information Retrieval (IS-MIR), 2017.[3] O. Slizovskaia, G. Haro, and E. Gómez, “Conditionedsource separation for music instrument performances,” arXiv preprint arXiv:2004.03873 , 2020.[4] S. Ewert, B. Pardo, M. Müller, and M. D. Plumb-ley, “Score-informed source separation for musical au-dio recordings: An overview,”
IEEE Signal ProcessingMagazine , vol. 31, no. 3, pp. 116–124, 2014.[5] G. Meseguer-Brocal and G. Peeters, “Conditioned-u-net: Introducing a control mechanism in the u-net formultiple source separations,” in
Proc. of ISMIR (In-ternational Society for Music Information Retrieval) ,Delft, Netherlands, 2019.[6] E. Tzinis, S. Wisdom, J. R. Hershey, A. Jansen,and D. P. Ellis, “Improving universal sound sep-aration using sound classification,” arXiv preprintarXiv:1911.07951 , 2019.[7] O. Slizovskaia, L. Kim, G. Haro, and E. Gomez, “End-to-end sound source separation conditioned on instru-ment labels,” in
ICASSP 2019-2019 IEEE Interna-tional Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2019, pp. 306–310.[8] P. Seetharaman, G. Wichern, S. Venkataramani, andJ. Le Roux, “Class-conditional embeddings for musicsource separation,” in
ICASSP 2019-2019 IEEE Inter-national Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2019, pp. 301–305.[9] D. Samuel, A. Ganeshan, and J. Naradowsky, “Meta-learning extractors for music source separation,” arXivpreprint arXiv:2002.07016 , 2020.[10] K. Schulze-Forster, C. Doire, G. Richard, andR. Badeau, “Weakly informed audio source separa-tion,” in .IEEE, 2019, pp. 273–277.[11] P. Chandna, M. Blaauw, J. Bonada, and E. Gómez,“Content based singing voice extraction from a musi-cal mixture,” in
ICASSP 2020-2020 IEEE InternationalConference on Acoustics, Speech and Signal Process-ing (ICASSP) . IEEE, 2020, pp. 781–785.[12] A. Demetriou, A. Jansson, A. Kumar, and R. M. Bit-tner, “Vocals in music matter: the relevance of vocalsin the minds of listeners.” in
In Proceedings of 19thInternational Society for Music Information RetrievalConference , Paris, France, September 2018.[13] E. Gómez, M. Blaauw, J. Bonada, P. Chandna, andH. Cuesta, “Deep learning for singing processing:Achievements, challenges and impact on singers andlisteners,” arXiv preprint arXiv:1807.03046 , 2018.[14] E. J. Humphrey, S. Reddy, P. Seetharaman, A. Kumar,R. M. Bittner, A. Demetriou, S. Gulati, A. Jansson,T. Jehan, B. Lehner et al. , “An introduction to sig-nal processing for singing-voice analysis: High notesin the effort to automate the understanding of vocalsin music,”
IEEE Signal Processing Magazine , vol. 36,no. 1, pp. 82–94, 2018.[15] J. Goldsmith, “Autosegmental phonology,” Ph.D. dis-sertation, MIT Press London, 1976.[16] D. R. Ladd,
Intonational phonology . Cambridge Uni-versity Press, 2008.[17] E. Perez, F. Strub, H. de Vries, V. Dumoulin, and A. C.Courville, “Film: Visual reasoning with a general con-ditioning layer,” in
Proc. of AAAI (Conference on Arti-ficial Intelligence) , New Orleans, LA, USA, 2018.[18] A. Jansson, E. J. Humphrey, N. Montecchio, R. Bit-tner, A. Kumar, and T. Weyde, “Singing voice separa-tion with deep u-net convolutional networks,” in
Proc.of ISMIR (International Society for Music InformationRetrieval) , Suzhou, China, 2017.[19] A. Liutkus, J.-L. Durrieu, L. Daudet, and G. Richard,“An overview of informed audio source separation,” in . IEEE,2013, pp. 1–4.[20] S. Ewert and M. Müller, “Using score-informed con-straints for nmf-based source separation,” in . IEEE, 2012, pp.129–132.[21] Z. Duan and B. Pardo, “Soundprism: An online systemfor score-informed source separation of music audio,”
IEEE Journal of Selected Topics in Signal Processing ,vol. 5, no. 6, pp. 1205–1215, 2011.22] F. J. Rodriguez-Serrano, Z. Duan, P. Vera-Candeas,B. Pardo, and J. J. Carabias-Orti, “Online score-informed source separation with adaptive instrumentmodels,”
Journal of New Music Research , vol. 44,no. 2, pp. 83–96, 2015.[23] J. Fritsch and M. D. Plumbley, “Score informed audiosource separation using constrained nonnegative ma-trix factorization and score synthesis,” in . IEEE, 2013, pp. 888–891.[24] M. Miron, J. J. Carabias Orti, and J. Janer Mestres,“Improving score-informed source separation for clas-sical music through note refinement,” in
Müller M,Wiering F, editors. Proceedings of the 16th Interna-tional Society for Music Information Retrieval (ISMIR)Conference; 2015 Oct 26-30; Málaga, Spain. Canada:International Society for Music Information Retrieval;2015.
International Society for Music InformationRetrieval (ISMIR), 2015.[25] M. Miron, J. J. Carabias-Orti, J. J. Bosch, E. Gómez,and J. Janer, “Score-informed source separation formultichannel orchestral recordings,”
Journal of Elec-trical and Computer Engineering , vol. 2016, 2016.[26] P. Chandna, M. Miron, J. Janer, and E. Gómez,“Monoaural audio source separation using deep convo-lutional neural networks,” in
Proc. of LVA/ICA (Inter-national Conference on Latent Variable Analysis andSignal Separation) , Grenoble, France, 2017.[27] S. Ewert and M. B. Sandler, “Structured dropout forweak label and multi-instance learning and its appli-cation to score-informed source separation,” in . IEEE, 2017, pp.2277–2281.[28] T.-S. Chan, T.-C. Yeh, Z.-C. Fan, H.-W. Chen, L. Su,Y.-H. Yang, and J.-S. R. Jang, “Vocal activity informedsinging voice separation with the ikala dataset,” , pp. 718–722, 2015.[29] Y. Bengio, A. Courville, and P. Vincent, “Representa-tion learning: A review and new perspectives,”
IEEETransactions on Pattern Analysis and Machine Intelli-gence , vol. 35, no. 8, pp. 1798–1828, Aug 2013.[30] V. Dumoulin, E. Perez, N. Schucher, F. Strub,H. d. Vries, A. Courville, and Y. Bengio,“Feature-wise transformations,”
Distill , 2018,https://distill.pub/2018/feature-wise-transformations.[31] G. Meseguer-Brocal, A. Cohen-Hadria, and G. Peeters,“Dali: a large dataset of synchronised audio, lyrics andnotes, automatically created using teacher-student ma-chine learning paradigm.” in
In Proceedings of 19thInternational Society for Music Information RetrievalConference , Paris, France, September 2018. [32] G. Meseguer-Brocal, G. Peeters, G. Pellerin, M. Buffa,E. Cabrio, C. F. Zucker, A. Giboin, I. Mirbel, R. Hen-nequin, M. Moussallam et al. , “Wasabi: A two mil-lion song database project with audio and culturalmetadata plus webaudio enhanced client applications,”in
Web Audio Conference 2017–Collaborative Audio , 2017.[33] G. Meseguer-Brocal, A. Cohen-Hadria, and G. Peeters,“Creating dali, a large dataset of synchronized audio,lyrics, and notes,”
Transactions of the International So-ciety for Music Information Retrieval , 2020.[34] D. Stoller, S. Ewert, and S. Dixon, “Wave-u-net: Amulti-scale neural network for end-to-end audio sourceseparation,” in
Proc. of ISMIR (International Societyfor Music Information Retrieval) , Paris, France, 2018.[35] B. McFee, J. Salamon, and J. P. Bello, “Adaptive pool-ing operators for weakly labeled sound event detec-tion,”
IEEE/ACM Transactions on Audio, Speech, andLanguage Processing , vol. 26, no. 11, pp. 2180–2193,2018.[36] D. D. Lee and H. S. Seung, “Algorithms for non-negative matrix factorization,” in
Advances in neuralinformation processing systems , 2001, pp. 556–562.[37] C. Raffel, B. Mcfee, E. J. Humphrey, J. Salamon,O. Nieto, D. Liang, and D. P. W. Ellis, “mir_eval: atransparent implementation of common mir metrics,”in
Proc. of ISMIR (International Society for Music In-formation Retrieval) , Porto, Portugal, 2014.[38] E. Vincent, R. Gribonval, and C. Févotte, “Perfor-mance measurement in blind audio source separation,”
IEEE/ACM TASLP (Transactions on Audio Speech andLanguage Processing) , vol. 14, no. 4, 2006.[39] S. Uhlich, M. Porcu, F. Giron, M. Enenkl, T. Kemp,N. Takahashi, and Y. Mitsufuji, “Improving musicsource separation based on deep neural networksthrough data augmentation and network blending,” in . IEEE, 2017,pp. 261–265.[40] Z. Rafii, A. Liutkus, F.-R. Stöter, S. I. Mimilakis, andR. Bittner, “The MUSDB18 corpus for music separa-tion,” 2017, https://zenodo.org/record/1117372.[41] F.-R. Stöter, A. Liutkus, and N. Ito, “The 2018 signalseparation evaluation campaign,” in
International Con-ference on Latent Variable Analysis and Signal Sepa-ration . Springer, 2018, pp. 293–305.[42] L. Prétet, R. Hennequin, J. Royo-Letelier, andA. Vaglio, “Singing voice separation: A study on train-ing data,” in
ICASSP 2019-2019 IEEE InternationalConference on Acoustics, Speech and Signal Process-ing (ICASSP) . IEEE, 2019, pp. 506–510.43] Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassingideal time–frequency magnitude masking for speechseparation,”
IEEE/ACM transactions on audio, speech,and language processing , vol. 27, no. 8, pp. 1256–1266, 2019.[44] K. Schulze-Forster, C. S. Doire, G. Richard, andR. Badeau, “Joint phoneme alignment and text-informed speech separation on highly corruptedspeech,” in
ICASSP 2020-2020 IEEE InternationalConference on Acoustics, Speech and Signal Process-ing (ICASSP) . IEEE, 2020, pp. 7274–7278.[45] N. Takahashi, M. K. Singh, S. Basak, P. Sudarsanam,S. Ganapathy, and Y. Mitsufuji, “Improving voice sep-aration by incorporating end-to-end speech recogni-tion,” in
ICASSP 2020-2020 IEEE International Con-ference on Acoustics, Speech and Signal Processing(ICASSP) . IEEE, 2020, pp. 41–45.[46] A. Vaglio, R. Hennequin, M. Moussallam, G. Richard,and F. d’Alché Buc, “Audio-based detection of explicitcontent in music,” in
ICASSP 2020-2020 IEEE Inter-national Conference on Acoustics, Speech and SignalProcessing (ICASSP) . IEEE, 2020, pp. 526–530.[47] D. Stoller, S. Durand, and S. Ewert, “End-to-end lyricsalignment for polyphonic music using an audio-to-character recognition model,” in