Revisiting Singing Voice Detection: a Quantitative Review and the Future Outlook
RREVISITING SINGING VOICE DETECTION:A QUANTITATIVE REVIEW AND THE FUTURE OUTLOOK
Kyungyun Lee Keunwoo Choi Juhan Nam School of Computing, KAIST Spotify Inc., USA Graduate School of Culture Technology, KAIST [email protected], [email protected], [email protected]
ABSTRACT
Since the vocal component plays a crucial role in popularmusic, singing voice detection has been an active researchtopic in music information retrieval. Although several pro-posed algorithms have shown high performances, we ar-gue that there still is a room to improve to build a morerobust singing voice detection system. In order to identifythe area of improvement, we first perform an error analysison three recent singing voice detection systems. Based onthe analysis, we design novel methods to test the systemson multiple sets of internally curated and generated data tofurther examine the pitfalls, which are not clearly revealedwith the current datasets. From the experiment results, wealso propose several directions towards building a more ro-bust singing voice detector.
1. INTRODUCTION
Singing voice detection (or VD, vocal detection) is a musicinformation retrieval (MIR) task to identify vocal segmentsin a song. The length of each segment is typically at aframe level, for example, 100 ms. Since singing voice isone of the key components in popular music, VD can beapplied to music discovery and recommendation as well asvarious MIR tasks such as melody extraction [7], audio-lyrics alignment [31], and artist recognition [2].Existing VD methods can be categorized into three dif-ferent classes.
First , the early approaches focused on theacoustic similarity between singing voice and speech, uti-lizing cepstral coefficients [1] and linear predictive cod-ing [10]. The second class would be the majority of ex-isting methods, where the systems take advantages of ma-chine learning classifiers such as support vector machinesor hidden Markov models, combined with large sets ofaudio descriptors (e.g., spectral flatness) as well as dedi-cated new features such as fluctograms [14].
Lastly , thereis a recent trend towards feature learning using deep neu-ral networks, with which the VD systems learn optimized c (cid:13) Kyungyun Lee, Keunwoo Choi, Juhan Nam. Licensedunder a Creative Commons Attribution 4.0 International License (CC BY4.0).
Attribution:
Kyungyun Lee, Keunwoo Choi, Juhan Nam. “Revis-iting Singing Voice Detection: A quantitative review and the future out-look”, 19th International Society for Music Information Retrieval Con-ference, Paris, France, 2018. features for the task using a convolutional neural network(CNN) [27] and a recurrent neural network (RNN) [11].They have achieved state-of-the-art performances on com-monly used datasets with over 90% of the true positive rate(recall) and accuracy.We hypothesize that there are common problems in ex-isting VD methods in spite of such well-performing met-rics that have been reported. Our scope primarily includesmethods in the second and third classes since they signif-icantly outperform those in the first class. Our hypothe-sis was inspired by inspecting the assumptions in the ex-isting algorithms. The most common one, for example,has been made on the spectro-temporal characteristics ofsinging voices; that they include frequency modulation (orvibrato) [15, 24], which leads to our analysis on whetherthere are any problems by pursuing to be a vibrato detector.We can also raise similar questions on the behavior of thesystems in the third class, the deep learning-based systems,by examining on their assumptions and results. Based onthe analysis, we invent a set of empirical analysis methodsand use them to reveal the exact types of problems in thecurrent VD systems.Our contributions are as follows : • A quantitative analysis to clarify and classify commonerrors of three recent VD systems (Section 4) • An analysis using curated and generated audio con-tents that exploit the discovered weakness of the systems(Section 5) • Suggestions on future research directions (Section 6)In addition, we review previous VD systems in Section 3and summarize the paper in Section 7.
2. BACKGROUND2.1 Problem definition
Singing voice detection is usually defined as a binary clas-sification task about whether a short audio segment in-put includes singing voice. However, the details havebeen rather empirically decided. By ‘short’, the segmentlength for prediction is often 100 ms or 200 ms. ‘Au-dio’ can be provided as stereo, although they are frequentlydownmixed to mono. More importantly, ‘singing voice’ isnot clearly defined, for example, leaving the question that a r X i v : . [ c s . S D ] J un ize Annotations Past VD papers NotesJamendo Corpus 93 tracks (443 mins) Vocal activation [13], [27], [26][11], [24], [12], Train/valid/test split from [22]RWC Popular Music 100 tracks (407 mins) instrument annotationVocal activation, [13][26], [27], [14] VD annotation by [16]MIR-1K 100 short clips (113 mins) pitch contoursVocal activation, [9] providedRegular speech filesMedleyDB 122 tracks (437 mins) pitch annotationMelody annotation, [26] Multitrack Table 1 : A summary of public datasets relevant to singing voice detectionbackground vocals should be regarded as singing voice ornot. In previous works, this problem has been neglectedsince the majority of songs in datasets do not include back-ground vocals that are independent of the main vocals.These will be further discussed in Section 6.
In Table 1, four public datasets for evaluating VD systemsare summarized. Three of them are well described byLehner et al. [12]: Jamendo Corpus [22], RWC PopularMusic Database [4] and MIR-1K Corpus [8]. In addition,we add MedleyDB [3], which is a multitrack dataset, com-posed of raw mono recordings for each instrument as wellas processed stereo mix tracks. Although it does not pro-vide annotations for vocal/non-vocal segments, it is possi-ble to utilize the annotations for the instrument activation,which considers vocals as one of the instruments. Therecan be more benefit by using the multitrack dataset for VDresearch, which will be discussed in Section 6.
In this section, we present the properties as well as theunderlying assumptions of various audio representationsin the context of VD. Previous works have used a com-bination of numerous audio features, seeking easier waysfor the algorithm to detect the singing voice. They rangefrom representations such as short-time Fourier transform(STFT) to high-level features such as onsets and pitch es-timations. • STFT provides a 2-dimensional representation of au-dio, decomposing the frequency components. STFT isprobably the most basic (or ‘raw’) representation in VD,based on which some other representations are eitherdesigned and computed, or learned using deep learningmethods. • Mel-spectrogram is a mel-scaled frequency representa-tion and usually more compressive than STFTs and orig-inally inspired by the human perception of speech. Be-ing closely related to speech provides a good motivationto be used in VD, therefore mel-spectrogram has beenactively used as an input representation of CNNs [27]and RNNs [11]. When deep learning methods are used,mel-spectrogram is often preferred due to its efficiencycompared to STFT. • Spectral Features such as spectral centroid and spectralroll-off are statistics of a spectral distribution of a singleframe of time-frequency representations (e.g., STFT).A particular and most noteworthy example is
Mel-Frequency Cepstral Coefficients (MFCCs). MFCCshave originally been designed for automatic speechrecognition and take advantages of mel-scale and fourieranalysis for providing approximately pitch-invarianttimbre-related information. They are often (assumed tobe) relevant to MIR tasks including VD [12, 25]. Spec-tral features, in general, are not robust to additive noise,which means that they would be heavily affected by theinstrumental part of the music when used for VD.
3. MODELS
In this section, we introduce three recent and distinctiveVD systems that have improved the state-of-the-art perfor-mances along with the details of our re-implementation ofthem. They are briefly illustrated in Figure 1, where x and y indicate the input audio signal and prediction respec-tively. FE-VD ) This feature engineering (FE) method,
FE-VD is basedon fluctogram, spectral flatness, vocal variance and otherhand-engineered audio features. We select this model forits rich and task-specific feature extraction process to com-pare with the other models. Although the features are ulti-mately computed frame-wise, ‘context’ from the adjacentframes are taken into account, supposedly enabling the sys-tem to use dynamic aspect of the features. The features areaimed to reduce the false positive rate caused by the confu-sion between singing voice and pitch-varying instrumentssuch as woodwinds and strings. Random forest classifierwas adopted as a classifier, achieving an accuracy of 88.2%on the Jamendo dataset. While their methods have shownreduction in the false positive rates on strings, Lehner et al.mentions woodwinds such as pan flutes and saxophonesstill show high error rate.Same as in [14], we extract 6 different audio features(fluctograms, spectral flatness, spectral contraction, vocalvariances, MFCCs and delta MFCCs), resulting in 116 fea-tures per frame. We use input size of 1.1 seconds as the http://github.com/kyungyunlee/ismir2018-revisiting-svd igure 1 : Block diagrams for three VD systems – (a) FE-VD [14], (b)
CNN-VD [27], and (c)
RNN-VD [11]. x and y for input audio signal and output prediction (proba-bility of singing voice). Rounded, gray blocks are train-able classifiers or layers. The details of the features in(a) are explained in [14]. In (c), ‘+’ indicates frequency-axis concatenation and ‘h’ and ‘p’ are the separated har-monic/percussive components.input to the random forest classifier, where we performedgrid search to find optimal parameters. As a post process-ing step, we apply the median filter of 800 ms on the pre-dictions. CNN-VD ) Recently, VD systems using deep learning models haveshown the state-of-the-art result [11, 26, 27]. These sys-tems often use basic audio representations such as STFTas an input to the system such as CNN and RNN, expect-ing the relevant features are learned by the model. We firstintroduce a CNN-based system [27].Schl¨uter et al. suggested a deep CNN architecture with3-by-3 2D convolution layers. We name the CNN model
CNN-VD . As a result, the system extracts trained, rele-vant local time-frequency patterns from its input, a mel-spectrogram. During training, they apply data augmenta-tion such as pitch shifting and time stretching on the audiorepresentation. They reported that it reduces the error ratefrom 9.4% to 7.7% on the Jamendo dataset.Our CNN architecture is identical to the original onein using an input size of 115 frames (1.6 sec) and using 43 × RNN-VD ) As another deep learning-based system, Leglaive et al. [11]proposed a recurrent neural network with bi-directionallong short-term memory units (Bi-LSTMs) [6], with anassumption that temporal information of music can pro-vide valuable information for detecting vocal segments.We name this system
RNN-VD . For the classifier input, thesystem performs double-stage harmonic-percussion sourceseparation (HPSS) [20] on the audio signal to extract sig-nals relevant to the singing voice. For each frame, mel-spectrograms of the obtained harmonic and percussivecomponents are concatenated as an input for the classi-fier. Several recurrent layers followed by a shared densely-connected layer (also known as time distributed dense
FE-VD CNN-VD RNN-VD
Acc.(%) 87.9 86.8 87.5Recall(%) 91.7 89.1 87.2Precision(%) 83.8 83.7 86.1F-measure(%) 87.6 86.3 86.6FPR(%) 15.3 15.1 12.2FNR(%) 8.3 10.9 12.8
Table 2 : Results of our implementations on the Jamendotest set. FPR and FNR refer to false positive rate and falsenegative rate, respectively.layer) yield the output predictions for each input frame.This model achieves the state-of-the-art result without dataaugmentation, showing accuracy of 91.5% on the Jamendodataset. From this result, although the contributions fromadditional preprocessing vs. recurrent layers may be com-bined, we can assume that past and future temporal contexthelp to identify vocal segments.For our RNN architecture, we use the best performingmodel from the original article [11], one with three hiddenlayers of size 30, 20 and 40. The input to the model is 218frames (3.5 seconds).
4. EXPERIMENT I: ERROR CATEGORIZATION
The purpose of this experiment is to identify common er-rors in the VD systems through our implementation ofmodels from Section 3. The results and observations leadto the motivation of experiments in Section 5. Librosa [18]is used in audio processing and feature extraction stages.
Three systems (
FE-VD , CNN-VD , RNN-VD ) are trained onthe Jamendo dataset with a suggested split of 61, 16 and 16for train, validation and test sets [22], respectively. Theyare primarily tested on the Jamendo test set. For qualitativeanalysis, we also utilize MedleyDB. Note that MedleyDBdoes not provide vocal segment annotations, so we usethe provided annotation for instrument activation to createground truth labels for vocal containing songs.
The test results of our implementation are shown inTable 2. We did not focus on fine-tuning individual modelsbecause three systems altogether are used as a tool to geta generalized view of the recent VD systems, thus show-ing slightly lower performances compared to the results inoriginal papers. Overall,
FE-VD , CNN-VD and
RNN-VD show a negligible difference on the test scores. We ob-serve trends that are similar to the original papers in termsof performance and the precision/recall ratio.Upon listening to the misclassified segments, we cat-egorize the source of errors into three classes – pitch-fluctuating instruments, low signal-to-noise ratio of thesinging voice, and non-melodic sounds. ong Title Confusing inst
FE-VD CNN-VD RNN-VD
LIrlandaise Woodwind, Synth
Castaway Elec. Guitar
Say me Good Bye N/A
Inside N/A
Table 3 : False positive rate (%) of each system for 4 songsfrom the Jamendo test set. The top 2 songs are the onesranked within the top 5 lowest accuracy and the bottom 2songs are the ones ranked within the top 5 highest accura-cies at song level across all three systems.
Classes of instruments such as strings, woodwinds andbrass exhibit similar characteristics as the singing voice,which we refer to as being ‘voice-like’ [28]. By ‘voice-like’, we consider three aspects of the signal, namely,pitch range, harmonic structure, and temporal dynamics(vibrato). Especially, we find temporal dynamics as im-portant attributes that are recognized by the VD systems toidentify vocal segments.Frequency modulation, also known as vibrato, resem-bles the modulation created from the vowel componentof singing voice. This is illustrated in Figure 2, wheremel-spectrograms of both a female vocalist and an electricguitar show curved lines. We observe that this similaritycauses further confusion in the system.In Table 3, we list two songs found among the top 5least/most accurately predicted songs in the test set ofall three systems. The woodwind in ‘05 - Llrlandaise’causes high false positives, which may be due to the pres-ence of vibrato and the similarity in pitch range to that ofsoprano singers (above 220 Hz).
FE-VD and
CNN-VD show poor performance on woodwinds, probably becausethe fluctogram of
FE-VD and small 2D convolution ker-nels of
CNN-VD are specifically designed to detect vibratoas one of the features for identifying singing voice. In thesame song, all three systems show confusion with the syn-thesizer. Synthesizers mimicking pitch-fluctuating instru-ments are particularly challenging as it is difficult to char-acterize them as specific instrument types.In addition, electric guitars are one of the most fre-quently found sources of false positives, as can be seenfrom ‘03 - castaway’, mostly caused by the recognizablevibrato patterns. We find the confusion gets worse whenthe guitar is played with effects, like wah-wah pedals,which imitates the vowel sound of the human. Lastly, wenote that some of the other problematic instruments in ourtest sets include saxophones, trombones and cellos, whichare well-known ‘voice-like’ instruments.This observation, regarding the system pitfalls on vi-brato patterns, is further investigated in Section 5.1.
Lastly, we note that all the three systems are affected bythe signal-to-noise ratio (SNR), or the relative gain of vo-cal component, as one can easily expect. All of the three time (second) f r e q u e n c y ( H z ) time (second) Figure 2 : Excerpts of Mel-spectrograms from MedleyDB:‘Handel TornamiAVagheggiar’ with female vocalist (left)and ‘PurlingHiss Lolita’ with electric guitar (right) (seeSection 4.2.1.)systems exhibit high false negative rate when the vocal sig-nal is relatively at a low level.In systems such as Lehner et al., where audio featuressuch as MFCCs or spectral flatness are used, the perfor-mance varies by SNR because the features are statistics ofthe whole bandwidth which includes not only the targetsignal (vocal) but also additive noise (instrumental). VDsystems with deep neural networks are not free from thissince the low-level operation in the layers of deep neuralnetworks are a simple pattern matching by computing cor-relation.This is a common phenomenon in other tasks as well,e.g., speech recognition, and we continue the discussion toa follow-up experiment in Section 5.2 and finally a sugges-tion on the problem definition and dataset composition inSection 6.
Although the interest of most VD systems appears to liemainly in the melodic component of the song, we ex-pected the system to learn percussive nature of the singingvoice as well, which is exhibited by consonants from thesingers. Therefore, our hypothesis is whether the systemis confused by the consonants of singing voice and percus-sive instruments, resulting in either i) missing consonantparts (false negative) or ii) mis-classifying percussive in-struments (false positive).From our test results, we encounter false positive seg-ments containing snare drums and hi-hats, but the exactcause of this misclassification is unclear. We further testedthe system with drum set solos for potential false positivesand with a collection of consonant sounds such as plosivesand fricatives from the human voice for potential false neg-atives, but we did not observe a clear pattern in misclassifi-cation. Although we do not conduct further experiment onthis, it suggests a deeper analysis, which may also lead toa clear understanding of preprocessing strategies includingHPSS.
5. EXPERIMENT II: STRESS TESTING5.1 Testing with artificial vibrato
Based on the confusion between ‘voice-like’ instrumentsand singing voice, we hypothesize that the current VD sys-ems use vibrato patterns as one of the main tools for vocalsegment detection. We explore the degree of confusion foreach VD system by testing them on synthetic vibratos withvarying rate, extent and formant frequencies.
We create a set of synthetic vibratos with low pass-filteredsawtooth waveforms with f =220 Hz. We vary the modu-lation rate and frequency deviation ( f ∆ ) to investigate theireffects. Furthermore, we apply 5 bi-quad filters at the cor-responding formant frequencies (3 for each) to synthesizeso that they would sound like the basic vowel sounds, ‘a’,‘e’, ‘i’, ‘o’, ‘u’ [29]. The modulation rate ranges in { } and the frequency deviation ranges in { } with respect to its f ). As a result, the set consists of (rates) × ( f ∆ ’s) × (5 formants + 1 unfiltered) = 336 variations. Figure 3 shows the result of the prediction by the three VDsystems on the synthetic vibratos. The accuracy of 1.0 in-dicates that the system does not confuse the artificial vi-bratos with singing voice. Here, we observe the perfor-mance difference of each model, which were not visiblefrom looking at the scores in Table 2. In general, confu-sion areas tend to be concentrated on the bottom left to thecenter area of the graph. The extent and rate of the artifi-cial tones that are highly misclassified seem to be aroundthe range of vibratos of singers, which is said to be around0.6 to 2 semitone with rate around 5.5 to 8 Hz [30]. Wealso observe a within-system difference, i.e., the presenceand the type of formants affect the models. For instance,vibratos mimicking the vowel ‘a’ cause higher misclassifi-cation in all three models.
FE-VD performs much better than the latter two sys-tems. Note that
FE-VD is a feature engineering model,where unique features, such as fluctogram and vocal vari-ance, are mostly adapted from the ones used in speechrecognition task. As these features were intentionally de-signed to reduce false positives from pitch-varying instru-ments, it appears to significantly reduce error rate on vi-bratos with rate and extent that are beyond the range ofhuman singers.
CNN-VD confuses slightly wider range of vibratos.This is expected to some extent since the model promi-nently uses 3 × local pattern detector.In other words, the locality of CNN results in a system thatis easily confused by frequency modulation regardless ofthe non-singing voice aspects of the signal. This impliesthat the model may benefit from looking at a varying rangeof time and frequency to learn vocal-specific characteris-tics, such as timbre [21].Lastly, RNN-VD performs better than the
CNN-VD ,though worse than
FE-VD . On detecting vocal and non-vocal segments, it seems natural, even for humans, thatpast and future temporal context help. Also, we presumethat the preprocessing of double stage HPSS contributes to unfiltered a e i o u f ( s e m i t o n e s ) rate (Hz) FECNNRNN
Accuracy
Figure 3 : Heat-maps of the accuracies of the vibratoexperiment result. Each row corresponds to VD sys-tems (
FE-VD , CNN-VD , RNN-VD ) and each column cor-responds to the formant (unfiltered, ’a’, ’e’, ’i’, ’o’, ’u’).Within each heat map, x- and y-axes correspond to thevibrato rate and frequency deviation as annotated on thelower-left subplot (see Section 5.1)the robustness of the system against vibrato. Again, thisobservation leaves a question of separating the contribu-tions from preprocessing and model structure.
In this experiment, VD systems are tested with vocal gainadjusted tracks to further explore the behavior of the sys-tems on various scenarios, which can reflect the real-worldaudio settings of live recordings and radios, for example.
We create a modified test set using 61 vocal-containingtracks provided by MedleyDB. We use the first 30 secondsof the songs to build a pair of (vocal, instrumental) tracks.Vocal tracks are modified with SNR of { +12 dB, -12 dB,+6 dB, -6 dB, 0 dB } . The results of the energy level robustness test are presentedin Figure 4 with false positive rate, false negative rate, andoverall error rate. We see a consistent trend across the per-formance of all three VD systems, which is once again anexpected pattern as aforementioned in Section 4.2.2 – thatincreasing SNR help to reduce false negatives. Overall er-ror rate also exhibits a noticeable decrease in common withhigher SNRs. In practice, one could take advantage of dataaugmentation with changing SNR to build a more robustsystem. More importantly, it can be part of the evaluationprocedure for VD, as we discuss in Section 6.While the VD systems behave similarly on all testcases, we note that
FE-VD , owing to its additional fea-tures, shows lowest variance and lowest value for the falsepositive rate. Also, our assumption that the double-stageHPSS, which filters out vocal-related signals, would make
RNN-VD more robust against SNR is observed to be notnecessarily true as we clearly see performance differencesacross the varying SNR test cases.
12 -6 0 (REF) +6 +12Vocal Level Adjustments (dB) F P R ( % ) FE CNN RNN -12 -6 0 (REF) +6 +12Vocal Level Adjustments (dB) F N R ( % ) FE CNN RNN -12 -6 0 (REF) +6 +12Vocal Level Adjustments (dB) E rr o r R a t e ( % ) FE CNN RNN
Figure 4 : False positive rates, false negative rates, and overall error rates for the three systems in the stress testing withcontrolling SNR (see Section 5.2).
6. DIRECTIONS TO IMPROVE6.1 Defining the problem and the datasets
By using the annotations in datasets such as Jamendo,many VD systems implicitly assume that the target‘singing voice’ is defined as vocal components that corre-spond to the main melody . Other voice-related componentssuch as backing vocal, narration, humming, and breathingare not clearly defined to be singing voice or not.In some applications, however, they can be of interest.For example, a system may want to find purely instrumen-tal tracks, avoiding tracks with backing vocal. In this case,the method should consider backing vocal as singing voice.However, for a Karaoke application, only the singing voiceof the main melody would matter.Therefore, an improvement can be made on defining theVD problem and creating datasets. For the annotation, ahierarchy among the voice-related components can be use-ful for both structured training and evaluation of a sys-tem [17, 23]. For the audio input, we see a great benefitof multitracks, where main vocal melody, backing vocal,and other components are provided separately.
For a long while, varying SNR had been one of the com-mon ways to evaluate speech recognition or enhance-ment using dataset such as Aurora [5]. As observed inSection 4.2.2, it can be used as a ‘test-set augmentation’to measure the performance of a system more precisely.Also, it can be an additional data augmentation methodalong with the ones in [27] to build a VD system morerobust to various audio settings, such as audios from usergenerated videos. These can both be easily achieved witha multitrack dataset in practice.
Human annotators are neither perfect or identical, thuscausing annotation noise and disagreement. Since VD isa binary classification problem, we may remain optimisticby assuming that the annotation noise is a matter of tem-poral precision, which is arbitrary and not agreed amongmany datasets so far. For example, in RWC Popular Music[16], “short background segments of less than 0.5-secondduration were merged with the preceding region” and theannotations have 8 decimal digits (in second), while in Ja-mendo, they are 3 decimal digits. The optimal precision may depend on human perception of sound which is oftensaid around 10 ms in general [19]. Although it would re-quire a deeper investigation, the current temporal precisionmay be too high, leading to evaluate the systems with anoverly precise annotation.
The characteristic of voice was the main motivation in thevery early works exploiting speech-related features [1,10].Clearly, however, those approaches that solely relied onspeech features showed limited performances. While fol-lowing researches have improved the performance, as ourexperiments have demonstrated through this paper, the sys-tems do not completely take advantage of the cues that hu-man is probably using, e.g., the global formants, linguisticinformation, musical knowledge, etc.
A light-weight VD system was introduced in [12] whereonly MFCCs were used to achieve a precision of 0.788 onJamendo dataset. This implies that there is a possibility toachieve better performance by optimizing the preprocess-ing stage. One of the unanswered questions is the effect ofthe preprocessing stage in
RNN-VD [11] as well as whethersimilar processing could lead to better performance withother systems, e.g., CNN [27].
7. CONCLUSIONS
In this paper, we suggested that there still are several areasto improve for the current singing voice detectors. In thefirst set of experiments, we identified the common errorsthrough error analysis on three recent systems. Our obser-vations that the main sources of error are pitch-fluctuatinginstruments and low signal-to-noise ratios of the singingvoice motivated us to further perform stress tests. Test-ing with synthetic vibratos revealed that some systems(
FE-VD ) are more robust to non-vocal vibratos than others(
CNN-VD and
RNN-VD ). SNR-varying test showed thatSNR manipulation greatly affects the current VD systems,thus it can potentially be used to strengthen the VD sys-tems to become invariant to a wider range of audio settings.As we propose several directions for a more robust singingvoice detector, we note that defining the VD problem isdependent on the goal of the system, thus using multitrackdatasets can be beneficial. Our future interest is to furtherinvestigate on SNR to extend VD systems on uncontrolledaudio settings and to examine different components of in-dividual systems, including the preprocessing stage. . ACKNOWLEDGEMENTS
We thank Bernhard Lehner and Simon Leglaive for ac-tive discussion and code, Jeongsoo Park for sharing Ono’scode. This work was supported by the National ResearchFoundation of Korea (Project 2015R1C1A1A02036962).
9. REFERENCES [1] Adam L Berenzweig and Daniel PW Ellis. Locatingsinging voice segments within music signals. In
Appli-cations of Signal Processing to Audio and Acoustics,2001 IEEE Workshop on the , pages 119–122. IEEE,2001.[2] Adam L Berenzweig, Daniel PW Ellis, and SteveLawrence. Using voice segments to improve artist clas-sification of music. In
Audio Engineering Society Con-ference: 22nd International Conference: Virtual, Syn-thetic, and Entertainment Audio . Audio EngineeringSociety, 2002.[3] Rachel M Bittner, Justin Salamon, Mike Tierney,Matthias Mauch, Chris Cannam, and Juan PabloBello. MedleyDB: A multitrack dataset for annotation-intensive mir research. In
ISMIR , volume 14, pages155–160, 2014.[4] Masataka Goto, Hiroki Hashiguchi, TakuichiNishimura, and Ryuichi Oka. RWC music database:Popular, classical and jazz music databases. In
Proc.of the 3rd International Society for Music InformationRetrieval Conference (ISMIR) , volume 2, pages287–288, 2002.[5] Hans-G¨unter Hirsch and David Pearce. The auroraexperimental framework for the performance evalua-tion of speech recognition systems under noisy con-ditions. In
ASR2000-Automatic Speech Recognition:Challenges for the new Millenium ISCA Tutorial andResearch Workshop (ITRW) , 2000.[6] Sepp Hochreiter and J¨urgen Schmidhuber. Long short-term memory.
Neural computation , 9(8):1735–1780,1997.[7] Chao-Ling Hsu, Liang-Yu Chen, Jyh-Shing RogerJang, and Hsing-Ji Li. Singing pitch extraction frommonaural polyphonic songs by contextual audio mod-eling and singing harmonic enhancement. In
Proc. ofthe 10th International Society for Music InformationRetrieval Conference (ISMIR) , pages 201–206, 2009.[8] Chao-Ling Hsu and Jyh-Shing Roger Jang. On theimprovement of singing voice separation for monau-ral recordings using the MIR-1K dataset.
IEEE Trans-actions on Audio, Speech, and Language Processing ,18(2):310–319, 2010.[9] Chao-Ling Hsu, DeLiang Wang, Jyh-Shing RogerJang, and Ke Hu. A tandem algorithm for singing pitch extraction and voice separation from music accompa-niment.
IEEE Transactions on Audio, Speech, and Lan-guage Processing , 20(5):1482–1491, 2012.[10] Youngmoo E Kim and Brian Whitman. Singer identifi-cation in popular music recordings using voice codingfeatures. In
Proc. of the 3rd International Conferenceon Music Information Retrieval (ISMIR) , volume 13,page 17, 2002.[11] Simon Leglaive, Romain Hennequin, and RolandBadeau. Singing voice detection with deep recurrentneural networks. In
Acoustics, Speech and Signal Pro-cessing (ICASSP), 2015 IEEE International Confer-ence on , pages 121–125. IEEE, 2015.[12] Bernhard Lehner, Reinhard Sonnleitner, and Ger-hard Widmer. Towards light-weight, real-time-capablesinging voice detection. In
Proc. of the 14th Interna-tional Society for Music Information Retrieval Confer-ence (ISMIR) , pages 53–58, 2013.[13] Bernhard Lehner, Gerhard Widmer, and SebastianB¨ock. A low-latency, real-time-capable singing voicedetection method with lstm recurrent neural networks.In
Signal Processing Conference (EUSIPCO), 201523rd European , pages 21–25. IEEE, 2015.[14] Bernhard Lehner, Gerhard Widmer, and ReinhardSonnleitner. On the reduction of false positives insinging voice detection. In
Acoustics, Speech and Sig-nal Processing (ICASSP), 2014 IEEE InternationalConference on , pages 7480–7484. IEEE, 2014.[15] Maria E Markaki, Andr´e Holzapfel, and YannisStylianou. Singing voice detection using modulationfrequency feature. In
SAPA@ INTERSPEECH , pages7–10, 2008.[16] Matthias Mauch, Hiromasa Fujihara, KazuyoshiYoshii, and Masataka Goto. Timbre and melody fea-tures for the recognition of vocal activity and instru-mental solos in polyphonic music. In
Proc. of the 12thInternational Society for Music Information RetrievalConference (ISMIR) , pages 233–238, 2011.[17] Brian McFee and Juan Pablo Bello. Structured trainingfor large-vocabulary chord recognition. In
Proc. of the18th International Society for Music Information Re-trieval Conference (ISMIR) , 2017.[18] Brian McFee, Matt McVicar, Oriol Nieto, StefanBalke, Carl Thome, Dawen Liang, Eric Battenberg,Josh Moore, Rachel Bittner, Ryuichi Yamamoto, et al.librosa 0.5. 0, 2017.[19] Brian CJ Moore.
An introduction to the psychology ofhearing . Brill, 2012.[20] Nobutaka Ono, Kenichi Miyamoto, Jonathan Le Roux,Hirokazu Kameoka, and Shigeki Sagayama. Sep-aration of a monaural audio signal into har-monic/percussive components by complementary dif-usion on spectrogram. In
Signal Processing Confer-ence, 2008 16th European , pages 1–4. IEEE, 2008.[21] Jordi Pons, Olga Slizovskaia, Rong Gong, EmiliaG´omez, and Xavier Serra. Timbre analysis of musicaudio signals with convolutional neural networks. In
Signal Processing Conference (EUSIPCO), 2017 25thEuropean , pages 2744–2748. IEEE, 2017.[22] Mathieu Ramona, Ga¨el Richard, and Bertrand David.Vocal detection in music with support vector ma-chines. In
Acoustics, Speech and Signal Processing,2008. ICASSP 2008. IEEE International Conferenceon , pages 1885–1888. IEEE, 2008.[23] Joseph Redmon and Ali Farhadi. YOLO9000: better,faster, stronger. In , pages 6517–6525,2017.[24] Lise Regnier and Geoffroy Peeters. Singing voice de-tection in music tracks using direct voice vibrato de-tection. In
Acoustics, Speech and Signal Processing,2009. ICASSP 2009. IEEE International Conferenceon , pages 1685–1688. IEEE, 2009.[25] Martın Rocamora and Perfecto Herrera. Comparing au-dio descriptors for singing voice detection in music au-dio files. In
Brazilian symposium on computer music,11th. san pablo, brazil , volume 26, page 27, 2007.[26] Jan Schl¨uter. Learning to pinpoint singing voice fromweakly labeled examples. In
Proc. of the 17th Interna-tional Society for Music Information Retrieval Confer-ence (ISMIR) , pages 44–50, 2016.[27] Jan Schl¨uter and Thomas Grill. Exploring data aug-mentation for improved singing voice detection withneural networks. In
Proc. of the 16th International So-ciety for Music Information Retrieval Conference (IS-MIR) , pages 121–126, 2015.[28] Emery Schubert and Joe Wolfe. Voicelikeness of musi-cal instruments: A literature review of acoustical, psy-chological and expressiveness perspectives.
MusicaeScientiae , 20(2):248–262, 2016.[29] Julius Orion Smith.
Introduction to digital filters: withaudio applications , volume 2. Julius Smith, 2007.[30] Renee Timmers and Peter Desain. Vibrato: Questionsand answers from musicians and science. In
Proc. Int.Conf. on Music Perception and Cognition , volume 2,2000.[31] Ye Wang, Min-Yen Kan, Tin Lay Nwe, Arun Shenoy,and Jun Yin. Lyrically: automatic synchronization ofacoustic musical signals and textual lyrics. In