[PDF] DEEPF0: End-To-End Fundamental Frequency Estimation for Music and Speech Signals

Abstract

We propose a novel pitch estimation technique called DeepF0, which leverages the available annotated data to directly learns from the raw audio in a data-driven manner. F0 estimation is important in various speech processing and music information retrieval applications. Existing deep learning models for pitch estimations have relatively limited learning capabilities due to their shallow receptive field. The proposed model addresses this issue by extending the receptive field of a network by introducing the dilated convolutional blocks into the network. The dilation factor increases the network receptive field exponentially without increasing the parameters of the model exponentially. To make the training process more efficient and faster, DeepF0 is augmented with residual blocks with residual connections. Our empirical evaluation demonstrates that the proposed model outperforms the baselines in terms of raw pitch accuracy and raw chroma accuracy even using 77.4% fewer network parameters. We also show that our model can capture reasonably well pitch estimation even under the various levels of accompaniment noise.

Full PDF

DDEEPF0: END-TO-END FUNDAMENTAL FREQUENCY ESTIMATION FOR MUSIC ANDSPEECH SIGNALS

Satwinder Singh Ruili Wang ∗ Yuanhang Qiu

School of Natural and Computational Sciences, Massey University, Auckland, New Zealand

ABSTRACT

We propose a novel pitch estimation technique called DeepF0, whichleverages the available annotated data to directly learns from the rawaudio in a data-driven manner. f estimation is important in vari-ous speech processing and music information retrieval applications.Existing deep learning models for pitch estimations have relativelylimited learning capabilities due to their shallow receptive ﬁeld. Theproposed model addresses this issue by extending the receptive ﬁeldof a network by introducing the dilated convolutional blocks into thenetwork. The dilation factor increases the network receptive ﬁeld ex-ponentially without increasing the parameters of the model exponen-tially. To make the training process more efﬁcient and faster, DeepF0is augmented with residual blocks with residual connections. Ourempirical evaluation demonstrates that the proposed model outper-forms the baselines in terms of raw pitch accuracy and raw chromaaccuracy even using 77.4% fewer network parameters. We also showthat our model can capture reasonably well pitch estimation even un-der the various levels of accompaniment noise. Index Terms — pitch estimation, f estimation, temporal con-volutional network, speech processing

1. INTRODUCTION

The fundamental frequency often represented by f is the lowest andpredominant frequency in a complex periodic signal. It is also re-ferred to as the pitch of the waveform [1]. However, there is a subtledifference between f and pitch since f is perceived as the physicalproperty of the audio signal, whereas pitch relates to the perceptualaspect of it. Nevertheless, outside the scope of psychoacoustics, bothterms are used interchangeably in the literature [2] and in this paperas well. Pitch estimation has been studied for almost for the last ﬁvedecades because of its central importance in various domains such asspeech recognition [3], speech synthesis [4], and music informationretrieval [5].There are many algorithms proposed in the past to carry outthe task of pitch estimation. These algorithms can be categorizedinto two broad categories: digital signal processing (DSP) basedapproaches, and data-driven approaches. The signal processingbased approaches can be further classiﬁed into the time-domainapproaches (RAPT [6], YIN [7], and pYIN [8]), frequency-domainapproaches (SWIPE [9] and PEFAC [10]), or hybrid (both time andfrequency-domain) approaches (YAAPT [11]). Most of them use athree-stage process consisting of preprocessing of a perceived signal(usually framing and signal conditioning) followed by possible can-didate search using an auto-correlation function, cross-correlationfunction, or cepstrum function [12]. Lastly, post-processing totrack down the best possible candidates for f using dynamic pro-gramming [8]. These approaches are computationally intensive, ∗ Corresponding author not robust in noisy environments, fail when the pitch is rapidlychanging, and do not learn anything from available data [12]. Onthe other hand, data-driven approaches take full advantage of theavailable data and learn the estimation model based on the dataitself. Most of the data-driven approaches are either based on tra-ditional machine learning or deep learning approaches. Due to theavailability of annotated pitch estimation datasets, and the successof deep learning models in various domains, it has become commonpractice to train the pitch estimation models in a data-driven man-ner. Recently, numerous deep learning approaches were proposedeither based on hand-crafted features or raw audio front-end. Manyof the early work extracted hand-crafted features, which includedconstant-Q transform (CQT) [2] and spectral-domain features [13].While extracting the acoustic features from raw audio, there isalways a chance of leaving out important features that might becrucial for pitch estimation [14]. To deal with such a situation,many researchers attempted to exploit raw waveform as the frontend features in various speech-related tasks [14, 15].In [16], a deep neural network (DNN) based pitch estimationmodel was proposed, which operates on raw audio. Kim et al. [12]designed a convolutional neural network (CNN) model that utilizesraw audio in the time domain and was able to outperform existingDSP based algorithms. Similarly, Dong et al. [14] proposed a con-volutional residual network model to estimate pitch using raw poly-phonic music. Even though these approaches perform better thanDSP based algorithms, but these models still have very shallow re-ceptive ﬁelds. Authors in [17] and [18] showed the effectiveness andapplications of larger receptive in sequence modeling tasks. We in-tend to use that intuition in our pitch estimation task, where we canaugment the network with dilation to have large memory or recep-tive ﬁeld. We propose the DeepF0 model that is based on a dilatedcausal temporal convolutional network (TCN). Dilation in CNN in-creases the receptive ﬁeld exponentially, without putting a compu-tational burden on the network in terms of the number of networkparameters used [17]. To stabilize the training of deep network, weintroduce residual network blocks and skip connections to the net-work, which can increase training efﬁciency, and achieve high ac-curacy as well [19]. The residual networks (also known as ResNet)have been successfully applied to a range of diverse areas of research[17, 20].We evaluated our proposed model on standard datasets that in-clude MIR-1k [21], MDB-stem-synth [22], and PTDB-TUG [23].These datasets contain audio samples of heterogeneous timbre andcharacteristics. We compared our approach with state-of-the-artCREPE and SWIPE baselines algorithms, where the former is deeplearning based data-driven approach and the latter is a DSP basedmethod. Empirical evaluation demonstrates that the proposed modelyields state-of-the-art results in terms of pitch accuracy. Besides, theproposed method also outperforms the baselines in the presence ofa reasonable amount of background noise. The rest of the paper is a r X i v : . [ ee ss . A S ] F e b r a w aud i o s a m p l e s C on v R e s i dua l b l o ck - R e s i dua l b l o ck - R e s i dua l b l o ck - Avg-pool f l a tt e r n F u ll y - c onne c t ed 360 ou t pu t Residual Blocks Filters: 128Kernel size: 64 Average pool: 64 R e s i dua l b l o ck - Fig. 1 : Network architecture of DeepF0.

ReLu + Dilated causal conv1d Weightnormalization Covn1d ReLu

Filters: 128 Kernel size: 64 Dilation rate: [1,2,4,8] Filters: 128Kernel size: 1 x y

Fig. 2 : Internal view of a residual block of DeepF0.organized as follows: Section 2 outlines the proposed architecture.Section 3 describes the experimental setup. The results are discussedin Section 4, followed by Section 5, which concludes the paper.

2. PROPOSED ARCHITECTURE

The proposed approach is based on a dilated temporal convolutionnetwork. This type of architecture has been applied in the text-to-speech task (WaveNet [17]) but has not been applied in the pitch es-timation task. The receptive ﬁeld of the basic CNN is limited, whichdepends upon the linear depth of the network [18]. We can improvethe receptive ﬁeld by adding more convolutional layers, which willincrease the receptive ﬁeld linearly. However, this will put a compu-tational burden on the network due to increased network parameters,and can also lead to a vanishing gradient problem. To address theseissues dilated CNN is adopted [17]. In dilated convolutions (alsoreferred to as convolutions with holes or atrous), we skip certain val-ues to gain the receptive area which is usually larger than the ﬁltersize. This way we can achieve an exponentially large receptive ﬁeldwithout even increasing the network parameters exponentially.The dilated convolutions are causal, which ensure that the cur-rent output is derived from the past outputs only and it does not lookinto future outputs. As we are increasing the depth of the network,which is ideal for learning robust representations, it can lead to theclassical problem of vanishing gradients. Considering this issue, weadopt residual connections that resolve the problem of gradient van-ishing by making new ways to ﬂow the gradients [17]. It also makessure that the higher layers perform as good as the deeper layers bylearning through identity mapping while training a deeper networkand is expressed as the following equation: y = ReLu ( x + F ( x )) (1) where x is input, and F ( x ) represents a series of transformationslike convolution operations and weight normalization.The DeepF0 model resamples the 16 kHz raw audio waveformin 1024 samples of audio frames, just like CREPE [12] with a 360-dimensional output vector. The architecture of the model is shownin Figure 1. The input is passed to 1D convolution with 128 ﬁltersand 64 kernel size. The big kernel size allows to have a wide recep-tive ﬁeld and it also contributes to learn directly from the raw audio[14]. The output of the ﬁrst convolution goes through the residualblocks. Our residual block as shown in Figure 2 consists of a 1Ddilated causal convolution layer and a normal × convolutionallayer followed by ReLu non-linearity [24] and a weight normaliza-tion layer [25]. To achieve extended receptive ﬁelds, we use dila-tion rate of d = 1 , , , . In each residual block, we employ theresidual/skip connections. The outcome of the last residual block isdownsampled using average pooling with a pool size of 64 followedby a dense layer and sigmoid activation function. The model usesa binary cross-entropy loss function to calculate the error betweentrue y i and predicted values ˆ y i . The model is optimized using theAdam optimizer [26] with a learning rate of 0.0002. Early stop-ping is enforced to ensure no overﬁtting when validation accuracyis not improving for 32 epochs. DeepF0 is trained using the NvidiaGeforce RTX 2080 Ti GPU.Following [12], the proposed model outputs 360-dimensionalvector ( c − c ), which represents pitches on the logarithmic scalemeasured in terms of cents (a unit to measure small musical inter-vals). Each dimension of the output vector corresponds to the fre-quency bin that covers a frequency range from 32.70 Hz to 1975.5Hz with 20 cents of intervals. The output vector estimates the Gaus-sian curve using the Gaussian kernel smoother [12]. Afterwards,we calculate the pitch value by taking the local weighted averageof pitches closest to frequency bins having the highest peak valueas shown in Eq 2. The resulted pitch values in cents are convertedack to frequency equivalent (Hz) using Eq. 3, where f is resultedfrequency and 1200 being a single octave. f ref represents the refer-ence frequency, which is set to 10 throughout our experimentation. ˆ c = m +4 (cid:88) i = m − (ˆ y i c i ) (cid:30) m +4 (cid:88) i = m − (ˆ y i ) m = argmax i ˆ y i (2) f = f ref · ˆ c/ (3)

3. EXPERIMENTAL SETUP3.1. Datasets

The proposed model is trained and evaluated on three publicly avail-able standard datasets, namely, MIR-1k [21], MDB-stem-synth [22],and PTDB-TUG [23]. MIR-1k contains 1000 audio clips of peoplesinging Chinese pop songs (11 males and 8 females) with pitch an-notations. The right channel of the audio consists of the singingpart, and the left channel holds musical accompaniment. The lengthof songs is between 4 and 13 seconds, which makes a total of 133minutes of recordings. The MDB-stem-synth dataset contains 230tracks resynthesized from the MedleyDB dataset [27]. It consistsof 418 minutes of diverse musical instruments and singing voices.Besides that, we also used the Pitch Tracking Database provided bythe Graz University of Technology (PTDB-TUG). The dataset com-prises of 4720 speech and laryngograph recordings of 20 native En-glish speakers (10 males and 10 females) with a total length of 576minutes. These three datasets have different characteristics, coveringvarious musical instruments, singing voices, and speech signals.

The proposed model is trained using 5-fold cross-validation with60/20/20 split of train, validation, and test, respectively. The splitis carried out in such a way that no artist/speaker/instrument over-laps with train and test splits. To investigate the model’s robustnessagainst background noise, we trained and evaluated the model ona dataset corrupted with musical accompaniment noise. We addeda musical accompaniment noise to the original clean audio for theMIR-1k dataset at different signal-to-noise ratio (SNR) levels of20dB, 10dB, and 0dB.

We compare the proposed model with baseline models that includeCREPE [12] and SWIPE [9]. The DSP based SWIPE algorithm esti-mates pitch by template matching with the spectrum of the sawtoothwaveform. On the other hand, CREPE is data-driven deep learningmodel, which is based on CNN and trained using multiple datasets(MIR-1k [21], MDB-stem-synth [22], MedlyDB [27], RWC-Synth[8], Nsynth [28], and Bach10 [29]). We use the full version of theCREPE model (22.2M parameters) without Viterbi smoothing pro-vided by the authors. Both the chosen models directly operate on theraw waveform in the time domain.

To evaluate and compare the performance of the proposed modelwith the baselines, we use evaluation metrics deﬁned in [30]. RawPitch Accuracy (RPA) measures the percentage of audio frameswhere the frequency estimate is accurate within the threshold value,which is 50 cents in our case. Raw Chroma Accuracy (RCA) also measures the percentage of audio frames where the frequency esti-mate is correct. However, the octave errors are ignored and mappedon to a single octave. Note that both RPA and RCA ignore voicingerrors.

4. RESULTS AND DISCUSSION4.1. Pitch Accuracy

The proposed model is compared with state-of-the-art models thatinclude CREPE [12] and SWIPE [9]. The results are depicted in Ta-ble 1. Our proposed model outperforms the baseline models in termsof raw pitch/chroma accuracies on all the three datasets on cleanaudio. In terms of raw pitch accuracy, the DeepF0 model shows1.33% and 9.29% of relative improvement compared with CREPEand SWIPE models on the MIR-1k dataset, respectively. A similartrend can be seen in raw chroma accuracy where the proposed modeloutperforms both the baseline models with no added noise acrossall the datasets. On the MDB-stem-synth dataset, DeepF0 achievesnear-perfect pitch estimation with 98.38% RPA and 98.44% RCAin comparison with its baselines. We also evaluated our model onan additional dataset (PTDB-TUG), which has heterogeneous tim-bre (speaking voices) as compared to MIR-1k (singing voices) andMDB-stem-synth (musical instruments). Our model signiﬁcantlyperforms better on PTDB-TUG in contrast to CREPE and SWIPE.On the PTDB-TUG dataset, CREPE performs worst of all the mod-els. This could be attributed to the fact the model provided by theauthors was not trained on the PTDB-TUG dataset. DeepF0 alsoshows more stability as it demonstrates consistently lower variancein pitch accuracy next to its baselines.

Table 1 : Average raw pitch accuracy and raw chroma accuracy andtheir standard deviation ( ± ) tested on three different test datasets. Model Params Metrics DatasetsMIR-1k MDB-stem-synth PTDB-TUGSWIPE - RPA (%) 88.73 ± . ± . ± . RCA (%) 89.24 ± . ± . ± . CREPE 22.2M RPA (%) 96.51 ± . ± . ± . RCA (%) 96.84 ± . ± . ± . DeepF0 5M RPA (%) ± ± ± RCA (%) ± ± ± Table 2 : Average raw pitch accuracy and raw chroma accuracy andtheir standard deviation ( ± ) on the MIR-1k dataset with added noiseon various levels of SNR. Model Metrics Noise ProﬁleClean 20dB 10dB 0dBSWIPE RPA (%) 88.73 ± ± ± ± ± ± ± ± ± ± ± ± RCA (%) 96.84 ± ± ± ± DeepF0 RPA (%) ± ± ± ± ± ± ± ± o noise0 dB noise F r e q u e n c y ( H z ) Fig. 3 : The estimated pitch trajectories of DeepF0 in comparisonwith ground truth under clean (top) and 0dB noise (bottom). Underno noise scenario DeepF0 produces near perfect pitch estimation,while under noise there are few errors.

Ideally, even in noisy environments, a model can still perform rea-sonably well. We put our proposed model into such testing scenariosby contaminating the signals with musical accompaniments at vari-ous levels of SNR on the MIR-1k dataset and results are presented inTable 2. In general, our proposed method achieves higher RPA andRCA as compared with the baselines under 10dB and 20dB noise.However, CREPE performs better when SNR is as low as 0dB. Onthe other hand, the performance of the SWIPE model is worst under10dB and 0dB noise. Overall, we can say that DeepF0 achieves bet-ter performance under low to moderate noise scenarios and CREPEworks better under moderate to high noise scenarios.

Our proposed model is more efﬁcient in terms of the number of pa-rameters used for training, which is around 5 million as comparedwith the CREPE model, which uses 22.2 million parameters. Thus,our DeepF0 model with 77.4% fewer parameters can still outperform T e s t i n g R P A ( % ) d=1d=2d=4d=8d=16 Fig. 4 : Evaluation results of the proposed model with different di-lation rates on the MDB-stem-synth dataset. Dilation rate d = 8 shows the best results. Table 3 : Evaluation results of the ablation study of our DeepF0model. Without residual connections accuracy of the model de-crease. With the dropout layer included in the residual blocks, theperformance more or less remains the same.Models Metrics DatasetMIR-1kDeepF0 baseline RPA(%) 97.82 ± ± ± ± ± ± d = 1 , , , , on the MDB-stem-synth dataset. The raw pitch accuracies are depicted in Fig-ure 4. DeepF0 with d = 1 , which is basically a standard CNN,only achieves about 86.40% of RPA and 86.55% of RCA. Further d = 2 , improve the performance and able to achieve similar re-sults as the CREPE model. However, the results are not very stablein terms of variance, which ranges from ± . to ± . on dilationrate 1 to 4. We ﬁnd that d = 8 gives the best results in terms of RPA(98.38% ± . ), RCA (98.44% ± . ), and variance. DeepF0 doesnot show any performance improvements beyond the dilation rate of d = 8 .We have also analyzed some of the design choices that we madewhile constructing the architecture of DeepF0 and the results of ourablation study are presented in Table 3. We ﬁnd that residual connec-tions are making a signiﬁcant difference when it comes to stabilizingand speeding up the training process. Not only model converges fast,but it also helps in achieving higher performance with low variance.Besides this, we have not used dropout layers throughout our net-work as we observe that these layers seemed redundant in the pres-ence of the weight normalization layer and have almost zero effecton the ﬁnal results.

5. CONCLUSIONS

In this paper, we propose a data-driven approach based on dilatedtemporal convolutional networks for the task of fundamental fre-quency estimation. The proposed DeepF0 model operates on a rawaudio and outputs the pitch estimation. The experimental results per-formed on three heterogeneous datasets (singing voice vs speakingvoices vs musical instruments) reveal that DeepF0 outperforms ex-isting baseline models like CREPE and SWIPE. Our proposed modeldoes not only achieve better results but also used 77.4% fewer pa-rameters as compared with the CREPE model. Our model is alsoable to perform reasonably well under low to moderate noise. Fur-ther, we gain crucial insights about the large receptive ﬁeld, whichwas not there in earlier models. We ﬁnd that the length of the re-ceptive ﬁeld of the network is very signiﬁcant in pitch estimation,which aids in achieving excellent results with consistently low vari-ance. In the future, we would like to improve the noise robustness ofour proposed model by introducing changes in architectural design,data augmentation and speech enhancement techniques [31]. Theperformance can be further improved by post-processing the pitchestimate through temporal smoothing techniques. . REFERENCES [1] Donn Morrison, Ruili Wang, and Liyanage C De Silva,“Ensemble methods for spoken emotion recognition in call-centres,”

Speech communication , vol. 49, no. 2, pp. 98–112,2007.[2] Beat Gfeller, Christian Frank, Dominik Roblek, Matt Shariﬁ,Marco Tagliasacchi, and Mihajlo Velimirovic, “SPICE: Self-supervised pitch estimation,”

IEEE/ACM Transactions on Au-dio, Speech, and Language Processing , 2020.[3] Pegah Ghahremani, Bagher BabaAli, Daniel Povey, KorbinianRiedhammer, Jan Trmal, and Sanjeev Khudanpur, “A pitch ex-traction algorithm tuned for automatic speech recognition,” in

IEEE international conference on acoustics, speech and signalprocessing , 2014, pp. 2494–2498.[4] M Kiran Reddy and K Sreenivasa Rao, “Excitation modellingusing epoch features for statistical parametric speech synthe-sis,”

Computer Speech & Language , vol. 60, pp. 101029, 2020.[5] Sangeun Kum and Juhan Nam, “Joint detection and classiﬁ-cation of singing voice melody using convolutional recurrentneural networks,”

Applied Sciences , vol. 9, no. 7, pp. 1324,2019.[6] David Talkin and W Bastiaan Kleijn, “A robust algorithm forpitch tracking (RAPT),”

Speech coding and synthesis , vol. 495,pp. 518, 1995.[7] Alain De Cheveign´e and Hideki Kawahara, “Yin, a fundamen-tal frequency estimator for speech and music,”

The Journal ofthe Acoustical Society of America , vol. 111, no. 4, pp. 1917–1930, 2002.[8] Matthias Mauch and Simon Dixon, “pYIN: A fundamental fre-quency estimator using probabilistic threshold distributions,”in

IEEE International Conference on Acoustics, Speech andSignal Processing , 2014, pp. 659–663.[9] Arturo Camacho and John G Harris, “A sawtooth waveforminspired pitch estimator for speech and music,”

The Journal ofthe Acoustical Society of America , vol. 124, no. 3, pp. 1638–1652, 2008.[10] Sira Gonzalez and Mike Brookes, “PEFAC-a pitch estimationalgorithm robust to high levels of noise,”

IEEE/ACM Transac-tions on Audio, Speech, and Language Processing , vol. 22, no.2, pp. 518–530, 2014.[11] Kavita Kasi,

Yet Another Algorithm for Pitch Track-ing:(YAAPT) , Ph.D. thesis, Old Dominion University, 2002.[12] Jong Wook Kim, Justin Salamon, Peter Li, and Juan PabloBello, “CREPE: A convolutional representation for pitch es-timation,” in

IEEE International Conference on Acoustics,Speech and Signal Processing , 2018, pp. 161–165.[13] Kun Han and DeLiang Wang, “Neural network based pitchtracking in very noisy speech,”

IEEE/ACM transactions onaudio, speech, and language processing , vol. 22, no. 12, pp.2158–2168, 2014.[14] Mingye Dong, Jie Wu, and Jian Luan, “Vocal pitch extractionin polyphonic music using convolutional residual network,” in , 2019, pp. 2010–2014.[15] Tara N Sainath, Ron J Weiss, Andrew Senior, Kevin W Wil-son, and Oriol Vinyals, “Learning the speech front-end withraw waveform CLDNNs,” in , 2015, pp.1–5.[16] Prateek Verma and Ronald W Schafer, “Frequency estima-tion from waveforms using multi-layered neural networks.,” in , 2016, pp. 2165–2169.[17] Aaron van den Oord, Sander Dieleman, Heiga Zen, Karen Si-monyan, Oriol Vinyals, Alex Graves, Nal Kalchbrenner, An-drew Senior, and Koray Kavukcuoglu, “WaveNet: A genera-tive model for raw audio,” arXiv preprint arXiv:1609.03499 ,2016.[18] Shaojie Bai, J Zico Kolter, and Vladlen Koltun, “An empiricalevaluation of generic convolutional and recurrent networks forsequence modeling,” arXiv preprint arXiv:1803.01271 , 2018.[19] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun,“Deep residual learning for image recognition,” in

IEEE con-ference on computer vision and pattern recognition , 2016, pp.770–778.[20] Yan Tian, Xun Wang, Jiachen Wu, Ruili Wang, and BailinYang, “Multi-scale hierarchical residual network for densecaptioning,”

Journal of Artiﬁcial Intelligence Research , vol.64, pp. 181–196, 2019.[21] Chao-Ling Hsu and Jyh-Shing Roger Jang, “On the improve-ment of singing voice separation for monaural recordings usingthe MIR-1K dataset,”

IEEE Transactions on Audio, Speech,and Language Processing , vol. 18, no. 2, pp. 310–319, 2009.[22] Justin Salamon, Rachel M Bittner, Jordi Bonada, Juan J Bosch,Emilia G´omez Guti´errez, and Juan Pablo Bello, “An analy-sis/synthesis framework for automatic f0 annotation of multi-track datasets,” in , 2017.[23] Gregor Pirker, Michael Wohlmayr, Stefan Petrik, and FranzPernkopf, “A pitch tracking corpus with evaluation on mul-tipitch tracking scenario,” in

12 Annual Conference of the In-ternational Speech Communication Association , 2011.[24] Vinod Nair and Geoffrey E Hinton, “Rectiﬁed linear units im-prove restricted boltzmann machines,” in , 2010, pp. 807–814.[25] Tim Salimans and Durk P Kingma, “Weight normalization: Asimple reparameterization to accelerate training of deep neu-ral networks,” in

Advances in neural information processingsystems , 2016, pp. 901–909.[26] Diederik P Kingma and Jimmy Ba, “Adam: A method forstochastic optimization,” arXiv preprint arXiv:1412.6980 ,2014.[27] Rachel M Bittner, Justin Salamon, Mike Tierney, MatthiasMauch, Chris Cannam, and Juan Pablo Bello, “Medleydb: Amultitrack dataset for annotation-intensive mir research.,” in

ISMIR , 2014, vol. 14, pp. 155–160.[28] Jesse Engel, Cinjon Resnick, Adam Roberts, Sander Dieleman,Mohammad Norouzi, Douglas Eck, and Karen Simonyan,“Neural audio synthesis of musical notes with wavenet autoen-coders,” in

ICML , 2017, pp. 1068–1077.[29] Zhiyao Duan, Bryan Pardo, and Changshui Zhang, “Multiplefundamental frequency estimation by modeling spectral peaksand non-peak regions,”

IEEE Transactions on Audio, Speech,and Language Processing , vol. 18, no. 8, pp. 2121–2133, 2010.[30] Colin Raffel, Brian McFee, Eric J Humphrey, Justin Salamon,Oriol Nieto, Dawen Liang, Daniel PW Ellis, and C Colin Raf-fel, “mir eval: A transparent implementation of common MIRmetrics,” in , 2014.[31] Yuanhang Qiu and Ruili Wang, “Adversarial latent representa-tion learning for speech enhancement,” in21th Annual Confer-ence of the International Speech Communication Association