[PDF] Frequency-Temporal Attention Network for Singing Melody Extraction

Abstract

Musical audio is generally composed of three physical properties: frequency, time and magnitude. Interestingly, human auditory periphery also provides neural codes for each of these dimensions to perceive music. Inspired by these intrinsic characteristics, a frequency-temporal attention network is proposed to mimic human auditory for singing melody extraction. In particular, the proposed model contains frequency-temporal attention modules and a selective fusion module corresponding to these three physical properties. The frequency attention module is used to select the same activation frequency bands as did in cochlear and the temporal attention module is responsible for analyzing temporal patterns. Finally, the selective fusion module is suggested to recalibrate magnitudes and fuse the raw information for prediction. In addition, we propose to use another branch to simultaneously predict the presence of singing voice melody. The experimental results show that the proposed model outperforms existing state-of-the-art methods.

Full PDF

FFREQUENCY-TEMPORAL ATTENTION NETWORK FOR SINGING MELODYEXTRACTION

Shuai Yu , Xiaoheng Sun , Yi Yu and Wei Li* , School of Computer Science and Technology, Fudan University, Shanghai, China National Institute of Informatics (NII), Tokyo, Japan Shanghai Key Laboratory of Intelligent Information Processing, Fudan University, Shanghai, China

ABSTRACT

Musical audio is generally composed of three physical properties:frequency, time and magnitude. Interestingly, human auditory pe-riphery also provides neural codes for each of these dimensionsto perceive music. Inspired by these intrinsic characteristics, afrequency-temporal attention network is proposed to mimic humanauditory for singing melody extraction. In particular, the proposedmodel contains frequency-temporal attention modules and a selec-tive fusion module corresponding to these three physical properties.The frequency attention module is used to select the same activa-tion frequency bands as did in cochlear and the temporal attentionmodule is responsible for analyzing temporal patterns. Finally, theselective fusion module is suggested to recalibrate magnitudes andfuse the raw information for prediction. In addition, we propose touse another branch to simultaneously predict the presence of singingvoice melody. The experimental results show that the proposedmodel outperforms existing state-of-the-art methods . Index Terms — frequency-temporal attention network, singingmelody extraction, music information retrieval

1. INTRODUCTION

Singing melody extraction estimates the fundamental frequency (F0)of the singing melody, which is one challenging and critical taskin music information retrieval (MIR) [1]. Recently it has becomean active research topic with a lot of downstream applications ofmelody-based AI music, such as cover song identiﬁcation [2], query-by-humming [3] and voice separation [4].With the advances of deep learning technique, several neuralnetwork based methods have been proposed to learn the mappingbetween the audio and melody [5, 6, 7, 8]. Bittner et al. [6] proposeda fully convolutional neural network to learn the pitch salience andachieved promising results. Hsieh et al. [8] proposed an encoder-decoder architecture and a way to use the bottleneck layer of the net-work to estimate the presence of a melody line. Unfortunately, othertype of network without bottleneck layer cannot enjoy the advan-tage of it. On the other hand, researchers also attempt to combinethe deep learning technique with human auditory [9, 10, 11]. Forexample, Gao et al. [11] proposed to integrate a multi-dilation se-mantic segmentation model with multi-resolution representations torespectively simulate top-down and bottom-up processes of humanperception for singing melody extraction. However, prior works treatthe three dimensions (i.e., frequency, time and magnitude) equallywhich is inconsistent with human audition. In addition, directly con- Codes are available in https://github.com/yushuai/FTANet-melodic catenating the spectral and temporal information may lower the per-formance of the singing melody extraction.Human auditory periphery provides neural codes for frequency,time and magnitude [12]. When the sound enters the cochlea, differ-ent frequencies within the sound selectively stimulate different re-gions of the cochlea. Then human can extract the pitch via temporalpatterns generated by unresolved harmonics in the auditory periph-ery [13]. In this paper, we focus on designing a frequency-temporalattention based model for singing melody extraction to simulate suchmechanism.Based on the Psycho-acoustic research, the human auditory sys-tem will select the stimulated frequency bands in the cochlea. Ac-cordingly, auto-correlation is performed to capture temporal correla-tions in the auditory cortex. To mimic this mechanism, we introducefrequency attention to assign different weights in the spectrogramalong the frequency axis, which corresponds to selecting the stim-ulated frequency bands in the cochlea. To simulate the process ofauto-correlation, a temporal attention module is proposed to capturetemporal relations between the adjacent frames, which can capturemore complex nonlinear temporal relationship.Neuro-physiological research show that some high-frequencysignals cannot be perceived by the spectral mode, but temporal mod-els can weakly perceive them [12]. We argue that features fromspectral and temporal models cannot be simply concatenated, as theconcatenated features may bring noises for melody extraction whichhinders the further improvement of this task. Based on extending theidea of selective kernel networks used in computer vision [14], wesuggest a selective fusion module to dynamically select the featuresfrom frequency and temporal attention networks. Accordingly, wefuse the features for thereafter singing melody extraction.As shown in Fig.1, in diagram (b), the related frequency bandsare enhanced and the unrelated frequency bands are suppressed. Di-agram (c) shows the melody line is more salient via the temporalattention. In diagram (d), the selective fusion module dynamicallyselects and fuses features from (b) and (c) to generate the predictedmelody line. We hypothesize that we can learn a frequency-temporalattention based deep architecture for singing melody extraction tosimulate these potential intrinsic characteristics. Three technicalcontributions are made: i) we propose a novel frequency-temporalattention network to mimic the human auditory assigning differentweights in the time and frequency axis. To the best of our knowl-edge, there is no such works for singing melody extraction in theliterature. ii) A selective fusion module is proposed to dynamicallyassign weight to the spectral features and temporal features. Thenwe fuse them for melody extraction. iii) We propose to use anotherbranch to directly predict the presence of melody. a r X i v : . [ c s . S D ] F e b a) Input Spectrogram. (b) Frequency Attention Output.(c) Temporal Attention Output. (d) Selective Fusion Output. Fig. 1 : Procedures of singing melody extraction of the 128-256 msof ‘daisy4.wav’ in the ADC2004 dataset performed by the proposedmodel.

2. PROPOSED MODEL

The overall architecture of our frequency-temporal attention networkis shown in Fig. 2. It has two branches: the top branch containsstacked convolution layers performing fast downsampling to obtainhigh-level semantic representation for singing melody detection; andthe bottom branch contains the proposed frequency-temporal atten-tion and selective fusion modules. The Frequency attention and tem-poral attention, selective fusion module, singing melody detection inthe proposed deep architecture are respectively addressed in the Sec-tion.

We choose Combined Frequency and Periodicity (CFP) representa-tion [15] as the input of our model due to its effectiveness and pop-ularity. The CFP representation has three parts: the power-scaledspectrogram, generalized cepstrum (GC) [16, 17] and generalizedcepstrum of spectrum (GCoS) [18]. We use 44100 Hz sampling rate,2,048-sample window size, and 256-sample hop size for computingthe short-time-Fourier-transformation (STFT).

The design of frequency-temporal attention module represents theﬁrst contribution of this work. The detailed architecture is shownin Fig. 2. Unlike previous works for speech-related tasks, we donot employ classic attention models [19, 20]. Instead we use 1-Dconvolution to select relevant regions (frequency/time) for this task.Formally, given the input feature map S ∈ R F × T × C (cid:48) , an aver-age pooling is ﬁrstly applied to the spectrogram for calculating thedistribution of magnitudes along the time axis. Unlike 2-D averagepooling, we use row average pooling to achieve this. The frequencydescriptor f ∈ R F × C (cid:48) can be calculated: f = 1 T (cid:88) i< = T s ij (1)where s ij is the element in the i -th row and j -th column in S .Then 1-D convolution is used to dynamically select task-relatedfrequency for singing melody extraction. Since 1-D convolution isgood at learning the relationship along the frequency bins or time Fig. 2 : The overall architecture of the proposed model. The topbranch is proposed to predict the presence of a melody. The bot-tom branch consists of the proposed frequency-temporal attentionmodule and selective fusion module for F0 estimation. ‘k’ and ‘c’denotes for the kernel size and the number of channels, respectively.

Fig. 3 : The detailed architecture of the proposed frequency-temporalattention module. E f and E t are the output of frequency attentionand temporal attention, respectively.axis, we choose to employ 1-D convolution to perform frequency-temporal attention. For a frequency descriptor f , the process of 1-Dconvolution can be written as: V l = f ∗ k l (2)where k l is the kernel of 1-D convolution at l -th layer and V l ∈ R F × C is the newly generated feature map, * stands for the convo-lution operation. Finally, we apply a softmax layer to the output ofthe 1-D convolution layer V to obtain the frequency attention map A f ∈ R F × C : A f = softmax ( V ) (3)Similarly, we can also obtain the temporal attention map A t ∈ R T × C through the same process mentioned above.Meanwhile, we feed feature map S into 2-D convolution layerswith the kernel size of (3 × and (5 × to generate two newfeature maps { S f , S t }∈ R F × T × C . Then we perform matrix multi-plications between { S f , S t } and { A f , A t } to obtain the ﬁnal output E = { E f , E t } : E f = S f ⊗ broadcast ( A f ) , E t = S t ⊗ broadcast ( A t ) . (4)where broadcast is the process of making matrices with differentshapes have compatible shapes for element-wise multiplication. ig. 4 : The detailed architecture of the proposed selective fusionmodule. The design of selective fusion module (SFM) represents the secondcontribution of this work. Inspired by the idea of selective kernelnetworks used in computer vision [14], we devise this module to dy-namically select spectral and temporal features and fuse them. Thedetailed architecture is shown in Fig. 3. This module takes three in-puts: the feature map S (cid:48) ∈ R F × T × C generated from S via a (1 × convolution, the frequency attention result E f and the temporal at-tention result E t . Firstly, an element-wise addition is performed tofuse the three inputs into a new feature map Γ and then a global aver-age pooling (GAP) is performed to obtain global descriptor g ∈ R C . g = 1 F × T (cid:88) i< = F,j< = T Γ ij (5)After a fully connected (FC) layer for non-linear transformation,three FC layers are used to learn the importance of each channelof the feature maps. The softmax layer is applied to obtain the atten-tion map. After obtaining the attention maps, matrix multiplicationis performed between the three inputs and the attention maps to ob-tain the weighted feature maps. Finally, we fused the three weightedfeature maps by element-wise addition operation. The fuse featuremap contains rich information selected from the spectral and tempo-ral modes. The design of melody detection branch (MDB) represents the thirdcontribution of this work. The motivation of this branch is to designa general scheme that does not depend on special structure and canfast predict the presence of a melody. In this branch, we employfour stacked convolution layers to directly perform downsamplingon the spectrograms. Speciﬁcally, we have the ﬁrst three convolutionlayers with the kernel size (4 × and stride size (4 × and thelast convolution layer with the kernel size (5 × and stride size (5 × . As a result, the output of this branch is m ∈ R × T stridesize of (5 × . Following [8], the output of MDB is concatenatedwith the salience map to make an ( F + 1) × T matrix. By includingthis branch, the voicing recall (VR) rate can be improved.

3. EXPERIMENTS3.1. Experiment Setup

We randomly choose 60 vocal tracks from the Medley DB dataset.To increase the amount of the training data, we augment the training Method ADC2004 (vocal)OA RPA RCA VR VFAw/o F-att. 83.9 82.4 84.2 84.1 8.8w/o T-att. 83.5 82.3 84.7 85.2 8.4w/o SFM 84.5 83.4 85.1 85.2 w/o MDB 84.7 84.0 85.6 87.1 11.3

Proposed 85.9 85.2 86.7 87.5 (a) ADC2004

Method MIREX 05 (vocal)OA RPA RCA VR VFAw/o F-att. 81.7 76.5 77.7 81.0 w/o T-att. 78.3 78.0 79.0 84.0 20.2w/o SFM 79.3 74.9 76.0 78.8 13.2w/o MDB 81.0 79.6 80.6 84.6 16.6

Proposed 84.0 80.0 80.8 85.2 (b) MIREX 05

Table 1 : Results of Ablation Study on ADC2004 and MIREX 05dataset. The values in the table are percentile. “w/o F-att” and “w/oT-att” denote for without frequency/temporal attention respectively.“w/o SFM” stands for without selective fusion module. “w/o MDB”stands for without melody detection branch.dataset by copying some of the chosen vocal tracks. Accordingly,there are 98 clips in the training dataset. We select only sampleshaving melody sung by human voice from ADC2004, MIREX 05 and MedleyDB for test sets. As a result, 12 clips in ADC2004, 9clips in MIREX 05 and 12 clips in MedleyDB are selected. Notethat there is no overlap between the training and testing datasets.To adapt the pitch ranges required in singing melody extraction,following [8] we set hyperparameters in computing the CFP for ourmodel. For vocal melody extraction, the number of frequency bins isset to 320, with 60 bins per octave, and the frequency range is from31 Hz ( B ) to 1250 Hz ( D (cid:93) ). We divide the training clips into ﬁxed-length segments of T = 128 frames, which is 128 milliseconds. Ourmodel is implemented with Keras . For model update, we choosethe binary cross entropy as the loss function. The Adam optimizer isused with the learning rate of 0.0001.Following the convention in the literature, we use the follow-ing metrics for performance evaluation: overall accuracy (OA), rawpitch accuracy (RPA), raw chroma accuracy (RCA), voicing recall(VR) and voicing false alarm (VFA). We use mir eval library [21]with the default setting to calculate the metrics. For each metric otherthan VFA, the higher score, the higher performance. In the literature[1], OA is often considered more important than other metrics. To investigate how much the proposed frequency-temporal attentionmodule contributes to the model, we ﬁrst remove the frequency at-tention, and only temporal attention is used to encode the tempo-ral information. As is shown in Table 1, the performances on bothdatasets are decreased. When focusing on OA, the performance isdecreased by 2.4% on ADC2004 and by 2.8% on MIREX 05. Wethen remove the temporal attention and only keep the frequency at-tention. The performances on both datasets are decreased by 2.9%and by 7.3%, respectively, on the both datasets. https://labrosa.ee.columbia.edu/projects/melody https://keras.io ethod ADC 2004 (vocal)OA RPA RCA VR VFAMCDNN [5] 69.8 70.2 74.3 79.0 30.7SegNet [8] 81.6 82.7 84.9 87.4 18.6MD+MR [11] 82.8 82.4 84.6 85.3 12.7 Proposed 85.9 85.2 86.7 87.5 11.1 (a) ADC 2004

Method MIREX 05 (vocal)OA RPA RCA VR VFAMCDNN [5] 75.6 69.4 71.1 74.9 14.7SegNet [10] 78.6 78.4 79.7 85.7 21.5MD+MR [11] 80.7 78.6 79.5 84.3 16.9

Proposed 84.0 80.0 80.8 85.2 9.7 (b) MIREX 05

Method Medley DB (vocal)OA RPA RCA VR VFAMCDNN [5] 60.0 52.2 56.3 59.5 19.6SegNet [8] 65.8 (c) MedleyDB

Table 2 : Results of the proposed and baseline methods on ADC2004, MIREX 05 and Medley DB dataset. The values in the tableare percentile.We then investigate the effectiveness of the selective fusion mod-ule. When focusing on OA, the performances of the ablated versionare decreased by 2.8% and 5.9% on ADC2004 and MIREX 05, re-spectively. The results justify the assumption that direct concatena-tion may hinder the further improvement of the model. Lastly, we in-vestigate the effectiveness of the proposed melody detection branch.The results w.r.t. OA on both datasets are decreased by 1.4% and3.7%, respectively, on the two datasets. The results demonstrate theeffectiveness of the proposed simple design for melody detection.However, when focusing on VFA, the ablated version achieves a bet-ter score than the proposed model. We analyzed this phenomenon,and found that the shallow CNNs cannot capture semantic informa-tion well and it is prone to predict a non-melody frame with complexharmonic patterns as a melody one.

The performances on the three datasets are listed in Table 2. Threebaseline methods are compared in Table 2, including MCDNN[5], SegNet [8] and MD+MR [11]. We carefully tune the hyper-parameters of the three baseline methods to ensure that they reachtheir peak performances on our training dataset. The proposedmodel and the three baseline methods are trained on the samedataset. Compared with the baseline methods, the proposed methodachieves the highest score in general. The results clearly conﬁrmthe effectiveness and robustness of our proposed model. For com-parison with other baselines, when focusing on OA, the proposedmethod outperforms MCDNN by 23.1% in ADC2004, by 11.1%in MIREX 05, and by 10.5% in Medley DB. Since all of the threebaselines have melody-detectors, when focusing on VR and VFA,the proposed method achieves the highest score in ADC2004 andMIREX 05 datasets, and comparable result in MedleyDB.To investigate what types of errors are solved by the pro- (a) Opera male3 with ours. (b) Opera male3 with SegNet.(c) Opera male5 with ours. (d) Opera male5 with SegNet.

Fig. 5 : Visualization of singing melody extraction results on twoopera songs using different models.posed model, a case study is performed on two opera songs:“opera male3.wav” and “opera male5.wav” in the ADC2004 dataset.We choose SegNet [8] to compare with due to its effectiveness andpopularity. As depicted in Fig. 5, we can observe that there arefewer octave errors in diagram (a) and (c) than in diagram (b) and(d). Moreover, from 1000-1200 ms in the diagram (d), we canﬁnd some errors that predict the wrong frequency bin near theright one, which are correctly predicted in diagram (c). Throughthe visualization of the predicted melody contour, we can say thatthe performance gains of the proposed model can be attributed tosolving the octave errors and other errors. However, we can alsoobserve that there seem to be more the melody detection errors (i.e.,predicting a non-melody frame as a melody one) than in SegNet[8]. The goal of the melody detection branch is to fast predict thepresence of the melody which does not depend on special structure.We leave this for future research topic to design more accurate andfast network for improving the quality of melody detection.

4. CONCLUSION

In this paper, we propose a novel frequency-temporal attention net-work to mimic the human auditory for singing melody extraction,which mainly contains three novel modules: frequency-temporal at-tention, selective fusion, and singing melody detection. Frequency-temporal feature learning and singing melody detection are simul-taneously learned in an end-to-end way. Experimental results showthe proposed model outperforms the existing state-of-the-art mod-els on three datasets. Designing more accurate and faster methodto improve the performance of singing melody detection will be ourfuture work.

5. ACKNOWLEDGEMENT

This work was supported by National Key R&D Program of China(2019YFC1711800), NSFC (61671156). . REFERENCES [1] Justin Salamon, Emilia G´omez, Daniel PW Ellis, and Ga¨elRichard, “Melody extraction from polyphonic music signals:Approaches, applications, and challenges,”

IEEE Signal Pro-cessing Magazine , vol. 31, no. 2, pp. 118–134, 2014.[2] Joan Serra, Emilia G´omez, and Perfecto Herrera, “Audio coversong identiﬁcation and similarity: background, approaches,evaluation, and beyond,” in

Advances in Music InformationRetrieval , pp. 307–332. Springer, 2010.[3] Chung-Che Wang and Jyh-Shing Roger Jang, “Improvingquery-by-singing/humming by combining melody and lyric in-formation,”

IEEE ACM Trans. Audio Speech Lang. Process(TASLP) , vol. 23, no. 4, pp. 798–806, 2015.[4] Yukara Ikemiya, Kazuyoshi Yoshii, and Katsutoshi Itoyama,“Singing voice analysis and editing based on mutually depen-dent F0 estimation and source separation,” in

Proc. ICASSP ,2015, pp. 574–578.[5] Sangeun Kum, Changheun Oh, and Juhan Nam, “Melodyextraction on vocal segments using multi-column deep neuralnetworks.,” in

Proc. ISMIR , 2016, pp. 819–825.[6] Rachel M Bittner, Brian McFee, Justin Salamon, Peter Li, andJuan Pablo Bello, “Deep salience representations for f0 estima-tion in polyphonic music,” in

Proc. ISMIR , 2017, pp. 63–70.[7] Dogac Basaran, Slim Essid, and Geoffroy Peeters, “Mainmelody extraction with source-ﬁlter NMF and CRNN,” in

Proc. ISMIR , 2018, pp. 82–89.[8] Tsung-Han Hsieh, Li Su, and Yi-Hsuan Yang, “A streamlinedencoder/decoder architecture for melody extraction,” in

Proc.ICASSP , 2019, pp. 156–160.[9] Hsin Chou, Ming-Tso Chen, and Tai-Shih Chi, “A hybrid neu-ral network based on the duplex model of pitch perception forsinging melody extraction,” in

Proc. ICASSP , 2018, pp. 381–385.[10] Ming-Tso Chen, Bo-Jun Li, and Tai-Shih Chi, “CNNbased two-stage multi-resolution end-to-end model for singingmelody extraction,” in

Proc. ICASSP , 2019, pp. 1005–1009. [11] Ping Gao, Cheng-You You, and Tai-Shih Chi, “A multi-dilationand multi-resolution fully convolutional network for singingmelody extraction,” in

Proc. ICASSP , 2020, pp. 551–555.[12] William A Yost, “Pitch perception,”

Attention, Perception, &Psychophysics , vol. 71, no. 8, pp. 1701–1715, 2009.[13] J. F. Schouten, R. J. Ritsma, and B. Lopes Cardozo, “Pitch ofthe residue,”

The Journal of the Acoustical Society of America(JASA) , vol. 34, no. 9B, pp. 1418–1424, 1962.[14] Xiang Li, Wenhai Wang, Xiaolin Hu, and Jian Yang, “Selectivekernel networks,” in

Proc. CVPR , 2019, pp. 510–519.[15] Li Su and Yi-Hsuan Yang, “Combining spectral and tempo-ral representations for multipitch estimation of polyphonic mu-sic,”

IEEE/ACM Transactions on Audio, Speech and LanguageProcessing (TASLP) , vol. 23, no. 10, pp. 1600–1612, 2015.[16] Takao Kobayashi and Satoshi Imai, “Spectral analysis us-ing generalised cepstrum,”

IEEE Transactions on Acoustics,Speech, and Signal Processing (TASLP) , vol. 32, no. 6, pp.1235–1238, 1984.[17] Keiichi Tokuda, Takao Kobayashi, Takashi Masuko, andSatoshi Imai, “Mel-generalized cepstral analysis-a uniﬁed ap-proach to speech spectral estimation,” in

Proc. ICSLP , 1994.[18] Li Su, “Between homomorphic signal processing and deepneural networks: Constructing deep algorithms for polyphonicmusic transcription,” in

Proc. APSIPA ASC , 2017, pp. 884–891.[19] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio,“Neural machine translation by jointly learning to align andtranslate,” in

Proc. ICLR , 2015.[20] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit,Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polo-sukhin, “Attention is all you need,” in

Proc. NeurIPS , 2017,pp. 5998–6008.[21] Colin Raffel, Brian McFee, Eric J Humphrey, Justin Salamon,Oriol Nieto, Dawen Liang, Daniel PW Ellis, and C Colin Raf-fel, “mir eval: A transparent implementation of common mirmetrics,” in