Between Homomorphic Signal Processing and Deep Neural Networks: Constructing Deep Algorithms for Polyphonic Music Transcription
BBetween Homomorphic Signal Processing and DeepNeural Networks: Constructing Deep Algorithmsfor Polyphonic Music Transcription
Li Su ∗∗ Institute of Information Science, Academia Sinica, TaiwanE-mail: [email protected] Tel: +886-2788-3799
Abstract —This paper presents a new approach in under-standing how deep neural networks (DNNs) work by applyinghomomorphic signal processing techniques. Focusing on the taskof multi-pitch estimation (MPE), this paper demonstrates theequivalence relation between a generalized cepstrum and a DNNin terms of their structures and functionality. Such an equiva-lence relation, together with pitch perception theories and therecently established rectified-correlations-on-a-sphere (RECOS)filter analysis, provide an alternative way in explaining the roleof the nonlinear activation function and the multi-layer structure,both of which exist in a cepstrum and a DNN. To validate theefficacy of this new approach, a new feature designed in the samefashion is proposed for pitch salience function. The new featureoutperforms the one-layer spectrum in the MPE task and, aspredicted, it addresses the issue of the missing fundamental effectand also achieves better robustness to noise.
I. I
NTRODUCTION
Automatic music transcription (AMT) refers to the taskof converting acoustic music signals into symbolic notation,such as the onset time, offset time, pitch, and others. Sincemusic is typically composed of multiple overlapped sourceswith diverse spectral patterns spreading over a wide frequencyrange, AMT is a highly challenging task, and is considered tobe the holy grail in the field in machine listening of music[1], [2]. The most fundamental technique required for AMTis pitch detection. Depending on the types of source signalsand the scenario in application, pitch detection algorithms aredesigned in different ways, such as single-pitch detection formonophonic music, or melody tracking and and multi-pitchestimation (MPE) for polyphonic music. The main focus ofthis paper is on the MPE task, the core task in buildingan AMT system. Solutions to the MPE task include feature-based approaches such as pitch salience functions [3], [4], [5],nonnegative matrix factorization (NMF) [6], probability latentcomponent analysis (PLCA) [7], [8] as well as convolutionalsparse coding (CSC) [9], [10], to name but a few.Recently, deep learning approaches have gained increasingattention in polyphonic pitch detection [11], [12], [13], [14],[15]. In [13], Verma and Schafer used time-domain DNNsfor melody tracking in polyphonic music, and found that thelearned networks resemble traditional pitch detection methods:the first layer behaves like a spectral analyzer while the secondlayer behaves like a comb filter for saliency mapping. Al-though such observations keep providing indirect evidence on the physical meanings of deep learning, it is still a highly em-pirical technique, with its superior performance unexplainableand its limitation unknown. Moreover, unlike its success inother fields such as image processing and speech recognition,deep learning approaches have not been proven state-of-the-artin the problems of pitch detection. In a recent work on MPE,Kelz shows that when the training data contains concurrentnotes, neural networks suffer from the entanglement issue,which somehow implies a limited deduction power of deeplearning on this problem [15]. These facts reveal the urgencyof understanding the theory behind deep learning, and suggesta direction to ask if conventional signal processing techniquescan help one understand deep learning.Recent studies focusing on how to understand deep learn-ing are based on diverse approaches, including scatteringtransform [16], [17], [18], Taylor expansion [19], generativemodeling [20], renormalization groups [21], probability the-ory [22], tensor analysis [23], saliency map or layer-wiserelevance propagation [24], [19], [25] and cascaded filtering[26], [27]. Particularly, in [26], [27], Kuo interprets the con-volutional neural networks (CNN) as the so-called REctified-COrrelations on a Sphere (RECOS) filters, which answer thequestion why nonlinear activation functions and multi-layerstructures are useful: a nonlinear activation function eliminatesthe information negatively correlated to the anchor vectors(i.e., frequently occurring patterns or dictionary atoms) and,when the filters are cascaded in layers, the system improvesitself in each processing step by avoiding confusion withnegative-correlation samples, noise, and the background. Sinceit attempts to bridge the gap between different languages usedin signal processing and deep learning, such an approach isnoteworthy for the research in audio signal processing amongall the above-mentioned approaches.This paper will show that the structure similar to the RECOSfilter can also be applied in MPE and, maybe surprisingly,such a structure do exist in many traditional pitch detectionalgorithms, including the autocorrelation function (ACF), cep-strum (a.k.a. homomorphic signal processing) [28], as wellas the YIN algorithm [29], where the RECOS filters arerepresented as the Fourier transform followed by a piecewisemultiplication. In other words, those traditional pitch detectionfunctions inherently have a DNN-like structure. To the bestof the author’s knowledge, this fact has never been discussed a r X i v : . [ c s . S D ] J un efore. In addition, by reviewing the issues such as the missingfundamental effect in pitch perception, this paper will alsoprovide an perceptual interpretation of the nonlinear activationfunction in a DNN. Moreover, to prove that the proposedstatements do inspire one to understand how a deep algorithmworks on MPE, a novel feature computed with a three-layernetwork, named the generalized cepstrum of spectrum (GCoS),is proposed based on the statements. Experiments on piano andmixed-instrument datasets show that the GCoS outperformsthe baseline method by showing its ability in detecting missingfundamentals and its robustness to noise.This paper is organized as follows. Section II reviews twoclassic topics on the role of nonlinearity in pitch detection,the first being the theories explaining the missing fundamentaleffects, and the second being the cepstrum. Section III givesa generalized framework for pitch detection algorithms anddiscusses its relation to DNNs and RECOS filters. Remainingparts go for experiments, results and conclusions.II. N ONLINEARITY AND PITCH DETECTION
In this section, the nonlinearity of pitch detection will beinvestigated in the context of perception and signal processing.The missing fundamental effect will be taken as an example ofstudy in the context of psychoacoustics, while cepstrum willbe taken as an example in the context of signal processing.In later sections, such investigation will be shown useful toexplain the use of nonlinear activation in pitch detection.
A. Missing fundamentals
The missing fundamental effect refers to the phenomenonthat one can perceive a pitch when the source signal containsfew or no spectral components corresponding to that pitchand its low-order harmonics [30]. More specifically, one canperceive the pitch of a note at frequency f when thereis no spectral component at f , and even no f in thespectrum – there could be only spectral components at nf , ( n + 1) f , · · · , ( n + k ) f for some n, k > . As a result, pitchdetection algorithms cannot be designed by using spectrumonly. Though seemingly unusual, the missing fundamentaleffect does exist commonly in musical signals; it is commonlyfound in the low-pitch notes of various kinds of instrumentssuch as the 88-key piano, where the physical size of theinstrument is too small to effectively radiate the signal withlong acoustic wavelength. The same also applies to the musicplayed with a cellphone, where one can still hear bass notes,although distorted, out of a tiny Lo-Fi speaker.The missing fundamental effect has been studied and ex-plained with mainly three different approaches. First, theperiod or time theory holds that a pitch is determined by theminimal period of the waveform rather than the fundamentalfrequency of the signal. In contrast to the place theory whichis based on Fourier transform, time theory explains the miss-ing fundamental effect: the period of a signal with spectral The discussion here mainly focuses on the main ideas of the theories ratherthan the relations and debates among them in the history. For more historicalbackgrounds of the pitch perception theories, readers are referred to [30]. Fig. 1. Illustration of a pitch detection algorithm based on the cepstrum. components at nf , ( n + 1) f , · · · , ( n + k ) f obviously hasa period at /f . The idea that “pitch is period” is the basicassumption of many pitch detection algorithms, including thecepstrum which will be introduced later.Second, the theories incorporating nonlinear effects aremany: for example, Helmholtz claimed that one can hear themissing fundamental because of the nonlinear effect in theear; two sinusoidal components at frequencies f and f wouldgenerate a component at the frequencies mf + nf , m, n ∈ Z [31]. Although this theory was rejected by Schouten throughfurther experiments [30], it is still inspiring for signal pro-cessing: missing fundamentals can be detected by producingcross terms through nonlinear effects. Another example is the neural cancellation filters proposed by de Cheveign´e [32],which introduces the concept of neural networks for MPE.Third, the pattern matching theory proposed by Goldsteinholds that the auditory system resolves the sinusoidal compo-nents of a complex sound, and encodes its pitch which bestfits the harmonic series of components no matter whetherthe f component exists in the source signal or not [33].The idea of pattern matching is the basis of many MPEalgorithms, such as those using NMF or PLCA [6], [2].However, for MPE, a pattern matching approach also faceschallenges such as overlapped harmonics and indeterminacyof the lowest note. Since this paper mainly focuses on therole of deep structure and nonlinearity in pitch detection, thepattern matching approach will not be discussed. B. Cepstrum
The cepstrum , also referred to as homomorphic signalprocessing , is a classic signal processing method with threeoperations: a Fourier transform, a nonlinear transform (usu-ally a logarithm function), followed by an inverse Foueiertransform. [34]. Since its invention in 1963 [35], the cepstrumand its derived features have been applied in various signalprocessing tasks, such as deconvolution, image enhancement,speech recognition, and pitch detection, to name but a few. Athorough review of the cepstrum can be found in [36], [34].The cepstrum-based pitch detection algorithm works undera general assumption: the fast-varying part of a spectrum,usually the regularly-spaced harmonic peaks, are importantfor pitch detection, while the slow-varying part, such as thespectral envelope, should be discarded since it is not related topitch, as shown in Fig. 1. To separate the fast-varying part fromthe slow-varying one, the spectrum is transformed into thecepstrum in the quefrency (or lag ) domain through an inverse ig. 2. Conceptual illustration of Equations (1)-(3). Homomorphic signalprocessing (i.e. cepstrum) is equivalent to a 2-layer network.
Fourier transform. As a result, the non-pitch informationwould lie in the low-quefrency (i.e. short-time) region. Thisterm can be discarded by means of a high-pass filter . Noticethat since the levels of the harmonic peaks in the spectrumvary largely, a nonlinear scaling (usually a logarithm scale) onthe spectrum before the inverse Fourier transform is required.Finally, the detected pitch frequency is then the inverse of lagcorresponding to the maximal value of the filtered cepstrum ,as shown in the right part of Fig. 1. The relation between afiltered cepstrum and a DNN will be discussed in the nextsection.III. G
ENERALIZED F ORMULATION OF P ITCH S ALIENCE F UNCTIONS
Since the main focus of this paper is on frame-level tran-scription of polyphonic music, all features are described asthe signal processing of an N -point, time-domain segment x ∈ R N , a frame of music signal in computing the short-timeFourier transform (STFT). All vector throughout this paperare zero-indexed, i.e. x = [ x [0] , x [1] , · · · , x [ n ] , x [ N − .The N -point windowed discrete Fourier transform (DFT) isdenoted by the operator F ∈ C N × N . Similarly, the N -pointinverse DFT is represented by F − . Denote | x | as the absolutevalue of each element in x . Consider the following equations: z (1) = σ (1) (cid:16) | Fx | + b (1) (cid:17) , (1) z (2) = σ (2) (cid:16) W (2) F − z (1) + b (2) (cid:17) , (2) z (3) = σ (3) (cid:16) W (3) Fz (2) + b (3) (cid:17) . (3)where σ ( i ) is a element-wise nonlinear transform function suchthat for γ i > , i = 1 , , , σ ( i ) ( x ) = (cid:26) x γ i , x > , x ≤ , (4) W ( i ) ∈ R N × N is the weighting matrix for filtering the featureof interest, and b ( i ) is the bias vector.Consider the the first (cid:98) N/ (cid:99) -th elements (i.e. the positive-frequency or the positive-quefrency part) of z ( i ) . First, assume W ( i ) = I and b ( i ) = to simplify the discussion. Bydefinition, z (1) := z (1) [ k ] in (1) is the magnitude spectrum Since quefrency and lag have the same unit as time, quefrency, lag andtime are used interchangeably in the following discussion. of x estimated from the DFT. The k -th element of z (1) , z (1) [ k ] , is the spectral element at the corresponding frequency f [ k ] = kf s /N , k = 0 , , · · · , (cid:98) N/ (cid:99) , where f s is the samplingfrequency. Then, z (2) in (2) is known as the generalizedautocorrelation function (GACF) or the generalized cepstrum (GC) in the literature, since this formulation is equivalent tothe ACF when γ = 2 , and is also equivalent to the cepstrum up to a linear transformation when γ → [28], [37]. As theinverse DFT of a spectral feature in the frequency domain, z (2) is also said to be a feature in the lag or quefrency domain,which has the same unit as time. Therefore, for convenience,the elements of z (2) are indexed by n in this paper. The n -thelement, z (2) [ n ] , represents the salience of the feature at thecorresponding lag q [ n ] = n/f s , n = 0 , , · · · , (cid:98) N/ (cid:99) . TheGC has been known as a good feature for MPE by setting < γ < , such as γ = 0 . [38], . [39], . [40]and . [41]. The generalized ACF is also known as the rootcepstrum , and it has been shown more robust to noise thanthe logarithm cepstrum in the literature of speech processing[42], [43]. Interestingly, z (2) with either γ = 2 (i.e. ACF) or γ = 0 (i.e. cepstrum) works only for single-pitch detectionrather than MPE [44], [38]. One may view GC as a morestable feature than cepstrum since the logarithmic operation isnumerically unstable.Finally, z (3) in (3) is again a feature indexed by k infrequency domain, and z (3) [ k ] represents its k -th elementcorresponded to the frequency f [ k ] . The feature z (3) in Equa-tion (3) is less discussed in the literature of pitch detection.An exception is [45], where Peeters considered using theACF of the magnitude spectrum as a feature for single-pitchdetection, and showed its advantage in identifying the missingfundamentals unseen in a spectrum. This is because thatalthough the peak at the fundamental frequency is attenuated,the ACF still captures the periodicity in between the peaks ofhigh-order harmonics by mixing those peaks up and producinga cross-term at the true fundamental frequency. The ACF ofspectrum is a special case of z (3) , where γ = 2 , γ = 1 , and γ = 1 . It is therefore straightforward to generalize z (3) bymaking both γ i tunable parameters, where < γ i ≤ . Since z (3) is computed by a DFT, a nonlinear activation functionand another subsequently on z (1) , it can be interpreted asthe Generalized Cepstrum of Spectrum (GCoS). This novelfeature is firstly proposed and its superior performance to othertraditional spectral features will be shown in the experiments.As mentioned in Section II-B, the terms lying in both thelow- k and low- n indexes are uninformative for MPE sincethey are unrelated to pitch. The function of W ( i ) and b ( i ) is therefore to discard the terms unrelated to pitch or outsidethe pitch detection region. Assuming that b ( i ) is slow-varying,the contribution of b ( i ) would simply be the terms discardedby W ( i +1) . Therefore, for simplicity, b ( i ) is set to be If the term x γ in (4) is replaced by ( x γ − /γ , then σ γ,δ ( x ) | x>δ converges to log x as γ → [28], [37]. In this case, z (2) is the cepstrum. hroughout this paper, and W is a diagonal matrix such that W ( i ) [ l, l ] = (cid:26) , l > k c ( i = 1 , ) or n c ( i = 2 ) ;0 , otherwise . (5)where k c and q c are the indices of the cutoff frequency andcutoff quefrency, respectively. It means that W (2) is a high-pass filter with cutoff quefrency at q c = n c /N and W (3) isa high-pass filter with cutoff frequency at f c = k c f s /N . Astraightforward suggestion to the value of f c and q c is thelowest pitch frequency and the shortest pitch period. In thispaper, f c = 27 . Hz (frequency of A0 ) and q c = 0 . ms(period of C8 ).Besides spectrum, cepstrum, and GCoS, there are also othertypes of pitch salience functions which can be summarized byEquations (1)-(3). For example, the pitch salience function ofthe YIN algorithm is [29]: N (cid:88) q =1 ( x [ q ] − x [ q + n ]) , (6)When the source signal is assumed stationary, derivationsshow that (6) cab be represented by z (2) , where γ = 2 , γ =1 , W (2) = − I , and b (2) = 2 z (2) [0] . This bias term makesthe algorithm more stable to amplitude changes. A. Relation to DNNs
As illustrated in Fig. 2, Equations (1)-(3) explicitly resem-ble a DNN with three fully-connected layers. Specifically, aFourier spectrum is equivalent to a one-layer network, an ACF,cepstrum, or GC are equivalent to a two-layer network, and anACF of spectrum or GCoS are then equivalent to a three-layernetwork.Equations (1)-(3) also possess some important characteris-tics which are different from common DNNs:1) The fully-connected layers are complex numbers (i.e.,the DFT matrix) while the commonly-used DNNs aretypically real-valued. Notice that since z (1) is symmet-ric, the Fourier transform in the second and the thirdlayers can be replaced by a real-valued discrete cosinetransform (DCT) matrix without changing the result.2) The nonlinear activation function is a power function;the widely-used rectified linear unit (ReLU) function isa special case where γ = 1 . However, previous studiesconsistently suggest a rather sharp nonlinearity where γ < , since taking γ = 1 does not work well in MPE[38], [39], [40], [41].3) Since the fully-connected layers are determined clearlyby the Fourier transform, each feature in each layer has a More precisely, W (2) should be called a “long-pass lifter” rather than“high-pass filter” in order to distinguish the filtering processing in thequefrency domain from the one in the frequency domain. This paper uses“high-pass filter” for both cases to simplify the terminology. Equations (1)-(3) and the discussion on ACF are based on the assumptionthat the source signal is stationary. If the source signal is non-stationary, theformulation of (1)-(3) as well as the resulting weighting and bias factorshere should be slightly modified Since the Wiener-Khinchin theorem doesnot hold anymore. The resulting network should be formulated in a structurelike a CNN and will not be discussed here. Fig. 3. The spectra ( z (1) , top), GCoS ( z (3) with γ = 0 . and γ = 0 . ,middle), and ACF of spectrum ( z (3) with γ = 2 and γ = 1 , bot-tom) of an audio sample from 17.91 to 18.09 seconds of ‘MAPS MUS-alb se2 ENSTDkCl.wav’ (i.e. Catalu˜na in Alb´eniz’s
Suite Espa˜nola , Op. 47)from the MAPS dataset. The pitch frequencies of the four notes consistingthe sample are labeled. clear dimensional unit. For example, z (1) and z (3) are inthe frequency domain while z (2) is in the time domain.Moreover, it should be emphasized that, although the net-work parameters (i.e., the Fourier coefficients) mentioned hereare predetermined rather than learned, they do share similarproperties (i.e., frequency selection) in performing pitch de-tection [13]. Therefore, the analogy between a cepstrum and aDNN is not only physically plausible but also provides a newway for one to better understand why and how a DNN worksin pitch detection problems. This fact can be seen from thefollowing example. B. Examples and interpretation
Fig. 3 shows the spectrum (top) with γ = 0 . , GCoSwith γ = 0 . and γ = 0 . (middle), and ACF of spectrum(bottom) of a piano signal with four pitches D (77.78 Hz),
A (116.54 Hz),
F (185.00 Hz), and
A (233.08 Hz). Inthe last two cases γ is set to 1. The sampling frequency is f s = 44100 Hz, and the window function h is the Blackmanwindow. To investigate the effect of noise interference on thefeatures, a contaminated signal with pink noise such that theSNR = 10 dB is also considered. All of the illustrated spectraand GCoS features are normalized to unit l -norm.The clean and noisy spectra both show that the fundamentalfrequencies of D and
A are weak comparing to their high-order harmonics. There are numbers of peaks unrelated to thetrue fundamental frequencies and their harmonics, especiallyin the noisy spectrum. There is also no clear trend in the A different value of γ merely changes the scale of the output saliencyfunction but does not change the result of pitch detection. Therefore, forsimplicity, γ is set to be 1 for all cases in this paper. pectra such that one can set a threshold function to discardthose peaks. As a result, the spectrum is not an effective pitchsalience function, since it is sensitive to noise and is unableto identify weak fundamental frequencies.These problems are well resolved in the GCoS. As shownin the middle of Fig. 3. By setting γ = 0 . and γ = 0 . ,the GCoS feature enhances two identifiable peaks at thefrequencies of D and
A . For the noisy signal, most ofthe fluctuation peaks are eliminated except for a low-frequencypeak at 35 Hz, which is the only unwanted cross term producedby the nonlinear activation function. Moreover, the GCoS ofthe noisy signal is nearly identical to the one of the cleansignal; such robustness to noise cannot be seen in spectrum.The reasons why GCoS works well are explained as follows.From the discussion in Section II-B, pitch-related informationis the periodic components in the signal. Moreover, the Fouriertransform of a periodic signal typically has a harmonic pattern,which is also periodic, while the non-periodic componentswould lie in the low-frequency or low-quefrency regions.Therefore, the non-periodic components in the i -th layer areeliminated by the high-pass filter W ( i +1) in the ( i +1) -th step.The periodic components are preserved and enhanced throughthe Fourier transform F in each step. Those componentshaving negative correlation to the DFT basis, mostly noiseand non-periodic ones, are eliminated through rectification in σ ( i ) . In other words, similar to the discussion in [26], [27], W ( i ) F is a set of RECOS filters for pitch detection, where itsanchor vectors form the Fourier basis.For the ACF of spectrum, there could also be an alternativeinterpretation: since z (3) is an ACF, the anchor vector couldbe the time shifting of the input spectrum (i.e. z (1) ) itself. Itcomputes the correlation of the spectrum to the time shiftingof the itself and discards the components inducing negativecorrelation. Such a viewpoint also explains the reason whythe ACF of spectrum is useful in single-pitch detection [45].It should be noticed that, however, the ACF of spectrumis not feasible for MPE. The bottom of Fig. 3 shows that,although there are some cross terms produced in the ACF ofspectrum, the peaks of these cross terms are actually not atthe frequencies of D and
A , possibly because of othernonlinear effects produced by other terms. This is the reasonwhy a sharp nonlinear activation function, i.e. < γ i ≤ , isusually suggested for MPE. C. Combining features in different layers
The idea of combining frequency-domain and time-domainfeatures has been proposed by Peeters [45] and Emiya et al. [46] for single-pitch detection, and by Su et al. for MPE.The idea works well in detecting the pitch frequencies byutilizing the complementary structure of the two features: thefrequency-domain features with harmonic peaks and the time-domain features with sub-harmonic peaks. The frequency-domain feature is prone to upper octave errors while robust tolower octave errors [47], [48], and the time-domain featurebe vice versa. Therefore, combining them together could Octave error means the estimated pitch is off by one or multiples of octave. suppress both the upper and the lower octave errors. In otherwords, fusion of two features at different layers, one in thetime domain and the other in the frequency domain, gives riseto an improved pitch salience function L := L (cid:0) z ( i ) , z ( i +1) (cid:1) .More specifically, we consider L (cid:16) z (1 or , z (2) (cid:17) = z (1 or [ k ] z (2) (cid:20) (cid:98) Nk (cid:101) (cid:21) (7)where the time-domain feature z (2) is nonlinearly mapped intothe frequency domain, and (cid:100)·(cid:99) is the rounding function. Forexample, [45] considered some cases including the combina-tion of the ACF of spectrum ( z (3) , γ = 1 , γ = 2 , γ = 1 )and cepstrum ( z (2) , γ = 0 , γ = 1 ). In this paper, z (2) and z (3) are constructed with the same set of networks, while theparameters γ ( i ) are arbitrary values between 0 and 2. Besides,to improve the performance, more steps for post-processingand pitch selection are used.IV. E XPERIMENT SETTINGS
Since this paper is to investigate the multi-layer constructionof pitch-related features, a learning-based framework for MPEis not considered now. the MPE method combining z (1) and z (2) introduced in [4] is taken as the baseline method forcomparison. The purpose of the experiment is to comparetwo frequency-domain features, GCoS ( z (3) ) and magnitudespectrum ( z (1) ), and show that z (3) outperforms z (1) in thefollowing senses:1) elegant design: the feature can be used directly withoutpseudo whitening and adaptive thresholding like in [49],[4] to filter out the spectral envelope in z (1) whileachieve a more succinct feature representation.2) detect missing fundamentals: the method can be de-signed for detecting missing fundamentals without hand-crafted rules in [4].3) robustness to noise: z (3) outperforms z (1) because itcancels the uncorrelated parts with more layers. Thiseffect could be more obvious with low SNR.The source code of the proposed as well as the baselinemethods will be announced publicly. A. Pre-processing
After computing the frequency-domain feature and the time-domain feature, the same procedure in [4] is adopted to com-pute the pitch profiles of the both features for pitch selection.The features pairs (cid:0) z (3) , z (2) (cid:1) are mapped into the pitch pro-files (cid:0) ¯ z (3) , ¯ z (2) (cid:1) through a filterbank which center frequenciesare according to the equal-tempered scale indexed by the pitchnumber p such that p = P ( f ) = 69 + (cid:98)
12 log f / (cid:101) , where (cid:98)·(cid:101) denotes the round function. For A0 ( f = 440 Hz), p = 69 .Then, the elements corresponding to the pitch number p in z (3) and z (2) are merged into the p -th element respectively in ¯ z (3) and ¯ z (2) through max pooling. That is, ¯ z (3) [ p ] = max k (cid:48) z (3) [ k (cid:48) ] s . t . P (cid:18) f s k (cid:48) N (cid:19) = p , (8) ¯ z (2) [ p ] = max n (cid:48) z (2) [ n (cid:48) ] s . t . P (cid:18) f s n (cid:48) (cid:19) = p . (9) ABLE IMPE R
ESULT ON
MAPS
AND
TRIOS
DATASETS . P
RECISION (P),
RECALL (R)
AND F- SCORE (F) IN % ARE LISTED .Dataset Pitch Proposed BaselineP R F P R FAll 69.91 68.94
It is worth mentioning that this process resembles a max-pooling layer in a CNN. However, to make the feature fit thelog-frequency scale for pitch detection, the filter size and thestride vary with the frequency and the time index.
B. Pitch selection process
The criteria of selecting the pitch and the sparsity constraintsfor reducing false positives proposed in [4] are applied boththe proposed and baseline methods. Specifically, by settingthe harmonic/subharmonic series are with length four (i.e.the fundamental frequency/period and the first three harmon-ics/subharmonics) and the sparse constraint parameter δ = 0 . ,a pitch p i is a true positive if1) ¯ z (3) [ p i ] , ¯ z (3) [ p i + 12] , ¯ z (3) [ p i + 19] , ¯ z (3) [ p i + 24] > ¯ z (2) [ p i ] , ¯ z (2) [ p i − , ¯ z (2) [ p i − , ¯ z (2) [ p i − > (cid:107) ¯ z (3) [ p i : p i +24] (cid:107) < δ or (cid:107) ¯ z (2) [ p i −
24 : p i ] (cid:107) < δ where p i : p i + 24 means the indices from p i to p i + 24 . Allpitch activations satisfying these conditions form a piano roll.Finally, a median filter with length of 25 frames is applied onthe resulting piano roll in order to smooth the note activationand prune the notes with too short duration.Notice that unlike [4], the rules of selecting missing funda-mentals and stacked harmonics are not applied in this works.Experiments will show that the proposed method can prop-erly catch the information of missing fundamentals withoutintroducing these rules. C. Datasets
We consider two MPE datasets in our evaluation. The firstdataset, MAPS, is a widely used piano transcription datasetcreated by Emiya et al. [50]. It contains audio recordingsplayed on Yamaha Disklavier, an automatic accompanimentpiano, in accordance with MIDI-aligned ground-truth pitchannotations. Following previous work [6], [7], [8], [47], [51],[52], we use the first 30 seconds of the 30 music pieces in thesubset ENSTDkCl in our evaluation, totalling 15 minutes.The second dataset, TRIOS, consists of five pieces of fullysynthesized music in trios form [53]. Each piece containspiano, two other pitched instruments (the two instrumentsare different for different pieces). One piece has non-pitchedpercussions. The pitch annotations were for the three pitchedinstruments in each piece, whose length ranges from 17 to 53seconds, totalling 3 minutes and 11 seconds. http://c4dm.eecs.qmul.ac.uk/rdr/handle/123456789/27 Fig. 4. F-scores of the MAPS and TRIOS datasets with different SNR levelsfor pink noise. ( ◦ ): proposed. ( (cid:5) ): baseline. D. Parameters and evaluation metrics
The sampling frequency for all source signals is 44.1 kHz.For computation of the frame-level features, the Blackman-Harris window with size of 0.18 sec is used and the hop sizeis 0.01 sec. The pitch range considered in evaluation is from A1 (55.55Hz) to C7 (2205 Hz); both the ground truth andthe experiment results are restricted in this range. Further, toidentify the performance on the missing fundamentals, the bass notes, defined as the notes below C3 , and the treble notes,defined as the notes above C3 , are also evaluated separately.Following the standard in MIREX MF0 challenge, the per-formance of MPE is evaluated using the micro-average frame-level Precision (P), Recall (R), and F-score (F). After countingthe number of true positives ( N tp ), false positives ( N fp ) andfalse negatives ( N fn ) over all the frames within a dataset,The evaluation metrics are defined as: P = N tp / ( N tp + N fp ) , R = N tp / ( N tp + N fn ) and F = 2 P R/ ( P + R ) .To evaluate the robustness to noise for both the baselineand proposed methods, we use the audio degradation toolbox(ADT) [54] to add pink noise with signal-to-noise ratio (SNR)ranging from 30 dB (least noisy) to 0 dB (most noisy) to everypiece in the dataset.V. R ESULTS AND DISCUSSION
Table I lists the resulting precision, recall and F-scoreevaluated with 3 sets of pitch ranges: All ( A1 to C7 ), Bass, ( A1 to C3 ), and Treble ( C3 to C7 ). For all pitch sets and datasets,the proposed method outperforms the baseline method in F-scores. This implies an overall improvement of using GCoS.For the Bass set, the F-score is improved by 1.24% in MAPSand by 1.39% in TRIOS. Such an improvement is more thanthe ones in the sets of All or Treble in the MAPS dataset. Thisfact indicates that the proposed feature can better recognizemissing fundamentals even without the hand-crafted rules [4]in selecting missing fundamentals. However, the improvementof the proposed method in either P or R in both datasetsbehaves inconsistently. In the Bass set of MAPS, the proposedmethod improves R by 9.33% while degrades P by 24.37%.Conversely, in TRIOS, the proposed method improves P whiledegrades R. The possible reason of such a difference is the roperty of the input data: TRIOS contains less low-frequencynoise since it is constructed with synthetic data. As a result,there are less unwanted low-frequency terms, making moreof the cross terms be true pitches, and therefore improvingprecision more than recall.Fig. 4 shows that the improvement of F-score in noisysource data is much more than the improvement in clean data.The lower the SNR is, the more improvement is found in theproposed method. Specifically, when the SNR is lower than 10dB, the improvement in both datasets is over 5%. This verifiesthe statement that the GCoS is more robust to noise since ithas one more layer of RECOS filters in refining the features. A. Discussion
The idea of using multiple Fourier transforms can beinterpreted by one intuition: the Fourier transform of a pe-riodic signal also have periodic (i.e., harmonic) patterns and,perceptually, the strength of such periodic patterns dependson an appropriate nonlinear scaling on the input, and thenonlinear scaling function has some perceptual basis, such asthe dB scale or Stevens’ power law [55]. This intuition directlysuggests promising future directions of applying other kindsof nonlinear activation function and more than three layers ofFourier transform to construct the pitch salience function.Another remaining issue is that the discussion in this paperstill not includes the learning aspect. In fact, the parameter γ i cannot be learned from gradient descent and back propagation,since the gradient of σ ( i ) at zero diverges. Therefore, there isa need to find a differentiable nonlinear activation functionwhich could replace the role of the power function in MPEor, to investigate the potential of other optimization methodssuch as coordinate descent. Besides, since using the DFTmatrix works well in MPE, it is also important to investigate ascenario where a learning-based DNN for the MPE task withthe network parameters initialized with a DFT or DCT matrix.VI. C ONCLUSIONS
This paper has presented a signal-processing perspective forone to better understand the reason why deep learning worksby demonstrating a new MPE algorithm which generalizesthe concept of homomorphic signal processing and DNNstructures. Containing layers of DFT matrices to extract theperiodic components, high-pass filters to discard non-periodiccomponents, and nonlinear activation functions to eliminatethe components negatively correlated with the DFT matrix,the algorithm has shown superior performance in detectingmissing fundamentals and in noisy sources. Although the MPEtask is merely one specific pattern recognition problem, thisapproach can still be viewed as an example of RECOS filteranalysis for one to better understand deep learning. Withthe positive experiment results, this paper has positioned onemore step to bridge the philosophical and methodological gapsbetween signal processing and deep learning, and, providesnew cues in demystifying and understanding deep learning onother pattern recognition tasks. R
EFERENCES[1] E. Benetos, S. Dixon, D. Giannoulis, H. Kirchhoff, and A. Klapuri,“Automatic music transcription: Breaking the glass ceiling,” in
ISMIR ,2012.[2] ——, “Automatic music transcription: challenges and future directions,”
J. Intelligent Information Systems , vol. 41, no. 3, pp. 407–434, 2013.[3] C. Yeh, A. R¨obel, and X. Rodet, “Multiple fundamental frequencyestimation and polyphony inference of polyphonic music signals,”
IEEETrans. Audio, Speech, Lang. Proc. , vol. 18, no. 6, pp. 1116–1126, 2010.[4] L. Su and Y.-H. Yang, “Combining spectral and temporal representationsfor multipitch estimation of polyphonic music,”
Audio, Speech, andLanguage Processing, IEEE/ACM Transactions on , vol. 23, no. 10, pp.1600–1612, 2015.[5] L. Su, T.-Y. Chuang, and Y.-H. Yang, “Exploiting frequency, peri-odicity and harmonicity using advanced time-frequency concentrationtechniques for multipitch estimation of choir and symphony,” in
ISMIR ,2016.[6] E. Vincent, N. Bertin, and R. Badeau, “Adaptive harmonic spectral de-composition for multiple pitch estimation,”
IEEE Trans. Audio, Speech,Lang. Proc. , vol. 18, no. 3, pp. 528–537, 2010.[7] E. Benetos and S. Dixon, “A shift-invariant latent variable model forautomatic music transcription,”
Computer Music Journal , vol. 36, no. 4,pp. 81–94, 2012.[8] E. Benetos, S. Cherla, and T. Weyde, “An effcient shift-invariant modelfor polyphonic music transcription,” in
Proc. Int. Workshop on MachineLearning and Music , 2013.[9] A. Cogliati, Z. Duan, and B. Wohlberg, “Context-dependent piano musictranscription with convolutional sparse coding,”
IEEE/ACM Transactionson Audio, Speech, and Language Processing , vol. 24, no. 12, pp. 2218–2230, 2016.[10] ——, “Piano transcription with convolutional sparse lateral inhibition,”
IEEE Signal Processing Letters , vol. 24, no. 4, pp. 392–396, 2017.[11] K. Han and D. Wang, “Neural network based pitch tracking in verynoisy speech,”
IEEE/ACM Transactions on Audio, Speech and LanguageProcessing (TASLP) , vol. 22, no. 12, pp. 2158–2168, 2014.[12] S. Sigtia, E. Benetos, and S. Dixon, “An end-to-end neural networkfor polyphonic piano music transcription,”
IEEE/ACM Transactions onAudio, Speech and Language Processing (TASLP) , vol. 24, no. 5, pp.927–939, 2016.[13] V. Prateek and R. W. Schafer, “Frequency estimation from waveformsusing multi-layered neural networks,” in
INTERSPEECH , 2016, pp.2165–2169.[14] R. Kelz, M. Dorfer, F. Korzeniowski, S. B¨ock, A. Arzt, and G. Widmer,“On the potential of simple framewise approaches to piano transcrip-tion,” arXiv preprint arXiv:1612.05153 , 2016.[15] R. Kelz and G. Widmer, “An experimental analysis of the entanglementproblem in neural-network-based music transcription systems,” arXivpreprint arXiv:1702.00025 , 2017.[16] J. Bruna and S. Mallat, “Invariant scattering convolution networks,”
IEEE transactions on pattern analysis and machine intelligence , vol. 35,no. 8, pp. 1872–1886, 2013.[17] S. Mallat, “Understanding deep convolutional networks,”
Phil. Trans. R.Soc. A , vol. 374, no. 2065, p. 20150203, 2016.[18] T. Wiatowski and H. B¨olcskei, “A mathematical theory of deepconvolutional neural networks for feature extraction,” arXiv preprintarXiv:1512.06293 , 2015.[19] G. Montavon, S. Lapuschkin, A. Binder, W. Samek, and K.-R. M¨uller,“Explaining nonlinear classification decisions with deep taylor decom-position,”
Pattern Recognition , vol. 65, pp. 211–222, 2017.[20] J. Dai, Y. Lu, and Y.-N. Wu, “Generative modeling of convolutionalneural networks,” arXiv preprint arXiv:1412.6296 , 2014.[21] P. Mehta and D. J. Schwab, “An exact mapping between the vari-ational renormalization group and deep learning,” arXiv preprintarXiv:1410.3831 , 2014.[22] A. B. Patel, T. Nguyen, and R. G. Baraniuk, “A probabilistic theory ofdeep learning,” arXiv preprint arXiv:1504.00641 , 2015.[23] N. Cohen, O. Sharir, and A. Shashua, “On the expressive power of deeplearning: A tensor analysis,” in
Conference on Learning Theory , 2016,pp. 698–728.[24] S. Bach, A. Binder, G. Montavon, F. Klauschen, K.-R. M¨uller, andW. Samek, “On pixel-wise explanations for non-linear classifier deci-sions by layer-wise relevance propagation,”
PloS one , vol. 10, no. 7, p.e0130140, 2015.25] W. Samek, A. Binder, G. Montavon, S. Lapuschkin, and K.-R. M¨uller,“Evaluating the visualization of what a deep neural network has learned,”
IEEE transactions on neural networks and learning systems , 2017.[26] C.-C. J. Kuo, “Understanding convolutional neural networks with amathematical model,”
Journal of Visual Communication and ImageRepresentation , vol. 41, pp. 406–413, 2016.[27] ——, “The cnn as a guided multilayer recos transform [lecture notes],”
IEEE Signal Processing Magazine , vol. 34, no. 3, pp. 81–89, 2017.[28] T. Kobayashi and S. Imai, “Spectral analysis using generalized cep-strum,”
IEEE Trans. Acoust., Speech, Signal Proc. , vol. 32, no. 5, pp.1087–1089, 1984.[29] D. A. Cheveign´e and H. Kawahara, “YIN, a fundamental frequencyestimator for speech and music,”
J. Acoustical Society of America , vol.111, no. 4, pp. 1917–1930, 2002.[30] T. D. Rossing, F. R. Moore, and P. A. Wheeler,
The science of sound .Addison Wesley San Francisco, 2002, vol. 3.[31] H. Von Helmholtz, “On the sensations of tone (english translation ajellis, 1885, 1954),” 1877.[32] A. De Cheveign´e, “Separation of concurrent harmonic sounds: Funda-mental frequency estimation and a time-domain cancellation model ofauditory processing,”
The Journal of the Acoustical Society of America ,vol. 93, no. 6, pp. 3271–3290, 1993.[33] J. L. Goldstein, “An optimum processor theory for the central formationof the pitch of complex tones,”
The Journal of the Acoustical Society ofAmerica , vol. 54, no. 6, pp. 1496–1516, 1973.[34] A. Oppenheim and R. Schafer,
Discrete-Time Signal Processing , 3rd ed.Prentice Hall, 2009.[35] B. P. Bogert, M. J. Healy, and J. W. Tukey, “The quefrency alanysisof time series for echoes: Cepstrum, pseudo-autocovariance, cross-cepstrum and shape cracking,” in
Proceedings of the symposium on timeseries analysis , vol. 15, 1963, pp. 209–243.[36] A. V. Oppenheim and R. W. Schafer, “From frequency to quefrency:A history of the cepstrum,”
IEEE Signal Processing Magazine , vol. 21,no. 5, pp. 95–106, 2004.[37] K. Tokuda, T. Kobayashi, T. Masuko, and S. Imai, “Mel-generalizedcepstral analysis: a unified approach to speech spectral estimation.” in
Proc. Int. Conf. Spoken Language Processing , 1994.[38] T. Tolonen and M. Karjalainen, “A computationally efficient multipitchanalysis model,”
IEEE Speech Audio Processing , vol. 8, no. 6, pp. 708–716, 2000.[39] S. Kraft and U. Z¨olzer, “Polyphonic pitch detection by iterative analysisof the autocorrelation function,” in
Proc. Int. Conf. Digital Audio Effects ,2014, pp. 1–8.[40] H. Indefrey, W. Hess, and G. Seeser, “Design and evaluation of double-transform pitch determination algorithms with nonlinear distortion in thefrequency domain-preliminary results,” in
Proc. IEEE Int. Conf. Acoust.Speech Signal Process , 1985, pp. 415–418.[41] A. Klapuri, “Multipitch analysis of polyphonic music and speech signalsusing an auditory model,”
IEEE Trans. Audio, Speech, Lang. Proc. ,vol. 16, no. 2, pp. 255–266, 2008.[42] J. S. Lim, “Spectral root homomorphic deconvolution system,”
IEEETrans. Acoust., Speech, Signal Proc. , vol. 27, no. 3, pp. 223–233, 1979.[43] P. Alexandre and P. Lockwood, “Root cepstral analysis: A unified view.application to speech processing in car noise environments,”
SpeechCommunication , vol. 12, no. 3, pp. 277–288, 1993.[44] L. Rabiner, M. Cheng, A. Rosenberg, and C. McGonegal, “A compar-ative performance study of several pitch detection algorithms,”
IEEETrans. Acoust., Speech, Signal Proc. , vol. 24, no. 5, pp. 399–418, 1976.[45] G. Peeters, “Music pitch representation by periodicity measures basedon combined temporal and spectral representations,” in
Proc. IEEE Int.Conf. Acoust. Speech Signal Proc. , 2006.[46] V. Emiya, B. David, and R. Badeau, “A parametric method for pitchestimation of piano tones,” in
Proc. IEEE Int. Conf. Acoust. SpeechSignal Proc. , 2007, pp. 249–252.[47] C.-T. Lee, Y.-H. Yang, and H. H. Chen, “Multipitch estimation ofpiano music by exemplar-based sparse representation,”
IEEE Trans.Multimedia , vol. 14, no. 3, pp. 608–618, 2012.[48] L. Su and Y.-H. Yang, “Resolving octave ambiguities: A cross-datasetinvestigation,” in
Int. Conf. Computer Music/Sound Music Computing(ICMC/SMC) , 2014, pp. 962–966.[49] A. P. Klapuri, “Multiple fundamental frequency estimation based onharmonicity and spectral smoothness,”
IEEE Trans. Speech Audio Proc. ,vol. 11, no. 6, pp. 804–816, 2003. [50] V. Emiya, R. Badeau, and B. David, “Multipitch estimation of pianosounds using a new probabilistic spectral smoothness principle,”
IEEETrans. Audio, Speech, Lang. Proc. , vol. 18, no. 6, pp. 1643–1654, 2010.[51] N. Keriven, K. O’Hanlon, and M. D. Plumbley, “Structured sparsityusing backwards elimination for automatic music transcription,” in
Proc.IEEE Int. Works. Machine Learning for Sig. Proc. , 2013, pp. 1–6.[52] K. O’Hanlon, H. Nagano, and M. D. Plumbley, “Structured sparsity forautomatic music transcription,” in
Proc. IEEE Int. Conf. Acoust. SpeechSignal Proc. , 2012, pp. 441–444.[53] J. Fritsch, “High quality musical audio source separation,” Master’sthesis, Queen Mary Centre for Digital Music, 2012.[54] M. Mauch and S. Ewert, “The audio degradation toolbox and itsapplication to robustness evaluation.” in
Proc. Int. Soc. Music Infor-mation Retrieval Conf. , 2013, pp. 83–88, https://code.soundsoftware.ac.uk/projects/audio-degradation-toolbox (v0.2).[55] S. S. Stevens, “On the psychophysical law.”