[PDF] Time-Frequency Audio Features for Speech-Music Classification

Abstract

Distinct striation patterns are observed in the spectrograms of speech and music. This motivated us to propose three novel time-frequency features for speech-music classification. These features are extracted in two stages. First, a preset number of prominent spectral peak locations are identified from the spectra of each frame. These important peak locations obtained from each frame are used to form Spectral peak sequences (SPS) for an audio interval. In second stage, these SPS are treated as time series data of frequency locations. The proposed features are extracted as periodicity, average frequency and statistical attributes of these spectral peak sequences. Speech-music categorization is performed by learning binary classifiers on these features. We have experimented with Gaussian mixture models, support vector machine and random forest classifiers. Our proposal is validated on four datasets and benchmarked against three baseline approaches. Experimental results establish the validity of our proposal.

Full PDF

aa r X i v : . [ ee ss . A S ] N ov Time-Frequency Audio Features for Speech-MusicClassiﬁcation

Mrinmoy Bhattacharjee,

Student MIEEE , S.R.M. Prasanna,

SMIEEE , Prithwijit Guha,

MIEEE

Abstract —Distinct striation patterns are observed in the spec-trograms of speech and music. This motivated us to propose threenovel time-frequency features for speech-music classiﬁcation.These features are extracted in two stages. First, a preset numberof prominent spectral peak locations are identiﬁed from thespectra of each frame. These important peak locations obtainedfrom each frame are used to form Spectral peak sequences(SPS) for an audio interval. In second stage, these SPS aretreated as time series data of frequency locations. The proposedfeatures are extracted as periodicity, average frequency andstatistical attributes of these spectral peak sequences. Speech-music categorization is performed by learning binary classiﬁerson these features. We have experimented with Gaussian mixturemodels, support vector machine and random forest classiﬁers.Our proposal is validated on four datasets and benchmarkedagainst three baseline approaches. Experimental results establishthe validity of our proposal.

Index Terms —Time-frequency audio features, speech musicclassiﬁcation, spectrogram, SVM

I. I

NTRODUCTION C ONTENT based audio indexing and retrieval applica-tions often involve an important preprocessing step ofsegmenting and classifying audio signals into distinct cat-egories. Apart from general environmental sounds, speechand music are two important audio categories. Preprocessingsteps necessarily require classiﬁcation algorithms that ensurehomogeneity of the category in audio segments. This workfocuses on proposing features for better discrimination ofspeech and music for such audio segmentation applications.Researchers have observed several differences in speech andmusic signals. For example, pitch in speech usually existsover a span of octaves only, whereas music consists offundamental tones spanning up to octaves [1]. Also, speciﬁcfrequency tones play an important part in the productionof music. Hence, unlike speech, music is expected to havestrict structures in the frequency domain [2]. Furthermore,short silences usually punctuate speech sound units [3], whilemusic is generally continuous and without breaks (Figure 1).Literature in the classiﬁcation of speech and music (CSM,henceforth) includes many studies that exploit such (and other)differences between them [4], [5]. We brieﬂy review a fewclosely related works next.Table I lists the most widely used feature sets of CSMliterature. We have categorized these features into two groups Mrinmoy Bhattacharjee, S.R.M. Prasanna and P. Guha are with the Dept.of Electronics and Electrical Engineering, Indian Institute of TechnologyGuwahati, Guwahati-781039, IndiaS.R.M. Prasanna is also with the Dept. of Electrical Engineering, IndianInstitute of Technology Dharwad, Dharwad-580011, Indiaemail { mrinmoy.bhattacharjee,prasanna,pguha } @iitg.ac.in Time F r e q u e n c y Time F r e q u e n c y (a) (b)Fig. 1: Spectrograms of (a) Speech and (b) Music. Note thedistinct striation patterns of speech and music. Thisobservation motivated our proposal of time-frequency audiofeatures for speech-music discrimination.viz. Spectral Features and

Temporal Features . Most widelyused features from the spectral group are Zero-Crossing Rate(ZCR, henceforth) [2], Spectral Centroid, Spectral Roll-offand Spectral Flux [6]. Energy [7], Entropy [8] and RootMean Square (RMS) [2] values are the most popular onesfrom the temporal group. Apart from these, few works haveused spectrograms as features and processed them as images.For example, the approach proposed by Mesgarani et al. [9]is inspired by auditory cortical processing and uses Gabor-like spectro-temporal response ﬁelds for feature extractionfrom spectrogram. On the other hand, Neammalai et al. [7]performed thresholding and smoothing on standard spectro-grams to form binary images and used them as features forclassiﬁcation. Existing works on speech-music classiﬁcationhave mostly employed Gaussian Mixture Models (GMM)[2], [10], [11], Artiﬁcial Neural Networks (ANN) [8], k-Nearest Neighbors (kNN) [12], [13], [14] and Support VectorMachines (SVM) [11], [6], [7] as classiﬁers. Recent workshave also used deep learning techniques for this task [15], [16].Most existing works have attempted to characterize speech ormusic using pure temporal and/or spectral features. We believethat time-frequency feature based representations are necessaryfor better speech-music classiﬁcation. Our motivation for thisproposal is described next.Figure 1 shows the spectrograms of speech and music.In case of speech, pitch and harmonics slowly change fromone frame to another [17]. This leads to the formation ofsmooth arc-like patterns in its spectrogram. On the otherhand, pitch and harmonics in music remain stationary forsome ﬁnite duration before performing sharp transitions [18].As such, music spectrograms contain patterns in the form ofmany horizontal line segments. These can be attributed to thefollowing reasons.

Inertia of speech production system – Speech productionsystem possesses inertia [19], [20]. It requires a ﬁnite amountof time to change from one sound unit to another, leadingto the formation of slowly changing striation patterns in

TABLE I: Most widely used audio features in speech vsmusic classiﬁcation literature

Group Features PapersSpectralFeatures

ZCR, Spectral Centroid, Spectral Flux, SpectralRolloff, MFCC, Chroma, Log Mel spectrumenergy, Harmonic ratio, Modulation spectrumenergy, Pitch [10], [6], [26],[8], [2], [27],[5], [28]

TemporalFeatures

Energy, Entropy, RMS, Peak-to-Sidelobe ratio(PSR) from the Hilbert Envelope of the LPResidual, Normalized Autocorrelation PeakStrength (NAPS) of Zero frequency ﬁltered signal [10], [6], [8],[7], [2], [25],[5], [29], [28] speech spectrogram. Whereas, individual notes of music havea speciﬁc onset instant, marked by a relatively large burst ofenergy that make its striation patterns discontinuous [21].

Slowly decaying harmonics in music – Music tones decayslowly. Comparatively, speech production system is a dampedsystem where sound decays quite fast [22], [23].

Range of sounds produced – A musical instrument producesonly a ﬁxed number of tones and their overtones. On the otherhand, speech production system generates a large number ofintermediate frequencies while transitioning from one soundunit to another [24], [2].The tempo-spectral properties of speech and music are quitedistinct. Hence, features capturing joint variations in temporaland spectral domains should be harnessed for efﬁcient classi-ﬁcation of speech and music. Existing works in this area haveused combinations of temporal and spectral audio features [2],[6], [8], [10], [25] for achieving better performance.We propose three new audio features capable of capturingthe joint tempo-spectral characteristics of an audio segment.Peaks in the spectra of audio frames appear as striation patternsin spectrograms. Prominent spectral peaks having relativelyhigher amplitudes correspond to the brightest patterns inspectrograms. We believe that the frequency locations of suchprominent peaks carry class speciﬁc information. Accordingly,We compute the features in a two-stage approach. First, theseprominent spectral peaks are identiﬁed in all frames of anaudio interval. Second, locations of detected peaks acrossframes are treated as temporal sequences, deﬁned as spectralpeak sequences (SPS). The proposed features are derived aszero crossing rate, periodicity and second order statistics ofeach SPS. The speech-music classiﬁcation is performed bytraining classiﬁers on these features. The proposed scheme forfeature extraction is described in further detail in Section II.We have benchmarked our proposal on four audio datasetsand against three baseline approaches [10], [2], [6]. Theresults of our experiments are reported in Section III. Finally,we conclude in Section IV and sketch the possible futureextensions of the present proposal.II. P

ROPOSED WORK

The audio segment x ( x [ n ] ∈ R ; n = 0 , . . . N s − is di-vided into L overlapping frames x l ( l = 0 , . . . L − of size N f . Let, X l [ k ] = N f − P m =0 x l [ m ] e − jk π N f m ( k = 0 . . . N f − ) be the DFT of x l . These frames ( x l ) are sequences ofreal numbers. Hence, we consider only the ﬁrst half of DFTcoefﬁcients (i.e. X l [ k ]; k = 0 , . . . N f − ) from each frame. The proposed features are extracted in two stages and aredescribed next.The ﬁrst stage identiﬁes the important spectral peaks presentin each frame of the audio interval. The frequency locationsof all spectral peaks in the l th frame are stored in a set H l .This set is constructed as H l = { k : [ X l [ k − < X l [ k ]] ∧ [ X l [ k ] > X l [ k + 1]] } (1)where ≤ k < ( N f − . The number of spectral peaks( | H l | ) varies in each frame varies. Thus, we retain at most p prominent spectral peaks from each frame to construct thetruncated set tH l = n k ( l )0 , k ( l )1 , . . . k ( l ) p − : X l [ k ] ≥ X l [ k ] ≥ . . . ≥ X l [ k p ] o However, if | H l | = q < p then, the last frequency location k q − is repeated p − q times to maintain uniformity incardinality of tH for all frames. The elements of tH l arefurther sorted in descending order to construct the vector pH l = h k ( l )(0) , k ( l )(1) , . . . k ( l )( p − i ( k ( l )(0) ≥ k ( l )(1) ≥ . . . ≥ k ( l )( p − ).These vectors ( pH ) are used to construct a p × L peaksequence matrix S peak = h pH T , . . . pH TL − i for an audiointerval. Each row of S peak is deﬁned as a Spectral PeakSequence (SPS, henceforth). It is noteworthy that, the ﬁrstrow of S peak corresponds to the SPS with highest frequencylocations and the last row corresponds to one with lowestfrequency locations.In second stage, the proposed features are extracted from S peak . For notational convenience, the index r ( ≤ r < p )will be used for referring to the r th row of S peak or the r th SPS. Attributes derived from the r th SPS will also be indexedby r . This work proposes three different features derived fromthe SPS. These are (a) SPS Periodicity (

SPS-P , henceforth), (b)

SPS Zero Crossing Rate (

SPS-ZCR , henceforth), and (c)

SPS Standard Deviation, Centroid and its Gradient (

SPS-SCG , henceforth). The following are computed from the SPSfor feature extraction. Let µ r = 1 L L − X l =0 S peak [ r ][ l ] be thecentroid frequency location of the r th SPS. These centroidfrequencies are used to construct the zero-centered SPS C r such that C r [ l ] = S peak [ r ][ l ] − µ r ( l = 0 , . . . L − ).The auto-correlation sequence of C r can be estimated as A r [ τ ] = 1 L L − − τ X l =0 C r [ l ] C r [ l + τ ] where, τ = 0 , . . . L ( L = L if L is even and L + 12 otherwise). One or more of theseattributes are used to compute the proposed features. SPS-Periodicity – It is well known that quasi-periodic voicedsounds constitute a major part of the speech signals [30],[31]. Whereas, music is created by musicians with theirpersonalized styles of arranging sound items from multipleinstruments. Hence, music signals need not necessarily havea periodic nature. Figures 2(a)-(e) show the average trendsin autocorrelation sequences of different speech and musicSPS estimated from the GTZAN dataset. Presence of peaks(other than the ﬁrst one) in autocorrelation sequence of a τ A SpeechMusic τ A SpeechMusic τ A SpeechMusic τ A SpeechMusic τ A SpeechMusic (a) (b) (c) (d) (e) Z P r o b a b ili t y SpeechMusic Z P r o b a b ili t y SpeechMusic Z P r o b a b ili t y SpeechMusic Z P r o b a b ili t y SpeechMusic Z P r o b a b ili t y SpeechMusic (f) (g) (h) (i) (j)

SPS µ r SpeechMusic SPS σ r SpeechMusic SPS G r a d i e n t o f µ r SpeechMusic (k) (l) (m)Fig. 2: Proposed features computed from the GTZAN dataset. (a)-(e) show the trend of autocorrelation sequence A r . Speech A r indicate presence of periodicity; (f)-(j) show the SPS-ZCR distribution. Speech in general have higher SPS-ZCR valuesthan music; (k)-(m) show the values of µ r , σ r and ∆ µ r . Speech and music show distinct trends; Figures represent averagedbehavior over the GTZAN data-set. SPS-ZCR and A r are shown only for rd , th , th , th and th SPS of speech andmusic.signal indicates its periodicity. Such peaks are observed inautocorrelation sequences of SPS of speech but, not in thatof music. This motivated us to exploit the periodicity of SPSas feature for speech-music discrimination. Periodicity of the r th SPS is estimated using its auto-correlation sequence. Thepeak locations τ ( r ) of A r are detected (Equation 1) and storedin a set T r = n τ ( r )0 , τ ( r )1 , . . . o ( | T r | < L ) in an ascendingorder. We compute the quantities ∆ τ ( r ) u = τ ( r ) u − τ ( r ) u − ( u = 1 , . . . | T r | − ). The variance V r of these quantities { ∆ τ ( r ) u } provides an estimate of the periodicity of the r th SPS. The feature SPS-P is constructed as a p dimensionalvector such that SP S - P = [ V , . . . V p − ] . SPS-Zero Crossing Rate – Audio signals are non-stationary.Thus, spectral peaks in a certain SPS may correspond todifferent frequency locations within the spectra of audioframes in an interval. Hence, without any loss of generality,we can assume that spectral peak sequences contain varyingvalues. The Zero Crossing Rate (ZCR) provides a grossestimate of average frequency of time-series data [32]. Wepropose to compute the ZCR of each SPS to estimate theiraverage frequency and use this as a feature for CSM. TheZCR ( Z r ) of the r th zero-centered SPS is computed as Z r = 12 L L − X l =0 | sgn ( C r [ l ]) − sgn ( C r [ l − | where, sgn ( • ) is the signum function. SPS-ZCR feature is constructed as a p dimensional vector such that SP S - ZCR = [ Z , . . . Z p − ] .Figures 2(f)-(j) show the distributions of ZCR values fordifferent SPS of speech and music. We observe that, ZCR oflower-frequency SPS (e.g. Z , Figure 2(f)) exhibit signiﬁcantoverlap between the ZCR distribution of two classes. However,this overlap reduces as music SPS-ZCR values graduallydecrease (compared to that of speech) for higher-frequency spectral peak sequences ( Z to Z , Figures 2(g)-(j)). Ingeneral, speech SPS-ZCR values are higher than that of music,indicating that speech SPS values vary more than that ofmusic. Hence, this property can be exploited as a discriminatorbetween the two classes. SPS-Standard Deviation, Centroid and its Gradient –We believe that the frequency locations in any r th SPS arecategory speciﬁc (i.e. either speech or music). This moti-vated us to propose a set of features based on the statisticalproperties of the spectral peak sequences. These statisticalattributes include the centroid µ r and standard deviation σ r = s L L − P l =0 ( S peak [ r ][ l ] − µ r ) of the r th SPS. Also, therates of change of µ r (with respect to r ) exhibit distinct trendsfor both speech and music. We compute the gradient ∆ µ r =12 ( µ r +1 − µ r − ) for representing this trend. Thus, we proposethe SPS-SCG feature as a p dimensional vector given by SP S - SCG = [ µ , . . . µ p − , σ , . . . σ p − , ∆ µ , . . . ∆ µ p − ] .Here, ∆ µ = ( µ − µ ) and ∆ µ p − = ( µ p − − µ p − ) .Figure 2(k)-(m) show the trends of SPS-SCG features aver-aged over several audio intervals for both speech and music(GTZAN dataset).The proposed features capture prominent spectral informa-tion in the ﬁrst stage and temporal variations are characterizedin the second stage. Binary classiﬁers are learned on theseproposed features. In this proposal, we have experimented withGaussian mixture models (GMM), support vector machines(SVM) and random forest (RF) classiﬁers. The results of ourexperiments with these tempo-spectral features are presentednext. TABLE II: Performance of baseline approaches andindividual features on GTZAN dataset. Additionally,performances of early and late fusion of proposed featuresare also presented. Experiments are performed with GMM,SVM and Random Forest. The classiﬁer parameters areoptimized by grid-search. SPS-SCG with SVM has betterperformance compared to baseline approaches and otherfeatures.

GMM Random Forest SVMKhonglah-FS .

91 (0 .

02) 0 .

93 (0 .

02) 0 .

93 (0 . Sell-FS .

94 (0 .

01) 0 .

95 (0 .

01) 0 .

95 (0 . MFCC . ( . ) 0 .

92 (0 .

02) 0 .

97 (0 . SPS-P .

83 (0 .

05) 0 .

86 (0 .

04) 0 .

84 (0 . SPS-ZCR .

81 (0 .

04) 0 .

84 (0 .

04) 0 .

87 (0 . SPS-SCG .

93 (0 . . ( . ) . ( . ) SPS-EF .

93 (0 .

02) 0 .

95 (0 .

01) 0 .

98 (0 . SPS-LF .

91 (0 .

02) 0 .

95 (0 .

01) 0 .

92 (0 . Datasets

Broadcast News GTZAN Movie Dataset Scheirer Slaney A v e r ag e F - s c o r e MFCCKhonglah-FSSell-FSSPS-PSPS-ZCRSPS-SCG

Fig. 3: The performance of baseline and proposed featureson four data-sets using SVM (with radial basis functionkernel) classiﬁer. Among proposed features, SPS-SCG hasbest performance on three out of four datasets.III. E

XPERIMENTS AND R ESULTS

The proposed approach is validated on four datasets. Theseare (a)

GTZAN Music/Speech collection [33], (b)

Scheirer-Slaney Music-Speech Corpus [34], (c)

Movie dataset, (d)

TVNews Broadcast dataset. The later two datasets are createdby us and are available on request for non-commercial usage.The movie dataset consists of s clips of pure speech and puremusic from old Bollywood movies. The TV News Broadcastdataset contains s clips of speech and non-vocal musicrecorded from Indian English news channels.Our proposal is benchmarked against the following threebaseline approaches. First, the method proposed by Khonglahet al. in [10] (Khonglah-FS). The authors propose thatspeech speciﬁc features like Normalized Autocorrelation PeakStrength of the Zero Frequency Filtered Signal, the Peak-to-Sidelobe Ratio from Hilbert Envelope of the LP residual, Log-Mel Spectrum Energy, and 4-Hz Modulation Energy etc. arebetter in characterizing speech and hence, good discriminatorsfrom music. The second approach proposed by Sell et al.[2] (Sell-FS) uses novel chroma based features that representmusic tonality for better speech-music classiﬁcation. Third,the MFCC coefﬁcients [6] (MFCC) are considered asfeatures as these are widely used in most speech processingapplications.For all our experiments, we have chosen audio intervalsof s duration. From each audio interval, we have drawn frames of ms duration with a shift of ms . Features areextracted from each audio interval. Accordingly, each audiointerval is classiﬁed as either speech or music. The number ofprominent peaks p is empirically selected and is set to p = 20 for all our experiments. We have used MATLAB toolboxesfor realizing the GMM and RF based classiﬁers. The lib-SVMtoolbox [35] is used for SVM with radial basis function kernelbased classiﬁer. The classiﬁer parameters are optimized bygrid-search. The training and test data are chosen in a ratio of

70 : 30 . The experiments are repeated times. The mean andvariances of F-scores of these independent trials are reported.The performance of baseline approaches and individualfeatures from our proposal (on GTZAN only) are presentedin Table II. SPS-P and SPS-ZCR fail to outperform thebaseline approaches. However, SPS-SCG provides a signif-icant improvement over the best baseline. Additionally, wehave experimented with early and late feature fusion schemesfor our proposal. However, no signiﬁcant improvement wasobserved over the performance of SPS-SCG. The comparativeperformance analysis of proposed features and baseline ap-proaches (with SVM only) for all four datasets are shownin Figure 3. The SPS-SCG features with SVM classiﬁerprovides the best performance for GTZAN, Scheirer Slaneyand TV News Broadcast dataset. However, it has second bestperformance for the Movie data-set. Thus, the experimentalresults establish that the proposed features can effectivelycapture the time-frequency characteristics of speech and musicwhile discriminating one from another.IV. C ONCLUSION

This work proposes a novel two-stage feature extractionscheme for representing the time-frequency characteristics ofan audio interval. In the ﬁrst stage, we detect the frequencylocations of p prominent spectral peaks for each frame in anaudio interval. These peak locations are stored as columns ina matrix S peak . The rows of this matrix are deﬁned as the p spectral peak sequences (SPS) that characterize the audiointerval. The proposed features are computed in the secondstage by treating each SPS as temporal sequence. We estimatethe periodicity (SPS-P), ZCR (SPS-ZCR), standard deviation,centroid and its gradient (collectively, SPS-SCG) as features ofeach SPS. The performance of our proposal is benchmarkedon four datasets and against three baseline approaches. Theproposed features are deployed with GMM, SVM and RandomForest based classiﬁers. Among the proposed features, SPS-SCG (with SVM) has better performance compared to baselineapproaches and other features on three datasets.The spectral peak sequences are prominent peak locations(integer values) of frame spectra. This feature can be extendedto incorporate sequences of other attributes of frame spectra.The present work focuses on ZCR, periodicity and a fewstatistical attributes of the spectral peak sequences. This canbe further enhanced by considering other temporal sequencefeatures. The proposed features are applied to the domainof speech-music classiﬁcation. This work can be extendedto deploy an enhanced set of these features for effectivediscrimination of speech, music and multiple categories ofenvironmental sounds. R EFERENCES[1] J. Saunders, “Real-time discrimination of broadcast speech/music,” in , vol. 2, May 1996, pp. 993–996 vol.2.[2] G. Sell and P. Clark, “Music tonality features for speech/music discrim-ination,” in , May 2014, pp. 2489–2493.[3] C. Panagiotakis and G. Tziritas, “A speech/music discriminator basedon rms and zero-crossings,”

IEEE Transactions on Multimedia , vol. 7,no. 1, pp. 155–166, Feb 2005.[4] V. A. Masoumeh and M. B. Mohammad, “A review on speech-musicdiscrimination methods,”

International Journal of Computer Science andNetwork Solutions , vol. 2, Feb 2014.[5] Y. Lavner and D. Ruinskiy, “A decision-tree-based algorithm forspeech/music classiﬁcation and segmentation,”

EURASIP Journal onAudio, Speech, and Music Processing , vol. 2009, no. 1, p. 239892, Jun2009.[6] E. Mezghani, M. Charfeddine, C. B. Amar, and H. Nicolas, “Multifea-ture speech/music discrimination based on mid-term level statistics andsupervised classiﬁers,” in , Nov 2016, pp. 1–8.[7] P. Neammalai, S. Phimoltares, and C. Lursinsap, “Speech and musicclassiﬁcation using hybrid form of spectrogram and fourier transforma-tion,” in

Signal and Information Processing Association Annual Summitand Conference (APSIPA), 2014 Asia-Paciﬁc , Dec 2014, pp. 1–6.[8] M. Srinivas, D. Roy, and C. K. Mohan, “Learning sparse dictionaries formusic and speech classiﬁcation,” in , Aug 2014, pp. 673–675.[9] N. Mesgarani, M. Slaney, and S. A. Shamma, “Discrimination of speechfrom nonspeech based on multiscale spectro-temporal modulations,”

IEEE Transactions on Audio, Speech, and Language Processing , vol. 14,no. 3, pp. 920–930, May 2006.[10] B. K. Khonglah and S. Mahadeva Prasanna, “Speech / music classiﬁca-tion using speech-speciﬁc features,”

Digital Signal Processing , vol. 48,no. C, pp. 71–83, jan 2016.[11] H. Zhang, X.-K. Yang, W. Q. Zhang, W.-L. Zhang, and J. Liu,“Application of i-vector in speech and music classiﬁcation,” in , Dec 2016, pp. 1–5.[12] J. G. A. Barbedo and A. Lopes, “A robust and computationally efﬁcientspeech/music discriminator,”

J. Audio Eng. Soc , vol. 54, no. 7/8, pp.571–588, 2006.[13] E. Alexandre-Cortizo, M. Rosa-Zurera, and F. Lopez-Ferreras, “Applica-tion of ﬁsher linear discriminant analysis to speech/music classiﬁcation,”in

EUROCON 2005 - The International Conference on ”Computer as aTool” , vol. 2, Nov 2005, pp. 1666–1669.[14] J. J. Burred and A. Lerch, “Hierarchical automatic audio signal classi-ﬁcation,”

J. Audio Eng. Soc , vol. 52, no. 7/8, pp. 724–739, 2004.[15] A. Kruspe, D. Zapf, and H. Lukashevich, “Automatic speech/musicdiscrimination for broadcast signals,” in

INFORMATIK 2017 , M. Eibland M. Gaedke, Eds. Gesellschaft f¨ur Informatik, Bonn, 2017, pp.151–162.[16] A. Pikrakis and S. Theodoridis, “Speech-music discrimination: A deeplearning perspective,” in , Sept 2014, pp. 616–620.[17] Y. Xu and X. Sun, “Maximum speed of pitch change and how it mayrelate to speech,”

The Journal of the Acoustical Society of America , vol.111, no. 3, pp. 1399–1413, 2002.[18] J. F. Alm and J. S. W. Review, “Time-frequency analysis of musicalinstruments,”

Society for Industrial and Applied Mathematics , vol. 44,no. 3, pp. 457–476, August 2002.[19] K. S. R. Murty and B. Yegnanarayana, “Epoch extraction from speechsignals,”

IEEE Transactions on Audio, Speech, and Language Process-ing , vol. 16, no. 8, pp. 1602–1613, Nov 2008.[20] Z. Zhang, “Mechanics of human voice production and control,”

TheJournal of the Acoustical Society of America , vol. 140(4), p. 2614–2635,2016.[21] J. P. Bello, L. Daudet, S. Abdallah, C. Duxbury, M. Davies, andM. B. Sandler, “A tutorial on onset detection in music signals,”

IEEETransactions on Speech and Audio Processing , vol. 13, no. 5, pp. 1035–1047, Sept 2005.[22] J. Meyer,

Structure of Musical Sound . New York, NY: Springer NewYork, 2009, pp. 23–44. [23] L. Oller, S. Ternstr¨om, R. I. of Technology. School of Computer Science,M. Communication. Department of Speech, and Hearing,

Analysis ofVoice Signals for the Harmonics-to-noise Crossover Frequency , 2008.[24] B. K. Khonglah and S. R. M. Prasanna, “Low frequency region of vocaltract information for speech / music classiﬁcation,” in , Nov 2016, pp. 2593–2597.[25] C. Lim and J. h. Chang, “Enhancing support vector machine-basedspeech/music classiﬁcation using conditional maximum a posterioricriterion,”

IET Signal Processing , vol. 6, no. 4, pp. 335–340, June 2012.[26] B. K. Khonglah and S. R. M. Prasanna, “Speech / music classiﬁcationusing vocal tract constriction aspect of speech,” in , Dec 2015, pp. 1–6.[27] A. Gallardo-Antolin and J. M. Montero, “Histogram equalization-basedfeatures for speech, music, and song discrimination,”

IEEE SignalProcessing Letters , vol. 17, no. 7, pp. 659–662, July 2010.[28] A. Pikrakis, T. Giannakopoulos, and S. Theodoridis, “A speech/musicdiscriminator of radio recordings based on dynamic programming andbayesian networks,”

IEEE Transactions on Multimedia , vol. 10, no. 5,pp. 846–857, Aug 2008.[29] J. H. Song, K. H. Lee, J. H. Chang, J. K. Kim, and N. S. Kim, “Analysisand improvement of speech/music classiﬁcation for 3gpp2 smv based ongmm,”

IEEE Signal Processing Letters , vol. 15, pp. 103–106, 2008.[30] A. Biswas, P. K. Sahu, A. Bhowmick, and M. Chandra, “Featureextraction technique using erb like wavelet sub-band periodic andaperiodic decomposition for timit phoneme recognition,”

InternationalJournal of Speech Technology , vol. 17, no. 4, pp. 389–399, Dec 2014.[31] H. Kawahara, M. Morise, R. Nisimura, and T. Irino, “Higher orderwaveform symmetry measure and its application to periodicity detectorsfor speech and singing with ﬁne temporal resolution,” in ,May 2013, pp. 6797–6801.[32] D. S. Shete and P. S. B. Patil, “Zero crossing rate and energy of thespeech signal of devanagari script,” vol. 4, Jan 2014, pp. 01–05.[33] G. Tzanetakis and P. Cook, “Musical genre classiﬁcation of audiosignals,”

IEEE Transactions on Speech and Audio Processing , vol. 10,no. 5, pp. 293–302, Jul 2002.[34] E. Scheirer and M. Slaney, “Construction and evaluation of a robustmultifeature speech/music discriminator,” in , vol. 2, Apr1997, pp. 1331–1334 vol.2.[35] C.-C. Chang and C.-J. Lin, “Libsvm: A library for support vectormachines,”