[PDF] A Subband-Based SVM Front-End for Robust ASR

Abstract

This work proposes a novel support vector machine (SVM) based robust automatic speech recognition (ASR) front-end that operates on an ensemble of the subband components of high-dimensional acoustic waveforms. The key issues of selecting the appropriate SVM kernels for classification in frequency subbands and the combination of individual subband classifiers using ensemble methods are addressed. The proposed front-end is compared with state-of-the-art ASR front-ends in terms of robustness to additive noise and linear filtering. Experiments performed on the TIMIT phoneme classification task demonstrate the benefits of the proposed subband based SVM front-end: it outperforms the standard cepstral front-end in the presence of noise and linear filtering for signal-to-noise ratio (SNR) below 12-dB. A combination of the proposed front-end with a conventional front-end such as MFCC yields further improvements over the individual front ends across the full range of noise levels.

Full PDF

aa r X i v : . [ c s . C L ] D ec A Subband-Based SVM Front-End for Robust ASR

Jibran Yousafzai † , Member, IEEE

Zoran Cvetkovi´c † , Senior Member, IEEE

Peter Sollich ‡ Matthew Ager ‡ Abstract — This work proposes a novel support vector machine (SVM)based robust automatic speech recognition (ASR) front-end that operateson an ensemble of the subband components of high-dimensional acousticwaveforms. The key issues of selecting the appropriate SVM kernels forclassiﬁcation in frequency subbands and the combination of individualsubband classiﬁers using ensemble methods are addressed. The proposedfront-end is compared with state-of-the-art ASR front-ends in terms ofrobustness to additive noise and linear ﬁltering. Experiments performedon the TIMIT phoneme classiﬁcation task demonstrate the beneﬁts of theproposed subband based SVM front-end: it outperforms the standardcepstral front-end in the presence of noise and linear ﬁltering forsignal-to-noise ratio (SNR) below 12-dB. A combination of the proposedfront-end with a conventional front-end such as MFCC yields furtherimprovements over the individual front ends across the full range ofnoise levels.

Index Terms —Speech recognition, robustness, subbands, support vec-tor machines.

I. INTRODUCTIONAUTOMATIC speech recognition (ASR) systems suffer severeperformance degradation in the presence of environmental distortions,in particular additive and convolutive noise. Humans, on the otherhand, exhibit a very robust behavior in recognizing speech evenin extremely adverse conditions. The central premise behind thedesign of state-of-the-art ASR systems is that combining front-ends based on the non-linear compression of speech, such as Mel-Frequency Cepstral Coefﬁcients (MFCC) [1] and Perceptual LinearPrediction (PLP) coeffcients [2], with appropriate language andcontext modelling techniques can bring the recognition performanceof ASR close to humans. However, the effectiveness of context andlanguage modelling depends critically on the accuracy with whichthe underlying sequence of elementary phonetic units is predicted [3],and this is where there are still signiﬁcant performance gaps betweenhumans and ASR systems. Humans recognize isolated speech unitsabove the level of chance already at − -dB SNR, and signiﬁcantlyabove it at − -dB SNR [4]. At such high noise levels, humanspeech recognition performance exceeds that of the state-of-the-art ASR systems by over an order of magnitude. Even in quietconditions, the machine phone error rates for nonsense syllables aresigniﬁcantly higher than human error rates [3, 5–7]. Although thereare a number of factors preventing the conventional ASR systems toreach the human benchmark, several studies [7–12] have attributedthe marked difference between human and machine performance tothe fundamental limitations of the ASR front-ends. These studiessuggest that the large amount of redundancy in speech signals, whichis removed in the process of the extraction of cepstral features suchas Mel-Frequency Cepstral Coefﬁcients (MFCC) [1] and PerceptualLinear Prediction (PLP) coefﬁcients [2], is in fact needed to copewith environmental distortions. Among these studies, the work onhuman speech perception [7, 9, 11, 12] has shown explicitly thatthe information reduction that takes place in the conventional front-ends leads to a severe degradation in human speech recognitionperformance and, furthermore, that in noisy environments there is The authors are with the Department of Informatics † and theDepartment of Mathematics ‡ at King’s College London (e-mail: { jibran.yousafzai, zoran.cvetkovic, peter.sollich, matthew.ager } @kcl.ac.uk). a high correlation between human and machine errors in recognitionof speech with distortions introduced by typical ASR front-endprocessing. Over the years, techniques such as cepstral mean-and-variance normalization (CMVN) [13, 14], vector Taylor series (VTS)compensation [15] and ETSI advanced front-end (AFE) [16] havebeen developed that aim to explicitly reduce the effects of noise on theshort-term spectra in order to make the ASR front-ends less sensitiveto noise. However, the distortion of the cepstral features caused byadditive noise and linear ﬁltering critically depends on the speechsignal, ﬁlter characteristics, noise type and noise level in a complexfashion that makes effective feature compensation or adaptation veryintricate and not sufﬁciently effective [14].In our previous work we showed that using acoustic waveformsdirectly, without any compression or nonlinear transformation canimprove the robustness of ASR front-ends to additive noise [17].In this paper, we propose an ASR front-end derived from thedecomposition of speech into its frequency subbands, to achieveadditional robustness to additive noise as well as linear ﬁltering.This approach draws its motivation primarily from the experimentsconducted by Fletcher [18], which suggest that the human decodingof linguistic messages is based on decisions within narrow frequencysubbands that are processed quite independently of each other. Thisreasoning further implies that accurate recognition in any subbandshould result in accurate recognition overall, regardless of the errorsin other subbands. While this theory has not been proved and somestudies on the subband correlation of speech signals [19, 20] haveeven put its validity into question, there are some technical reasonsfor considering classiﬁcation in frequency subbands. First of all,decomposing speech into its frequency subbands can be beneﬁcialsince it allows a better exploitation of the fact that certain subbandsmay inherently provide better separation of some phoneme classesthan others. Secondly, the effect of wideband noise in sufﬁcientlynarrow subbands can be approximated as that of narrowband whitenoise and thus make the compensation of features be approximatelyindependent of the spectral characteristics of the additive noiseand linear ﬁltering. Moreover, appropriate ensemble methods foraggregation of the decisions in individual frequency subbands canfacilitate selective de-emphasis of unreliable information, particularlyin the presence of narrowband noise.Previously, the subband approach has been used in [21–27] whichresulted in marginal improvements in recognition performance overits full band counterparts. Note that the front-ends employed in theprevious works were the subband-based variants of cepstral featuresor multi-resolution cepstral features. By contrast, our proposed front-end features are extracted from an ensemble of subband componentsof high-dimensional acoustic waveforms and thus retain more in-formation about speech that is potentially relevant to discriminationof phonetic units than the corresponding cepstral representations.In addition to investigation of robustness of the proposed front-endto additive noise, we also assess its robustness to linear ﬁlteringdue to room reverberation. This form of distortion causes temporalsmearing of short-term spectra which degrades the performance ofASR systems. This can be attributed primarily to the use of analysiswindows for feature extraction in the conventional front-ends such asMFCC that are much shorter than typical room impulse responses. Furthermore, the distortion caused by linear ﬁltering is correlatedwith the underlying speech signal. Hence, conventional methods forrobust ASR that are tuned for recognition of data corrupted byadditive noise only will not be effective in reverberant environments.Several speech dereverberation techniques that rely on multi-channelrecordings of speech such as [28, 29] exist in the literature. However,these consideration extend beyond the scope of this paper and instead,standard single channel feature compensation methods for additivenoise and linear ﬁltering such as VTS and CMVN compensation areused throughout this paper.Robustness of the proposed front-end to additive noise and linearﬁltering is demonstrated by its comparison with the MFCC front-end on a phoneme classiﬁcation task; this task remains important incomparing different methods and representations [21, 30–40]. Theimprovements achieved on the classiﬁcation task can be expected toextend to continuous speech recognition tasks [41, 42] as SVMs havebeen employed in hybrid frameworks [42, 43] with hidden Markovmodels (HMMs) as well as in frame-based architectures using thetoken passing algorithm [44] for recognition of continuous speech.The results demonstrate the beneﬁts of the subband classiﬁcation interms of robustness to additive noise and linear ﬁltering. The subband-waveform classiﬁers outperform even the MFCC classiﬁers trainedand tested under matched conditions for signal-to-noise ratios below6-dB. Furthermore, in classifying noisy reverberant speech, the sub-band classiﬁer outperforms the MFCC classiﬁer compensated usingVTS for all signal-to-noise ratios (SNRs) below a crossover pointbetween 12-dB and 6-dB. Finally, their convex combination yieldsfurther performance improvements over both individual classiﬁers.This paper is organized as follows: the proposed subband classiﬁ-cation approach is described in Section II. Experimental results thatdemonstrate its robustness to additive noise and linear ﬁltering arepresented in Section III. Finally, Section IV draws some conclusionsand suggests future directions of this work towards application of theproposed front-end in continuous speech recognition tasks.II. SUBBAND CLASSIFICATION USING SUPPORTVECTOR MACHINESSupport vector machines (SVMs) are receiving increasing attentionas a tool for speech recognition applications due to their goodgeneralization properties [17, 35, 42, 43, 45–48]. Here we use them inconjunction with the proposed subband-based representation aimingto improve the robustness of the standard cepstral front-end to noiseand ﬁltering. To this end we construct a ﬁxed-length representationthat could potentially be used as the front-end for a continuous speechrecognition systems based on e.g. hidden Markov models (HMMs)[42–44]. Dealing with variable phoneme length has been addressedby means of generative kernels such as Fisher kernels [47, 49] anddynamic time-warping kernels [50], but lies beyond the scope of thispaper. Hence, the features of the proposed front-end are derived fromﬁxed-length segments of acoustic waveforms of speech and theseare studied in comparison with the MFCC features derived from thesame speech segments. Several possible extensions of the proposedfront-end for application in continuous speech recognition tasks arehighlighted in Section IV and will be investigated in a future study.

A. Support Vector Machines

A binary SVM classiﬁer estimates a decision surface that jointlymaximizes the margin between the two classes and minimizes themisclassiﬁcation error on the training set. For a given training set ( x , . . . , x p ) with corresponding class labels ( y , . . . , y p ) , y i ∈{ +1 , − } , an SVM classiﬁes a test point x by computing a score function, h ( x ) = p X i =1 α i y i K ( x , x i ) + b (1)where α i is the Lagrange multiplier corresponding to the i th trainingsample, x i , b is the classiﬁer bias – these are optimized duringtraining – and K is a kernel function. The class label of x is thenpredicted as sgn ( h ( x )) . While the simplest kernel K ( x , ˜ x ) = h x , ˜ x i produces linear decision boundaries, in most real classiﬁcation tasks,the data is not linearly separable. Nonlinear kernel functions implic-itly map data points to a high-dimensional feature space where thedata could potentially be linearly separable. Kernel design is thereforeeffectively equivalent to feature-space selection, and using an appro-priate kernel for a given classiﬁcation task is crucial. Commonlyused is the polynomial kernel, K p ( x , ˜ x ) = (1 + h x , ˜ x i ) Θ , wherethe polynomial order Θ in K p is a hyper-parameter that is tunedto a particular classiﬁcation problem. More sophisticated kernels canbe obtained by various combinations of basic SVM kernels. Herewe use a polynomial kernel for classiﬁcation with cepstral features(MFCC) whereas classiﬁcation with acoustic waveforms in frequencysubbands is performed using a custom-designed kernel described inthe following.For multiclass problems, binary SVMs are combined via error-correcting output codes (ECOC) methods [51, 52]. In this work,for an M -class problem we train N = M ( M − / binarypairwise classiﬁers, primarily to lower the computational complexityby training on only the relevant two classes of data. The trainingscheme can be captured in a coding matrix w mn ∈ { , , − } , i.e. classiﬁer n is trained only on data from the two classes m for which w mn = 0 , with sgn( w mn ) as the class label. One then predicts fortest input x the class that minimizes the loss P Nn =1 χ ( w mn f n ( x )) where f n ( x ) is the output of the n th binary classiﬁer and χ is aloss function. We experimented with a variety of loss functions,including hinge, Hamming, exponential and linear. The hinge lossfunction χ ( z ) = max(1 − z, performed best and is therefore usedthroughout.For classiﬁcation in frequency subbands, each waveform x isprocessed through an S -channel maximally-decimated perfect recon-struction cosine modulated ﬁlter bank (CMFB) [53] and decomposedinto its subband components, x s , s = 1 , . . . , S . Several othersubband decompositions such as discrete wavelet transform, waveletpacket decomposition and discrete cosine transform also achievedcomparable, albeit somewhat inferior performance. A summary of theclassiﬁcation results obtained with different subband decompositionsin quiet conditions is presented in Section III-B. The CMFB consistsof a set of orthonormal analysis ﬁlters g s [ k ] = 1 √ S g [ k ] cos (cid:18) s − S (2 k − S − π (cid:19) ,s = 1 , . . . , S, k = 1 , . . . , S, (2)where g [ k ] = √ π ( k − . / S ) , k = 1 , . . . , S , is a low-pass prototype ﬁlter. Such a ﬁlter bank implements an orthogonaltransform, hence the collection of the subband components is arepresentation of the original waveform in a different coordinatesystem [53]. The subband components x s [ n ] are thus given by x s [ n ] = X k x [ k ] g s [ nS − k ] . (3)A maximally-decimated ﬁlter bank was chosen primarily becausethe sub-sampling operation avoids introducing additional unneces-sary redundancies and thus limits the overall computational burden.However, we believe that redundant expansions of speech signals obtained using over-sampled ﬁlter banks could be advantageous toeffectively account for the shift invariance of speech.For classiﬁcation in frequency subbands, an SVM kernel is con-structed by partly following steps from our previous work [17],which attempted to capture known invariances or express explicitlythe waveform qualities which are known to correlate with phonemeidentity. First, an even kernel is constructed from a baseline polyno-mial kernel K p to account for the sign-invariance of human speechperception as K e ( x s , x si ) = K ′ p ( x s , x si ) + K ′ p ( x s , − x si ) (4)where K ′ p is a modiﬁed polynomial kernel given by K ′ p ( x s , x si ) = K p (cid:18) x s k x s k , x si k x si k (cid:19) = (cid:18) (cid:28) x s k x s k , x si k x si k (cid:29)(cid:19) Θ . (5)Kernel K ′ p , which acts on normalized input vectors, will be usedas a baseline kernel for the acoustic waveforms. On the other hand,the standard polynomial kernel K p is used for classiﬁcation withthe cepstral representations where feature standardization by cepstralmean-and-variance normalization (CMVN) [13] already ensures thatfeature vectors typically have unit norm.Next, the temporal dynamics of speech are explicitly taken intoaccount by means of features that capture the evolution of energyin individual subbands. To obtain these features, each subbandcomponent x s is ﬁrst divided into T frames, x t,s , t = 1 , . . . , T ,and then a vector of their energies ω s is formed as, ω s = (cid:20) log (cid:13)(cid:13) x ,s (cid:13)(cid:13) , . . . , log (cid:13)(cid:13)(cid:13) x T,s (cid:13)(cid:13)(cid:13) (cid:21) . Finally, time differences [54] of ω s are evaluated to form thedynamic subband feature vector Ω s as Ω s = (cid:2) ω s ∆ ω s ∆ ω s (cid:3) .This dynamic subband feature vector Ω s is then combined with thecorresponding acoustic waveform subband component x s formingkernel K Ω given by K Ω ( x s , x si , Ω s , Ω si ) = K e ( x s , x si ) K p ( Ω s , Ω si ) , (6)where Ω si is the dynamic subband feature vector corresponding tothe s th subband component x si of the i -th training point x i . B. Ensemble Methods

For each binary classiﬁcation problem, decomposing an acousticwaveform into its subband components produces an ensemble of S classiﬁers. The decision of the subband classiﬁers in the ensemble,given by f s ( x s , Ω s ) = X i α si y i K Ω ( x s , x si , Ω s , Ω si ) + b s , s = 1 , . . . , S (7)are then aggregated using ensemble methods to obtain the binaryclassiﬁcation decision for a test waveform x . Here α si and b s arethe Lagrange multiplier corresponding to x si and the bias of the s th subband binary classiﬁer.

1) Uniform Aggregation:

Under a uniform aggregation scheme,the decisions of the subband classiﬁers in the ensemble are assigneduniform weights. Majority voting is the simplest uniform aggregationscheme commonly used in machine learning. In our context it isequivalent to forming a meta-level score function as h ( x ) = S X s =1 sgn ( f s ( x s , Ω s )) , (8)then predicting the class label as y = sgn ( h ( x )) . In addition tothis conventional majority voting scheme, which maps the scores in individual subbands to the corresponding class labels ( ± ), we alsoconsidered various smooth squashing functions, e.g. sigmoidal, asalternatives to the sgn function in (8), and obtained similar results.To gain some intuition about the potential of ensemble methods suchas the majority voting in improving the classiﬁcation performance,consider the ideal case when the errors of the individual subbandclassiﬁers in the ensemble are independent with error probability p < / . Under these conditions, a simple combinatorial argument showsthat the error probability p e of the majority voting scheme is givenby p e = S X s = ⌈ S/ ⌉ Ss ! p s (1 − p ) S − s . (9)where the largest contribution to the overall error is due to term with s = ⌈ S/ ⌉ . For a large ensemble cardinality S , this error probabilitycan be bounded as: p e < p ⌈ S/ ⌉ (1 − p ) S −⌈ S/ ⌉ S X s = ⌈ S/ ⌉ Ss ! ≈

12 (4 p (1 − p )) S/ . (10)Therefore, in ideal conditions, the ensemble error decreases expo-nentially in S even with this simple aggregation scheme [55, 56].However, it has been shown that there exists a correlation between thesubband components of speech and the resulting speech recognitionerrors in individual frequency subbands [19, 20]. As a result, themajority voting scheme may not yield considerable improvements inthe classiﬁcation performance. Furthermore, the uniform aggregationschemes also suffer from a major drawback; they do not exploitthe differences in the relative importance of individual subbands indiscriminating among speciﬁc pairs of phonemes. To remedy this,we use stacked generalization [57] as discussed next, to explicitlylearn weighting functions speciﬁc to each pair of phonemes for non-uniform aggregation of the outputs of base-level SVMs.

2) Stacked Generalization:

Our practical implementation ofstacked generalization [57] consists of a hierarchical two-layer SVMarchitecture, where the outputs of subband base-level SVMs areaggregated by a meta-level linear SVM. The decision function ofthe meta-level SVM classiﬁer is of the form h ( x ) = h f ( x ) , w i + v = X s w s f s ( x s , Ω s ) + v , (11)where f ( x ) = (cid:2) f ( x , Ω ) , . . . , f S ( x S , Ω S ) (cid:3) is the base-level SVMscore vector of the test waveform x , v is the classiﬁer bias, and w = (cid:2) w , . . . , w S (cid:3) is the weight vector of the meta-level classiﬁer.Note each of the binary classiﬁers will have its speciﬁc weight vectordetermined from an independent development/validation set { ˜ x j , ˜ y j } .Each weight vector can, therefore, be expressed as w = X j β j ˜ y j f (˜ x j ) , (12)where f (˜ x j ) = h f (˜ x j , ˜ Ω j ) , . . . , f S (˜ x Sj , ˜ Ω Sj ) i is the base-levelSVM score vector of the training waveform ˜ x j , and β j and ˜ y j are the Lagrange multiplier and class label corresponding to f (˜ x j ) ,respectively. While a base-level SVM assigns a weight to eachsupporting feature vector, stacked generalization effectively assignsan additional weight w s to each subband based on the performanceof the corresponding base-level subband classiﬁer. Again, ECOCmethods are used to combine the meta-level binary classiﬁers formulticlass classiﬁcation.An obvious advantage of the subband approach for ASR isthat the effect of environmental distortions in sufﬁciently narrowsubbands can be approximated as similar to that of a narrow-band white noise. This, in turn, facilitates the compensation of features to be independent of the spectral characteristics of theadditive and convolutive noise sources. In a preceding paper [17],we proposed an ASR front-end based on the full-band acousticwaveform representation of speech where a spectral shape adaptationof the features was performed in order to account for the varyingstrength of contamination of the frequency components due to thepresence of colored noise. In this work, compensation of the featuresis performed using standard approaches such as cepstral mean-and-variance normalization (CMVN) and vector Taylor series (VTS)methods which do not require any prior knowledge of the additiveand convolutive noise sources. Furthermore, we found that the stackedgeneralization also depends on the level of noise contaminating itstraining data. To this end, the weight vectors corresponding to thestacked classiﬁers can be tuned for classiﬁcation in a particularenvironment by introducing similar distortion to its training data.In scenarios where a performance gain over a wide range of SNRsis desired, a multi-style training approach that offers a reasonablecompromise between various test conditions can also be employed.For instance, a meta-level classiﬁer can be trained using the scorefeature vectors of noisy data or the score feature vectors of a mixtureof clean and noisy data.Note that since the dimension of the score feature vectors thatform the input to the stacked subband classiﬁer ( S ) is very smallcompared to the typical MFCC or waveform feature vectors, onlya very limited amount of data is required to learn optimal weightsof the meta-level classiﬁers. As such, stacked generalization offersﬂexibility and some coarse frequency selectivity for the individualbinary classiﬁcation problems, and can be particularly useful in de-emphasizing information from unreliable subbands. The experimentspresented in this paper show that the subband approach attains majorgains in classiﬁcation performance over its full-band counterpart [17]as well as the state-of-the-art front-ends such as MFCC.III. EXPERIMENTAL RESULTS A. Experimental Setup

Experiments are performed on the ‘si’ (diverse) and ‘sx’ (compact)sentences of the TIMIT database [58]. The training set consistsof 3696 sentences from 168 different speakers. For testing we usethe core test set which consists of 192 sentences from 24 differentspeakers not included in the training set. The development set consistsof 1152 sentences uttered by 96 male and 48 female speakers notincluded in either the training or the core test set, with speakersfrom 8 different dialect regions. In training the meta-level subbandclassiﬁers, we use a small subset, randomly selecting an eighth ofthe data points in the complete TIMIT development set. The glottalstops /q/ are removed from the class labels and certain allophonesare grouped into their corresponding phoneme classes using thestandard Kai-Fu Lee clustering [59], resulting in a total of M = 48 phoneme classes and N = M ( M − / classiﬁers.Among these classes, there are 7 groups for which the contribution ofwithin-group confusions toward multiclass error is not counted, againfollowing standard practice [35, 59]. Initially, we experimented withdifferent values of the hyperparameters for the binary SVM classiﬁersbut decided to use ﬁxed values for all classiﬁers as parameteroptimization had a large computational overhead but only a smallimpact on the multiclass classiﬁcation error: the degree of K p is setto Θ = 6 and the penalty parameter (for slack variables in the SVMtraining algorithm) to C = 1 .To test the classiﬁcation performance in noise, each TIMIT testsentence is normalized to unit energy per sample and then a noisesequence is added to the entire sentence to set the sentence-levelSNR. Hence for a given sentence-level SNR, signal-to-noise ratio at π rad/sample) M agn i t ude ( d B ) π rad/sample) P ha s e ( x π r ad i an s ) R(z) (Coloration=−0.5dB)R ′ (z) (Coloration=−0.9dB) Fig. 1:

Frequency response of the ICSI conference room ﬁlters withspectral coloration -0.5-dB and -0.9-dB. Here, spectral coloration isdeﬁned as the ratio of the geometric mean to the arithmetic mean ofspectral magnitudes. R ( z ) is used to add reverb to the test data whereas R ′ ( z ) , a proxy ﬁlter recorded at a different location in the same room,is used for the training of cepstral and meta-level subband classiﬁers. the level of individual phonemes will vary widely. Both artiﬁcialnoise (white, pink) and recordings of real noise (speech-babble)from the NOISEX-92 database are used in our experiments. Whitenoise was selected due to its attractive theoretical interpretation asprobing in an isotropic manner the separation of phoneme classesin different representation domains. Pink noise was chosen because /f -like noise patterns are found in music melodies, fan and cockpitnoises, in nature etc. [60–62]. In order to further test the classiﬁcationperformance in the presence of linear ﬁltering, noisy TIMIT sentencesare convolved with an impulse response with reverberation time T = 0 . sec. This impulse response is one that was measuredusing an Earthworks QTC1 microphone in the ICSI conferenceroom [63] populated with people; its magnitude response R ( e jω ) is shown in Figure 1, where we also show the spectrum of animpulse response corresponding to a different speaker position inthe same room, R ′ ( e jω ) . While the substantial difference betweenthese ﬁlters is evident from their spectra and spectral colorations(deﬁned as a ratio of the geometric mean to the arithmetic mean ofspectral magnitude), R ′ ( e jω ) can be viewed as an approximation ofthe effect of the R ( e jω ) on the speech spectrum and is used in someof our experiments for training of the cepstral and meta-level subbandclassiﬁers in order to reduce the mismatch between training and testdata.To obtain the cepstral (MFCC) representation, each sentence isconverted into a sequence of 13 dimensional feature vectors, theirtime derivatives and second order derivatives which are combinedinto a sequence of 39 dimensional feature vectors. Then, T = 10 frames (with frame duration of ms and a frame rate of frames/sec) closest to the center of a phoneme are concatenated togive a representation in R . Noise compensation of the MFCCfeatures is performed via vector Taylor series (VTS) method whichhas been extensively used in recent literature and is consideredas state-of-the-art. This scheme estimates the distribution of noisyspeech given the distribution of clean speech, a segment of noisyspeech, and the Taylor series expansion that relates the noisy speechfeatures to the clean ones, and then uses it to predict the unobservedclean cepstral feature vectors. In our experiments, a Gaussian mixturemodel (GMM) with 64 mixture components was used to learn thedistribution of the Mel-log spectra of clean training data. Addition- ally, cepstral mean-and-variance normalization (CMVN) [13, 14] isperformed to standardize the cepstral features, ﬁxing their range ofvariation for both training and test data. CMVN computes the meanand variance of the feature vectors across a sentence and standardizesthe features so that each has zero mean and a ﬁxed variance. Thefollowing training-test scenarios are considered for classiﬁers withthe cepstral front-end:1. Anechoic training with VTS - training of the SVM classiﬁersis performed with anechoic clean speech and the test data iscompensated via VTS.2.

Reverberant training with VTS - training of the SVM clas-siﬁers is performed with reverberant clean speech with featurecompensation of the test data via VTS. Two particular casesin this scenario are considered. (a)

The clean training data andthe noisy test data are convolved using the same linear ﬁlter, R ( e jω ) ). This case provides a lower bound on the classiﬁcationerror in the unlikely event when the exact knowledge of theconvolutive noise source is known. (b) This case investigates theeffects of a mismatch of the linear ﬁlter used for convolutionwith the training and test data. In particular, the data usedfor training of the SVM classiﬁers as well as learning ofthe distribution of log-spectra in VTS feature compensationis convolved with R ′ ( e jω ) while the test data is convolvedwith R ( e jω ) ). Since the exact knowledge of the linear ﬁltercorrupting the test data is usually difﬁcult to determine, thiscase offers a more practical solution to the problem and itsperformance is expected to lie between the brackets obtainedwith the two scenarios mentioned above i.e. anechoic training,and reverberant training and testing using the same ﬁlter.3. Matched training - In this scenario, both the training and testingconditions are identical. Again, this is an impractical target;nevertheless, we present the results (only in the presence ofadditive noise) as a reference, since this setup is considered togive the optimal achievable performance with cepstral features[64–66].Furthermore, note that the MFCC features of both training and testdata are standardized using CMVN [13] in all scenarios.Acoustic waveforms segments x are extracted from the TIMITsentences by applying a 100ms rectangular window at the centre ofeach phoneme and are then decomposed into subband components { x s } Ss =1 using a cosine-modulated ﬁlter bank (see (2)). We conductedexperiments to examine the effect of the number of ﬁlter bankchannels ( S ) on classiﬁcation accuracy. Generally, decompositionof speech into wider subbands does not effectively capture thefrequency-speciﬁc dynamics of speech and thus results in relativelypoor performance. On the other hand, decomposition of speech insufﬁciently narrow subbands improves classiﬁcation performance asdemonstrated in [22], but at the cost of an increase in the overallcomputational complexity. For the results presented in this paper, thenumber of ﬁlter bank channels is limited to in order to reduce thecomputational complexity. The dynamic subband feature vector, Ω s is computed by extracting T = 10 equal-length (25ms with an overlapof 10ms) frames around the centre of each phoneme thus yielding avector of dimension . These feature vectors are further standardizedwithin each sentence of TIMIT for the evaluation of kernel K Ω . Notethat the training of base-level SVM subband classiﬁers is alwaysperformed with clean data. The development subset is used fortraining of the meta-level subband classiﬁers as learning the optimalweights requires only a limited amount of data. Several scenarios areconsidered for training of the meta-level classiﬁers:1. Anechoic clean training - training the meta-level SVM clas-siﬁer with the base-level SVM score vectors obtained from anechoic clean data.2.

Anechoic multi-style training - training the meta-level SVMclassiﬁer with the base-level SVM score vectors of anechoicdata containing a mixture of clean waveforms and waveformscorrupted by white noise at 0-dB SNR,3.

Reverberant multi-style training - training the meta-levelSVM classiﬁer with the base-level SVM score vectors of re-verberant data containing a mixture of clean waveforms andwaveforms corrupted by white noise at 0-dB SNR. Similarto the MFCC training-test scenarios, two particular cases areconsidered here: (a) the development data for training as wellas the test data are convolved with the same ﬁlter R ( e jω ) , and (b) the development data for training is convolved with R ′ ( e jω ) whereas the test data is convolved with R ( e jω ) ,4. Matched training - training and testing with the meta-levelclassiﬁer under identical noise level and type conditions. Resultsfor this scenario are shown only in the presence of additive noise.Next, we present the results of TIMIT phoneme classiﬁcation withthe setup detailed above.

B. Results: Robustness to Additive Noise

First we compare various frequency decompositions and ensemblemethods for subband classiﬁcation. A summary of their respectiveclassiﬁcation errors in quiet condition is presented in Table I. Weﬁnd the stacked generalization to yield signiﬁcantly better resultsthan majority voting; it consistently achieves over improvementover majority voting for all subband decompositions consideredhere. Among these decompositions, classiﬁcation with the 16-channelcosine-modulated ﬁlter bank achieves the largest improvement of . over the composite acoustic waveforms [17] and is thereforeselected for further experiments.TABLE I: Errors obtained with different subband decompositions [53](listed in the left column) and aggregation schemes for subband classiﬁ-cation in quiet condition.

ERROR [ % ]Subband Analysis Maj. Voting Stack. Gen. Level-4 wavelet decomposition . . Level-4 wavelet packet decomposition . . DCT (16 uniform-width bands) . . . Composite Waveform [17] . Let us now consider classiﬁcation of phonemes in the presence ofadditive noise. Robustness of the proposed method to both additivenoise and linear ﬁltering is discussed in Section III-C. In Figure 2,we compare the classiﬁcation in frequency subbands using ensemblemethods with composite acoustic waveform classiﬁcation (resultsas reported in [17]) in the presence of white and pink noise. Thedashed curves correspond to subband classiﬁcation using ensemblemethods i.e uniform combination (majority voting) and stacked gen-eralization with different training scenarios for meta-level classiﬁers(see Section III-A). The meta-level classiﬁers of multi-style stackedsubband classiﬁer are trained according to scenario 2. The resultsshow that stacked generalization generally attains better performancethan uniform aggregation. The majority voting scheme also performspoorly in comparison to the composite acoustic waveforms across allSNRs. On the other hand the stacked subband classiﬁer trained inquiet condition improves over the composite waveform classiﬁer inlow noise conditions. But its performance then degrades relativelyquickly in high noise because its corresponding meta-level binary −18 −12 −6 0 6 12 18 Q 30405060708090100 SNR [dB] E RR O R [ % ] Composite WaveformSubband − Majority VotingStacked Subband − QuietStacked Subband − Multi−styleStacked Subband − Matched −18 −12 −6 0 6 12 18 Q 30405060708090100 SNR [dB] E RR O R [ % ] Composite WaveformSubband − Majority VotingStacked Subband − QuietStacked Subband − Multi−styleStacked Subband − Matched

Fig. 2:

Ensemble methods for aggregation of subband classiﬁers andtheir comparison with composite acoustic waveform classiﬁers (resultsas reported in [17]) in the presence of white noise (top) and pinknoise (bottom). The curves correspond to uniform combination (majorityvoting) and stacked generalization with different training scenarios forthe meta-level classiﬁers. The multi-style stacked subband classiﬁer istrained only with the small development subset (one eighth randomlyselected score vectors from the development set) consisting of clean andwhite-noise (0-dB SNR) corrupted anechoic data. The classiﬁers are thentested on data corrupted with white noise (matched) and pink noise(mismatched). W e i gh t s ( M ean ± S t anda r d D e v i a t i on ) Majority VotingStacked Subband − QuietStacked Subband − Multi−style

Fig. 3:

Weights (mean ± standard deviation across N = 1128 binaryclassiﬁers) assigned to S = 16 subbands by the multi-style meta-levelclassiﬁers and by the meta-level classiﬁers trained in quiet conditions. classiﬁers are trained to assign weights to different subbands thatare tuned for classiﬁcation in quiet. To improve the robustness toadditive noise, the meta-level classiﬁers can be trained on a mixtureof base-level SVM score vectors obtained from both clean and datacorrupted by white noise (0-dB SNR), as explained above. Figure 3shows the weights (mean ± standard deviation across N = 1128 binary classiﬁers) assigned to S = 16 subbands by the stackedclassiﬁer with its metal-level binary classiﬁers trained in quiet andin multi-style conditions. It can be observed that relatively highweights are assigned to the low frequency subband components bythe multi-style training. This is reasonable as low frequency subbandshold a substantial portion of speech energy and can provide reliablediscriminatory information in the presence of wideband noise. Thelarge amount of variation in the assigned weights as indicated by theerror bars is consistent with the variation of speech data encounteredby the N = 1128 binary phoneme classiﬁers. It can be observedthat the multi-style subband classiﬁers consistently improves overthe composite waveform classiﬁer as well as the stacked subbandclassiﬁer trained in quiet condition. Overall, it achieves averageimprovements of . and . over the composite waveformclassiﬁer in the presence of white (matched noise type) and pink(mismatched noise type) noise, respectively. As expected, the stackedsubband classiﬁer trained in matched conditions ﬁnally outperformsthe other classiﬁers in all noise conditions as shown in Figure 2.Next, we compare the performance of the multi-style subband clas-siﬁer with the VTS-compensated MFCC classiﬁer and the compositeacoustic waveform classiﬁer [17] in the presence of additive white andpink noise. These results along with classiﬁcation with the stackedsubband classiﬁer and MFCC classiﬁer in matched training-testconditions are presented in Figure 2. The results show that the stackedsubband classiﬁer exhibits better classiﬁcation performance than theVTS-compensated MFCC classiﬁer for SNR below 12-dB whereasthe performance crossover between MFCC and composite acousticwaveform classiﬁers is between 6-dB and 0-dB SNR. The stackedsubband classiﬁer achieves average improvements of . and . over the MFCC classiﬁer across the range of SNRs considered in thepresence of white and pink noise, respectively. Moreover, and quiteremarkably, the stacked subband classiﬁer also signiﬁcantly improvesover the MFCC classiﬁer trained and tested in matched conditionsfor SNRs below a crossover point between 6-dB and 0-dB SNR, eventhough its meta-level classiﬁers are trained only using clean data anddata corrupted by white noise at 0-dB SNR and the number of datapoints used to learn the optimal weights amounts only to a smallfraction of the data set used for training of the MFCC classiﬁer inmatched conditions. In particular, an average improvement of . in the phoneme classiﬁcation error is achieved by the multi-stylesubband classiﬁer over the matched MFCC classiﬁer for SNRs below6-dB in the presence of white noise.In [67] we showed that the MFCC classiﬁers suffer performancedegradation in case of a mismatch of the noise type between trainingand test data. On the other hand, the stacked subband classiﬁerdegrades gracefully in a mismatched environment as shown in Figure4. This can be attributed to the decomposition of acoustic waveformsinto frequency subbands where the effect of wideband colored noiseon each binary subband classiﬁer can be approximated as that of anarrow-band white noise. In comparison to the result reported in [33],where a . error was obtained at -dB SNR in pink noise using asecond-order regularized least squares algorithm (RLS2) trained usingMFCC feature vectors with variable length encoding, the proposedmethod achieves a relative improvement with a ﬁxed lengthrepresentation in similar conditions.Figure 4 also shows a comparison of the stacked subband classiﬁerwith the MFCC classiﬁer trained and tested in matched conditions. −18 −12 −6 0 6 12 18 Q 2030405060708090100 SNR [dB] E RR O R [ % ] MFCC − VTSMFCC − MatchedComposite WaveformStacked Subband − Multi−styleStacked Subband − Matched−18 −12 −6 0 6 12 18 Q 2030405060708090100 SNR [dB] E RR O R [ % ] MFCC − VTSMFCC − MatchedComposite WaveformStacked Subband − Multi−styleStacked Subband − Matched

Fig. 4:

SVM classiﬁcation in the subbands of acoustic waveforms andits comparison with MFCC and composite acoustic waveform classiﬁersin the presence of white noise (top) and pink noise (bottom). The multi-style stacked subband classiﬁer is trained only with a small subset ofthe development data (one eighth randomly selected score vectors fromthe development set) consisting of clean and white-noise (0-dB SNR)corrupted data. In the matched training case, noise levels as well asnoise types of training and test data are identical for both MFCC andstacked subband classiﬁers.

The matched-condition subband classiﬁer signiﬁcantly outperformsthe matched MFCC classiﬁer for SNRs below 6-dB. Around av-erage improvement is achieved by the subband classiﬁer over MFCCclassiﬁer for SNRs below 6-dB in the presence of both white and pinknoise. This suggests that the high-dimensional subband representationobtained from acoustic waveforms might provide a better separationof phoneme classes compared to cepstral representation.

C. Results: Robustness to Linear Filtering

We now consider classiﬁcation in the presence of additive noise aswell as linear ﬁltering. First, Figure 5 presents results of the ensemblesubband classiﬁcation using stacked generalization with multipletraining-test scenarios (see Section III-A) in the presence of whiteand pink noise. To reiterate, three different scenarios are consideredfor training of the multi-style stacked subband classiﬁer: one involvestraining the meta-level classiﬁers with the base-level SVM scorevectors of the development subset consisting of clean and white-noise (0-dB SNR) corrupted anechoic data, second involves trainingwith the score vectors of the same development data convolved with R ′ ( e jω ) (mismatched reverberant conditions) while the third involvestraining in matched reverberant conditions i.e. training with the same −18 −12 −6 0 6 12 18 Q 30405060708090100 SNR [dB] E RR O R [ % ] Quiet TrainingMulti−style (Anechoic)Multi−style (Reverberant Mismatch)Multi−style (Reverberant Matched)−18 −12 −6 0 6 12 18 Q 30405060708090100 SNR [dB] E RR O R [ % ] Quiet TrainingMulti−style (Anechoic)Multi−style (Reverberant Mismatch)Multi−style (Reverberant Matched)

Fig. 5:

Classiﬁcation in frequency subbands using ensemble methods inthe presence of the linear ﬁlter R ( e jω ) with white noise (top) and pinknoise (bottom). The curves correspond to stacked generalization withdifferent training scenarios for meta-level subband classiﬁer. development subset convolved with R ( e jω ) . These classiﬁers, whichare referred to as anechoic and reverberant multi-style subbandclassiﬁers (see Section III-A), are then tested on data corrupted bywhite, pink or speech-babble noise, and convolved with R ( e jω ) .Similar to our ﬁndings in the previous section, the results in Figure5 show that the anechoic multi-style subband classiﬁer consistentlyimproves over the stacked subband classiﬁer trained only in quietcondition. Moreover, the reverberant multi-style subband classiﬁers(both matched and mismatched) further reduce the mismatch withthe test data and hence exhibit more superior performances. Forinstance, in the presence of pink noise and linear ﬁltering, the subbandclassiﬁers trained in mismatched and matched reverberant conditionsattain average improvements of and . across all SNRs overthe anechoic multi-style subband classiﬁer, respectively. Note that anaccurate measurement of the linear ﬁlter corrupting the test data maybe difﬁcult to obtain in practical scenarios. Nonetheless, classiﬁcationresults in matched reverberant condition are presented to determinea lower bound on the error. On the other hand, the mismatchedreverberant case can be considered as a more practical solution to theproblem and its performance is expected to lie between the bracketsobtained with the anechoic training and matched reverberant training.Figure 6 compares the classiﬁcation performances of the subbandand VTS-compensated MFCC classiﬁers trained under three differentscenarios (see Section III-A) in the presence of linear ﬁltering, andpink and speech-babble noise. The ﬁrst is an agnostic (anechoic) case that does not rely on any information at all regarding the source ofthe convolutive noise R ( e jω ) , the second (reverberant mismatch case)employs a proxy reverberation ﬁlter R ′ ( e jω ) in order to reduce themismatch of the training and the reverberant test environments up to acertain degree, whereas the third (reverberant matched case) employsaccurate knowledge of the reverberation ﬁlter R ( e jω ) in the trainingof the MFCC classiﬁers and the meta-level subband classiﬁers.These training scenarios are respectively represented by squares,stars and circles in Figure 6. The results show that the comparisonsof the stacked subband classiﬁers and MFCC classiﬁers under thedifferent training regimes exhibit similar trends. Generally speaking,the MFCC classiﬁer outperforms the corresponding subband classiﬁerin quiet and low noise conditions however the latter yield signiﬁcantimprovements in high noise conditions. For example, the anechoicsubband classiﬁers yields better classiﬁcation performance than theanechoic MFCC classiﬁer for SNRs below a crossover point between -dB and -dB. Quantitatively similar conclusions apply to thecomparative performances of the MFCC and subband classiﬁers inthe reverberant training scenarios. Under the three different trainingregimes and two different noise types, the subband classiﬁers attainan average improvement of . over the MFCC classiﬁers across allSNRs below -dB. Note that in the reverberant training scenarios,the MFCC classiﬁer is trained with the complete TIMIT reverberanttraining set. On the other hand, the meta-level subband classiﬁer istrained using the reverberant development subset with a number ofdata points less than of that in the TIMIT training set. Moreover,the dimension of the feature vectors that form the input to the meta-level classiﬁers is almost 24 times smaller than the MFCC featurevectors. To this end, the subband approach offers more ﬂexibilityin terms of training and adaptation of the classiﬁers to a newenvironment.Since an obvious performance crossover between the subband andMFCC classiﬁers exists at moderate SNRs, we therefore considera convex combination of the scores of the SVM classiﬁers witha combination parameter λ as discussed in [17]. Here λ = 0 corresponds to the MFCC classiﬁcation whereas λ = 1 correspondsto the subband classiﬁcation. The combination approach was alsomotivated by the differences in the confusion matrices of the twoclassiﬁers (not shown here). This suggests that the errors of thesubband and MFCC classiﬁers may be independent up to a certaindegree and therefore a combination of the two may yield betterperformance than either of classiﬁers individually. Two differentvalues of the combination parameter λ are considered. First, thevalue of λ is set to / which corresponds to the arithmetic meanof the MFCC and subband SVM classiﬁer scores. In the secondcase, we set the combination parameter λ to a function λ emp ( σ ) which approximates the optimal combination parameter values foran independent development set. This approximated function wasdetermined empirically in our previous experiments [17] and is givenby λ emp ( σ ) = η + ζ/ [1 + (cid:0) σ /σ (cid:1) ] , with η = 0 . , ζ = 0 . and σ = 0 . . Note that λ emp ( σ ) also requires an estimate of the noisevariance ( σ ) which was explicitly measured using the decision-directed estimation algorithm [68, 69].Figure 7 compares the classiﬁcation performances of the subbandand MFCC classiﬁers with their convex combination in the presenceof speech-babble noise under anechoic and reverberant mismatchedtraining regimes. One can observe that the combined classiﬁcationwith λ emp consistently outperforms either of the individual classiﬁersacross all SNRs. For instance, under the anechoic training of theclassiﬁers, the combined classiﬁcation with λ emp attains a . and . average improvement over the subband and MFCC classiﬁersrespectively, across all SNRs considered. Moreover, the combinedclassiﬁcation via a simple averaging of the subband and MFCC −18 −12 −6 0 6 12 18 Q 2030405060708090100 SNR [dB] E RR O R [ % ] MFCC + VTS (Anechoic)MFCC + VTS (Reverb. Mismatch)MFCC + VTS (Reverb. Matched)Subband − Multi−style (Anechoic)Subband − Multi−style (Reverb. Mismatch)Subband − Multi−style (Reverb. Matched) −18 −12 −6 0 6 12 18 Q 2030405060708090100 SNR [dB] E RR O R [ % ] MFCC + VTS (Anechoic)MFCC + VTS (Reverb. Mismatch)MFCC + VTS (Reverb. Matched)Subband − Multi−style (Anechoic)Subband − Multi−style (Reverb. Mismatch)Subband − Multi−style (Reverb. Matched)

Fig. 6:

Classiﬁcation with the subband and VTS-compensated MFCCclassiﬁers trained under three different scenarios: anechoic training(squares), reverberant mismatched training (stars) and reverberantmatched training (circles). Classiﬁcation results for the test data con-taminated with pink noise (top) and speech-babble noise (bottom), andlinear ﬁlter R ( e jω ) are shown. classiﬁers by setting λ = 1 / provides a reasonable compromisebetween classiﬁcation performance achieved within both represen-tation domains i.e. subbands of acoustic waveforms and cepstralrepresentation. While the performance of the combined classiﬁer with λ = 1 / degrades only slightly (approximately ) for SNRs abovea cross over point between -dB and -dB, it achieves relatively fargreater improvements in high noise. e.g. under the anechoic trainingregime, the combined classiﬁer with λ = 1 / attains a and . improvement over the MFCC and subband classiﬁers at -dBSNR, respectively. Quantitatively similar conclusions apply in thereverberant mismatched training scenario as shown in Figure 7.IV. CONCLUSIONSIn this paper we studied an SVM front-end for robust speechrecognition that operates in frequency subbands of high-dimensionalacoustic waveforms. We addressed the issues of kernel design forsubband components of acoustic waveforms and the aggregationof the individual subband classiﬁers using ensemble methods. Theexperiments demonstrated that the subband classiﬁers outperform thecepstral classiﬁers in the presence of noise and linear ﬁltering forSNRs below -dB. While the subband classiﬁers do not perform aswell as the MFCC classiﬁers in low noise conditions, major gainsacross all noise levels can be attained by a convex combination [17]. −18 −12 −6 0 6 12 18 Q 30405060708090100 SNR [dB] E RR O R [ % ] MFCC + VTS (Anechoic)Subband − Multi−style (Anechoic)Combination ( λ =0.5)Combination ( λ emp ) −18 −12 −6 0 6 12 18 Q 30405060708090100 SNR [dB] E RR O R [ % ] MFCC + VTS (Reverb. Mismatch)Subband − Multi−style (Reverb. Mismatch)Combination ( λ =0.5)Combination ( λ emp ) Fig. 7:

Comparison of the classiﬁcation performances of the subbandand MFCC classiﬁers with their convex combination in the presenceof speech-babble noise under anechoic training (top) and reverberantmismatched training regimes (bottom). Results are shown for two differentsettings of the combination parameter, λ . This work primarily focused on comparison of different repre-sentations in terms of the robustness they provide. To this end,experiments were conducted on the TIMIT phoneme classiﬁcationtask. However, the results reported in this paper also have implicationsfor the construction of ASR systems. In future work, we plan toinvestigate extensions to the proposed technique in order to facilitatethe recognition of continuous speech. One straight-forward approachwould be to pre-process the speech signals using the combination ofthe subband and cepstral SVM classiﬁers, and error-correcting outputcodes and generate class-wise feature vectors for overlapping andextended frames of speech. These feature vectors can be extractedin a manner similar to the MFCC features. An HMM-based systemcan then be trained using these feature vectors for recognition ofcontinuous speech. Alternatively, the proposed technique can alsobe integrated with other approaches such as the hybrid phone-basedHMM-SVM architecture [42, 43] and the token-passing algorithm[44] for continuous speech recognition. In the former, a baselineHMM system would be required to perform a ﬁrst pass throughthe test data and for each utterance, generate a set of possiblesegmentations into phonemes. The best segmentations can then re-scored by the combined SVM classiﬁer to predict the ﬁnal phonemesequence. This approach has provided improvements in recognitionperformance over HMM baselines on both small and large vocabularyrecognition tasks, even though the SVM classiﬁers were constructed solely from the cepstral representations [42, 43]. However, thisHMM-SVM hybrid solution can also limit the efﬁciency of SVMsdue to possible errors in the segmentation stage. On the other hand,a recognizer based solely on SVMs as discussed in [44] can alsoemployed which makes decisions at a frame level via SVMs anddetermines the chain of recognized phonemes and words using thetoken-passing algorithm. These extensions will be the subject of afuture study. R

EFERENCES [1] S. B. Davis and P. Mermelstein, “Comparison of ParametricRepresentations for Monosyllabic Word Recognition in Con-tinuously Spoken Sentences,”

IEEE Trans. ASSP , vol. 28, pp.357–366, 1980.[2] H. Hermansky, “Perceptual Linear Predictive (PLP) Analysis ofSpeech,”

J. Acoust. Soc. Amer. , vol. 87, no. 4, pp. 1738–1752,April 1990.[3] R. Lippmann, “Speech Recognition by Machines and Humans,”

Speech Comm. , vol. 22, no. 1, pp. 1–15, 1997.[4] G. Miller and P. Nicely, “An Analysis of Perceptual Confu-sions among some English Consonants,”

J. Acoust. Soc. Amer. ,vol. 27, no. 2, pp. 338–352, 1955.[5] J. B. Allen, “How do humans process and recognize speech?”

IEEE Trans. Speech & Audio Proc. , vol. 2, no. 4, pp. 567–577,1994.[6] J. Sroka and L. Braida, “Human and Machine ConsonantRecognition,”

Speech Comm. , vol. 45, no. 4, pp. 401–423, 2005.[7] B. Meyer, M. W¨achter, T. Brand, and B. Kollmeier, “PhonemeConfusions in Human and Automatic Speech Recognition,”

Proc. INTERSPEECH , pp. 2740–2743, 2007.[8] B. Atal, “Automatic Speech Recognition: a CommunicationPerspective,”

Proc. ICASSP , pp. 457–460, 1999.[9] S. D. Peters, P. Stubley, and J. Valin, “On the Limits of SpeechRecognition in Noise,”

Proc. ICASSP , pp. 365–368, 1999.[10] H. Bourlard, H. Hermansky, and N. Morgan, “Towards Increas-ing Speech Recognition Error Rates,”

Speech Comm. , vol. 18,no. 3, pp. 205–231, 1996.[11] K. K. Paliwal and L. D. Alsteris, “On the Usefulness of STFTPhase Spectrum in Human Listening Tests,”

Speech Comm. ,vol. 45, no. 2, pp. 153–170, 2005.[12] L. D. Alsteris and K. K. Paliwal, “Further Intelligibility Re-sults from Human Listening Tests using the Short-Time PhaseSpectrum,”

Speech Comm. , vol. 48, no. 6, pp. 727–736, 2006.[13] O. Viikki and K. Laurila, “Cepstral Domain Segmental FeatureVector Normalization for Noise Robust Speech Recognition,”

Speech Comm. , vol. 25, pp. 133–147, 1998.[14] C. Chen and J. Bilmes, “MVA Processing of Speech Features,”

IEEE Trans. ASLP , vol. 15, no. 1, pp. 257–270, 2007.[15] P. J. Moreno, B. Raj, and R. M. Stern, “A Vector Taylor SeriesApproach for Environment-Independent Speech Recognition,”

Proc. ICASSP , pp. 733–736, 1996.[16] E. standard doc., “Speech processing, Transmission and Qualityaspects (STQ): Advanced front-end feature extraction,”

ETSI ES202 050 , 2002.[17] J. Yousafzai, Z. Cvetkovi´c, P. Sollich, and B. Yu, “CombinedFeatures and Kernel Design for Noise Robust Phoneme Classiﬁ-cation Using Support Vector Machines,”

To appear in the IEEETrans. ASLP , 2011.[18] H. Fletcher,

Speech and Hearing in Communication . NewYork: Van Nostrand, 1953.[19] J. McAuley, J. Ming, D. Stewart, and P. Hanna, “Subbandcorrelation and robust speech recognition,”

IEEE Trans. onSpeech and Audio Proc. , vol. 13, no. 5, pp. 956 – 964, 2005. [20] J. Ming, P. Jancovic, and F. Smith, “Robust speech recognitionusing probabilistic union models,” IEEE Trans. on Speech andAudio Proc. , vol. 10, no. 6, pp. 403 – 414, Sep. 2002.[21] P. McCourt, N. Harte, and S. Vaseghi, “Discriminative Multi-resolution Sub-band and Segmental Phonetic Model Combina-tion,”

IET Electronics Letters , vol. 36, no. 3, pp. 270 –271,2000.[22] S. Thomas, S. Ganapathy, and H. Hermansky, “Recognition OfReverberant Speech Using Frequency Domain Linear Predic-tion,”

IEEE Signal Process. Letters , vol. 15, pp. 681–684, 2008.[23] S. Tibrewala and H. Hermansky, “Subband Based RecognitionOf Noisy Speech,”

Proc. ICASSP , pp. 1255–1258, 1997.[24] P. McCourt, S. Vaseghi, and N. Harte, “Multi-Resolution Cep-stral Features for Phoneme Recognition across Speech Sub-Bands,”

Proc. ICASSP , pp. 557–560, 1998.[25] H. Bourlard and S. Dupont, “Subband-based Speech Recogni-tion,”

Proc. ICASSP , pp. 1251–1254, 1997.[26] S. Okawa, E. Bocchieri, and A. Potamianos, “Multi-bandSpeech Recognition in Noisy Environments,”

Proc. ICASSP , pp.641–644, 1998.[27] C. Cerisara, J.-P. Haton, J.-F. Mari, and D. Fohr, “A Recom-bination Model for Multi-band Speech Recognition,”

ICASSP ,pp. 717 –720 vol.2, 1998.[28] J. Flanagan, J. Johnston, R. Zahn, and G. Elko, “Computer-Steered Microphone Arrays for Sound Transduction in LargeRooms,”

J. Acoust. Soc. Amer. , vol. 78, no. 11, pp. 1508–1518,1985.[29] M. Wu and D. Wang, “A Two-Stage Algorithm for One-Microphone Reverberant Speech Enhancement,”

IEEE Trans.ASLP , vol. 14, pp. 774–784, 2006.[30] H. Chang and J. Glass, “Hierarchical Large-Margin GaussianMixture Models for Phonetic Classiﬁcation,”

Proc. ASRU , pp.272–275, 2007.[31] D. Yu, L. Deng, and A. Acero, “Hidden Conditional RandomFields with Distribution Constraints for Phone Classiﬁcation,”

Proc. INTERSPEECH , pp. 676–679, 2009.[32] F. Sha and L. K. Saul, “Large Margin Gaussian MixtureModeling for Phonetic Classiﬁcation and Recognition,”

Proc.ICASSP , pp. 265–268, 2006.[33] R. Rifkin, K. Schutte, M. Saad, J. Bouvrie, and J. Glass, “NoiseRobust Phonetic Classiﬁcation with Linear Regularized LeastSquares and Second-Order Features,”

Proc. ICASSP , pp. 881–884, 2007.[34] A. Halberstadt and J. Glass, “Heterogeneous Acoustic Measure-ments for Phonetic Classiﬁcation,”

Proc. EuroSpeech , pp. 401–404, 1997.[35] P. Clarkson and P. J. Moreno, “On the Use of Support VectorMachines for Phonetic Classiﬁcation,”

Proc. ICASSP , pp. 585–588, 1999.[36] A. Gunawardana, M. Mahajan, A. Acero, and J. C. Platt,“Hidden Conditional Random Fields for Phone Classiﬁcation,”

Proc. INTERSPEECH , pp. 1117–1120, 2005.[37] M. Layton and M. Gales, “Augmented Statistical Models forSpeech Recognition,”

Proc. ICASSP , pp. I29–132, 2006.[38] V. Pitsikalis and P. Maragos, “Analysis and Classiﬁcation ofSpeech Signals by Generalized Fractal Dimension Features,”

Speech Comm. , vol. 51, pp. 1206–1223, 2009.[39] S. Dusan, “On the Relevance of Some Spectral and TemporalPatterns for Vowel Classiﬁcation,”

Speech Comm. , vol. 49, pp.71–82, 2007.[40] K. M. Indrebo, R. J. Povinelli, and M. T. Johnson, “Sub-bandedReconstructed Phase Spaces for Speech Recognition,”

SpeechComm. , vol. 48, no. 7, pp. 760–774, 2006. [41] A. Halberstadt and J. Glass, “Heterogeneous Measurements andMultiple Classiﬁers for Speech Recognition,”

Proc. ICSLP , pp.995–998, 1998.[42] A. Ganapathiraju, J. E. Hamaker, and J. Picone, “Applications ofSupport Vector Machines to Speech Recognition,”

IEEE Trans.Signal Proc. , vol. 52, no. 8, pp. 2348–2355, 2004.[43] S. E. Kr¨uger, M. Schaff¨ner, M. Katz, E. Andelic, and A. Wen-demuth, “Speech Recognition with Support Vector Machines ina Hybrid System,”

Proc. INTERSPEECH , pp. 993–996, 2005.[44] J. Padrell-Sendra, D. Mart´ın-Iglesias, and F. D´ıaz-de Mar´ıa,“Support Vector Machines for Continuous Speech Recognition,”

Proc. EUSIPCO , 2006.[45] V. N. Vapnik,

The Nature of Statistical Learning Theory . NewYork: Springer-Verlag, 1995.[46] N. Smith and M. Gales, “Speech Recognition using SVMs,” in

Adv. Neural Inf. Process. Syst. , vol. 14, 2002, pp. 1197–1204.[47] J. Louradour, K. Daoudi, and F. Bach, “Feature Space Maha-lanobis Sequence Kernels: Application to SVM Speaker Ver-iﬁcation,”

IEEE Trans. ASLP , vol. 15, no. 8, pp. 2465–2475,2007.[48] A. Sloin and D. Burshtein, “Support Vector Machine Trainingfor Improved Hidden Markov Modeling,”

IEEE Trans. SignalProc. , vol. 56, no. 1, pp. 172–188, 2008.[49] T. Jaakkola and D. Haussler, “Exploiting Generative Models inDiscriminative Classiﬁers,” in

Adv. Neural Inf. Process. Syst. ,vol. 11, 1999, pp. 487–493.[50] R. Solera-Urena, D. Mart´ın-Iglesias, A. Gallardo-Antol´ın,C. Pel´aez-Moreno, and F. D´ıaz-de Mar´ıa, “Robust ASR usingSupport Vector Machines,”

Speech Comm. , vol. 49, no. 4, pp.253–267, 2007.[51] T. Dietterich and G. Bakiri, “Solving Multiclass Learning Prob-lems via Error-Correcting Output Codes,”

J. Artif. Intell. Res. ,vol. 2, pp. 263–286, 1995.[52] R. Rifkin and A. Klautau, “In Defense of One-Vs-All Classiﬁ-cation,”

J. Mach. Learn. Res. , vol. 5, pp. 101–141, 2004.[53] M. Vetterli and J. Kovacevic,

Wavelets and Subband Coding .Englewood Cliffs, NJ: Prentice-Hall, 1995.[54] S. Furui, “Speaker-Independent Isolated Word Recognition us-ing Dynamic Features of Speech Spectrum,”

IEEE Trans. ASSP ,vol. 34, no. 1, pp. 52–59, 1986.[55] T. Dietterich, “Ensemble Methods in Machine Learning,”

Lec-ture Notes in Computer Science: Multiple Classiﬁer Systems ,pp. 1–15, 2000.[56] L. Hansen and P. Salamon, “Neural Network Ensembles,”

IEEETrans. PAMI , vol. 12, no. 10, pp. 993–1001, 1990.[57] D. Wolpert, “Stacked Generalization,”

Neural Networks , vol. 5,no. 2, pp. 241–259, 1992.[58] J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallet, andN. Dahlgren, “TIMIT Acoustic-Phonetic Continuous SpeechCorpus,”

Linguistic Data Consortium , 1993.[59] K. F. Lee and H. W. Hon, “Speaker-Independent Phone Recogni-tion Using Hidden Markov Models,”

IEEE Trans. ASSP , vol. 37,no. 11, pp. 1641–1648, 1989.[60] R. F. Voss and J. Clarke, “1/f Noise in Music: Music from 1/fNoise,”

J. Acoust. Soc. Amer. , vol. 63, no. 1, pp. 258–263, 1978.[61] B. J. West and M. Shlesinger, “The Noise in Natural Phenom-ena,”

American Scientist , vol. 78, no. 1, pp. 40–45, 1990.[62] P. Grigolini, G. Aquino, M. Bologna, M. Lukovic, and B. J.West, “A Theory of 1/f Noise in Human Cognition,”

PhysicaA: Stat. Mech. and its Appl. , vol. 388, no. 19, pp. 4192–4204,2009.[63] “The ICSI Meeting Recorder Project - RoomResponses,” online Web Resource. [Online]. Available: IEEE Trans. ASLP , vol. 14, no. 1, pp. 43–49, 2006.[65] R. Lippmann and E. A. Martin, “Multi-Style Training for RobustIsolated-Word Speech Recognition,”

Proc. ICASSP , pp. 705–708, 1987.[66] M. Gales and S. Young, “Robust Continuous Speech Recog-nition using Parallel Model Combination,”

IEEE Trans. SAP ,vol. 4, pp. 352–359, Sept. 1996.[67] J. Yousafzai, Z. Cvetkovi´c, and P. Sollich, “Towards RobustPhoneme Classiﬁcation with Hybrid Features ,”

Proc. ISIT , pp.1643–1647, 2010.[68] Y. Ephraim and D. Malah, “Speech Enhancement Using aMinimum Mean-Square Error Short-time Spectral AmplitudeEstimator,”

IEEE Trans. ASSP , vol. ASSP-32, pp. 1109–1121,1984.[69] ——, “Speech Enhancement Using a Minimum Mean-SquareLog-Spectral Amplitude Estimator,”