A Subband-Based SVM Front-End for Robust ASR
Jibran Yousafzai, Zoran Cvetkovic, Peter Sollich, Matthew Ager
aa r X i v : . [ c s . C L ] D ec A Subband-Based SVM Front-End for Robust ASR
Jibran Yousafzai † , Member, IEEE
Zoran Cvetkovi´c † , Senior Member, IEEE
Peter Sollich ‡ Matthew Ager ‡ Abstract — This work proposes a novel support vector machine (SVM)based robust automatic speech recognition (ASR) front-end that operateson an ensemble of the subband components of high-dimensional acousticwaveforms. The key issues of selecting the appropriate SVM kernels forclassification in frequency subbands and the combination of individualsubband classifiers using ensemble methods are addressed. The proposedfront-end is compared with state-of-the-art ASR front-ends in terms ofrobustness to additive noise and linear filtering. Experiments performedon the TIMIT phoneme classification task demonstrate the benefits of theproposed subband based SVM front-end: it outperforms the standardcepstral front-end in the presence of noise and linear filtering forsignal-to-noise ratio (SNR) below 12-dB. A combination of the proposedfront-end with a conventional front-end such as MFCC yields furtherimprovements over the individual front ends across the full range ofnoise levels.
Index Terms —Speech recognition, robustness, subbands, support vec-tor machines.
I. INTRODUCTIONAUTOMATIC speech recognition (ASR) systems suffer severeperformance degradation in the presence of environmental distortions,in particular additive and convolutive noise. Humans, on the otherhand, exhibit a very robust behavior in recognizing speech evenin extremely adverse conditions. The central premise behind thedesign of state-of-the-art ASR systems is that combining front-ends based on the non-linear compression of speech, such as Mel-Frequency Cepstral Coefficients (MFCC) [1] and Perceptual LinearPrediction (PLP) coeffcients [2], with appropriate language andcontext modelling techniques can bring the recognition performanceof ASR close to humans. However, the effectiveness of context andlanguage modelling depends critically on the accuracy with whichthe underlying sequence of elementary phonetic units is predicted [3],and this is where there are still significant performance gaps betweenhumans and ASR systems. Humans recognize isolated speech unitsabove the level of chance already at − -dB SNR, and significantlyabove it at − -dB SNR [4]. At such high noise levels, humanspeech recognition performance exceeds that of the state-of-the-art ASR systems by over an order of magnitude. Even in quietconditions, the machine phone error rates for nonsense syllables aresignificantly higher than human error rates [3, 5–7]. Although thereare a number of factors preventing the conventional ASR systems toreach the human benchmark, several studies [7–12] have attributedthe marked difference between human and machine performance tothe fundamental limitations of the ASR front-ends. These studiessuggest that the large amount of redundancy in speech signals, whichis removed in the process of the extraction of cepstral features suchas Mel-Frequency Cepstral Coefficients (MFCC) [1] and PerceptualLinear Prediction (PLP) coefficients [2], is in fact needed to copewith environmental distortions. Among these studies, the work onhuman speech perception [7, 9, 11, 12] has shown explicitly thatthe information reduction that takes place in the conventional front-ends leads to a severe degradation in human speech recognitionperformance and, furthermore, that in noisy environments there is The authors are with the Department of Informatics † and theDepartment of Mathematics ‡ at King’s College London (e-mail: { jibran.yousafzai, zoran.cvetkovic, peter.sollich, matthew.ager } @kcl.ac.uk). a high correlation between human and machine errors in recognitionof speech with distortions introduced by typical ASR front-endprocessing. Over the years, techniques such as cepstral mean-and-variance normalization (CMVN) [13, 14], vector Taylor series (VTS)compensation [15] and ETSI advanced front-end (AFE) [16] havebeen developed that aim to explicitly reduce the effects of noise on theshort-term spectra in order to make the ASR front-ends less sensitiveto noise. However, the distortion of the cepstral features caused byadditive noise and linear filtering critically depends on the speechsignal, filter characteristics, noise type and noise level in a complexfashion that makes effective feature compensation or adaptation veryintricate and not sufficiently effective [14].In our previous work we showed that using acoustic waveformsdirectly, without any compression or nonlinear transformation canimprove the robustness of ASR front-ends to additive noise [17].In this paper, we propose an ASR front-end derived from thedecomposition of speech into its frequency subbands, to achieveadditional robustness to additive noise as well as linear filtering.This approach draws its motivation primarily from the experimentsconducted by Fletcher [18], which suggest that the human decodingof linguistic messages is based on decisions within narrow frequencysubbands that are processed quite independently of each other. Thisreasoning further implies that accurate recognition in any subbandshould result in accurate recognition overall, regardless of the errorsin other subbands. While this theory has not been proved and somestudies on the subband correlation of speech signals [19, 20] haveeven put its validity into question, there are some technical reasonsfor considering classification in frequency subbands. First of all,decomposing speech into its frequency subbands can be beneficialsince it allows a better exploitation of the fact that certain subbandsmay inherently provide better separation of some phoneme classesthan others. Secondly, the effect of wideband noise in sufficientlynarrow subbands can be approximated as that of narrowband whitenoise and thus make the compensation of features be approximatelyindependent of the spectral characteristics of the additive noiseand linear filtering. Moreover, appropriate ensemble methods foraggregation of the decisions in individual frequency subbands canfacilitate selective de-emphasis of unreliable information, particularlyin the presence of narrowband noise.Previously, the subband approach has been used in [21–27] whichresulted in marginal improvements in recognition performance overits full band counterparts. Note that the front-ends employed in theprevious works were the subband-based variants of cepstral featuresor multi-resolution cepstral features. By contrast, our proposed front-end features are extracted from an ensemble of subband componentsof high-dimensional acoustic waveforms and thus retain more in-formation about speech that is potentially relevant to discriminationof phonetic units than the corresponding cepstral representations.In addition to investigation of robustness of the proposed front-endto additive noise, we also assess its robustness to linear filteringdue to room reverberation. This form of distortion causes temporalsmearing of short-term spectra which degrades the performance ofASR systems. This can be attributed primarily to the use of analysiswindows for feature extraction in the conventional front-ends such asMFCC that are much shorter than typical room impulse responses. Furthermore, the distortion caused by linear filtering is correlatedwith the underlying speech signal. Hence, conventional methods forrobust ASR that are tuned for recognition of data corrupted byadditive noise only will not be effective in reverberant environments.Several speech dereverberation techniques that rely on multi-channelrecordings of speech such as [28, 29] exist in the literature. However,these consideration extend beyond the scope of this paper and instead,standard single channel feature compensation methods for additivenoise and linear filtering such as VTS and CMVN compensation areused throughout this paper.Robustness of the proposed front-end to additive noise and linearfiltering is demonstrated by its comparison with the MFCC front-end on a phoneme classification task; this task remains important incomparing different methods and representations [21, 30–40]. Theimprovements achieved on the classification task can be expected toextend to continuous speech recognition tasks [41, 42] as SVMs havebeen employed in hybrid frameworks [42, 43] with hidden Markovmodels (HMMs) as well as in frame-based architectures using thetoken passing algorithm [44] for recognition of continuous speech.The results demonstrate the benefits of the subband classification interms of robustness to additive noise and linear filtering. The subband-waveform classifiers outperform even the MFCC classifiers trainedand tested under matched conditions for signal-to-noise ratios below6-dB. Furthermore, in classifying noisy reverberant speech, the sub-band classifier outperforms the MFCC classifier compensated usingVTS for all signal-to-noise ratios (SNRs) below a crossover pointbetween 12-dB and 6-dB. Finally, their convex combination yieldsfurther performance improvements over both individual classifiers.This paper is organized as follows: the proposed subband classifi-cation approach is described in Section II. Experimental results thatdemonstrate its robustness to additive noise and linear filtering arepresented in Section III. Finally, Section IV draws some conclusionsand suggests future directions of this work towards application of theproposed front-end in continuous speech recognition tasks.II. SUBBAND CLASSIFICATION USING SUPPORTVECTOR MACHINESSupport vector machines (SVMs) are receiving increasing attentionas a tool for speech recognition applications due to their goodgeneralization properties [17, 35, 42, 43, 45–48]. Here we use them inconjunction with the proposed subband-based representation aimingto improve the robustness of the standard cepstral front-end to noiseand filtering. To this end we construct a fixed-length representationthat could potentially be used as the front-end for a continuous speechrecognition systems based on e.g. hidden Markov models (HMMs)[42–44]. Dealing with variable phoneme length has been addressedby means of generative kernels such as Fisher kernels [47, 49] anddynamic time-warping kernels [50], but lies beyond the scope of thispaper. Hence, the features of the proposed front-end are derived fromfixed-length segments of acoustic waveforms of speech and theseare studied in comparison with the MFCC features derived from thesame speech segments. Several possible extensions of the proposedfront-end for application in continuous speech recognition tasks arehighlighted in Section IV and will be investigated in a future study.
A. Support Vector Machines
A binary SVM classifier estimates a decision surface that jointlymaximizes the margin between the two classes and minimizes themisclassification error on the training set. For a given training set ( x , . . . , x p ) with corresponding class labels ( y , . . . , y p ) , y i ∈{ +1 , − } , an SVM classifies a test point x by computing a score function, h ( x ) = p X i =1 α i y i K ( x , x i ) + b (1)where α i is the Lagrange multiplier corresponding to the i th trainingsample, x i , b is the classifier bias – these are optimized duringtraining – and K is a kernel function. The class label of x is thenpredicted as sgn ( h ( x )) . While the simplest kernel K ( x , ˜ x ) = h x , ˜ x i produces linear decision boundaries, in most real classification tasks,the data is not linearly separable. Nonlinear kernel functions implic-itly map data points to a high-dimensional feature space where thedata could potentially be linearly separable. Kernel design is thereforeeffectively equivalent to feature-space selection, and using an appro-priate kernel for a given classification task is crucial. Commonlyused is the polynomial kernel, K p ( x , ˜ x ) = (1 + h x , ˜ x i ) Θ , wherethe polynomial order Θ in K p is a hyper-parameter that is tunedto a particular classification problem. More sophisticated kernels canbe obtained by various combinations of basic SVM kernels. Herewe use a polynomial kernel for classification with cepstral features(MFCC) whereas classification with acoustic waveforms in frequencysubbands is performed using a custom-designed kernel described inthe following.For multiclass problems, binary SVMs are combined via error-correcting output codes (ECOC) methods [51, 52]. In this work,for an M -class problem we train N = M ( M − / binarypairwise classifiers, primarily to lower the computational complexityby training on only the relevant two classes of data. The trainingscheme can be captured in a coding matrix w mn ∈ { , , − } , i.e. classifier n is trained only on data from the two classes m for which w mn = 0 , with sgn( w mn ) as the class label. One then predicts fortest input x the class that minimizes the loss P Nn =1 χ ( w mn f n ( x )) where f n ( x ) is the output of the n th binary classifier and χ is aloss function. We experimented with a variety of loss functions,including hinge, Hamming, exponential and linear. The hinge lossfunction χ ( z ) = max(1 − z, performed best and is therefore usedthroughout.For classification in frequency subbands, each waveform x isprocessed through an S -channel maximally-decimated perfect recon-struction cosine modulated filter bank (CMFB) [53] and decomposedinto its subband components, x s , s = 1 , . . . , S . Several othersubband decompositions such as discrete wavelet transform, waveletpacket decomposition and discrete cosine transform also achievedcomparable, albeit somewhat inferior performance. A summary of theclassification results obtained with different subband decompositionsin quiet conditions is presented in Section III-B. The CMFB consistsof a set of orthonormal analysis filters g s [ k ] = 1 √ S g [ k ] cos (cid:18) s − S (2 k − S − π (cid:19) ,s = 1 , . . . , S, k = 1 , . . . , S, (2)where g [ k ] = √ π ( k − . / S ) , k = 1 , . . . , S , is a low-pass prototype filter. Such a filter bank implements an orthogonaltransform, hence the collection of the subband components is arepresentation of the original waveform in a different coordinatesystem [53]. The subband components x s [ n ] are thus given by x s [ n ] = X k x [ k ] g s [ nS − k ] . (3)A maximally-decimated filter bank was chosen primarily becausethe sub-sampling operation avoids introducing additional unneces-sary redundancies and thus limits the overall computational burden.However, we believe that redundant expansions of speech signals obtained using over-sampled filter banks could be advantageous toeffectively account for the shift invariance of speech.For classification in frequency subbands, an SVM kernel is con-structed by partly following steps from our previous work [17],which attempted to capture known invariances or express explicitlythe waveform qualities which are known to correlate with phonemeidentity. First, an even kernel is constructed from a baseline polyno-mial kernel K p to account for the sign-invariance of human speechperception as K e ( x s , x si ) = K ′ p ( x s , x si ) + K ′ p ( x s , − x si ) (4)where K ′ p is a modified polynomial kernel given by K ′ p ( x s , x si ) = K p (cid:18) x s k x s k , x si k x si k (cid:19) = (cid:18) (cid:28) x s k x s k , x si k x si k (cid:29)(cid:19) Θ . (5)Kernel K ′ p , which acts on normalized input vectors, will be usedas a baseline kernel for the acoustic waveforms. On the other hand,the standard polynomial kernel K p is used for classification withthe cepstral representations where feature standardization by cepstralmean-and-variance normalization (CMVN) [13] already ensures thatfeature vectors typically have unit norm.Next, the temporal dynamics of speech are explicitly taken intoaccount by means of features that capture the evolution of energyin individual subbands. To obtain these features, each subbandcomponent x s is first divided into T frames, x t,s , t = 1 , . . . , T ,and then a vector of their energies ω s is formed as, ω s = (cid:20) log (cid:13)(cid:13) x ,s (cid:13)(cid:13) , . . . , log (cid:13)(cid:13)(cid:13) x T,s (cid:13)(cid:13)(cid:13) (cid:21) . Finally, time differences [54] of ω s are evaluated to form thedynamic subband feature vector Ω s as Ω s = (cid:2) ω s ∆ ω s ∆ ω s (cid:3) .This dynamic subband feature vector Ω s is then combined with thecorresponding acoustic waveform subband component x s formingkernel K Ω given by K Ω ( x s , x si , Ω s , Ω si ) = K e ( x s , x si ) K p ( Ω s , Ω si ) , (6)where Ω si is the dynamic subband feature vector corresponding tothe s th subband component x si of the i -th training point x i . B. Ensemble Methods
For each binary classification problem, decomposing an acousticwaveform into its subband components produces an ensemble of S classifiers. The decision of the subband classifiers in the ensemble,given by f s ( x s , Ω s ) = X i α si y i K Ω ( x s , x si , Ω s , Ω si ) + b s , s = 1 , . . . , S (7)are then aggregated using ensemble methods to obtain the binaryclassification decision for a test waveform x . Here α si and b s arethe Lagrange multiplier corresponding to x si and the bias of the s th subband binary classifier.
1) Uniform Aggregation:
Under a uniform aggregation scheme,the decisions of the subband classifiers in the ensemble are assigneduniform weights. Majority voting is the simplest uniform aggregationscheme commonly used in machine learning. In our context it isequivalent to forming a meta-level score function as h ( x ) = S X s =1 sgn ( f s ( x s , Ω s )) , (8)then predicting the class label as y = sgn ( h ( x )) . In addition tothis conventional majority voting scheme, which maps the scores in individual subbands to the corresponding class labels ( ± ), we alsoconsidered various smooth squashing functions, e.g. sigmoidal, asalternatives to the sgn function in (8), and obtained similar results.To gain some intuition about the potential of ensemble methods suchas the majority voting in improving the classification performance,consider the ideal case when the errors of the individual subbandclassifiers in the ensemble are independent with error probability p < / . Under these conditions, a simple combinatorial argument showsthat the error probability p e of the majority voting scheme is givenby p e = S X s = ⌈ S/ ⌉ Ss ! p s (1 − p ) S − s . (9)where the largest contribution to the overall error is due to term with s = ⌈ S/ ⌉ . For a large ensemble cardinality S , this error probabilitycan be bounded as: p e < p ⌈ S/ ⌉ (1 − p ) S −⌈ S/ ⌉ S X s = ⌈ S/ ⌉ Ss ! ≈
12 (4 p (1 − p )) S/ . (10)Therefore, in ideal conditions, the ensemble error decreases expo-nentially in S even with this simple aggregation scheme [55, 56].However, it has been shown that there exists a correlation between thesubband components of speech and the resulting speech recognitionerrors in individual frequency subbands [19, 20]. As a result, themajority voting scheme may not yield considerable improvements inthe classification performance. Furthermore, the uniform aggregationschemes also suffer from a major drawback; they do not exploitthe differences in the relative importance of individual subbands indiscriminating among specific pairs of phonemes. To remedy this,we use stacked generalization [57] as discussed next, to explicitlylearn weighting functions specific to each pair of phonemes for non-uniform aggregation of the outputs of base-level SVMs.
2) Stacked Generalization:
Our practical implementation ofstacked generalization [57] consists of a hierarchical two-layer SVMarchitecture, where the outputs of subband base-level SVMs areaggregated by a meta-level linear SVM. The decision function ofthe meta-level SVM classifier is of the form h ( x ) = h f ( x ) , w i + v = X s w s f s ( x s , Ω s ) + v , (11)where f ( x ) = (cid:2) f ( x , Ω ) , . . . , f S ( x S , Ω S ) (cid:3) is the base-level SVMscore vector of the test waveform x , v is the classifier bias, and w = (cid:2) w , . . . , w S (cid:3) is the weight vector of the meta-level classifier.Note each of the binary classifiers will have its specific weight vectordetermined from an independent development/validation set { ˜ x j , ˜ y j } .Each weight vector can, therefore, be expressed as w = X j β j ˜ y j f (˜ x j ) , (12)where f (˜ x j ) = h f (˜ x j , ˜ Ω j ) , . . . , f S (˜ x Sj , ˜ Ω Sj ) i is the base-levelSVM score vector of the training waveform ˜ x j , and β j and ˜ y j are the Lagrange multiplier and class label corresponding to f (˜ x j ) ,respectively. While a base-level SVM assigns a weight to eachsupporting feature vector, stacked generalization effectively assignsan additional weight w s to each subband based on the performanceof the corresponding base-level subband classifier. Again, ECOCmethods are used to combine the meta-level binary classifiers formulticlass classification.An obvious advantage of the subband approach for ASR isthat the effect of environmental distortions in sufficiently narrowsubbands can be approximated as similar to that of a narrow-band white noise. This, in turn, facilitates the compensation of features to be independent of the spectral characteristics of theadditive and convolutive noise sources. In a preceding paper [17],we proposed an ASR front-end based on the full-band acousticwaveform representation of speech where a spectral shape adaptationof the features was performed in order to account for the varyingstrength of contamination of the frequency components due to thepresence of colored noise. In this work, compensation of the featuresis performed using standard approaches such as cepstral mean-and-variance normalization (CMVN) and vector Taylor series (VTS)methods which do not require any prior knowledge of the additiveand convolutive noise sources. Furthermore, we found that the stackedgeneralization also depends on the level of noise contaminating itstraining data. To this end, the weight vectors corresponding to thestacked classifiers can be tuned for classification in a particularenvironment by introducing similar distortion to its training data.In scenarios where a performance gain over a wide range of SNRsis desired, a multi-style training approach that offers a reasonablecompromise between various test conditions can also be employed.For instance, a meta-level classifier can be trained using the scorefeature vectors of noisy data or the score feature vectors of a mixtureof clean and noisy data.Note that since the dimension of the score feature vectors thatform the input to the stacked subband classifier ( S ) is very smallcompared to the typical MFCC or waveform feature vectors, onlya very limited amount of data is required to learn optimal weightsof the meta-level classifiers. As such, stacked generalization offersflexibility and some coarse frequency selectivity for the individualbinary classification problems, and can be particularly useful in de-emphasizing information from unreliable subbands. The experimentspresented in this paper show that the subband approach attains majorgains in classification performance over its full-band counterpart [17]as well as the state-of-the-art front-ends such as MFCC.III. EXPERIMENTAL RESULTS A. Experimental Setup
Experiments are performed on the ‘si’ (diverse) and ‘sx’ (compact)sentences of the TIMIT database [58]. The training set consistsof 3696 sentences from 168 different speakers. For testing we usethe core test set which consists of 192 sentences from 24 differentspeakers not included in the training set. The development set consistsof 1152 sentences uttered by 96 male and 48 female speakers notincluded in either the training or the core test set, with speakersfrom 8 different dialect regions. In training the meta-level subbandclassifiers, we use a small subset, randomly selecting an eighth ofthe data points in the complete TIMIT development set. The glottalstops /q/ are removed from the class labels and certain allophonesare grouped into their corresponding phoneme classes using thestandard Kai-Fu Lee clustering [59], resulting in a total of M = 48 phoneme classes and N = M ( M − / classifiers.Among these classes, there are 7 groups for which the contribution ofwithin-group confusions toward multiclass error is not counted, againfollowing standard practice [35, 59]. Initially, we experimented withdifferent values of the hyperparameters for the binary SVM classifiersbut decided to use fixed values for all classifiers as parameteroptimization had a large computational overhead but only a smallimpact on the multiclass classification error: the degree of K p is setto Θ = 6 and the penalty parameter (for slack variables in the SVMtraining algorithm) to C = 1 .To test the classification performance in noise, each TIMIT testsentence is normalized to unit energy per sample and then a noisesequence is added to the entire sentence to set the sentence-levelSNR. Hence for a given sentence-level SNR, signal-to-noise ratio at π rad/sample) M agn i t ude ( d B ) π rad/sample) P ha s e ( x π r ad i an s ) R(z) (Coloration=−0.5dB)R ′ (z) (Coloration=−0.9dB) Fig. 1:
Frequency response of the ICSI conference room filters withspectral coloration -0.5-dB and -0.9-dB. Here, spectral coloration isdefined as the ratio of the geometric mean to the arithmetic mean ofspectral magnitudes. R ( z ) is used to add reverb to the test data whereas R ′ ( z ) , a proxy filter recorded at a different location in the same room,is used for the training of cepstral and meta-level subband classifiers. the level of individual phonemes will vary widely. Both artificialnoise (white, pink) and recordings of real noise (speech-babble)from the NOISEX-92 database are used in our experiments. Whitenoise was selected due to its attractive theoretical interpretation asprobing in an isotropic manner the separation of phoneme classesin different representation domains. Pink noise was chosen because /f -like noise patterns are found in music melodies, fan and cockpitnoises, in nature etc. [60–62]. In order to further test the classificationperformance in the presence of linear filtering, noisy TIMIT sentencesare convolved with an impulse response with reverberation time T = 0 . sec. This impulse response is one that was measuredusing an Earthworks QTC1 microphone in the ICSI conferenceroom [63] populated with people; its magnitude response R ( e jω ) is shown in Figure 1, where we also show the spectrum of animpulse response corresponding to a different speaker position inthe same room, R ′ ( e jω ) . While the substantial difference betweenthese filters is evident from their spectra and spectral colorations(defined as a ratio of the geometric mean to the arithmetic mean ofspectral magnitude), R ′ ( e jω ) can be viewed as an approximation ofthe effect of the R ( e jω ) on the speech spectrum and is used in someof our experiments for training of the cepstral and meta-level subbandclassifiers in order to reduce the mismatch between training and testdata.To obtain the cepstral (MFCC) representation, each sentence isconverted into a sequence of 13 dimensional feature vectors, theirtime derivatives and second order derivatives which are combinedinto a sequence of 39 dimensional feature vectors. Then, T = 10 frames (with frame duration of ms and a frame rate of frames/sec) closest to the center of a phoneme are concatenated togive a representation in R . Noise compensation of the MFCCfeatures is performed via vector Taylor series (VTS) method whichhas been extensively used in recent literature and is consideredas state-of-the-art. This scheme estimates the distribution of noisyspeech given the distribution of clean speech, a segment of noisyspeech, and the Taylor series expansion that relates the noisy speechfeatures to the clean ones, and then uses it to predict the unobservedclean cepstral feature vectors. In our experiments, a Gaussian mixturemodel (GMM) with 64 mixture components was used to learn thedistribution of the Mel-log spectra of clean training data. Addition- ally, cepstral mean-and-variance normalization (CMVN) [13, 14] isperformed to standardize the cepstral features, fixing their range ofvariation for both training and test data. CMVN computes the meanand variance of the feature vectors across a sentence and standardizesthe features so that each has zero mean and a fixed variance. Thefollowing training-test scenarios are considered for classifiers withthe cepstral front-end:1. Anechoic training with VTS - training of the SVM classifiersis performed with anechoic clean speech and the test data iscompensated via VTS.2.
Reverberant training with VTS - training of the SVM clas-sifiers is performed with reverberant clean speech with featurecompensation of the test data via VTS. Two particular casesin this scenario are considered. (a)
The clean training data andthe noisy test data are convolved using the same linear filter, R ( e jω ) ). This case provides a lower bound on the classificationerror in the unlikely event when the exact knowledge of theconvolutive noise source is known. (b) This case investigates theeffects of a mismatch of the linear filter used for convolutionwith the training and test data. In particular, the data usedfor training of the SVM classifiers as well as learning ofthe distribution of log-spectra in VTS feature compensationis convolved with R ′ ( e jω ) while the test data is convolvedwith R ( e jω ) ). Since the exact knowledge of the linear filtercorrupting the test data is usually difficult to determine, thiscase offers a more practical solution to the problem and itsperformance is expected to lie between the brackets obtainedwith the two scenarios mentioned above i.e. anechoic training,and reverberant training and testing using the same filter.3. Matched training - In this scenario, both the training and testingconditions are identical. Again, this is an impractical target;nevertheless, we present the results (only in the presence ofadditive noise) as a reference, since this setup is considered togive the optimal achievable performance with cepstral features[64–66].Furthermore, note that the MFCC features of both training and testdata are standardized using CMVN [13] in all scenarios.Acoustic waveforms segments x are extracted from the TIMITsentences by applying a 100ms rectangular window at the centre ofeach phoneme and are then decomposed into subband components { x s } Ss =1 using a cosine-modulated filter bank (see (2)). We conductedexperiments to examine the effect of the number of filter bankchannels ( S ) on classification accuracy. Generally, decompositionof speech into wider subbands does not effectively capture thefrequency-specific dynamics of speech and thus results in relativelypoor performance. On the other hand, decomposition of speech insufficiently narrow subbands improves classification performance asdemonstrated in [22], but at the cost of an increase in the overallcomputational complexity. For the results presented in this paper, thenumber of filter bank channels is limited to in order to reduce thecomputational complexity. The dynamic subband feature vector, Ω s is computed by extracting T = 10 equal-length (25ms with an overlapof 10ms) frames around the centre of each phoneme thus yielding avector of dimension . These feature vectors are further standardizedwithin each sentence of TIMIT for the evaluation of kernel K Ω . Notethat the training of base-level SVM subband classifiers is alwaysperformed with clean data. The development subset is used fortraining of the meta-level subband classifiers as learning the optimalweights requires only a limited amount of data. Several scenarios areconsidered for training of the meta-level classifiers:1. Anechoic clean training - training the meta-level SVM clas-sifier with the base-level SVM score vectors obtained from anechoic clean data.2.
Anechoic multi-style training - training the meta-level SVMclassifier with the base-level SVM score vectors of anechoicdata containing a mixture of clean waveforms and waveformscorrupted by white noise at 0-dB SNR,3.
Reverberant multi-style training - training the meta-levelSVM classifier with the base-level SVM score vectors of re-verberant data containing a mixture of clean waveforms andwaveforms corrupted by white noise at 0-dB SNR. Similarto the MFCC training-test scenarios, two particular cases areconsidered here: (a) the development data for training as wellas the test data are convolved with the same filter R ( e jω ) , and (b) the development data for training is convolved with R ′ ( e jω ) whereas the test data is convolved with R ( e jω ) ,4. Matched training - training and testing with the meta-levelclassifier under identical noise level and type conditions. Resultsfor this scenario are shown only in the presence of additive noise.Next, we present the results of TIMIT phoneme classification withthe setup detailed above.
B. Results: Robustness to Additive Noise
First we compare various frequency decompositions and ensemblemethods for subband classification. A summary of their respectiveclassification errors in quiet condition is presented in Table I. Wefind the stacked generalization to yield significantly better resultsthan majority voting; it consistently achieves over improvementover majority voting for all subband decompositions consideredhere. Among these decompositions, classification with the 16-channelcosine-modulated filter bank achieves the largest improvement of . over the composite acoustic waveforms [17] and is thereforeselected for further experiments.TABLE I: Errors obtained with different subband decompositions [53](listed in the left column) and aggregation schemes for subband classifi-cation in quiet condition.
ERROR [ % ]Subband Analysis Maj. Voting Stack. Gen. Level-4 wavelet decomposition . . Level-4 wavelet packet decomposition . . DCT (16 uniform-width bands) . . . Composite Waveform [17] . Let us now consider classification of phonemes in the presence ofadditive noise. Robustness of the proposed method to both additivenoise and linear filtering is discussed in Section III-C. In Figure 2,we compare the classification in frequency subbands using ensemblemethods with composite acoustic waveform classification (resultsas reported in [17]) in the presence of white and pink noise. Thedashed curves correspond to subband classification using ensemblemethods i.e uniform combination (majority voting) and stacked gen-eralization with different training scenarios for meta-level classifiers(see Section III-A). The meta-level classifiers of multi-style stackedsubband classifier are trained according to scenario 2. The resultsshow that stacked generalization generally attains better performancethan uniform aggregation. The majority voting scheme also performspoorly in comparison to the composite acoustic waveforms across allSNRs. On the other hand the stacked subband classifier trained inquiet condition improves over the composite waveform classifier inlow noise conditions. But its performance then degrades relativelyquickly in high noise because its corresponding meta-level binary −18 −12 −6 0 6 12 18 Q 30405060708090100 SNR [dB] E RR O R [ % ] Composite WaveformSubband − Majority VotingStacked Subband − QuietStacked Subband − Multi−styleStacked Subband − Matched −18 −12 −6 0 6 12 18 Q 30405060708090100 SNR [dB] E RR O R [ % ] Composite WaveformSubband − Majority VotingStacked Subband − QuietStacked Subband − Multi−styleStacked Subband − Matched
Fig. 2:
Ensemble methods for aggregation of subband classifiers andtheir comparison with composite acoustic waveform classifiers (resultsas reported in [17]) in the presence of white noise (top) and pinknoise (bottom). The curves correspond to uniform combination (majorityvoting) and stacked generalization with different training scenarios forthe meta-level classifiers. The multi-style stacked subband classifier istrained only with the small development subset (one eighth randomlyselected score vectors from the development set) consisting of clean andwhite-noise (0-dB SNR) corrupted anechoic data. The classifiers are thentested on data corrupted with white noise (matched) and pink noise(mismatched). W e i gh t s ( M ean ± S t anda r d D e v i a t i on ) Majority VotingStacked Subband − QuietStacked Subband − Multi−style
Fig. 3:
Weights (mean ± standard deviation across N = 1128 binaryclassifiers) assigned to S = 16 subbands by the multi-style meta-levelclassifiers and by the meta-level classifiers trained in quiet conditions. classifiers are trained to assign weights to different subbands thatare tuned for classification in quiet. To improve the robustness toadditive noise, the meta-level classifiers can be trained on a mixtureof base-level SVM score vectors obtained from both clean and datacorrupted by white noise (0-dB SNR), as explained above. Figure 3shows the weights (mean ± standard deviation across N = 1128 binary classifiers) assigned to S = 16 subbands by the stackedclassifier with its metal-level binary classifiers trained in quiet andin multi-style conditions. It can be observed that relatively highweights are assigned to the low frequency subband components bythe multi-style training. This is reasonable as low frequency subbandshold a substantial portion of speech energy and can provide reliablediscriminatory information in the presence of wideband noise. Thelarge amount of variation in the assigned weights as indicated by theerror bars is consistent with the variation of speech data encounteredby the N = 1128 binary phoneme classifiers. It can be observedthat the multi-style subband classifiers consistently improves overthe composite waveform classifier as well as the stacked subbandclassifier trained in quiet condition. Overall, it achieves averageimprovements of . and . over the composite waveformclassifier in the presence of white (matched noise type) and pink(mismatched noise type) noise, respectively. As expected, the stackedsubband classifier trained in matched conditions finally outperformsthe other classifiers in all noise conditions as shown in Figure 2.Next, we compare the performance of the multi-style subband clas-sifier with the VTS-compensated MFCC classifier and the compositeacoustic waveform classifier [17] in the presence of additive white andpink noise. These results along with classification with the stackedsubband classifier and MFCC classifier in matched training-testconditions are presented in Figure 2. The results show that the stackedsubband classifier exhibits better classification performance than theVTS-compensated MFCC classifier for SNR below 12-dB whereasthe performance crossover between MFCC and composite acousticwaveform classifiers is between 6-dB and 0-dB SNR. The stackedsubband classifier achieves average improvements of . and . over the MFCC classifier across the range of SNRs considered in thepresence of white and pink noise, respectively. Moreover, and quiteremarkably, the stacked subband classifier also significantly improvesover the MFCC classifier trained and tested in matched conditionsfor SNRs below a crossover point between 6-dB and 0-dB SNR, eventhough its meta-level classifiers are trained only using clean data anddata corrupted by white noise at 0-dB SNR and the number of datapoints used to learn the optimal weights amounts only to a smallfraction of the data set used for training of the MFCC classifier inmatched conditions. In particular, an average improvement of . in the phoneme classification error is achieved by the multi-stylesubband classifier over the matched MFCC classifier for SNRs below6-dB in the presence of white noise.In [67] we showed that the MFCC classifiers suffer performancedegradation in case of a mismatch of the noise type between trainingand test data. On the other hand, the stacked subband classifierdegrades gracefully in a mismatched environment as shown in Figure4. This can be attributed to the decomposition of acoustic waveformsinto frequency subbands where the effect of wideband colored noiseon each binary subband classifier can be approximated as that of anarrow-band white noise. In comparison to the result reported in [33],where a . error was obtained at -dB SNR in pink noise using asecond-order regularized least squares algorithm (RLS2) trained usingMFCC feature vectors with variable length encoding, the proposedmethod achieves a relative improvement with a fixed lengthrepresentation in similar conditions.Figure 4 also shows a comparison of the stacked subband classifierwith the MFCC classifier trained and tested in matched conditions. −18 −12 −6 0 6 12 18 Q 2030405060708090100 SNR [dB] E RR O R [ % ] MFCC − VTSMFCC − MatchedComposite WaveformStacked Subband − Multi−styleStacked Subband − Matched−18 −12 −6 0 6 12 18 Q 2030405060708090100 SNR [dB] E RR O R [ % ] MFCC − VTSMFCC − MatchedComposite WaveformStacked Subband − Multi−styleStacked Subband − Matched
Fig. 4:
SVM classification in the subbands of acoustic waveforms andits comparison with MFCC and composite acoustic waveform classifiersin the presence of white noise (top) and pink noise (bottom). The multi-style stacked subband classifier is trained only with a small subset ofthe development data (one eighth randomly selected score vectors fromthe development set) consisting of clean and white-noise (0-dB SNR)corrupted data. In the matched training case, noise levels as well asnoise types of training and test data are identical for both MFCC andstacked subband classifiers.
The matched-condition subband classifier significantly outperformsthe matched MFCC classifier for SNRs below 6-dB. Around av-erage improvement is achieved by the subband classifier over MFCCclassifier for SNRs below 6-dB in the presence of both white and pinknoise. This suggests that the high-dimensional subband representationobtained from acoustic waveforms might provide a better separationof phoneme classes compared to cepstral representation.
C. Results: Robustness to Linear Filtering
We now consider classification in the presence of additive noise aswell as linear filtering. First, Figure 5 presents results of the ensemblesubband classification using stacked generalization with multipletraining-test scenarios (see Section III-A) in the presence of whiteand pink noise. To reiterate, three different scenarios are consideredfor training of the multi-style stacked subband classifier: one involvestraining the meta-level classifiers with the base-level SVM scorevectors of the development subset consisting of clean and white-noise (0-dB SNR) corrupted anechoic data, second involves trainingwith the score vectors of the same development data convolved with R ′ ( e jω ) (mismatched reverberant conditions) while the third involvestraining in matched reverberant conditions i.e. training with the same −18 −12 −6 0 6 12 18 Q 30405060708090100 SNR [dB] E RR O R [ % ] Quiet TrainingMulti−style (Anechoic)Multi−style (Reverberant Mismatch)Multi−style (Reverberant Matched)−18 −12 −6 0 6 12 18 Q 30405060708090100 SNR [dB] E RR O R [ % ] Quiet TrainingMulti−style (Anechoic)Multi−style (Reverberant Mismatch)Multi−style (Reverberant Matched)
Fig. 5:
Classification in frequency subbands using ensemble methods inthe presence of the linear filter R ( e jω ) with white noise (top) and pinknoise (bottom). The curves correspond to stacked generalization withdifferent training scenarios for meta-level subband classifier. development subset convolved with R ( e jω ) . These classifiers, whichare referred to as anechoic and reverberant multi-style subbandclassifiers (see Section III-A), are then tested on data corrupted bywhite, pink or speech-babble noise, and convolved with R ( e jω ) .Similar to our findings in the previous section, the results in Figure5 show that the anechoic multi-style subband classifier consistentlyimproves over the stacked subband classifier trained only in quietcondition. Moreover, the reverberant multi-style subband classifiers(both matched and mismatched) further reduce the mismatch withthe test data and hence exhibit more superior performances. Forinstance, in the presence of pink noise and linear filtering, the subbandclassifiers trained in mismatched and matched reverberant conditionsattain average improvements of and . across all SNRs overthe anechoic multi-style subband classifier, respectively. Note that anaccurate measurement of the linear filter corrupting the test data maybe difficult to obtain in practical scenarios. Nonetheless, classificationresults in matched reverberant condition are presented to determinea lower bound on the error. On the other hand, the mismatchedreverberant case can be considered as a more practical solution to theproblem and its performance is expected to lie between the bracketsobtained with the anechoic training and matched reverberant training.Figure 6 compares the classification performances of the subbandand VTS-compensated MFCC classifiers trained under three differentscenarios (see Section III-A) in the presence of linear filtering, andpink and speech-babble noise. The first is an agnostic (anechoic) case that does not rely on any information at all regarding the source ofthe convolutive noise R ( e jω ) , the second (reverberant mismatch case)employs a proxy reverberation filter R ′ ( e jω ) in order to reduce themismatch of the training and the reverberant test environments up to acertain degree, whereas the third (reverberant matched case) employsaccurate knowledge of the reverberation filter R ( e jω ) in the trainingof the MFCC classifiers and the meta-level subband classifiers.These training scenarios are respectively represented by squares,stars and circles in Figure 6. The results show that the comparisonsof the stacked subband classifiers and MFCC classifiers under thedifferent training regimes exhibit similar trends. Generally speaking,the MFCC classifier outperforms the corresponding subband classifierin quiet and low noise conditions however the latter yield significantimprovements in high noise conditions. For example, the anechoicsubband classifiers yields better classification performance than theanechoic MFCC classifier for SNRs below a crossover point between -dB and -dB. Quantitatively similar conclusions apply to thecomparative performances of the MFCC and subband classifiers inthe reverberant training scenarios. Under the three different trainingregimes and two different noise types, the subband classifiers attainan average improvement of . over the MFCC classifiers across allSNRs below -dB. Note that in the reverberant training scenarios,the MFCC classifier is trained with the complete TIMIT reverberanttraining set. On the other hand, the meta-level subband classifier istrained using the reverberant development subset with a number ofdata points less than of that in the TIMIT training set. Moreover,the dimension of the feature vectors that form the input to the meta-level classifiers is almost 24 times smaller than the MFCC featurevectors. To this end, the subband approach offers more flexibilityin terms of training and adaptation of the classifiers to a newenvironment.Since an obvious performance crossover between the subband andMFCC classifiers exists at moderate SNRs, we therefore considera convex combination of the scores of the SVM classifiers witha combination parameter λ as discussed in [17]. Here λ = 0 corresponds to the MFCC classification whereas λ = 1 correspondsto the subband classification. The combination approach was alsomotivated by the differences in the confusion matrices of the twoclassifiers (not shown here). This suggests that the errors of thesubband and MFCC classifiers may be independent up to a certaindegree and therefore a combination of the two may yield betterperformance than either of classifiers individually. Two differentvalues of the combination parameter λ are considered. First, thevalue of λ is set to / which corresponds to the arithmetic meanof the MFCC and subband SVM classifier scores. In the secondcase, we set the combination parameter λ to a function λ emp ( σ ) which approximates the optimal combination parameter values foran independent development set. This approximated function wasdetermined empirically in our previous experiments [17] and is givenby λ emp ( σ ) = η + ζ/ [1 + (cid:0) σ /σ (cid:1) ] , with η = 0 . , ζ = 0 . and σ = 0 . . Note that λ emp ( σ ) also requires an estimate of the noisevariance ( σ ) which was explicitly measured using the decision-directed estimation algorithm [68, 69].Figure 7 compares the classification performances of the subbandand MFCC classifiers with their convex combination in the presenceof speech-babble noise under anechoic and reverberant mismatchedtraining regimes. One can observe that the combined classificationwith λ emp consistently outperforms either of the individual classifiersacross all SNRs. For instance, under the anechoic training of theclassifiers, the combined classification with λ emp attains a . and . average improvement over the subband and MFCC classifiersrespectively, across all SNRs considered. Moreover, the combinedclassification via a simple averaging of the subband and MFCC −18 −12 −6 0 6 12 18 Q 2030405060708090100 SNR [dB] E RR O R [ % ] MFCC + VTS (Anechoic)MFCC + VTS (Reverb. Mismatch)MFCC + VTS (Reverb. Matched)Subband − Multi−style (Anechoic)Subband − Multi−style (Reverb. Mismatch)Subband − Multi−style (Reverb. Matched) −18 −12 −6 0 6 12 18 Q 2030405060708090100 SNR [dB] E RR O R [ % ] MFCC + VTS (Anechoic)MFCC + VTS (Reverb. Mismatch)MFCC + VTS (Reverb. Matched)Subband − Multi−style (Anechoic)Subband − Multi−style (Reverb. Mismatch)Subband − Multi−style (Reverb. Matched)
Fig. 6:
Classification with the subband and VTS-compensated MFCCclassifiers trained under three different scenarios: anechoic training(squares), reverberant mismatched training (stars) and reverberantmatched training (circles). Classification results for the test data con-taminated with pink noise (top) and speech-babble noise (bottom), andlinear filter R ( e jω ) are shown. classifiers by setting λ = 1 / provides a reasonable compromisebetween classification performance achieved within both represen-tation domains i.e. subbands of acoustic waveforms and cepstralrepresentation. While the performance of the combined classifier with λ = 1 / degrades only slightly (approximately ) for SNRs abovea cross over point between -dB and -dB, it achieves relatively fargreater improvements in high noise. e.g. under the anechoic trainingregime, the combined classifier with λ = 1 / attains a and . improvement over the MFCC and subband classifiers at -dBSNR, respectively. Quantitatively similar conclusions apply in thereverberant mismatched training scenario as shown in Figure 7.IV. CONCLUSIONSIn this paper we studied an SVM front-end for robust speechrecognition that operates in frequency subbands of high-dimensionalacoustic waveforms. We addressed the issues of kernel design forsubband components of acoustic waveforms and the aggregationof the individual subband classifiers using ensemble methods. Theexperiments demonstrated that the subband classifiers outperform thecepstral classifiers in the presence of noise and linear filtering forSNRs below -dB. While the subband classifiers do not perform aswell as the MFCC classifiers in low noise conditions, major gainsacross all noise levels can be attained by a convex combination [17]. −18 −12 −6 0 6 12 18 Q 30405060708090100 SNR [dB] E RR O R [ % ] MFCC + VTS (Anechoic)Subband − Multi−style (Anechoic)Combination ( λ =0.5)Combination ( λ emp ) −18 −12 −6 0 6 12 18 Q 30405060708090100 SNR [dB] E RR O R [ % ] MFCC + VTS (Reverb. Mismatch)Subband − Multi−style (Reverb. Mismatch)Combination ( λ =0.5)Combination ( λ emp ) Fig. 7:
Comparison of the classification performances of the subbandand MFCC classifiers with their convex combination in the presenceof speech-babble noise under anechoic training (top) and reverberantmismatched training regimes (bottom). Results are shown for two differentsettings of the combination parameter, λ . This work primarily focused on comparison of different repre-sentations in terms of the robustness they provide. To this end,experiments were conducted on the TIMIT phoneme classificationtask. However, the results reported in this paper also have implicationsfor the construction of ASR systems. In future work, we plan toinvestigate extensions to the proposed technique in order to facilitatethe recognition of continuous speech. One straight-forward approachwould be to pre-process the speech signals using the combination ofthe subband and cepstral SVM classifiers, and error-correcting outputcodes and generate class-wise feature vectors for overlapping andextended frames of speech. These feature vectors can be extractedin a manner similar to the MFCC features. An HMM-based systemcan then be trained using these feature vectors for recognition ofcontinuous speech. Alternatively, the proposed technique can alsobe integrated with other approaches such as the hybrid phone-basedHMM-SVM architecture [42, 43] and the token-passing algorithm[44] for continuous speech recognition. In the former, a baselineHMM system would be required to perform a first pass throughthe test data and for each utterance, generate a set of possiblesegmentations into phonemes. The best segmentations can then re-scored by the combined SVM classifier to predict the final phonemesequence. This approach has provided improvements in recognitionperformance over HMM baselines on both small and large vocabularyrecognition tasks, even though the SVM classifiers were constructed solely from the cepstral representations [42, 43]. However, thisHMM-SVM hybrid solution can also limit the efficiency of SVMsdue to possible errors in the segmentation stage. On the other hand,a recognizer based solely on SVMs as discussed in [44] can alsoemployed which makes decisions at a frame level via SVMs anddetermines the chain of recognized phonemes and words using thetoken-passing algorithm. These extensions will be the subject of afuture study. R
EFERENCES [1] S. B. Davis and P. Mermelstein, “Comparison of ParametricRepresentations for Monosyllabic Word Recognition in Con-tinuously Spoken Sentences,”
IEEE Trans. ASSP , vol. 28, pp.357–366, 1980.[2] H. Hermansky, “Perceptual Linear Predictive (PLP) Analysis ofSpeech,”
J. Acoust. Soc. Amer. , vol. 87, no. 4, pp. 1738–1752,April 1990.[3] R. Lippmann, “Speech Recognition by Machines and Humans,”
Speech Comm. , vol. 22, no. 1, pp. 1–15, 1997.[4] G. Miller and P. Nicely, “An Analysis of Perceptual Confu-sions among some English Consonants,”
J. Acoust. Soc. Amer. ,vol. 27, no. 2, pp. 338–352, 1955.[5] J. B. Allen, “How do humans process and recognize speech?”
IEEE Trans. Speech & Audio Proc. , vol. 2, no. 4, pp. 567–577,1994.[6] J. Sroka and L. Braida, “Human and Machine ConsonantRecognition,”
Speech Comm. , vol. 45, no. 4, pp. 401–423, 2005.[7] B. Meyer, M. W¨achter, T. Brand, and B. Kollmeier, “PhonemeConfusions in Human and Automatic Speech Recognition,”
Proc. INTERSPEECH , pp. 2740–2743, 2007.[8] B. Atal, “Automatic Speech Recognition: a CommunicationPerspective,”
Proc. ICASSP , pp. 457–460, 1999.[9] S. D. Peters, P. Stubley, and J. Valin, “On the Limits of SpeechRecognition in Noise,”
Proc. ICASSP , pp. 365–368, 1999.[10] H. Bourlard, H. Hermansky, and N. Morgan, “Towards Increas-ing Speech Recognition Error Rates,”
Speech Comm. , vol. 18,no. 3, pp. 205–231, 1996.[11] K. K. Paliwal and L. D. Alsteris, “On the Usefulness of STFTPhase Spectrum in Human Listening Tests,”
Speech Comm. ,vol. 45, no. 2, pp. 153–170, 2005.[12] L. D. Alsteris and K. K. Paliwal, “Further Intelligibility Re-sults from Human Listening Tests using the Short-Time PhaseSpectrum,”
Speech Comm. , vol. 48, no. 6, pp. 727–736, 2006.[13] O. Viikki and K. Laurila, “Cepstral Domain Segmental FeatureVector Normalization for Noise Robust Speech Recognition,”
Speech Comm. , vol. 25, pp. 133–147, 1998.[14] C. Chen and J. Bilmes, “MVA Processing of Speech Features,”
IEEE Trans. ASLP , vol. 15, no. 1, pp. 257–270, 2007.[15] P. J. Moreno, B. Raj, and R. M. Stern, “A Vector Taylor SeriesApproach for Environment-Independent Speech Recognition,”
Proc. ICASSP , pp. 733–736, 1996.[16] E. standard doc., “Speech processing, Transmission and Qualityaspects (STQ): Advanced front-end feature extraction,”
ETSI ES202 050 , 2002.[17] J. Yousafzai, Z. Cvetkovi´c, P. Sollich, and B. Yu, “CombinedFeatures and Kernel Design for Noise Robust Phoneme Classifi-cation Using Support Vector Machines,”
To appear in the IEEETrans. ASLP , 2011.[18] H. Fletcher,
Speech and Hearing in Communication . NewYork: Van Nostrand, 1953.[19] J. McAuley, J. Ming, D. Stewart, and P. Hanna, “Subbandcorrelation and robust speech recognition,”
IEEE Trans. onSpeech and Audio Proc. , vol. 13, no. 5, pp. 956 – 964, 2005. [20] J. Ming, P. Jancovic, and F. Smith, “Robust speech recognitionusing probabilistic union models,” IEEE Trans. on Speech andAudio Proc. , vol. 10, no. 6, pp. 403 – 414, Sep. 2002.[21] P. McCourt, N. Harte, and S. Vaseghi, “Discriminative Multi-resolution Sub-band and Segmental Phonetic Model Combina-tion,”
IET Electronics Letters , vol. 36, no. 3, pp. 270 –271,2000.[22] S. Thomas, S. Ganapathy, and H. Hermansky, “Recognition OfReverberant Speech Using Frequency Domain Linear Predic-tion,”
IEEE Signal Process. Letters , vol. 15, pp. 681–684, 2008.[23] S. Tibrewala and H. Hermansky, “Subband Based RecognitionOf Noisy Speech,”
Proc. ICASSP , pp. 1255–1258, 1997.[24] P. McCourt, S. Vaseghi, and N. Harte, “Multi-Resolution Cep-stral Features for Phoneme Recognition across Speech Sub-Bands,”
Proc. ICASSP , pp. 557–560, 1998.[25] H. Bourlard and S. Dupont, “Subband-based Speech Recogni-tion,”
Proc. ICASSP , pp. 1251–1254, 1997.[26] S. Okawa, E. Bocchieri, and A. Potamianos, “Multi-bandSpeech Recognition in Noisy Environments,”
Proc. ICASSP , pp.641–644, 1998.[27] C. Cerisara, J.-P. Haton, J.-F. Mari, and D. Fohr, “A Recom-bination Model for Multi-band Speech Recognition,”
ICASSP ,pp. 717 –720 vol.2, 1998.[28] J. Flanagan, J. Johnston, R. Zahn, and G. Elko, “Computer-Steered Microphone Arrays for Sound Transduction in LargeRooms,”
J. Acoust. Soc. Amer. , vol. 78, no. 11, pp. 1508–1518,1985.[29] M. Wu and D. Wang, “A Two-Stage Algorithm for One-Microphone Reverberant Speech Enhancement,”
IEEE Trans.ASLP , vol. 14, pp. 774–784, 2006.[30] H. Chang and J. Glass, “Hierarchical Large-Margin GaussianMixture Models for Phonetic Classification,”
Proc. ASRU , pp.272–275, 2007.[31] D. Yu, L. Deng, and A. Acero, “Hidden Conditional RandomFields with Distribution Constraints for Phone Classification,”
Proc. INTERSPEECH , pp. 676–679, 2009.[32] F. Sha and L. K. Saul, “Large Margin Gaussian MixtureModeling for Phonetic Classification and Recognition,”
Proc.ICASSP , pp. 265–268, 2006.[33] R. Rifkin, K. Schutte, M. Saad, J. Bouvrie, and J. Glass, “NoiseRobust Phonetic Classification with Linear Regularized LeastSquares and Second-Order Features,”
Proc. ICASSP , pp. 881–884, 2007.[34] A. Halberstadt and J. Glass, “Heterogeneous Acoustic Measure-ments for Phonetic Classification,”
Proc. EuroSpeech , pp. 401–404, 1997.[35] P. Clarkson and P. J. Moreno, “On the Use of Support VectorMachines for Phonetic Classification,”
Proc. ICASSP , pp. 585–588, 1999.[36] A. Gunawardana, M. Mahajan, A. Acero, and J. C. Platt,“Hidden Conditional Random Fields for Phone Classification,”
Proc. INTERSPEECH , pp. 1117–1120, 2005.[37] M. Layton and M. Gales, “Augmented Statistical Models forSpeech Recognition,”
Proc. ICASSP , pp. I29–132, 2006.[38] V. Pitsikalis and P. Maragos, “Analysis and Classification ofSpeech Signals by Generalized Fractal Dimension Features,”
Speech Comm. , vol. 51, pp. 1206–1223, 2009.[39] S. Dusan, “On the Relevance of Some Spectral and TemporalPatterns for Vowel Classification,”
Speech Comm. , vol. 49, pp.71–82, 2007.[40] K. M. Indrebo, R. J. Povinelli, and M. T. Johnson, “Sub-bandedReconstructed Phase Spaces for Speech Recognition,”
SpeechComm. , vol. 48, no. 7, pp. 760–774, 2006. [41] A. Halberstadt and J. Glass, “Heterogeneous Measurements andMultiple Classifiers for Speech Recognition,”
Proc. ICSLP , pp.995–998, 1998.[42] A. Ganapathiraju, J. E. Hamaker, and J. Picone, “Applications ofSupport Vector Machines to Speech Recognition,”
IEEE Trans.Signal Proc. , vol. 52, no. 8, pp. 2348–2355, 2004.[43] S. E. Kr¨uger, M. Schaff¨ner, M. Katz, E. Andelic, and A. Wen-demuth, “Speech Recognition with Support Vector Machines ina Hybrid System,”
Proc. INTERSPEECH , pp. 993–996, 2005.[44] J. Padrell-Sendra, D. Mart´ın-Iglesias, and F. D´ıaz-de Mar´ıa,“Support Vector Machines for Continuous Speech Recognition,”
Proc. EUSIPCO , 2006.[45] V. N. Vapnik,
The Nature of Statistical Learning Theory . NewYork: Springer-Verlag, 1995.[46] N. Smith and M. Gales, “Speech Recognition using SVMs,” in
Adv. Neural Inf. Process. Syst. , vol. 14, 2002, pp. 1197–1204.[47] J. Louradour, K. Daoudi, and F. Bach, “Feature Space Maha-lanobis Sequence Kernels: Application to SVM Speaker Ver-ification,”
IEEE Trans. ASLP , vol. 15, no. 8, pp. 2465–2475,2007.[48] A. Sloin and D. Burshtein, “Support Vector Machine Trainingfor Improved Hidden Markov Modeling,”
IEEE Trans. SignalProc. , vol. 56, no. 1, pp. 172–188, 2008.[49] T. Jaakkola and D. Haussler, “Exploiting Generative Models inDiscriminative Classifiers,” in
Adv. Neural Inf. Process. Syst. ,vol. 11, 1999, pp. 487–493.[50] R. Solera-Urena, D. Mart´ın-Iglesias, A. Gallardo-Antol´ın,C. Pel´aez-Moreno, and F. D´ıaz-de Mar´ıa, “Robust ASR usingSupport Vector Machines,”
Speech Comm. , vol. 49, no. 4, pp.253–267, 2007.[51] T. Dietterich and G. Bakiri, “Solving Multiclass Learning Prob-lems via Error-Correcting Output Codes,”
J. Artif. Intell. Res. ,vol. 2, pp. 263–286, 1995.[52] R. Rifkin and A. Klautau, “In Defense of One-Vs-All Classifi-cation,”
J. Mach. Learn. Res. , vol. 5, pp. 101–141, 2004.[53] M. Vetterli and J. Kovacevic,
Wavelets and Subband Coding .Englewood Cliffs, NJ: Prentice-Hall, 1995.[54] S. Furui, “Speaker-Independent Isolated Word Recognition us-ing Dynamic Features of Speech Spectrum,”
IEEE Trans. ASSP ,vol. 34, no. 1, pp. 52–59, 1986.[55] T. Dietterich, “Ensemble Methods in Machine Learning,”
Lec-ture Notes in Computer Science: Multiple Classifier Systems ,pp. 1–15, 2000.[56] L. Hansen and P. Salamon, “Neural Network Ensembles,”
IEEETrans. PAMI , vol. 12, no. 10, pp. 993–1001, 1990.[57] D. Wolpert, “Stacked Generalization,”
Neural Networks , vol. 5,no. 2, pp. 241–259, 1992.[58] J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallet, andN. Dahlgren, “TIMIT Acoustic-Phonetic Continuous SpeechCorpus,”
Linguistic Data Consortium , 1993.[59] K. F. Lee and H. W. Hon, “Speaker-Independent Phone Recogni-tion Using Hidden Markov Models,”
IEEE Trans. ASSP , vol. 37,no. 11, pp. 1641–1648, 1989.[60] R. F. Voss and J. Clarke, “1/f Noise in Music: Music from 1/fNoise,”
J. Acoust. Soc. Amer. , vol. 63, no. 1, pp. 258–263, 1978.[61] B. J. West and M. Shlesinger, “The Noise in Natural Phenom-ena,”
American Scientist , vol. 78, no. 1, pp. 40–45, 1990.[62] P. Grigolini, G. Aquino, M. Bologna, M. Lukovic, and B. J.West, “A Theory of 1/f Noise in Human Cognition,”
PhysicaA: Stat. Mech. and its Appl. , vol. 388, no. 19, pp. 4192–4204,2009.[63] “The ICSI Meeting Recorder Project - RoomResponses,” online Web Resource. [Online]. Available: IEEE Trans. ASLP , vol. 14, no. 1, pp. 43–49, 2006.[65] R. Lippmann and E. A. Martin, “Multi-Style Training for RobustIsolated-Word Speech Recognition,”
Proc. ICASSP , pp. 705–708, 1987.[66] M. Gales and S. Young, “Robust Continuous Speech Recog-nition using Parallel Model Combination,”
IEEE Trans. SAP ,vol. 4, pp. 352–359, Sept. 1996.[67] J. Yousafzai, Z. Cvetkovi´c, and P. Sollich, “Towards RobustPhoneme Classification with Hybrid Features ,”
Proc. ISIT , pp.1643–1647, 2010.[68] Y. Ephraim and D. Malah, “Speech Enhancement Using aMinimum Mean-Square Error Short-time Spectral AmplitudeEstimator,”
IEEE Trans. ASSP , vol. ASSP-32, pp. 1109–1121,1984.[69] ——, “Speech Enhancement Using a Minimum Mean-SquareLog-Spectral Amplitude Estimator,”