Luis Buera
University of Zaragoza
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Luis Buera.
IEEE Transactions on Audio, Speech, and Language Processing | 2007
Luis Buera; Eduardo Lleida; Antonio Miguel; Alfonso Ortega; Oscar Saz
In this paper, a set of feature vector normalization methods based on the minimum mean square error (MMSE) criterion and stereo data is presented. They include multi-environment model-based linear normalization (MEMLIN), polynomial MEMLIN (P-MEMLIN), multi-environment model-based histogram normalization (MEMHIN), and phoneme-dependent MEMLIN (PD-MEMLIN). Those methods model clean and noisy feature vector spaces using Gaussian mixture models (GMMs). The objective of the methods is to learn a transformation between clean and noisy feature vectors associated with each pair of clean and noisy model Gaussians. The direct approach to learn the transformation is by using stereo data; that is, noisy feature vectors and the corresponding clean feature vectors. In this paper, however, a nonstereo data based training procedure, is presented. The transformations can be modeled just like a bias vector (MEMLIN), or by using a first-order polynomial (P-MEMLIN) or a nonlinear function based on histogram equalization (MEMHIN). Further improvements are obtained by using phoneme-dependent bias vector transformation (PD-MEMLIN). In PD-MEMLIN, the clean and noisy feature vector spaces are split into several phonemes, and each of them is modeled as a GMM. Those methods achieve significant word error rate improvements over others that are based on similar targets. The experimental results using the SpeechDat Car database show an average improvement in word error rate greater than 68% in all cases compared to the baseline when using the original clean acoustic models, and up to 83% when training acoustic models on the new normalized feature space
IEEE Transactions on Audio, Speech, and Language Processing | 2010
Luis Buera; Antonio Miguel; Oscar Saz; Alfonso Ortega; Eduardo Lleida
In this paper, an unsupervised data-driven robust speech recognition approach is proposed based on a joint feature vector normalization and acoustic model adaptation. Feature vector normalization reduces the acoustic mismatch between training and testing conditions by mapping the feature vectors towards the training space. Model adaptation modifies the parameters of the acoustic models to match the test space. However, since neither is optimal, both approaches use an intermediate space between training and testing spaces to map either the feature vectors or acoustic models. The joint optimization of both approaches provides a common intermediate space with a better match between normalized feature vectors and adapted acoustic models. In this paper, feature vector normalization is based on a minimum mean square error (MMSE) criterion. A class dependent multi-environment model linear normalization (CD-MEMLIN) based on two classes (silence/speech) with a cross probability model (CD-MEMLIN-CPM) is used. CD-MEMLIN-CPM assumes that each class of clean and noisy spaces can be modeled with a Gaussian mixture model (GMM), training a linear transformation for each pair of Gaussians in an unsupervised data-driven training process. This feature vector normalization maps the recognition space feature vector to a normalized space. The acoustic model adaptation maps the training space to the normalized space by defining a set of linear transformations over an expanded HMM-state space, compensating for those degradations that the feature vector normalization is not able to model, like rotations. Experiments have been carried out with the Spanish SpeechDat Car database and Aurora 2 databases using both the standard Mel-frequency cepstral coefficient (MFCC) and advanced ETSI front-ends. Consistent improvements were reached for both corpora and front-ends. Using the standard MFCC front-end, a 92.08% average improvement on WER for Spanish SpeechDat Car and a 69.75% average improvement for clean condition evaluation of Aurora 2 was obtained, improving those results reached with ETSI advanced front-end (83.28% and 67.41%, respectively). Using the ETSI advanced front-end with the proposed solution, a 75.47% average improvement was obtained for the clean condition evaluation of Aurora 2 database.
international conference on acoustics, speech, and signal processing | 2004
Luis Buera; Eduardo Lleida; Antonio Miguel; Alfonso Ortega
A multi-environment adaptation technique, based on minimum mean squared error estimation, is proposed. MEMLIN (multi-environment models based linear normalization) consists of a feature adaptation using stereo data and several basic defined environments. The target of this algorithm is to learn the difference between clean and noisy feature vectors associated to a pair of Gaussians (one for a clean model, and the other for a noisy model), for each basic environment. This knowledge, the associated Gaussians, the conditional probability between clean and noisy Gaussians, and the environment are the data used to compensate the mismatch between clean and noisy vectors. This algorithm obtains important improvements regarding other techniques that look for similar targets. The experimental results with the SpeechDat Car database shows an average improvement of more than 68%, concerning the baseline, over 7 different defined environments.
IEEE Transactions on Audio, Speech, and Language Processing | 2008
Antonio Miguel; Eduardo Lleida; Richard C. Rose; Luis Buera; Oscar Saz; Alfonso Ortega
The new model reduces the impact of local spectral and temporal variability by estimating a finite set of spectral and temporal warping factors which are applied to speech at the frame level. Optimum warping factors are obtained while decoding in a locally constrained search. The model involves augmenting the states of a standard hidden Markov model (HMM), providing an additional degree of freedom. It is argued in this paper that this represents an efficient and effective method for compensating local variability in speech which may have potential application to a broader array of speech transformations. The technique is presented in the context of existing methods for frequency warping-based speaker normalization for ASR. The new model is evaluated in clean and noisy task domains using subsets of the Aurora 2, the Spanish Speech-Dat-Car, and the TIDIGITS corpora. In addition, some experiments are performed on a Spanish language corpus collected from a population of speakers with a range of speech disorders. It has been found that, under clean or not severely degraded conditions, the new model provides improvements over the standard HMM baseline. It is argued that the framework of local warping is an effective general approach to providing more flexible models of speaker variability.
international conference on acoustics, speech, and signal processing | 2008
Luis Buera; Jasha Droppo; Alex Acero
In this paper we present two new methods for speech enhancement based on the previously publised fine pitch model (FPM) for voiced speech. The first method (FPM-NE) uses the FPM to produce a non-stationary noise estimate that can be used in any standard speech enhancement system. In this method, the FPM is used indirectly to perform speech enhancement. The second method we describe (FPM-SE) uses the FPM directly to perform speech enhancement. We present a study of the behavior of the two models on the standard Aurora 2 task, and demonstrate improvements of over 45% average word error rate reduction over the multi-style baseline.
IEEE Transactions on Audio, Speech, and Language Processing | 2011
Antonio Miguel; Alfonso Ortega; Luis Buera; Eduardo Lleida
Traditionally, in speech recognition, the hidden Markov model state emission probability distributions are usually associated to continuous random variables, by using Gaussian mixtures. Thus, complex multimodal inter-feature dependencies are not accurately modeled by Gaussian models, since they are unimodal distributions and mixtures of Gaussians are needed in these complex cases, but this is done in a loose and inefficient way. Graphical models provide a precise and simple mechanism to model the dependencies among two or more variables. This paper proposes the use of discrete random variables as observations and graphical models to extract the internal dependence structure in the feature vectors. Therefore, speech features are quantized to a small number of levels, in order to obtain a tractable model. These quantized speech features provide a mechanism to increase the robustness against noise uncertainty. In addition, discrete random variables allow the learning of joint statistics of the observation densities. A method to estimate a graphical model with a constrained number of dependencies is shown in this paper, being a special kind of Bayesian network. Experimental results show that by using this modeling, better performance can be obtained compared to standard baseline systems.
ieee automatic speech recognition and understanding workshop | 2007
Luis Buera; Antonio Miguel; Eduardo Lleida; Oscar Saz; Alfonso Ortega
An on-line unsupervised hybrid compensation technique is proposed to reduce the mismatch between training and testing conditions. It combines multi-environment model based linear normalization with cross-probability model based on GMMs (MEMLIN CPM) with a novel acoustic model adaptation method based on rotation transformations. Hence, a set of rotation transformations is estimated with clean and MEMLIN CPM-normalized training data by linear regression in an unsupervised process. Thus, in testing, each MEMLIN CPM normalized frame is decoded using a modified Viterbi algorithm and expanded acoustic models, which are obtained from the reference ones and the set of rotation transformations. To test the proposed solution, some experiments with Spanish SpeechDat Car database were carried out. MEMLIN CPM over standard ETSI front-end parameters reaches 83.89% of average improvement in WER, while the introduced hybrid solution goes up to 92.07%. Also, the proposed hybrid technique was tested with Aurora 2 database, obtaining an average improvement of 68.88% with clean training.
ieee automatic speech recognition and understanding workshop | 2005
Luis Buera; Eduardo Lleida; Antonio Miguel; Alfonso Ortega
In a previous work, phoneme-dependent multi-environment models based linear normalization, PD-MEMLIN, was presented and it was proved to be effective to compensate environment mismatch. Since PD-MEMLIN transformations have to be estimated from stereo data corpora, and the computational cost is high, two approaches are proposed: coefficient progressive PD-MEMLIN, CPPD-MEMLIN, and blind PD-MEMLIN. The first one consists on a partial normalization of the feature vector, reducing the computational cost, while blind PD-MEMLIN can be applied over any non stereo data corpora, thus the estimation of the transformation is based on an iterative technique from noisy data and a target clean speech model. Some experiments with SpeechDat car database were carried out in order to study the behavior of the proposed techniques in a real acoustic environment. In the previous work, PD-MEMLIN with stereo data and normalizing 13 MFCC coefficients reached 77.67% of improvement. In this paper, CPPD-MEMLEM with only 4 coefficients obtains an average improvement of 72.40%, and blind PD-MEMLIN obtains an average improvement of 73.96%
Archive | 2007
Alfonso Ortega; Eduardo Lleida; Enrique Masgrau; Luis Buera; Antonio Miguel
This chapter presents a two-channel speech reinforcement system which has the goal of improving speech intelligibility inside cars. As microphones pick up not only the voice of the speaker but also the reinforced speech coming from the loudspeakers, feedback paths appear. These feedback paths can make the system become unstable and acoustic echo cancellation is needed in order to avoid it. In a two-channel system, two system identifications must be performed for each channel, one of them is an open-loop identification and the other one is closed-loop. Several methods have been proposed for echo suppression in open-loop systems like hands-free systems. We propose here the use of echo suppression filters specially designed for closed-loop subsystems along with echo suppression filters for open-loop subsystems based on the optimal filtering theory. The derivation of the optimal echo suppression filter needed in the closed-loop subsystem is also presented.
international conference on acoustics, speech, and signal processing | 2006
Alfonso Ortega; Eduardo Lleida; Enrique Masgrau; Luis Buera; Antonio Miguel
This paper presents a two-channel speech reinforcement system for cars able to improve the communication between the front and the rear passengers. One of the problems of this kind of systems is that they must operate in closed-loop, as acoustic feedback paths appear due to the short distance between loudspeakers and microphones. This feedback paths can make the system become unstable and acoustic echo control is needed in order to ensure stability. The system must perform two plant identifications for each channel. One of them is an open-loop identification and the other one is closed-loop. We propose here the use of echo suppression filters specially designed for closed-loop subsystems along with echo suppression filters for open-loop subsystems based on the optimal filtering theory. Results about the performance of the proposed system are provided