Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Oscar Saz is active.

Publication


Featured researches published by Oscar Saz.


IEEE Transactions on Audio, Speech, and Language Processing | 2007

Cepstral Vector Normalization Based on Stereo Data for Robust Speech Recognition

Luis Buera; Eduardo Lleida; Antonio Miguel; Alfonso Ortega; Oscar Saz

In this paper, a set of feature vector normalization methods based on the minimum mean square error (MMSE) criterion and stereo data is presented. They include multi-environment model-based linear normalization (MEMLIN), polynomial MEMLIN (P-MEMLIN), multi-environment model-based histogram normalization (MEMHIN), and phoneme-dependent MEMLIN (PD-MEMLIN). Those methods model clean and noisy feature vector spaces using Gaussian mixture models (GMMs). The objective of the methods is to learn a transformation between clean and noisy feature vectors associated with each pair of clean and noisy model Gaussians. The direct approach to learn the transformation is by using stereo data; that is, noisy feature vectors and the corresponding clean feature vectors. In this paper, however, a nonstereo data based training procedure, is presented. The transformations can be modeled just like a bias vector (MEMLIN), or by using a first-order polynomial (P-MEMLIN) or a nonlinear function based on histogram equalization (MEMHIN). Further improvements are obtained by using phoneme-dependent bias vector transformation (PD-MEMLIN). In PD-MEMLIN, the clean and noisy feature vector spaces are split into several phonemes, and each of them is modeled as a GMM. Those methods achieve significant word error rate improvements over others that are based on similar targets. The experimental results using the SpeechDat Car database show an average improvement in word error rate greater than 68% in all cases compared to the baseline when using the original clean acoustic models, and up to 83% when training acoustic models on the new normalized feature space


IEEE Transactions on Audio, Speech, and Language Processing | 2010

Unsupervised Data-Driven Feature Vector Normalization With Acoustic Model Adaptation for Robust Speech Recognition

Luis Buera; Antonio Miguel; Oscar Saz; Alfonso Ortega; Eduardo Lleida

In this paper, an unsupervised data-driven robust speech recognition approach is proposed based on a joint feature vector normalization and acoustic model adaptation. Feature vector normalization reduces the acoustic mismatch between training and testing conditions by mapping the feature vectors towards the training space. Model adaptation modifies the parameters of the acoustic models to match the test space. However, since neither is optimal, both approaches use an intermediate space between training and testing spaces to map either the feature vectors or acoustic models. The joint optimization of both approaches provides a common intermediate space with a better match between normalized feature vectors and adapted acoustic models. In this paper, feature vector normalization is based on a minimum mean square error (MMSE) criterion. A class dependent multi-environment model linear normalization (CD-MEMLIN) based on two classes (silence/speech) with a cross probability model (CD-MEMLIN-CPM) is used. CD-MEMLIN-CPM assumes that each class of clean and noisy spaces can be modeled with a Gaussian mixture model (GMM), training a linear transformation for each pair of Gaussians in an unsupervised data-driven training process. This feature vector normalization maps the recognition space feature vector to a normalized space. The acoustic model adaptation maps the training space to the normalized space by defining a set of linear transformations over an expanded HMM-state space, compensating for those degradations that the feature vector normalization is not able to model, like rotations. Experiments have been carried out with the Spanish SpeechDat Car database and Aurora 2 databases using both the standard Mel-frequency cepstral coefficient (MFCC) and advanced ETSI front-ends. Consistent improvements were reached for both corpora and front-ends. Using the standard MFCC front-end, a 92.08% average improvement on WER for Spanish SpeechDat Car and a 69.75% average improvement for clean condition evaluation of Aurora 2 was obtained, improving those results reached with ETSI advanced front-end (83.28% and 67.41%, respectively). Using the ETSI advanced front-end with the proposed solution, a 75.47% average improvement was obtained for the clean condition evaluation of Aurora 2 database.


Speech Communication | 2012

A prelingual tool for the education of altered voices

William Ricardo Rodríguez; Oscar Saz; Eduardo Lleida

This paper addresses the problem of Computer-Aided Voice Therapy for altered voices. The proposal of the work is to develop a set of free activities called PreLingua for providing interactive voice therapy to a population of individuals with voice disorders. The interactive tools are designed to train voice skills like: voice production, intensity, blow, vocal onset, phonation time, tone, and vocalic articulation for Spanish language. The development of these interactive tools along with the underlying speech technologies that support them requires the existence of speech processing, whose algorithms must be robust with respect to the sources of speech variability that are characteristic of this population of speakers. One of the main problem addressed is how to estimate reliably formant frequencies in high-pitched speech (typical in children and women) and how to normalize these estimations independently of the characteristics of the speakers. Linear prediction coding, homomorphic analysis and modeling of the vocal tract are the core of the speech processing techniques used to allow such normalization through vocal tract length. This paper also presents the result of an experimental study where PreLingua was applied in a population with voice disorders and pathologies in special education centers in Spain and Colombia. Promising results were obtained in this preliminary study after 12 weeks of therapy, as it showed improvements in the voice capabilities of a remarkable number of users and the ability of the tool to educate impaired users with voice alterations. This improvement was assessed by the evaluation of the educators before and after the study and also by the performance of the subjects in the activities of PreLingua. The results were very encouraging to keep working in this direction, with the overall aim of providing further functionalities and robustness to the system.


international conference on acoustics, speech, and signal processing | 2008

E-inclusion technologies for the speech handicapped

Carlos Vaquero; Oscar Saz; Eduardo Lleida; William Ricardo Rodríguez

This paper addresses the problem that disabled people face when accessing the new systems and technologies that are available nowadays. The use of speech technologies, specially helpful for motor handicapped people, becomes unapproachable when these people also suffer speech impairments, making the gap in the society wider for them. As a way to include speech impaired people in the technological society of today, two lines of work have been carried out. On one hand, a computer-aided speech therapy software has been developed for the speech training of children with different disabilities. This tool, available for free distribution, makes use of different state-of-the-art speech technologies to train different levels of the language. As a result of this work, the software is being used currently in several centers for special education with a very encouraging feedback about the capabilities of the system. On the other hand, research on the use of automatic speech recognition (ASR) systems for the speech impaired has been carried out. This work has focused on current techniques of speaker adaptation to know how these techniques, fruitfully used in other tasks, can deal with this specific kind of speech. The use of Maximum A Posterior (MAP) obtains an improvement of 60.61% compared to the results of a baseline speaker independent model.


ambient media and systems | 2008

Improving dialogue systems in a home automation environment

Raquel Justo; Oscar Saz; Víctor G. Guijarrubia; Antonio Miguel; M. Inés Torres; Eduardo Lleida

In this paper, a task of human-machine interaction based on speech is presented. The specific task consists on the use and control of a set of home appliances through a turn-based dialogue system. This work focuses on the first part of the dialogue system, the Automatic Speech Recognition (ASR) system. Two lines of work are taken into account to improve the performance of the ASR system. On one hand, the acoustic modeling required for the ASR is improved via Speaker Adaptation techniques. On the other hand, the Language Modeling in the system is improved by the use of class-based Language Models. The results show the good performance of both techniques to improve the ASR results, as the Word Error Rate (WER) drops from 5.81% using a close-talk microphone to a 0.99% and from 14.53% using a lapel microphone to a 1.52%. Also, an important reduction is achieved in terms of the Category Error Rate (CER), which measures the ability of the ASR system to extract the semantic information of the uttered sentence, dropping from 6.13% and 15.32% to 1.29% and 1.32% for the two microphones used in the experiments.


IEEE Transactions on Audio, Speech, and Language Processing | 2008

Capturing Local Variability for Speaker Normalization in Speech Recognition

Antonio Miguel; Eduardo Lleida; Richard C. Rose; Luis Buera; Oscar Saz; Alfonso Ortega

The new model reduces the impact of local spectral and temporal variability by estimating a finite set of spectral and temporal warping factors which are applied to speech at the frame level. Optimum warping factors are obtained while decoding in a locally constrained search. The model involves augmenting the states of a standard hidden Markov model (HMM), providing an additional degree of freedom. It is argued in this paper that this represents an efficient and effective method for compensating local variability in speech which may have potential application to a broader array of speech transformations. The technique is presented in the context of existing methods for frequency warping-based speaker normalization for ASR. The new model is evaluated in clean and noisy task domains using subsets of the Aurora 2, the Spanish Speech-Dat-Car, and the TIDIGITS corpora. In addition, some experiments are performed on a Spanish language corpus collected from a population of speakers with a range of speech disorders. It has been found that, under clean or not severely degraded conditions, the new model provides improvements over the standard HMM baseline. It is argued that the framework of local warping is an effective general approach to providing more flexible models of speaker variability.


international conference on acoustics, speech, and signal processing | 2009

A study of pronunciation verification in a speech therapy application

Shou-Chun Yin; Richard C. Rose; Oscar Saz; Eduardo Lleida

Techniques are presented for detecting phoneme level mispronunciations in utterances obtained from a population of impaired children speakers. The intended application of these approaches is to use the resulting confidence measures to provide feedback to patients concerning the quality of pronunciations in utterances arising within interactive speech therapy sessions. The pronunciation verification scenario involves presenting utterances of known words to a phonetic decoder and generating confusion networks from the resulting phone lattices. Confidence measures are derived from the posterior probabilities obtained from the confusion networks. Phoneme level mispronunciation detection performance was significantly improved with respect to a baseline system by optimizing acoustic models and pronunciation models in the phonetic decoder and applying a nonlinear mapping to the confusion network posteriors.


EURASIP Journal on Advances in Signal Processing | 2009

Analysis of acoustic features in speakers with cognitive disorders and speech impairments

Oscar Saz; Javier Damián Simón; W.-Ricardo Rodríguez; Eduardo Lleida; Carlos Vaquero

This work presents the results in the analysis of the acoustic features (formants and the three suprasegmental features: tone, intensity and duration) of the vowel production in a group of 14 young speakers suffering different kinds of speech impairments due to physical and cognitive disorders. A corpus with unimpaired childrens speech is used to determine the reference values for these features in speakers without any kind of speech impairment within the same domain of the impaired speakers; this is 57 isolated words. The signal processing to extract the formant and pitch values is based on a Linear Prediction Coefficients (LPCs) analysis of the segments considered as vowels in a Hidden Markov Model (HMM) based Viterbi forced alignment. Intensity and duration are also based in the outcome of the automated segmentation. As main conclusion of the work, it is shown that intelligibility of the vowel production is lowered in impaired speakers even when the vowel is perceived as correct by human labelers. The decrease in intelligibility is due to a 30% of increase in confusability in the formants map, a reduction of 50% in the discriminative power in energy between stressed and unstressed vowels and to a 50% increase of the standard deviation in the length of the vowels. On the other hand, impaired speakers keep good control of tone in the production of stressed and unstressed vowels.


Archive | 2008

Speech Technology Applied to Children with Speech Disorders

William Ricardo Rodríguez Dueñas; Carlos Vaquero; Oscar Saz; Eduardo Lleida

This paper introduces an informatic applications for speech therapy in Spanish language based on the use of Speech Technology. The objective of this work is to help children with different speech impairments to improve their communication skills. Speech technology provides methods which can help children who suffer from speech disorders to develop pre-Language and Language. For pre-Language development the informatic application works on speech activity detection and intensity control, and for Language the application works on three levels of language: phonological, semantic and syntactic. This application is designed to enable those suffer from speech disorders to train their communication capabilities in an easy and entertaining way. This tool is available for free distribution.


Odyssey 2016 | 2016

The Sheffield language recognition system in NIST LRE 2015.

Raymond W. M. Ng; Mauro Nicolao; Oscar Saz; Madina Hasan; Bhusan Chettri; Mortaza Doulaty; Tan Lee; Thomas Hain

The Speech and Hearing Research Group of the University of Sheffield submitted a fusion language recognition system to NIST LRE 2015. It combines three language classifiers. Two are acoustic-based, which use i–vectors and a tandem DNN language recogniser respectively. The third classifier is a phonotactic language recogniser. Two sets of training data with duration of approximately 170 and 300 hours were composed for LR training. Using the larger set of training data, the primary Sheffield LR system gives 32.44 min DCF on the official LR 2015 eval data. A post-evaluation system enhancement was carried out where i–vectors were extracted from the bottleneck features of an English DNN. The min DCF was reduced to 29.20.

Collaboration


Dive into the Oscar Saz's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Luis Buera

University of Zaragoza

View shared research outputs
Top Co-Authors

Avatar

Thomas Hain

University of Sheffield

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Madina Hasan

University of Sheffield

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge