Hynek Boril
University of Texas at Dallas
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Hynek Boril.
international conference on acoustics, speech, and signal processing | 2011
Hynek Boril; John H. L. Hansen
Adverse environments impact the performance of automatic speech recognition systems in two ways - directly by introducing acoustic mismatch between the speech signal and acoustic models, and indirectly by affecting the way speakers communicate to maintain intelligible communication over noise (Lombard effect). Currently, an increasing number of studies have analyzed Lombard effect with respect to speech production and perception, yet limited attention has been paid to its impact on speech systems, especially within a larger vocabulary context. This study presents a large vocabulary speech material captured in the recently acquired portion of UT-Scope database, produced in several types and levels of simulated background noise (highway, crowd, pink). The impact of noisy background variations on speech parameters is studied together with the effects on automatic speech recognition. Front-end cepstral normalization utilizing a modified RASTA filter is proposed and shown to improve recognition performance in a side-by-side evaluation with several common and state-of-the-art normalization algorithms.
international conference on acoustics, speech, and signal processing | 2014
Shabnam Ghaffarzadegan; Hynek Boril; John H. L. Hansen
This study focuses on acoustic variations in speech introduced by whispering, and proposes several strategies to improve robustness of automatic speech recognition of whispered speech with neutral-trained acoustic models. In the analysis part, differences in neutral and whispered speech captured in the UT-Vocal Effort II corpus are studied in terms of energy, spectral slope, and formant center frequency and bandwidth distributions in silence, voiced, and unvoiced speech signal segments. In the part dedicated to speech recognition, several strategies involving front-end filter bank redistribution, cepstral dimensionality reduction, and lexicon expansion for alternative pronunciations are proposed. The proposed neutral-trained system employing redistributed filter bank and reduced features provides a 7.7 % absolute WER reduction over the baseline system trained on neutral speech, and a 1.3 % reduction over a baseline system with whisper-adapted acoustic models.
international conference on acoustics, speech, and signal processing | 2015
Shabnam Ghaffarzadegan; Hynek Boril; John H. L. Hansen
The lack of available large corpora of transcribed whispered speech is one of the major roadblocks for development of successful whisper recognition engines. Our recent study has introduced a Vector Taylor Series (VTS) approach to pseudo-whisper sample generation which requires availability of only a small number of real whispered utterances to produce large amounts of whisper-like samples from easily accessible transcribed neutral recordings. The pseudo-whisper samples were found particularly effective in adapting a neutral-trained recognizer to whisper. Our current study explores the use of denoising autoencoders (DAE) for pseudo-whisper sample generation. Two types of generative models are investigated - one which produces pseudo-whispered cepstral vectors on a frame basis and another which generates pseudo-whisper statistics of whole phone segments. It is shown that the DAE approach considerably reduces word error rates of the baseline system as well as the system adapted on real whisper samples. The DAE approach provides competitive results to the VTS-based method while cutting its computational overhead nearly in half.
international conference on acoustics, speech, and signal processing | 2012
Seyed Omid Sadjadi; Hynek Boril; John H. L. Hansen
Automatic speech recognition is known to deteriorate in the presence of room reverberation and variation of vocal effort in speakers. This study considers robustness of several state-of-the-art front-end feature extraction and normalization strategies to these sources of speech signal variability in the context of large vocabulary continuous speech recognition (LVCSR). A speech database recorded in an anechoic room, capturing modal speech and speech produced at different levels of vocal effort, is reverberated using measured room impulse responses and utilized in the evaluations. It is shown that the combination of recently introduced mean Hilbert envelope coefficients (MHEC) and a normalization strategy combining cepstral gain normalization and modified RASTA filtering (CGN_RASTALP) provides considerable recognition performance gains for reverberant modal and high vocal effort speech.
international conference on acoustics, speech, and signal processing | 2010
Mahnoosh Mehrabani; Hynek Boril; John H. L. Hansen
Dialect variations of a language have a severe impact on the performance of speech systems. Therefore, knowing how close or diverse dialects are in a given language space provides useful information to predict, or improve, system performance when there is a mismatch between train and test data. Distance measures have been used in several applications of speech processing. However, apart from phonetic measures, little if any work has been done on dialect distance measurement. This study explores differences in pitch movement microstructure among dialects. A method of dialect distance assessment based on pitch patterns modeled progressively from pitch contour primitives is proposed. The presented method does not require any manual labeling and is text-independent. The KL divergence is employed to compare the resulting statistical models. The proposed scheme is evaluated on a corpus of Arabic dialects, and shown to be consistent with the results from the spectral-based dialect classification system. Finally, it is also shown using a perceptive evaluation that the proposed objective approach correlates well with subjective distances.
international conference on acoustics, speech, and signal processing | 2013
Qian Zhang; Hynek Boril; John H. L. Hansen
Phonotactic modeling has become a widely used means for speaker, language, and dialect recognition. This paper explores variations to supervector pre-processing for phone recognition-support vector machines (PRSVM) based dialect identification. The aspects studied are: (i) normalization of supervector dimensions in the pre-squashing stage, (ii) impact of alternative squashing functions, and (iii) N-gram selection for supervector dimensionality reduction. In (i) and (ii), we find that several alternatives to commonly used approaches can provide moderate, yet consistent performance improvements. In (iii), a newly proposed dialect salience measure is applied in supervector dimension selection and compared to a common N-gram frequency based selection. The results show a strong correlation between dialect-salience and frequency of occurrence in N-grams. The evaluations in this study are conducted on a corpus of Chinese dialects, a Pan-Arabic corpus, and a set of Arabic CTS corpora.
international conference on acoustics, speech, and signal processing | 2009
Hynek Boril; John H. L. Hansen
When exposed to environmental noise, speakers adjust their speech production to maintain intelligible communication. This phenomenon, called Lombard effect (LE), is known to considerably impact the performance of automatic speech recognition (ASR) systems. In this study, novel frequency and cepstral domain equalizations that reduce the impact of LE on ASR are proposed. Short-time spectra of LE speech are transformed towards neutral ASR models in a maximum likelihood fashion. Dynamics of cepstral coefficients are normalized to a constant range using quantile estimations. The algorithms are incorporated in a recognizer employing a codebook of noisy acoustic models. In a recognition task on connected Czech digits presented in various levels of background car noise, the resulting system provides an absolute reduction in word error rate (WER) on 10 dB SNR data of 8.7% and 37.7% for female neutral and LE speech, and of 8.7% and 32.8% for male neutral and LE speech when compared to the baseline system employing perceptual linear prediction (PLP) coefficients and cepstral mean and variance normalization.
international conference on acoustics, speech, and signal processing | 2010
Sulyman Amuda; Hynek Boril; Abhijeet Sangwan; John H. L. Hansen
In this study, we introduce the UISpeech corpus which consists of Nigerian-Accented English audio-visual data. The corpus captures the linguistic diversity of Nigeria with data collected from native-speakers of Yoruba, Hausa, Igbo, Tiv, Funali and others. The UIS-peech corpus comprises isolated word recordings and read speech utterances. The new corpus is intended to provide a unique opportunity to apply and expand speech processing techniques to a limited resource language. Acoustic-phonetic differences between American English (AE) and Nigerian English (NE) are studied in terms of pronunciation variations, vowel locations in the formant space, and distances between AE-trained acoustic models and models adapted to NE. A strong impact of the AE-NE acoustic mismatch on automatic speech recognition (ASR) is observed. A combination of model adaptation and extension of AE lexicon for newly established NE pronunciation variants is shown to substantially improve performance of the AE-trained ASR system in the new NE task. This study represents the first step towards incorporating speech technology in Nigerian English.
IEEE Transactions on Audio, Speech, and Language Processing | 2010
Hynek Boril; John H. L. Hansen
international conference on acoustics, speech, and signal processing | 2013
Taufiq Hasan; Seyed Omid Sadjadi; Gang Liu; Navid Shokouhi; Hynek Boril; John H. L. Hansen