Xinhui Zhou
University of Maryland, College Park
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Xinhui Zhou.
international conference on acoustics, speech, and signal processing | 2012
Daniel Garcia-Romero; Xinhui Zhou; Carol Y. Espy-Wilson
We present a multicondition training strategy for Gaussian Probabilistic Linear Discriminant Analysis (PLDA) modeling of i-vector representations of speech utterances. The proposed approach uses a multicondition set to train a collection of individual subsystems that are tuned to specific conditions. A final verification score is obtained by combining the individual scores according to the posterior probability of each condition given the trial at hand. The performance of our approach is demonstrated on a subset of the interview data of NIST SRE 2010. Significant robustness to the adverse noise and reverberation conditions included in the multicondition training set are obtained. The system is also shown to generalize to unseen conditions.
ieee automatic speech recognition and understanding workshop | 2011
Xinhui Zhou; Daniel Garcia-Romero; Ramani Duraiswami; Carol Y. Espy-Wilson; Shihab A. Shamma
Mel-frequency cepstral coefficients (MFCC) have been dominantly used in speaker recognition as well as in speech recognition. However, based on theories in speech production, some speaker characteristics associated with the structure of the vocal tract, particularly the vocal tract length, are reflected more in the high frequency range of speech. This insight suggests that a linear scale in frequency may provide some advantages in speaker recognition over the mel scale. Based on two state-of-the-art speaker recognition back-end systems (one Joint Factor Analysis system and one Probabilistic Linear Discriminant Analysis system), this study compares the performances between MFCC and LFCC (Linear frequency cepstral coefficients) in the NIST SRE (Speaker Recognition Evaluation) 2010 extended-core task. Our results in SRE10 show that, while they are complementary to each other, LFCC consistently outperforms MFCC, mainly due to its better performance in the female trials. This can be explained by the relatively shorter vocal tract in females and the resulting higher formant frequencies in speech. LFCC benefits more in female speech by better capturing the spectral characteristics in the high frequency region. In addition, our results show some advantage of LFCC over MFCC in reverberant speech. LFCC is as robust as MFCC in the babble noise, but not in the white noise. It is concluded that LFCC should be more widely used, at least for the female trials, by the mainstream of the speaker recognition community.
international conference on acoustics, speech, and signal processing | 2013
Oldrich Plchot; Spyros Matsoukas; Pavel Matejka; Najim Dehak; Jeff Z. Ma; Sandro Cumani; Ondrej Glembek; Hynek Hermansky; Sri Harish Reddy Mallidi; Nima Mesgarani; Richard M. Schwartz; Mehdi Soufifar; Zheng-Hua Tan; Samuel Thomas; Bing Zhang; Xinhui Zhou
This paper describes the speaker identification (SID) system developed by the Patrol team for the first phase of the DARPA RATS (Robust Automatic Transcription of Speech) program, which seeks to advance state of the art detection capabilities on audio from highly degraded communication channels. We present results using multiple SID systems differing mainly in the algorithm used for voice activity detection (VAD) and feature extraction. We show that (a) unsupervised VAD performs as well supervised methods in terms of downstream SID performance, (b) noise-robust feature extraction methods such as CFCCs out-perform MFCC front-ends on noisy audio, and (c) fusion of multiple systems provides 24% relative improvement in EER compared to the single best system when using a novel SVM-based fusion algorithm that uses side information such as gender, language, and channel id.
international conference on acoustics, speech, and signal processing | 2009
Vikramjit Mitra; I. Yücel Özbek; Hosung Nam; Xinhui Zhou; Carol Y. Espy-Wilson
In this paper we present a technique for obtaining Vocal Tract (VT) time functions from the acoustic speech signal. Knowledge-based Acoustic Parameters (APs) are extracted from the speech signal and a pertinent subset is used to obtain the mapping between them and the VT time functions. Eight different vocal tract constriction variables consisting of five constriction degree variables, lip aperture (LA), tongue body (TBCD), tongue tip (TTCD), velum (VEL), and glottis (GLO); and three constriction location variables, lip protrusion (LP), tongue tip (TTCL), tongue body (TBCL) were considered in this study. The TAsk Dynamics Application model (TADA [1]) is used to create a synthetic speech dataset along with its corresponding VT time functions. We explore Support Vector Regression (SVR) followed by Kalman smoothing to achieve mapping between the APs and the VT time functions.
Journal of the Acoustical Society of America | 2004
Xinhui Zhou; Zhaoyan Zhang; Carol Y. Espy-Wilson
A Matlab‐based computer program for vocal tract acoustic response calculation (VTAR) has been developed. Based on a frequency‐domain vocal tract model [Z. Zhang and C. Espy‐Wilson, J. Acoust. Soc. Am. (2004)], VTAR is able to model various complex sounds such as nasals, rhotics, and liquids. With input in the form of vocal tract cross‐sectional area functions, VTAR calculates the vocal tract acoustic response function and the formant frequencies and bandwidths. The user‐friendly interface allows directed data input for defined categories: vowels, nasals, nasalized sounds, consonant, laterals, and rhotics. The program also provides an interface for input and modification of arbitrary vocal tract geometry configurations, which is ideal for research applications. [Work supported by NIH Grant 1 R01 DC05250‐01.]
Journal of the Acoustical Society of America | 2013
Xinhui Zhou; Jonghye Woo; Maureen Stone; Jerry L. Prince; Carol Y. Espy-Wilson
Magnetic resonance imaging has been widely used in speech production research. Often only one image stack (sagittal, axial, or coronal) is used for vocal tract modeling. As a result, complementary information from other available stacks is not utilized. To overcome this, a recently developed super-resolution technique was applied to integrate three orthogonal low-resolution stacks into one isotropic volume. The results on vowels show that the super-resolution volume produces better vocal tract visualization than any of the low-resolution stacks. Its derived area functions generally produce formant predictions closer to the ground truth, particularly for those formants sensitive to area perturbations at constrictions.
international conference on acoustics, speech, and signal processing | 2012
Daniel Garcia-Romero; Xinhui Zhou; Dmitry N. Zotkin; Balaji Vasan Srinivasan; Yuancheng Luo; Sriram Ganapathy; Samuel Thomas; Sridhar Krishna Nemala; Garimella S. V. S. Sivaram; Majid Mirbagheri; Sri Harish Reddy Mallidi; Thomas Janu; Padmanabhan Rajan; Nima Mesgarani; Mounya Elhilali; Hynek Hermansky; Shihab A. Shamma; Ramani Duraiswami
In recent years, there have been significant advances in the field of speaker recognition that has resulted in very robust recognition systems. The primary focus of many recent developments have shifted to the problem of recognizing speakers in adverse conditions, e.g in the presence of noise/reverberation. In this paper, we present the UMD-JHU speaker recognition system applied on the NIST 2010 SRE task. The novel aspects of our systems are: 1) Improved performance on trials involving different vocal effort via the use of linear-scale features; 2) Expected improved recognition performance in the presence of reverberation and noise via the use of frequency domain perceptual linear predictor and cortical features; 3) A new discriminative kernel partial least squares (KPLS) framework that complements state-of-the-art back-end systems JFA and PLDA to aid in better overall recognition; and 4) Acceleration of JFA, PLDA and KPLS back-ends via distributed computing. The individual components of the system and the fused system are compared against a baseline JFA system and results reported by SRI and MIT-LL on SRE2010.
Journal of the Acoustical Society of America | 2007
Xinhui Zhou; Carol Y. Espy-Wilson; Mark Tiede; Suzanne Boyce
The North American rhotic liquid has two well‐known and maximally distinct articulatory variants, the classic retroflex tongue posture and the classic bunched tongue posture. The evidence for acoustic difference between them is reexamined using magnetic resonance images of the vocal tracts from two similar‐sized subjects with different tongue postures of sustained /r/. Three‐dimensional finite element analysis is performed to investigate the acoustic wave propagation property inside the vocal tract, the acoustic response, and the area function extraction based on pressure isosurfaces. Sensitivity functions are studied for formant‐cavity affiliation. It is revealed that these two variants have similar patterns of F1–F3 and zero frequency. However, the retroflex variant is predicted to have a larger difference between F4 and F5 than the bunched one (1400 Hz versus 700 Hz). This difference can be explained by the geometric differences between them, in particular, the shorter, narrower, and more forward palat...
international conference on acoustics, speech, and signal processing | 2010
Xinhui Zhou; Carol Y. Espy-Wilson; Mark Tiede; Suzanne Boyce
The production of the lateral sounds generally involves a linguo-alveolar contact and one or two lateral channels along the parasagittal sides of the tongue. The acoustic effect of these articulatory features is not clearly understood. In this study, we compare two productions of /l/ in American English by one subject, one for a dark /l/ and the other for a light /l/. Three-dimensional vocal tract models derived from the magnetic resonance images were analyzed. It was shown that zeros in the vocal tract acoustic response are produced in the F3-F5 region in both /l/ productions, but the number of zeros and their frequencies are affected by the length of the linguo-alveolar contact and by the presence or absence of lateral linguopalatal contacts. The dark /l/ has one zero below 5 kHz, produced by the cross mode posterior to the linguo-alveolar contact, while the light /l/ has three zeros below 5 kHz, produced by the asymmetrical lateral channels, the supralingual cavity and the cross mode posterior to linguo-alveolar contact.
Journal of the Acoustical Society of America | 2016
Carol Y. Espy-Wilson; Xinhui Zhou; Ali Akrouf
In general, the commercial evaluation of speech for telecommunication is focused on what is called “speech quality” rather than intelligibility per se. This talk will discuss the current state-of-the-art in the context of the objective test model 3QUEST (3-fold Quality Evaluation of Speech in Telecommunications) developed by Head Acoustics. 3Quest is an objective evaluation of transmitted speech with background noise and we will discuss the new ETSI standard that it is based on. The particular improvement relative to previous objective methods (PESQ and POLQUA) is that influence of different background noises is taken into account and the calculation of three mean opinion score (MOS) values, one each for the speech-only regions of the signal (SMOS), the noise-only regions of the signal (NMOS), and the total signal (GMOS). These scores allow for a more meaningful statement regarding the causes for the impression of quality. Using examples, we will discuss how signal noise, intelligibility, and quality are interrelated as signals are affected by various parameters.