Phillip L. De Leon
New Mexico State University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Phillip L. De Leon.
international conference on acoustics, speech, and signal processing | 2011
Phillip L. De Leon; Inma Hernaez; Ibon Saratxaga; Michael Pucher; Junichi Yamagishi
In this paper, we present new results from our research into the vulnerability of a speaker verification (SV) system to synthetic speech. We use a HMM-based speech synthesizer, which creates synthetic speech for a targeted speaker through adaptation of a background model and both GMM-UBM and support vector machine (SVM) SV systems. Using 283 speakers from the Wall-Street Journal (WSJ) corpus, our SV systems have a 0.35% EER. When the systems are tested with synthetic speech generated from speaker models derived from the WSJ journal corpus, over 91% of the matched claims are accepted. We propose the use of relative phase shift (RPS) in order to detect synthetic speech and develop a GMM-based synthetic speech classifier (SSC). Using the SSC, we are able to correctly classify human speech in 95% of tests and synthetic speech in 88% of tests thus significantly reducing the vulnerability.
IEEE Transactions on Audio, Speech, and Language Processing | 2012
Phillip L. De Leon; Michael Pucher; Junichi Yamagishi; Inma Hernaez; Ibon Saratxaga
In this paper, we evaluate the vulnerability of speaker verification (SV) systems to synthetic speech. The SV systems are based on either the Gaussian mixture model–universal background model (GMM-UBM) or support vector machine (SVM) using GMM supervectors. We use a hidden Markov model (HMM)-based text-to-speech (TTS) synthesizer, which can synthesize speech for a target speaker using small amounts of training data through model adaptation of an average voice or background model. Although the SV systems have a very low equal error rate (EER), when tested with synthetic speech generated from speaker models derived from the Wall Street Journal (WSJ) speech corpus, over 81% of the matched claims are accepted. This result suggests vulnerability in SV systems and thus a need to accurately detect synthetic speech. We propose a new feature based on relative phase shift (RPS), demonstrate reliable detection of synthetic speech, and show how this classifier can be used to improve security of SV systems.
international conference on acoustics, speech, and signal processing | 2010
Phillip L. De Leon; Vijendra Raj Apsingekar; Michael Pucher; Junichi Yamagishi
In this paper, we investigate imposture using synthetic speech. Although this problem was first examined over a decade ago, dramatic improvements in both speaker verification (SV) and speech synthesis have renewed interest in this problem. We use a HMM-based speech synthesizer which creates synthetic speech for a targeted speaker through adaptation of a background model. We use two SV systems: standard GMM-UBM-based and a newer SVM-based. Our results show when the systems are tested with human speech, there are zero false acceptances and zero false rejections. However, when the systems are tested with synthesized speech, all claims for the targeted speaker are accepted while all other claims are rejected. We propose a two-step process for detection of synthesized speech in order to prevent this imposture. Overall, while SV systems have impressive accuracy, even with the proposed detector, high-quality synthetic speech will lead to an unacceptably high false acceptance rate.
Handbook of Biometric Anti-Spoofing | 2014
Nicholas W. D. Evans; Tomi Kinnunen; Junichi Yamagishi; Zhizheng Wu; Federico Alegre; Phillip L. De Leon
Progress in the development of spoofing countermeasures for automatic speaker recognition is less advanced than equivalent work related to other biometric modalities. This chapter outlines the potential for even state-of-the-art automatic speaker recognition systems to be spoofed. While the use of a multitude of different datasets, protocols and metrics complicates the meaningful comparison of different vulnerabilities, we review previous work related to impersonation, replay, speech synthesis and voice conversion spoofing attacks. The article also presents an analysis of the early work to develop spoofing countermeasures. The literature shows that there is significant potential for automatic speaker verification systems to be spoofed, that significant further work is required to develop generalised countermeasures, that there is a need for standard datasets, evaluation protocols and metrics and that greater emphasis should be placed on text-dependent scenarios.
IEEE Transactions on Audio, Speech, and Language Processing | 2016
Zhizheng Wu; Phillip L. De Leon; Ali Khodabakhsh; Simon King; Zhen-Hua Ling; Daisuke Saito; Bryan Stewart; Tomoki Toda; Mirjam Wester; Junichi Yamagishi
In this paper, we present a systematic study of the vulnerability of automatic speaker verification to a diverse range of spoofing attacks. We start with a thorough analysis of the spoofing effects of five speech synthesis and eight voice conversion systems, and the vulnerability of three speaker verification systems under those attacks. We then introduce a number of countermeasures to prevent spoofing attacks from both known and unknown attackers. Known attackers are spoofing systems whose output was used to train the countermeasures, while an unknown attacker is a spoofing system whose output was not available to the countermeasures during training. Finally, we benchmark automatic systems against human performance on both speaker verification and spoofing detection tasks.
data compression conference | 2011
Laura E. Boucheron; Phillip L. De Leon; Steven Sandoval
In this paper, we propose a low bit-rate speech codec based on a hybrid scalar/vector quantization of the mel-frequency cepstral coefficients (MFCCs). We begin by showing that if a high-resolution mel-frequency cepstrum (MFC) is computed, good-quality speech reconstruction is possible from the MFCCs despite the lack of explicit phase information. By evaluating the contribution toward speech quality that individual MFCCs make and applying appropriate quantization, our results show perceptual evaluation of speech quality (PESQ) of the MFCC-based codec matches the state-of-the-art MELPe codec at 600 bps and exceeds the CELP codec at 2000 -- 4000 bps coding rates. The main advantage of the proposed codec is in distributed speech recognition (DSR) since speech features based on MFCCs can be directly obtained from code words thus eliminating additional decode and feature extract stages.
international conference on digital signal processing | 2009
Aditi Akula; Vijendra Raj Apsingekar; Phillip L. De Leon
Speaker recognition systems tend to degrade if the training and testing conditions differ significantly. Such situations may arise due to the use of different microphones, telephone and mobile handsets or different acoustic conditions. Recently, the effect of the room acoustics on speaker identification (SI) has been investigated and it has been shown that a loss in accuracy results when using clean training and reverberated testing signals. Various techniques like dereverberation, use of multiple microphones, compensations have been proposed to minimize/alleviate the mismatch thereby increasing the SI accuracies. In this paper, we propose to use a Gaussian mixture model-Universal background model (GMM-UBM), with the multiple speaker model approach previously proposed, to compensate for the acoustical mismatch. By using this approach, the SI accuracies have improved over the conventional GMM based SI systems in the presence of room reverberation.
Speech Communication | 2011
Vijendra Raj Apsingekar; Phillip L. De Leon
Among the various proposed score normalizations, T- and Z-norm are most widely used in speaker verification systems. The main idea in these normalizations is to reduce the variations in impostor scores in order to improve accuracy. These normalizations require selection of a set of cohort models or utterances in order to estimate the impostor score distribution. In this paper we investigate basing this selection on recently-proposed speaker model clusters (SMCs). We evaluate this approach using the NTIMIT and NIST-2002 corpora and compare against T- and Z-norm which use other cohort selection methods. We also propose three new normalization techniques, @D-, @DT- and TC-norm, which also use SMCs to estimate the normalization parameters. Our results show that we can lower the equal error rate and minimum decision cost function with fewer cohort models using SMC-based score normalization approaches.
power systems computation conference | 2016
Milan Biswal; Yifan Hao; Phillip Chen; Sukumar M. Brahma; Huiping Cao; Phillip L. De Leon
Event identification is one among numerous applications being researched for PMU data. This application is intended to increase visualization of power system events, as well as for protection and control, including verification of relay operation to detect any misoperations. This paper uses data from field as well as from simulation to test a large variety of features using two well-known classifiers on a common dataset to find the most suitable features for disturbance data recorded by PMUs. The approach also uses data from only one PMU instead of data from multiple PMUs used by researchers so far, thus significantly reducing the data to be processed. It is shown that simple observation-based features capturing shape and statistics of disturbance waveforms work better than some well-known features derived from domain transformations. Classification accuracy and speed achieved with these features are shown to be satisfactory and suitable for the intended applications.
ieee automatic speech recognition and understanding workshop | 2015
Steven Sandoval; Phillip L. De Leon; Julie M. Liss
In recent work, we presented mathematical theory and algorithms for time-frequency analysis of non-stationary signals. In that work, we generalized the definition of the Hilbert spectrum by using a superposition of complex AM-FM components parameterized by the Instantaneous Amplitude (IA) and Instantaneous Frequency (IF). Using our Hilbert Spectral Analysis (HSA) approach, the IA and IF estimates can be far more accurate at revealing underlying signal structure than prior approaches to time-frequency analysis. In this paper, we have applied HSA to speech and compared to both narrowband and wideband spectrograms. We demonstrate how the AM-FM components, assumed to be intrinsic mode functions, align well with the energy concentrations of the spectrograms and highlight fine structure present in the Hilbert spectrum. As an example, we show never before seen intra-glottal pulse phenomena that are not readily apparent in other analyses. Such fine-scale analyses may have application in speech-based medical diagnosis and automatic speech recognition (ASR) for pathological speakers.