Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Jason W. Pelecanos is active.

Publication


Featured researches published by Jason W. Pelecanos.


international conference on acoustics, speech, and signal processing | 2011

Feature normalization for speaker verification in room reverberation

Sriram Ganapathy; Jason W. Pelecanos; Mohamed Kamal Omar

The performance of a typical speaker verification system degrades significantly in reverberant environments. This degradation is partly due to the conventional feature extraction/compensation techniques that use analysis windows which are much shorter than typical room impulse responses. In this paper, we present a feature extraction technique which estimates long-term envelopes of speech in narrow sub-bands using frequency domain linear prediction (FDLP). When speech is corrupted by reverberation, the long-term sub-band envelopes are convolved in time with those of the room impulse response function. In a first order approximation, gain normalization of these envelopes in the FDLP model suppresses the room reverberation artifacts. Experiments are performed on the 8 core conditions of the NIST 2008 speaker recognition evaluation (SRE). In these experiments, the FDLP features provide significant improvements on the interview microphone conditions (relative improvements of 20–30%) over the corresponding baseline system with MFCC features.


international conference on acoustics, speech, and signal processing | 2010

A novel approach to detecting non-native speakers and their native language

Mohamed Kamal Omar; Jason W. Pelecanos

Speech contains valuable information regarding the traits of speakers. This paper investigates two aspects of this information. The first is automatic detection of non-native speakers and their native language on relatively large data sets. We present several experiments which show how our system outperforms the best published results on both the Fisher database and the foreign-accented English (FAE) database for detecting non-native speakers and their native language respectively. Such performance is achieved by using an SVM-based classifier with ASR-based features integrated with a novel universal background model (UBM) obtained by clustering the Gaussian components of an ASR acoustic model. The second aspect of this work is to utilize the detected speaker characteristics within a speaker recognition system to improve its performance.


arXiv: Sound | 2016

The IBM 2016 Speaker Recognition System

Seyed Omid Sadjadi; Sriram Ganapathy; Jason W. Pelecanos

In this paper we describe the recent advancements made in the IBM i-vector speaker recognition system for conversational speech. In particular, we identify key techniques that contribute to significant improvements in performance of our system, and quantify their contributions. The techniques include: 1) a nearest-neighbor discriminant analysis (NDA) approach that is formulated to alleviate some of the limitations associated with the conventional linear discriminant analysis (LDA) that assumes Gaussian class-conditional distributions, 2) the application of speaker- and channel-adapted features, which are derived from an automatic speech recognition (ASR) system, for speaker recognition, and 3) the use of a deep neural network (DNN) acoustic model with a large number of output units (~10k senones) to compute the frame-level soft alignments required in the i-vector estimation process. We evaluate these techniques on the NIST 2010 speaker recognition evaluation (SRE) extended core conditions involving telephone and microphone trials. Experimental results indicate that: 1) the NDA is more effective (up to 35% relative improvement in terms of EER) than the traditional parametric LDA for speaker recognition, 2) when compared to raw acoustic features (e.g., MFCCs), the ASR speaker-adapted features provide gains in speaker recognition performance, and 3) increasing the number of output units in the DNN acoustic model (i.e., increasing the senone set size from 2k to 10k) provides consistent improvements in performance (for example from 37% to 57% relative EER gains over our baseline GMM i-vector system). To our knowledge, results reported in this paper represent the best performances published to date on the NIST SRE 2010 extended core tasks.


IEEE Signal Processing Letters | 2013

Using Polynomial Kernel Support Vector Machines for Speaker Verification

Sibel Yaman; Jason W. Pelecanos

In this letter, we propose a discriminative modeling approach for the speaker verification problem that uses polynomial kernel support vector machines (PK-SVMs). The proposed approach is rooted in an equivalence relationship between the state-of-the-art probabilistic linear discriminant analysis (PLDA) and second degree polynomial kernel methods. We present two techniques for overcoming the memory and computational challenges that PK-SVMs pose. The first of these, a kernel evaluation simplification trick, eliminates the need to explicitly compute dot products for a huge number of training samples. The second technique makes use of the massively parallel processing power of modern graphical processing units. We performed experiments on the Phase I speaker verification track of the DARPA sponsored Robust Automatic Transcription of Speech (RATS) program. We found that, in the multi-session enrollment experiments, second degree PK-SVMs outperformed PLDA across all tasks in terms of the official evaluation metric, and third and fourth degree PK-SVMs provided a performance improvement over the second degree PK-SVMs. Furthermore, for the “30s-30s” task, a linear score combination between the PLDA and PK-SVM based systems provided 27% improvement relative to the PLDA baseline in terms of the official evaluation metric.


Lecture Notes in Computer Science | 2001

Revisiting Carl Bildt's Impostor: Would a Speaker Verification System Foil Him?

Kirk P. H. Sullivan; Jason W. Pelecanos

Impostors pose a potential threat to security systems that rely on human identification and verification based on voice alone and to security systems that make use of computer audio-based person authentication systems. This paper presents a case-study, which explores these issues using recordings of a high quality professional impersonation of a well-known Swedish politician. These recordings were used in the role of impostor in the experiments reported here. The experiments using human listeners showed that an impostor who can closely imitate the speech of the target voice can result in confusion and, therefore, can pose a threat to security systems that rely on human identification and verification. In contrast, an established Gaussian mixture model based speaker identification system was employed to distinguish the recordings. It was shown that the recognition engine was capable of classifying the mimic attacks more appropriately.


international conference on acoustics, speech, and signal processing | 2008

Intersession variability compensation for language detection

Xi Zhou; J. Navrdtit; Jason W. Pelecanos; Ganesh N. Ramaswamy; Thomas S. Huang

Gaussian mixture models (QMM) have become one of the standard acoustic approaches for Language Detection. These models are typically incorporated to produce a log-likelihood ratio (LLR) verification statistic. In this framework, the intersession variability within each language becomes an adverse factor degrading the accuracy. To address this problem, we formulate the LLR as a function of the QMM parameters concatenated into normalized mean supervectors, and estimate the distribution of each language in this (high dimensional) supervector space. The goal is to de-emphasize the directions with the largest intersession variability. We compare this method with two other popular intersession variability compensation methods known as Nuisance Attribute Projection (NAP) and Within-Class Covariance Normalization (WCCN). Experiments on the NIST LRE 2003 and NIST LRE 2005 speech corpora show that the presented technique reduces the error by 50% relative to the baseline, and performs competitively with the NAP and WCCN approaches. Fusion results with a phonotactic component are also presented.


international conference on acoustics, speech, and signal processing | 2016

Online speaker diarization using adapted i-vector transforms

Weizhong Zhu; Jason W. Pelecanos

Many speaker diarization systems operate in an off-line mode. Such systems typically find homogeneous segments and then cluster these segments according to speaker. Such algorithms, like bottom-up clustering, k-means or spectral clustering, generally require the registration of all segments before clustering can begin. However, for real-time applications such as with multi-person voice interactive systems, there is a need to perform online speaker assignment in a strict left-to-right fashion. In this paper we propose a novel Maximum a Posteriori (MAP) adapted transform within an i-vector speaker diarization framework, that operates in a strict left-to-right fashion. Previous work by the community has shown that the principal components of variation of fixed dimensional i-vectors learned across segments tend to indicate a strong basis by which to separate speakers. However, determining this basis can be problematic when there are few segments or when operating in an online manner. The proposed method blends the prior with the estimated subspace as more i-vectors are observed. Given oracle SAD segments, with adaptation we achieve 3.2% speaker diarization error for a strict left-to-right constraint on the LDC Callhome English Corpus compared to 4.8% without adaptation.


international conference on acoustics, speech, and signal processing | 2015

Nearest neighbor based i-vector normalization for robust speaker recognition under unseen channel conditions

Weizhong Zhu; Seyed Omid Sadjadi; Jason W. Pelecanos

Many state-of-the-art speaker recognition engines use i-vectors to represent variable-length acoustic signals in a fixed low-dimensional total variability subspace. While such systems perform well under seen channel conditions, their performance greatly degrades under unseen channel scenarios. Accordingly, rapid adaptation of i-vector systems to unseen conditions has recently attracted significant research effort from the community. To mitigate this mismatch, in this paper we propose nearest neighbor based i-vector mean normalization (NN-IMN) and i-vector smoothing (IS) for unsupervised adaptation to unseen channel conditions within a state-of-the-art i-vector/PLDA speaker verification framework. A major advantage of the approach is its ability to handle multiple unseen channels without explicit retraining or clustering. Our observations on the DARPA Robust Automatic Transcription of Speech (RATS) speaker recognition task suggest that part of the distortion caused by an unseen channel may be modeled as an offset in the i-vector space. Hence, the proposed nearest neighbor based normalization technique is formulated to compensate for such a shift. Experimental results with the NN based normalized i-vectors indicate that, on average, we can recover 46% of the total performance degradation due to unseen channel conditions.


international conference on acoustics, speech, and signal processing | 2015

Nearest neighbor discriminant analysis for language recognition

Seyed Omid Sadjadi; Jason W. Pelecanos; Sriram Ganapathy

Many state-of-the-art i-vector based voice biometric systems use linear discriminant analysis (LDA) as a post-processing stage to increase the computational efficiency in the back-end via dimensionality reduction, as well as annihilate the undesired (noisy) directions in the total variability subspace. The traditional approach for computing the LDA transform uses parametric representations for both intra- and inter-class scatter matrices that are based on the Gaussian distribution assumption. However, it is known that the actual distribution of i-vectors may not necessarily be Gaussian, and in particular, in the presence of noise and channel distortions. In addition, the rank of the LDA projection (i.e., the maximum number of available discriminant bases) is limited to the number of classes minus 1. Accordingly, language recognition tasks on noisy data that involve only a few language classes receive limited or no benefit from the LDA post-processing. Motivated by this observation, we present an alternative non-parametric discriminant analysis (NDA) technique that measures both the within- and between-language variation on a local basis using the nearest neighbor rule. The effectiveness of the NDA method is evaluated in the context of noisy language recognition tasks using speech material from the DARPA Robust Automatic Transcription of Speech (RATS) program. Experimental results indicate that NDA is more effective than the traditional parametric LDA for language recognition under noisy and channel degraded conditions.


international conference on acoustics, speech, and signal processing | 2016

Speaker age estimation on conversational telephone speech using senone posterior based i-vectors

Seyed Omid Sadjadi; Sriram Ganapathy; Jason W. Pelecanos

Automatic age estimation from speech has a variety of applications including natural human-computer interaction, targeted advertising, customer-agent pairing in call centers, and forensics, to mention a few. Recently, the use of i-vectors has shown promise for automatic age estimation. In this paper, we adopt a phonetically-aware i-vector extractor for the age estimation problem. Such senone i-vector based schemes have demonstrated success in the speaker recognition field. Fixed-length and low-dimensional i-vectors are first conditioned through a linear discriminant analysis (LDA) transform, and then used to train a support vector regression (SVR) model. Additionally, in contrast to previous work, we employ the use of the logarithm of the age as the target in training the SVR to further penalize estimation errors for younger speakers compared with older speakers. The proposed system is evaluated using telephony speech material extracted from the NIST SRE 2008 and 2010 evaluation corpora. Experimental results indicate solid age estimation performance with a mean absolute error (MAE) of 4.7 years for both male and female speakers on the NIST SRE 2010 telephony test set.

Collaboration


Dive into the Jason W. Pelecanos's collaboration.

Top Co-Authors

Avatar

Sridha Sridharan

Queensland University of Technology

View shared research outputs
Top Co-Authors

Avatar

Seyed Omid Sadjadi

University of Texas at Dallas

View shared research outputs
Top Co-Authors

Avatar

Robert J. Vogt

Queensland University of Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge