Robbie Vogt
Queensland University of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Robbie Vogt.
Computer Speech & Language | 2008
Robbie Vogt; Sridha Sridharan
This article describes a general and powerful approach to modelling mismatch in speaker recognition by including an explicit session term in the Gaussian mixture speaker modelling framework. Under this approach, the Gaussian mixture model (GMM) that best represents the observations of a particular recording is the combination of the true speaker model with an additional session-dependent offset constrained to lie in a low-dimensional subspace representing session variability. A novel and efficient model training procedure is proposed in this work to perform the simultaneous optimisation of the speaker model and session variables required for speaker training. Using a similar iterative approach to the Gauss–Seidel method for solving linear systems, this procedure greatly reduces the memory and computational resources required by a direct solution. Extensive experimentation demonstrates that the explicit session modelling provides up to a 68% reduction in detection cost over a standard GMM-based system and significant improvements over a system utilising feature mapping, and is shown to be effective on the corpora of recent National Institute of Standards and Technology (NIST) Speaker Recognition Evaluations, exhibiting different session mismatch conditions.
international conference on acoustics, speech, and signal processing | 2006
Robbie Vogt; Sridha Sridharan
Presented is an approach to modelling session variability for GMM-based text-independent speaker verification incorporating a constrained session variability component in both the training and testing procedures. The proposed technique reduces the data labelling requirements and removes discrete categorisation needed by previous techniques and provides superior performance. Experiments on Mixer conversational telephony data show improvements of as much as 46% in equal error rate over a baseline system. In this paper the algorithm used for the enrollment procedure is described in detail. Results are also presented investigating the response of the technique to short test utterances and varying session subspace dimension
international conference on acoustics, speech, and signal processing | 2012
Ahilan Kanagasundaram; David Dean; Robbie Vogt; Mitchell McLaren; Sridha Sridharan; Michael Mason
This paper introduces the Weighted Linear Discriminant Analysis (WLDA) technique, based upon the weighted pairwise Fisher criterion, for the purposes of improving i-vector speaker verification in the presence of high inter-session variability. By taking advantage of the speaker discriminative information that is available in the distances between pairs of speakers clustered in the development i-vector space, the WLDA technique is shown to provide an improvement in speaker verification performance over traditional Linear Discriminant Analysis (LDA) approaches. A similar approach is also taken to extend the recently developed Source Normalised LDA (SNLDA) into Weighted SNLDA (WSNLDA) which, similarly, shows an improvement in speaker verification performance in both matched and mismatched enrolment/verification conditions. Based upon the results presented within this paper using the NIST 2008 Speaker Recognition Evaluation dataset, we believe that both WLDA and WSNLDA are viable as replacement techniques to improve the performance of LDA and SNLDA-based i-vector speaker verification.
international conference on acoustics, speech, and signal processing | 2009
Roy Wallace; Robbie Vogt; Sridha Sridharan
While spoken term detection (STD) systems based on word indices provide good accuracy, there are several practical applications where it is infeasible or too costly to employ an LVCSR engine. An STD system is presented, which is designed to incorporate a fast phonetic decoding front-end and be robust to decoding errors whilst still allowing for rapid search speeds. This goal is achieved through monophone open-loop decoding coupled with fast hierarchical phone lattice search. Results demonstrate that an STD system that is designed with the constraint of a fast and simple phonetic decoding front-end requires a compromise to be made between search speed and search accuracy.
international conference on acoustics, speech, and signal processing | 2010
Roy Wallace; Robbie Vogt; Brendan Baker; Sridha Sridharan
This paper introduces a novel technique to directly optimise the Figure of Merit (FOM) for phonetic spoken term detection. The FOM is a popular measure of STD accuracy, making it an ideal candidate for use as an objective function. A simple linear model is introduced to transform the phone log-posterior probabilities output by a phone classifier to produce enhanced log-posterior features that are more suitable for the STD task. Direct optimisation of the FOM is then performed by training the parameters of this model using a nonlinear gradient descent algorithm. Substantial FOM improvements of 11% relative are achieved on held-out evaluation data, demonstrating the generalisability of the approach.
international conference on acoustics, speech, and signal processing | 2009
Mitchell McLaren; Brendan Baker; Robbie Vogt; Sridha Sridharan
The problem of background dataset selection in SVM-based speaker verification is addressed through the proposal of a new data-driven selection technique. Based on support vector selection, the proposed approach introduces a method to individually assess the suitability of each candidate impostor example for use in the background dataset. The technique can then produce a refined background dataset by selecting only the most informative impostor examples. Improvements of 13% in min. DCF and 10% in EER were found on the SRE 2006 development corpus when using the proposed method over the best heuristically chosen set. The technique was also shown to generalise to the unseen NIST 2008 SRE corpus.
international conference on signal processing and communication systems | 2008
David Wang; Robbie Vogt; Michael Mason; Sridha Sridharan
This paper presents a novel technique for segmenting an audio stream into homogeneous regions according to speaker identities, background noise, music, environmental and channel conditions. Audio segmentation is useful in audio diarization systems, which aim to annotate an input audio stream with information that attributes temporal regions of the audio into their specific sources. The segmentation method introduced in this paper is performed using the Generalized Likelihood Ratio (GLR), computed between two adjacent sliding windows over preprocessed speech. This approach is inspired by the popular segmentation method proposed by the pioneering work of Chen and Gopalakrishnan, using the bayesian information criterion (BIC) with an expanding search window. This paper will aim to identify and address the shortcomings associated with such an approach. The result obtained by the proposed segmentation strategy is evaluated on the 2002 rich transcription (RT-02) Evaluation dataset, and a miss rate of 19.47% and a false alarm rate of 16.94% is achieved at the optimal threshold.
IEEE Transactions on Audio, Speech, and Language Processing | 2011
Roy Wallace; Brendan Baker; Robbie Vogt; Sridha Sridharan
This paper proposes to improve spoken term detection (STD) accuracy by optimizing the figure of merit (FOM). In this paper, the index takes the form of a phonetic posterior-feature matrix. Accuracy is improved by formulating STD as a discriminative training problem and directly optimizing the FOM, through its use as an objective function to train a transformation of the index. The outcome of indexing is then a matrix of enhanced posterior-features that are directly tailored for the STD task. The technique is shown to improve the FOM by up to 13% on held-out data. Additional analysis explores the effect of the technique on phone recognition accuracy, examines the actual values of the learned transform, and demonstrates that using an extended training data set results in further improvement in the FOM.
international conference on biometrics | 2007
Mitchell McLaren; Robbie Vogt; Sridha Sridharan
This paper demonstrates that modelling session variability during GMM training can improve the performance of a GMM supervector SVM speaker verification system. Recently, a method of modelling session variability in GMM-UBM systems has led to significant improvements when the training and testing conditions are subject to session effects. In this work, session variability modelling is applied during the extraction of GMM supervectors prior to SVM speaker model training and classification. Experiments performed on the NIST 2005 corpus show major improvements over the baseline GMM supervector SVM system.
international conference on biometrics | 2009
Brendan Baker; Robbie Vogt; Mitchell McLaren; Sridha Sridharan
This paper presents Scatter Difference Nuisance Attribute Projection (SD-NAP) as an enhancement to NAP for SVM-based speaker verification. While standard NAP may inadvertently remove desirable speaker variability, SD-NAP explicitly de-emphasises this variability by incorporating a weighted version of the between-class scatter into the NAP optimisation criterion. Experimental evaluation of SD-NAP with a variety of SVM systems on the 2006 and 2008 NIST SRE corpora demonstrate that SD-NAP provides improved verification performance over standard NAP in most cases, particularly at the EER operating point.