Themos Stafylakis
National Technical University of Athens
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Themos Stafylakis.
Pattern Recognition | 2010
Vassilis Papavassiliou; Themos Stafylakis; Vassilios Katsouros; George Carayannis
Two novel approaches to extract text lines and words from handwritten document are presented. The line segmentation algorithm is based on locating the optimal succession of text and gap areas within vertical zones by applying Viterbi algorithm. Then, a text-line separator drawing technique is applied and finally the connected components are assigned to text lines. Word segmentation is based on a gap metric that exploits the objective function of a soft-margin linear SVM that separates successive connected components. The algorithms tested on the benchmarking datasets of ICDAR07 handwriting segmentation contest and outperformed the participating algorithms.
international conference on acoustics, speech, and signal processing | 2013
Patrick Kenny; Themos Stafylakis; Pierre Ouellet; Jahangir Alam; Pierre Dumouchel
The duration of speech segments has traditionally been controlled in the NIST speaker recognition evaluations so that researchers working in this framework have been relieved of the responsibility of dealing with the duration variability that arises in practical applications. The fixed dimensional i-vector representation of speech utterances is ideal for working under such controlled conditions and ignoring the fact that i-vectors extracted from short utterances are less reliable than those extracted from long utterances leads to a very simple formulation of the speaker recognition problem. However a more realistic approach seems to be needed to handle duration variability properly. In this paper, we show how to quantify the uncertainty associated with the i-vector extraction process and propagate it into a PLDA classifier. We evaluated this approach using test sets derived from the NIST 2010 core and extended core conditions by randomly truncating the utterances in the female, telephone speech trials so that the durations of all enrollment and test utterances lay in the range 3-60 seconds and we found that it led to substantial improvements in accuracy. Although the likelihood ratio computation for speaker verification is more computationally expensive than in the standard i-vector/PLDA classifier, it is still quite modest as it reduces to computing the probability density functions of two full covariance Gaussians (irrespective of the number of the number of utterances used to enroll a speaker).
international conference on acoustics, speech, and signal processing | 2014
Vishwa Gupta; Patrick Kenny; Pierre Ouellet; Themos Stafylakis
State of the art speaker recognition systems are based on the i-vector representation of speech segments. In this paper we show how this representation can be used to perform blind speaker adaptation of hybrid DNN-HMM speech recognition system and we report excellent results on a French language audio transcription task. The implemenation is very simple. An audio file is first diarized and each speaker cluster is represented by an i-vector. Acoustic feature vectors are augmented by the corresponding i-vectors before being presented to the DNN. (The same i-vector is used for all acoustic feature vectors aligned with a given speaker.) This supplementary information improves the DNNs ability to discriminate between phonetic events in a speaker independent way without having to make any modification to the DNN training algorithms. We report results on the ETAPE 2011 transcription task, and show that i-vector based speaker adaptation is effective irrespective of whether cross-entropy or sequence training is used. For cross-entropy training, we obtained a word error rate (WER) reduction from 22.16% to 20.67% whereas for sequence training the WER reduces from 19.93% to 18.40%.
IEEE Transactions on Audio, Speech, and Language Processing | 2014
Mohammed Senoussaoui; Patrick Kenny; Themos Stafylakis; Pierre Dumouchel
Speaker clustering is a crucial step for speaker diarization. The short duration of speech segments in telephone speech dialogue and the absence of prior information on the number of clusters dramatically increase the difficulty of this problem in diarizing spontaneous telephone speech conversations. We propose a simple iterative Mean Shift algorithm based on the cosine distance to perform speaker clustering under these conditions. Two variants of the cosine distance Mean Shift are compared in an exhaustive practical study. We report state of the art results as measured by the Diarization Error Rate and the Number of Detected Speakers on the LDC CallHome telephone corpus.
international conference on acoustics, speech, and signal processing | 2008
Themos Stafylakis; Vassilis Papavassiliou; Vassilios Katsouros; George Carayannis
This paper addresses the problem of automatic text-line and word segmentation in handwritten document images. Two novel approaches are presented, one for each task. In text-line segmentation a Viterbi algorithm is proposed while an SVM-based metric is adopted to locate words in each text-line. The overall algorithm was tested in the ICDAR2007 handwriting segmentation contest and showed highly promising results.
international conference on acoustics, speech, and signal processing | 2013
Mohammed Senoussaoui; Patrick Kenny; Pierre Dumouchel; Themos Stafylakis
Speaker clustering is an important task in many applications such as Speaker Diarization as well as Speech Recognition. Speaker clustering can be done within a single multi-speaker recording (Diarization) or for a set of different recordings. In this work we are interested by the former case and we propose a simple iterative Mean Shift (MS) algorithm to deal with this problem. Traditionally, MS algorithm is based on Euclidean distance. We propose to use the Cosine distance in order to build a new version of MS algorithm. We report results as measured by speaker and cluster impurities on NIST SRE 2008 datasets.
international conference on acoustics, speech, and signal processing | 2014
Patrick Kenny; Themos Stafylakis; Pierre Ouellet; Md. Jahangir Alam
We discuss the limitations of the i-vector representation of speech segments in speaker recognition and explain how Joint Factor Analysis (JFA) can serve as an alternative feature extractor in a variety of ways. Building on the work of Zhao and Dong, we implemented a variational Bayes treatment of JFA which accommodates adaptation of universal background models (UBMs) in a natural way. This allows us to experiment with several types of features for speaker recognition: speaker factors and diagonal factors in addition to i-vectors, extracted with and without UBM adaptation in each case. We found that, in text-independent speaker verification experiments on NIST data, extracting i-vectors with UBM adaptation led to a 10% reduction in equal error rates although performance did not improve consistently over the whole DET curve. We achieved a further 10% reduction (with a similar inconsistency) by using speaker factors extracted with UBM adaptation as features. In text-dependent speaker recognition experiments on RSR2015 data, we were able to achieve very good performance using a JFA model with diagonal factors but no speaker factors as a feature extractor. Contrary to standard practice, this JFA model was configured so as to model speakerphrase combinations (rather than speakers) and it was trained on utterances of very short duration (rather than whole recording sessions). We also present a variant of the length normalization trick inspired by uncertainty propagation which leads to substantial gains in performance over the whole DET curve.
international conference on acoustics, speech, and signal processing | 2015
Patrick Kenny; Themos Stafylakis; Jahangir Alam; Marcel Kockmann
This paper introduces a new formulation of Joint Factor Analysis (JFA) for text-dependent speaker recognition based on left-to-right modeling with tied mixture HMMs. It accommodates many different ways of extracting multiple features to characterize speakers (features may or may not be HMM state-dependent, they may be modeled with subspace or factorial priors and these priors maybe imputed from text-dependent or text-independent background data). We feed these features to a new, trainable classifier for text-dependent speaker recognition in a manner which is broadly analogous to the i-vector/PLDA cascade in text-independent speaker recognition. We have evaluated this approach on a challenging proprietary dataset consisting of telephone recordings of short English and Urdu pass-phrases collected in Pakistan. By fusing results obtained with multiple front ends, equal error rate of around 2% are achievable.
international conference on acoustics, speech, and signal processing | 2014
David Martinez; Lukas Burget; Themos Stafylakis; Yun Lei; Patrick Kenny; Eduardo Lleida
Recently, a new version of the iVector modelling has been proposed for noise robust speaker recognition, where the nonlinear function that relates clean and noisy cepstral coefficients is approximated by a first order vector Taylor series (VTS). In this paper, it is proposed to substitute the first order VTS by an unscented transform, where unlike VTS, the nonlinear function is not applied over the clean model parameters directly, but over a set of sampled points. The resulting points in the transformed space are then used to calculate the model parameters. For very low signal-to-noise ratio improvements in equal error rate of about 7% for a clean backend and of 14.50% for a multistyle backend are obtained.
IEEE Transactions on Audio, Speech, and Language Processing | 2016
Themos Stafylakis; Patrick Kenny; Md. Jahangir Alam; Marcel Kockmann
We reformulate joint factor analysis so that it can serve as a feature extractor for text-dependent speaker recognition. The new formulation is based on left-to-right modeling with tied mixture HMMs and it is designed to deal with problems such as the inadequacy of subspace methods in modeling speaker-phrase variability, UBM mismatches that arise as a result of variable phonetic content, and the need to exploit text-independent resources in text-dependent speaker recognition. We pass the features extracted by factor analysis to a trainable backend which plays a role analogous to that of PLDA in the i-vector/PLDA cascade in text-independent speaker recognition. We evaluate these methods on a proprietary dataset consisting of English and Urdu passphrases collected in Pakistan. By using both text-independent data and text-dependent data for training purposes and by fusing results obtained with multiple front ends at the score level, we achieved equal error rates of around 1.3% and 2% on the English and Urdu portions of this task.