Waad Ben Kheder
University of Avignon
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Waad Ben Kheder.
international conference on acoustics, speech, and signal processing | 2015
Waad Ben Kheder; Driss Matrouf; Jean-François Bonastre; Moez Ajili; Pierre-Michel Bousquet
State-of-the-art speaker recognition systems performance degrades considerably in noisy environments even though they achieve very good results in clean conditions. In order to deal with this strong limitation, we aim in this work to remove the noisy part of an i-vector directly in the i-vector space. Our approach offers the advantage to operate only at the i-vector extraction level, letting the other steps of the system unchanged. A maximum a posteriori (MAP) procedure is applied in order to obtain clean version of the noisy i-vectors taking advantage of prior knowledge about clean i-vectors distribution. To perform this MAP estimation, Gaussian assumptions over clean and noise i-vectors distributions are made. Operating on NIST 2008 data, we show a relative improvement up to 60% compared with baseline system. Our approach also outperforms the “multi-style” backend training technique. The efficiency of the proposed method is obtained at the price of relative high computational cost. We present at the end some ideas to improve this aspect.
International Conference on Statistical Language and Speech Processing | 2014
Waad Ben Kheder; Driss Matrouf; Pierre-Michel Bousquet; Jean-François Bonastre; Moez Ajili
In the last few years, the use of i-vectors along with a generative back-end has become the new standard in speaker recognition. An i-vector is a compact representation of a speaker utterance extracted from a low dimensional total variability subspace. Although current speaker recognition systems achieve very good results in clean training and test conditions, the performance degrades considerably in noisy environments. The compensation of the noise effect is actually a research subject of major importance. As far as we know, there was no serious attempt to treat the noise problem directly in the i-vectors space without relying on data distributions computed on a prior domain. This paper proposes a full-covariance Gaussian modeling of the clean i-vectors and noise distributions in the i-vectors space then introduces a technique to estimate a clean i-vector given the noisy version and the noise density function using MAP approach. Based on NIST data, we show that it is possible to improve up to 60 % the baseline system performances. A noise adding tool is used to help simulate a real-world noisy environment at different signal-to-noise ratio levels.
conference of the international speech communication association | 2016
Waad Ben Kheder; Driss Matrouf; Moez Ajili; Jean-François Bonastre
Speaker recognition with short utterance is highly challenging. The use of i-vectors in SR systems became a standard in the last years and many algorithms were developed to deal with the short utterances problem. We present in this paper a new technique based on modeling jointly the i-vectors corresponding to short utterances and those of long utterances. The joint distribution is estimated using a large number of i-vectors pairs (coming from short and long utterances) corresponding to the same session. The obtained distribution is then integrated in an MMSE estimator in the test phase to compute an ”improved” version of short utterance i-vectors. We show that this technique can be used to deal with duration mismatch and that it achieves up to 40% of relative improvement in EER(%) when used on NIST data. We also apply this technique on the recently published SITW database and show that it yields 25% of EER(%) improvement compared to a regular PLDA scoring.
conference of the international speech communication association | 2016
Waad Ben Kheder; Driss Matrouf; Moez Ajili; Jean-François Bonastre
Additive noise is one of the main challenges for automatic speaker recognition and several compensation techniques have been proposed to deal with this problem. In this paper, we present a new ”data-driven” denoising technique operating in the i-vector space based on a joint modeling of clean and noisy i-vectors. The joint distribution is estimated using a large set of i-vectors pairs (clean i-vectors and their noisy versions generated artificially) then integrated in an MMSE estimator in the test phase to compute a ”cleaned-up” version of noisy test ivectors. We show that this algorithm achieves up to 80% of relative improvement in EER. We also present a version of the proposed algorithm that can be used to compensate multiple ”unseen” noises. We test this technique on the recently published SITW database and show a significant gain compared to the baseline system performance.
international conference on acoustics, speech, and signal processing | 2017
Moez Ajili; Jean-François Bonastre; Waad Ben Kheder; Solange Rossato; Juliette Kahn
Forensic Voice Comparison (FVC) is increasingly using the likelihood ratio (LR) in order to indicate whether the evidence supports the prosecution (same-speaker) or defender (different-speakers) hypotheses. Nevertheless, the LR accepts some practical limitations due both to its estimation process itself and to a lack of knowledge about the reliability of this (practical) estimation process. It is particularly true when FVC is considered using Automatic Speaker Recognition (ASR) systems. Indeed, in the LR estimation performed by ASR systems, different factors are not considered such as speaker intrinsic characteristics, denoted “speaker factor”, the amount of information involved in the comparison as well as the phonological content and so on. This article focuses on the impact of phonological content on FVC involving two different speakers and more precisely the potential implication of a specific phonemic category on wrongful conviction cases (innocents are send behind bars). We show that even though the vast majority of speaker pairs (more than 90%) are well discriminated, few pairs are difficult to distinguish. For the “best” discriminated pairs, all the phonemic content play a positive role in speaker discrimination while for the “worst” pairs, it appears that nasals have a negative effect and lead to a confusion between speakers.
Computer Speech & Language | 2017
Waad Ben Kheder; Driss Matrouf; Pierre-Michel Bousquet; Jean-François Bonastre; Moez Ajili
We use a normal distribution model for both clean and noisy i-vectors.We use an additive model of the noise in the i-vector space.We use a MAP estimator to clean-up noisy i-vectors based on both clean i-vectors and noise distributions in the i-vector space. Once the i-vector paradigm has been introduced in the field of speaker recognition, many techniques have been proposed to deal with additive noise within this framework. Due to the complexity of its effect in the i-vector space, a lot of effort has been put into dealing with noise in other domains (speech enhancement, feature compensation, robust i-vector extraction and robust scoring). As far as we know, there was no serious attempt to handle the noise problem directly in the i-vector space without relying on data distributions computed on a prior domain. The aim of this paper is twofold. First, it proposes a full-covariance Gaussian modeling of the clean i-vectors and noise distribution in the i-vector space and introduces a technique to estimate a clean i-vector given the noisy version and the noise density function using the MAP approach. Based on NIST data, we show that it is possible to improve by up to 60% the baseline system performance. Second, in order to make this algorithm usable in a real application and reduce the computational time needed by i-MAP, we propose an extension that requires building a noise distribution database in the i-vector space in an off-line step and using it later in the test phase. We show that it is possible to achieve comparable results using this approach (up to 57% of relative EER improvement) with a sufficiently large noise distribution database.
conference of the international speech communication association | 2016
Mohamed Morchid; Mohamed Bouaziz; Waad Ben Kheder; Killian Janod; Pierre-Michel Bousquet; Richard Dufour; Georges Linarès
Performance of spoken language understanding applications declines when spoken documents are automatically transcribed in noisy conditions due to high Word Error Rates (WER). To improve the robustness to transcription errors, recent solutions propose to map these automatic transcriptions in a latent space. These studies have proposed to compare classical topic-based representations such as Latent Dirichlet Allocation (LDA), supervised LDA and author-topic (AT) models. An original compact representation, called c-vector, has recently been introduced to walk around the tricky choice of the number of latent topics in these topic-based representations. Moreover, c-vectors allow to increase the robustness of document classification with respect to transcription errors by compacting different LDA representations of a same speech document in a reduced space and then compensate most of the noise of the document representation. The main drawback of this method is the number of sub-tasks needed to build the c-vector space. This paper proposes to both improve this compact representation (c-vector) of spoken documents and to reduce the number of needed sub-tasks, using an original framework in a robust low dimensional space of features from a set of AT models called Latent Topic-based Sub-space (LTS). In comparison to LDA, the AT model considers not only the dialogue content (words), but also the class related to the document. Experiments are conducted on the DECODA corpus containing speech conversations from the call-center of the RATP Paris transportation company. Results show that the original LTS representation outperforms the best previous compact representation (c-vector), with a substantial gain of more than 2.5% in terms of correctly labeled conversations.
Odyssey 2016 | 2016
Waad Ben Kheder; Driss Matrouf; Moez Ajili; Jean-François Bonastre
The i-vector framework witnessed great success in the past years in speaker recognition (SR). The feature extraction process is central in SR systems and many features have been developed over the years to improve the recognition performance. In this paper, we present a new feature representation which borrows a concept initially developed in computer vision to characterize textures called Local Binary Patterns (LBP). We explore the use of LBP as features for speaker recognition and show that using them as descriptors for cepstral coefficients dynamics (replacing ∆ and ∆∆ in the regular MFCC representation) results in more efficient features and yield up to 15% of relative improvement compared to the baseline system performance in both clean and noisy conditions. keywords: local binary patterns, feature extraction, ivector.
Odyssey 2016 | 2016
Waad Ben Kheder; Driss Matrouf; Moez Ajili; Jean-François Bonastre
Dealing with additive noise in the i-vector space can be challenging due to the complexity of its effect in that space. Several compensation techniques have been proposed in the last years to either remove the noise effect by setting a noise model in the i-vector space or build better scoring techniques that take environment perturbations into account. We recently presented a new efficient Bayesian cleaning technique operating in the ivector domain named I-MAP that improves the baseline system performance by up to 60%. This technique is based on Gaussian models for the clean and noise i-vectors distributions. After IMAP transformation, these hypothesis are probably less correct. For this reason, we propose to apply another MMSE-based approach that uses the Kabsch algorithm. For a certain noise, it estimates the best translation vector and rotation matrix between a set of train noisy i-vectors and their clean counterparts based on RMSD criterion. This transformation is then applied on noisy test i-vectors in order to remove the noise effect. We show that applying the Kabsch algorithm allows to reach a 40% relative improvement in EER(%) compared to a baseline system performance and that, when combined with I-MAP and repeated iteratively, it allows to reach 85% of relative improvement. keywords: i-vector, additive noise, Kabsch algorithm, IMAP
IEEE Transactions on Audio, Speech, and Language Processing | 2018
Waad Ben Kheder; Driss Matrouf; Moez Ajili; Jean-François Bonastre
The past decade has witnessed a significant improvement in speaker recognition (SR) technology in terms of performance with the introduction of the i-vectors framework. Despite these advances, the performance of SR systems considerably suffers in the presence of acoustic nuisances and variabilities. In this paper, we develop a data-driven nuisance compensation technique in the i-vector space without referring to the effects of the targeted nuisances in the temporal domain. This approach is nonparametric as it does not suppose a specific relationship between a “good” version of an i-vector and its corrupted version. Instead, our algorithm models directly the joint distribution of both representations (the good i-vector and its corrupted version) and takes advantage of the reproducibility of acoustic corruptions to generate the corrupted i-vectors. We then build an MMSE estimator that computes an improved version of a corrupted test i-vector, given this joint distribution. Experiments are carried out on NIST SRE 2010 and speakers in the wild databases where the proposed algorithm is used to deal with additive noise and short utterances. Our technique is shown to be efficient, improving the baseline system performance in terms of equal-error rate by up to 70% when used on known test noises and up to 65% in the context of unseen noises using a generic model. It was also proven efficient in the context of duration mismatch reaching up to 40% of relative improvement when used on short utterances using multiple models corresponding to different durations and up to 36% when used on arbitrary duration test segments.