Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Rahim Saeidi is active.

Publication


Featured researches published by Rahim Saeidi.


IEEE Transactions on Audio, Speech, and Language Processing | 2012

Low-Variance Multitaper MFCC Features: A Case Study in Robust Speaker Verification

Tomi Kinnunen; Rahim Saeidi; Filip Sedlak; Kong Aik Lee; Johan Sandberg; Maria Hansson-Sandsten; Haizhou Li

In speech and audio applications, short-term signal spectrum is often represented using mel-frequency cepstral coefficients (MFCCs) computed from a windowed discrete Fourier transform (DFT). Windowing reduces spectral leakage but variance of the spectrum estimate remains high. An elegant extension to windowed DFT is the so-called multitaper method which uses multiple time-domain windows (tapers) with frequency-domain averaging. Multitapers have received little attention in speech processing even though they produce low-variance features. In this paper, we propose the multitaper method for MFCC extraction with a practical focus. We provide, first, detailed statistical analysis of MFCC bias and variance using autoregressive process simulations on the TIMIT corpus. For speaker verification experiments on the NIST 2002 and 2008 SRE corpora, we consider three Gaussian mixture model based classifiers with universal background model (GMM-UBM), support vector machine (GMM-SVM) and joint factor analysis (GMM-JFA). Multitapers improve MinDCF over the baseline windowed DFT by relative 20.4% (GMM-SVM) and 13.7% (GMM-JFA) on the interview-interview condition in NIST 2008. The GMM-JFA system further reduces MinDCF by 18.7% on the telephone data. With these improvements and generally noncritical parameter selection, multitaper MFCCs are a viable candidate for replacing the conventional MFCCs.


international conference on acoustics, speech, and signal processing | 2013

Duration mismatch compensation for i-vector based speaker recognition systems

Taufiq Hasan; Rahim Saeidi; John H. L. Hansen; David A. van Leeuwen

Speaker recognition systems trained on long duration utterances are known to perform significantly worse when short test segments are encountered. To address this mismatch, we analyze the effect of duration variability on phoneme distributions of speech utterances and i-vector length. We demonstrate that, as utterance duration is decreased, number of detected unique phonemes and i-vector length approaches zero in a logarithmic and non-linear fashion, respectively. Assuming duration variability as an additive noise in the i-vector space, we propose three different strategies for its compensation: i) multi-duration training in Probabilistic Linear Discriminant Analysis (PLDA) model, ii) score calibration using log duration as a Quality Measure Function (QMF), and iii) multi-duration PLDA training with synthesized short duration i-vectors. Experiments are designed based on the 2012 National Institute of Standards and Technology (NIST) Speaker Recognition Evaluation (SRE) protocol with varying test utterance duration. Experimental results demonstrate the effectiveness of the proposed schemes on short duration test conditions, especially with the QMF calibration approach.


IEEE Signal Processing Letters | 2010

Temporally Weighted Linear Prediction Features for Tackling Additive Noise in Speaker Verification

Rahim Saeidi; Jouni Pohjalainen; Tomi Kinnunen; Paavo Alku

Text-independent speaker verification under additive noise corruption is considered. In the popular mel-frequency cepstral coefficient (MFCC) front-end, the conventional Fourier-based spectrum estimation is substituted with weighted linear predictive methods, which have earlier shown success in noise-robust speech recognition. Two temporally weighted variants of linear predictive modeling are introduced to speaker verification and they are compared to FFT, which is normally used in computing MFCCs, and to conventional linear prediction. The effect of speech enhancement (spectral subtraction) on the system performance with each of the four feature representations is also investigated. Experiments by the authors on the NIST 2002 SRE corpus indicate that the accuracy of the conventional and proposed features are close to each other on clean data. For factory noise at 0 dB SNR level, baseline FFT and the better of the proposed features give EERs of 17.4% and 15.6%, respectively. These accuracies improve to 11.6% and 11.2%, respectively, when spectral subtraction is included as a preprocessing method. The new features hold a promise for noise-robust speaker verification.


IEEE Signal Processing Letters | 2013

Iterative Closed-Loop Phase-Aware Single-Channel Speech Enhancement

Pejman Mowlaee; Rahim Saeidi

Many short-time Fourier transform (STFT) based single-channel speech enhancement algorithms are focused on estimating the clean speech spectral amplitude from the noisy observed signal in order to suppress the additive noise. To this end, they utilize the noisy amplitude information and the corresponding a priori and a posteriori SNRs while they employ the observed noisy phase when reconstructing enhanced speech signal. This paper presents two contributions: i) reconsidering the relation between the phase group delay deviation and phase deviation, and ii) proposing a closed-loop single-channel speech enhancement approach to estimate both amplitude and phase spectra of the speech signal. To this end, we combine a group-delay based phase estimator with a phase-aware amplitude estimator in a closed loop design. Our experimental results on various noise scenarios show considerable improvement in the objective perceived signal quality obtained by the proposed iterative phase-aware approach compared to conventional Wiener filtering which uses the noisy phase in signal reconstruction.


international conference on biometrics | 2013

The 2013 speaker recognition evaluation in mobile environment

Elie Khoury; B. Vesnicer; Javier Franco-Pedroso; Ricardo Paranhos Velloso Violato; Z. Boulkcnafet; L. M. Mazaira Fernandez; Mireia Diez; J. Kosmala; Houssemeddine Khemiri; T. Cipr; Rahim Saeidi; Manuel Günther; J. Zganec-Gros; R. Zazo Candil; Flávio Olmos Simões; M. Bengherabi; A. Alvarez Marquina; Mikel Penagarikano; Alberto Abad; M. Boulayemen; Petr Schwarz; D.A. van Leeuwen; J. Gonzalez-Dominguez; M. Uliani Neto; E. Boutellaa; P. Gómez Vilda; Amparo Varona; Dijana Petrovska-Delacrétaz; Pavel Matejka; Joaquin Gonzalez-Rodriguez

This paper evaluates the performance of the twelve primary systems submitted to the evaluation on speaker verification in the context of a mobile environment using the MOBIO database. The mobile environment provides a challenging and realistic test-bed for current state-of-the-art speaker verification techniques. Results in terms of equal error rate (EER), half total error rate (HTER) and detection error trade-off (DET) confirm that the best performing systems are based on total variability modeling, and are the fusion of several sub-systems. Nevertheless, the good old UBM-GMM based systems are still competitive. The results also show that the use of additional data for training as well as gender-dependent features can be helpful.


international conference on acoustics, speech, and signal processing | 2013

Accent recognition using i-vector, Gaussian Mean Supervector and Gaussian posterior probability supervector for spontaneous telephone speech

Mohamad Hasan Bahari; Rahim Saeidi; Hugo Van hamme; David A. van Leeuwen

In this paper, three utterance modelling approaches, namely Gaussian Mean Supervector (GMS), i-vector and Gaussian Posterior Probability Supervector (GPPS), are applied to the accent recognition problem. For each utterance modeling method, three different classifiers, namely the Support Vector Machine (SVM), the Naive Bayesian Classifier (NBC) and the Sparse Representation Classifier (SRC), are employed to find out suitable matches between the utterance modelling schemes and the classifiers. The evaluation database is formed by using English utterances of speakers whose native languages are Russian, Hindi, American English, Thai, Vietnamese and Cantonese. These utterances are drawn from the National Institute of Standards and Technology (NIST) 2008 Speaker Recognition Evaluation (SRE) database. The study results show that GPPS and i-vector are more effective than GMS in this accent recognition task. It is also concluded that among the employed classifiers, the best matches for i-vector and GPPS are SVM and SRC, respectively.


IEEE Transactions on Audio, Speech, and Language Processing | 2013

Quality Measure Functions for Calibration of Speaker Recognition Systems in Various Duration Conditions

Miranti Indar Mandasari; Rahim Saeidi; Mitchell McLaren; David A. van Leeuwen

This paper investigates the effect of utterance duration to the calibration of a modern i-vector speaker recognition system with probabilistic linear discriminant analysis (PLDA) modeling. A calibration approach to deal with these effects using quality measure functions (QMFs) is proposed to include duration in the calibration transformation. Extensive experiments are performed in order to evaluate the robustness of the proposed calibration approach for unseen conditions in the training of calibration parameters. Using the latest NIST corpora for evaluation, results highlight the importance of considering the quality metrics like duration in calibrating the scores for automatic speaker recognition systems.


IEEE Transactions on Audio, Speech, and Language Processing | 2012

A Joint Approach for Single-Channel Speaker Identification and Speech Separation

Pejman Mowlaee; Rahim Saeidi; Mads Græsbøll Christensen; Zheng-Hua Tan; Tomi Kinnunen; Pasi Fränti; Søren Holdt Jensen

In this paper, we present a novel system for joint speaker identification and speech separation. For speaker identification a single-channel speaker identification algorithm is proposed which provides an estimate of signal-to-signal ratio (SSR) as a by-product. For speech separation, we propose a sinusoidal model-based algorithm. The speech separation algorithm consists of a double-talk/single-talk detector followed by a minimum mean square error estimator of sinusoidal parameters for finding optimal codevectors from pre-trained speaker codebooks. In evaluating the proposed system, we start from a situation where we have prior information of codebook indices, speaker identities and SSR-level, and then, by relaxing these assumptions one by one, we demonstrate the efficiency of the proposed fully blind system. In contrast to previous studies that mostly focus on automatic speech recognition (ASR) accuracy, here, we report the objective and subjective results as well. The results show that the proposed system performs as well as the best of the state-of-the-art in terms of perceived quality while its performance in terms of speaker identification and automatic speech recognition results are generally lower. It outperforms the state-of-the-art in terms of intelligibility showing that the ASR results are not conclusive. The proposed method achieves on average, 52.3% ASR accuracy, 41.2 points in MUSHRA and 85.9% in speech intelligibility.


IEEE Signal Processing Letters | 2012

Regularized All-Pole Models for Speaker Verification Under Noisy Environments

Cemal Hanilçi; Tomi Kinnunen; Figen Ertaş; Rahim Saeidi; Jouni Pohjalainen; Paavo Alku

Regularization of linear prediction based mel-frequency cepstral coefficient (MFCC) extraction in speaker verification is considered. Commonly, MFCCs are extracted from the discrete Fourier transform (DFT) spectrum of speech frames. In this paper, DFT spectrum estimate is replaced with the recently proposed regularized linear prediction (RLP) method. Regularization of temporally weighted variants, weighted LP (WLP) and stabilized WLP (SWLP) which have earlier shown success in speech and speaker recognition, is also introduced. A novel type of double autocorrelation (DAC) lag windowing is also proposed to enhance robustness. Experiments on the NIST 2002 corpus indicate that regularized all-pole methods (RLP, RWLP and RSWLP) yield large improvement on recognition accuracy under additive factory and babble noise conditions in terms of both equal error rate (EER) and minimum detection cost function (MinDCF).


IEEE Signal Processing Letters | 2010

Multitaper Estimation of Frequency-Warped Cepstra With Application to Speaker Verification

Johan Sandberg; Maria Hansson-Sandsten; Tomi Kinnunen; Rahim Saeidi; Patrick Flandrin; Pierre Borgnat

Usually the mel-frequency cepstral coefficients are estimated either from a periodogram or from a windowed periodogram. We state a general estimator which also includes multitaper estimators. We propose approximations of the variance and bias of the estimate of each coefficient. By using Monte Carlo computations, we demonstrate that the approximations are accurate. Using the proposed formulas, the peak matched multitaper estimator is shown to have low mean square error (squared bias + variance) on speech-like processes. It is also shown to perform slightly better in the NIST 2006 speaker verification task as compared to the Hamming window conventionally used in this context.

Collaboration


Dive into the Rahim Saeidi's collaboration.

Top Co-Authors

Avatar

Tomi Kinnunen

University of Eastern Finland

View shared research outputs
Top Co-Authors

Avatar

Pejman Mowlaee

Graz University of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Pasi Fränti

University of Eastern Finland

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge