nan Sahidullah | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where nan Sahidullah is active.

Explore More

Publication

Featured researches published by nan Sahidullah.

Digital Signal Processing | 2016

Local spectral variability features for speaker verification

Sahidullah; Tomi Kinnunen

Speaker verification techniques neglect the short-time variation in the feature space even though it contains speaker related attributes. We propose a simple method to capture and characterize this spectral variation through the eigenstructure of the sample covariance matrix. This covariance is computed using sliding window over spectral features. The newly formulated feature vectors representing local spectral variations are used with classical and state-of-the-art speaker recognition systems. Results on multiple speaker recognition evaluation corpora reveal that eigenvectors weighted with their normalized singular values are useful in representing local covariance information. We have also shown that local variability features can be extracted using mel frequency cepstral coefficients (MFCCs) as well as using three recently developed features: frequency domain linear prediction (FDLP), mean Hilbert envelope coefficients (MHECs) and power-normalized cepstral coefficients (PNCCs). Since information conveyed in the proposed feature is complementary to the standard short-term features, we apply different fusion techniques. We observe considerable relative improvements in speaker verification accuracy in combined mode on text-independent (NIST SRE) and text-dependent (RSR2015) speech corpora. We have obtained up to 12.28% relative improvement in speaker recognition accuracy on text-independent corpora. Conversely in experiments on text-dependent corpora, we have achieved up to 40% relative reduction in EER. To sum up, combining local covariance information with the traditional cepstral features holds promise as an additional speaker cue in both text-independent and text-dependent recognition.

conference of the international speech communication association | 2016

Utterance Verification for Text-Dependent Speaker Recognition: A Comparative Assessment Using the RedDots Corpus

Tomi Kinnunen; Sahidullah; Ivan Kukanov; Héctor Delgado; Massimiliano Todisco; Achintya Kumar Sarkar; Nicolai Bæk Thomsen; Ville Hautamäki; Nicholas W. D. Evans; Zheng-Hua Tan

Text-dependent automatic speaker verification naturally calls for the simultaneous verification of speaker identity and spoken content. These two tasks can be achieved with automatic speaker verification (ASV) and utterance verification (UV) technologies. While both have been addressed previously in the literature, a treatment of simultaneous speaker and utterance verification with a modern, standard database is so far lacking. This is despite the burgeoning demand for voice biometrics in a plethora of practical security applications. With the goal of improving overall verification performance, this paper reports different strategies for simultaneous ASV and UV in the context of short-duration, text-dependent speaker verification. Experiments performed on the recently released RedDots corpus are reported for three different ASV systems and four different UV systems. Results show that the combination of utterance verification with automatic speaker verification is (almost) universally beneficial with significant performance improvements being observed.

Speech Communication | 2016

Spoofing detection goes noisy

Cemal Hanili; Tomi Kinnunen; Sahidullah; Aleksandr Sizov

For the first time, TTS and VC attack detection under additive noise, is studied.Various front-ends for synthetic speech detection under additive noise are systematically analyzed on ASVspoof 2015 database.Relative phase shift (RPS) features perform better than other features considered in clean condition.Mel-frequency cepstral coefficients (MFCCs) and subband spectral centroid magnitude (SCMC) features are the best two techniques among seven different front-ends under noisy conditions.Standard GMM performs better than i-vector PLDA for both clean and noisy conditions. Automatic speaker verification (ASV) technology is recently finding its way to end-user applications for secure access to personal data, smart services or physical facilities. Similar to other biometric technologies, speaker verification is vulnerable to spoofing attacks where an attacker masquerades as a particular target speaker via impersonation, replay, text-to-speech (TTS) or voice conversion (VC) techniques to gain illegitimate access to the system. We focus on TTS and VC that represent the most flexible, high-end spoofing attacks. Most of the prior studies on synthesized or converted speech detection report their findings using high-quality clean recordings. Meanwhile, the performance of spoofing detectors in the presence of additive noise, an important consideration in practical ASV implementations, remains largely unknown. To this end, our study provides a comparative analysis of existing state-of-the-art, off-the-shelf synthetic speech detectors under additive noise contamination with a special focus on front-end processing that has been found critical. Our comparison includes eight acoustic feature sets, five related to spectral magnitude and three to spectral phase information. All the methods contain a number of internal control parameters. Except for feature post-processing steps (deltas and cepstral mean normalization) that we optimized for each method, we fix the internal control parameters to their default values based on literature, and compare all the variants using the exact same dimensionality and back-end system. In addition to the eight feature sets, we consider two alternative classifier back-ends: Gaussian mixture model (GMM) and i-vector, the latter with both cosine scoring and probabilistic linear discriminant analysis (PLDA) scoring. Our extensive analysis on the recent ASVspoof 2015 challenge provides new insights to the robustness of the spoofing detectors. Firstly, unlike in most other speech processing tasks, all the compared spoofing detectors break down even at relatively high signal-to-noise ratios (SNRs) and fail to generalize to noisy conditions even if performing excellently on clean data. This indicates both difficulty of the task, as well as potential to over-fit the methods easily. Secondly, speech enhancement pre-processing is not found helpful. Thirdly, GMM back-end generally outperforms the more involved i-vector back-end. Fourthly, concerning the compared features, the Mel-frequency cepstral coefficient (MFCC) and subband spectral centroid magnitude coefficient (SCMC) features perform the best on average though the winner method depends on SNR and noise type. Finally, a study with two score fusion strategies shows that combining different feature based systems improves recognition accuracy for known and unknown attacks in both clean and noisy conditions. In particular, simple score averaging fusion, as opposed to weighted fusion with logistic loss weight optimization, was found to work better, on average. For clean speech, it provides 88% and 28% relative improvements over the best standalone features for known and unknown spoofing techniques, respectively. If we consider the best score fusion of just two features, then RPS serves as a complementary agent to one of the magnitude features. To sum up, our study reveals a significant gap between the performance of state-of-the-art spoofing detectors between clean and noisy conditions.

ieee india conference | 2015

Performance comparison of speaker recognition systems in presence of duration variability

Arnab Poddar; Sahidullah; Goutam Saha

Performance of speaker recognition system is highly dependent on the amount of speech data used in training and testing. In this paper, we compare the performance of two different speaker recognition systems in presence of utterance duration variability. The first system is based on state-of-the-art total variability (also known as i-vector system), whereas the other one is classical speaker recognition system based on Gaussian mixture model with universal background model (GMM-UBM). We have conducted extensive experiments for different cases of length mismatch on two NIST corpora: NIST SRE 2008 and NIST SRE 2010. Our study reveals that the relative improvement of total variability based system gradually drops with the reduction in test utterance length. We also observe that if the speakers are enrolled with sufficient amount of training data, GMM-UBM system outperforms i-vector system for very short test utterances.

spoken language technology workshop | 2016

Further optimisations of constant Q cepstral processing for integrated utterance and text-dependent speaker verification

Héctor Delgado; Massimiliano Todisco; Sahidullah; Achintya Kumar Sarkar; Nicholas W. D. Evans; Tomi Kinnunen; Zheng-Hua Tan

Many authentication applications involving automatic speaker verification (ASV) demand robust performance using short-duration, fixed or prompted text utterances. Text constraints not only reduce the phone-mismatch between enrolment and test utterances, which generally leads to improved performance, but also provide an ancillary level of security. This can take the form of explicit utterance verification (UV). An integrated UV + ASV system should then verify access attempts which contain not just the expected speaker, but also the expected text content. This paper presents such a system and introduces new features which are used for both UV and ASV tasks. Based upon multi-resolution, spectro-temporal analysis and when fused with more traditional parameterisations, the new features not only generally outperform Mel-frequency cepstral coefficients, but also are shown to be complementary when fusing systems at score level. Finally, the joint operation of UV and ASV greatly decreases false acceptances for unmatched text trials.

international conference on acoustics, speech, and signal processing | 2017

RedDots replayed: A new replay spoofing attack corpus for text-dependent speaker verification research

Tomi Kinnunen; Sahidullah; Mauro Falcone; Luca Costantini; Rosa González Hautamäki; Dennis Alexander Lehmann Thomsen; Achintya Kumar Sarkar; Zheng-Hua Tan; Héctor Delgado; Massimiliano Todisco; Nicholas W. D. Evans; Ville Hautamäki; Kong Aik Lee

This paper describes a new database for the assessment of automatic speaker verification (ASV) vulnerabilities to spoofing attacks. In contrast to other recent data collection efforts, the new database has been designed to support the development of replay spoofing countermeasures tailored towards the protection of text-dependent ASV systems from replay attacks in the face of variable recording and playback conditions. Derived from the re-recording of the original RedDots database, the effort is aligned with that in text-dependent ASV and thus well positioned for future assessments of replay spoofing countermeasures, not just in isolation, but in integration with ASV. The paper describes the database design and re-recording, a protocol and some early spoofing detection results. The new “RedDots Replayed” database is publicly available through a creative commons license.

international conference on acoustics, speech, and signal processing | 2017

Generalization of spoofing countermeasures: A case study with ASVspoof 2015 and BTAS 2016 corpora

Dipjyoti Paul; Sahidullah; Goutam Saha

Voice-based biometric systems are highly prone to spoofing attacks. Recently, various countermeasures have been developed for detecting different kinds of attacks such as replay, speech synthesis (SS) and voice conversion (VC). Most of the existing studies are conducted with a specific training set defined by the evaluation protocol. However, for realistic scenarios, selecting appropriate training data is an open challenge for the system administrator. Motivated by this practical concern, this work investigates the generalization capability of spoofing countermeasures in restricted training conditions where speech from a broad attack types are left out in the training database. We demonstrate that different spoofing types have considerably different generalization capabilities. For this study, we analyze the performance using two kinds of features, mel-frequency cepstral coefficients (MFCCs) which are considered as baseline and recently proposed constant Q cepstral coefficients (CQCCs). The experiments are conducted with standard Gaussian mixture model - maximum likelihood (GMM-ML) classifier on two recently released spoofing corpora: ASVspoof 2015 and BTAS 2016 that includes cross-corpora performance analysis. Feature-level analysis suggests that static and dynamic coefficients of spectral features, both are important for detecting spoofing attacks in the real-life condition.

pattern recognition and machine intelligence | 2017

An Adaptive i-Vector Extraction for Speaker Verification with Short Utterance

Arnab Poddar; Sahidullah; Goutam Saha

A prime challenge in automatic speaker verification (ASV) is to improve performance with short speech segments. The variability and uncertainty of intermediate model parameters associated with state-of-the-art i-vector based ASV system, extensively increases in short duration. To compensate increased variability, we propose an adaptive approach for estimation of model parameters. The pre-estimated universal background model (UBM) parameters are used for adaptation. The speaker models i.e., i-vectors are generated with the proposed adapted parameters. The ASV performance with the proposed approach considerably outperformed conventional i-vector based system on publicly available speech corpora, NIST SRE 2010, especially in short duration, as required in real-world applications.

Speech Communication | 2017

Acoustical and perceptual study of voice disguise by age modification in speaker verification

Rosa González Hautamäki; Sahidullah; Ville Hautamäki; Tomi Kinnunen

Abstract The task of speaker recognition is feasible when the speakers are co-operative or wish to be recognized. While modern automatic speaker verification (ASV) systems and some listeners are good at recognizing speakers from modal, unmodified speech, the task becomes notoriously difficult in situations of deliberate voice disguise when the speaker aims at masking his or her identity. We approach voice disguise from the perspective of acoustical and perceptual analysis using a self-collected corpus of 60 native Finnish speakers (31 female, 29 male) producing utterances in normal, intended young and intended old voice modes. The normal voices form a starting point and we are interested in studying how the two disguise modes impact the acoustical parameters and perceptual speaker similarity judgments. First, we study the effect of disguise as a relative change in fundamental frequency ( F 0) and formant frequencies ( F 1 to F 4) from modal to disguised utterances. Next, we investigate whether or not speaker comparisons that are deemed easy or difficult by a modern ASV system have a similar difficulty level for the human listeners. Further, we study affecting factors from listener-related self-reported information that may explain a particular listener’s success or failure in speaker similarity assessment. Our acoustic analysis reveals a systematic increase in relative change in mean F 0 for the intended young voices while for the intended old voices, the relative change is less prominent in most cases. Concerning the formants F 1 through F 4, 29% (for male) and 30% (for female) of the utterances did not exhibit a significant change in any formant value, while the remaining  ∼ 70% of utterances had significant changes in at least one formant. Our listening panel consists of 70 listeners, 32 native and 38 non-native, who listened to 24 utterance pairs selected using rankings produced by an ASV system. The results indicate that speaker pairs categorized as easy by our ASV system were also easy for the average listener. Similarly, the listeners made more errors in the difficult trials. The listening results indicate that target (same speaker) trials were more difficult for the non-native group, while the performance for the non-target pairs was similar for both native and non-native groups.

ieee india conference | 2015

Optimization of cepstral features for robust lung sound classification

Nandini Sengupta; Sahidullah; Goutam Saha

Detection of lung abnormalities by characterizing lung sounds has been a primary step for clinical examination for a pulmonologist. This work focuses on utilization of cepstral features for lung sound analysis and classification. The proposed method incorporates statistical properties of cepstral features along with artificial neural network (ANN) based classification. Experimental results indicate that the proposed features outperform the wavelet-based features and conventional mel-frequency cepstral features. Further analyses have been performed on the proposed features to experimentally optimize the frame size and feature dimensionality. We also look at optimizing number of hidden layer nodes to improve robustness. We have found that the optimized features perform better for a wide range of signal-to-noise ratio (SNR) values.

Explore More