Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Taufiq Hasan is active.

Publication


Featured researches published by Taufiq Hasan.


IEEE Signal Processing Magazine | 2015

Speaker Recognition by Machines and Humans: A tutorial review

John H. L. Hansen; Taufiq Hasan

Identifying a person by his or her voice is an important human trait most take for granted in natural human-to-human interaction/communication. Speaking to someone over the telephone usually begins by identifying who is speaking and, at least in cases of familiar speakers, a subjective verification by the listener that the identity is correct and the conversation can proceed. Automatic speaker-recognition systems have emerged as an important means of verifying identity in many e-commerce applications as well as in general business interactions, forensics, and law enforcement. Human experts trained in forensic speaker recognition can perform this task even better by examining a set of acoustic, prosodic, and linguistic characteristics of speech in a general approach referred to as structured listening. Techniques in forensic speaker recognition have been developed for many years by forensic speech scientists and linguists to help reduce any potential bias or preconceived understanding as to the validity of an unknown audio sample and a reference template from a potential suspect. Experienced researchers in signal processing and machine learning continue to develop automatic algorithms to effectively perform speaker recognition?with ever-improving performance?to the point where automatic systems start to perform on par with human listeners. In this article, we review the literature on speaker recognition by machines and humans, with an emphasis on prominent speaker-modeling techniques that have emerged in the last decade for automatic systems. We discuss different aspects of automatic systems, including voice-activity detection (VAD), features, speaker models, standard evaluation data sets, and performance metrics. Human speaker recognition is discussed in two parts?the first part involves forensic speaker-recognition methods, and the second illustrates how a na?ve listener performs this task from a neuroscience perspective. We conclude this review with a comparative study of human versus machine speaker recognition and attempt to point out strengths and weaknesses of each.


international conference on acoustics, speech, and signal processing | 2013

Duration mismatch compensation for i-vector based speaker recognition systems

Taufiq Hasan; Rahim Saeidi; John H. L. Hansen; David A. van Leeuwen

Speaker recognition systems trained on long duration utterances are known to perform significantly worse when short test segments are encountered. To address this mismatch, we analyze the effect of duration variability on phoneme distributions of speech utterances and i-vector length. We demonstrate that, as utterance duration is decreased, number of detected unique phonemes and i-vector length approaches zero in a logarithmic and non-linear fashion, respectively. Assuming duration variability as an additive noise in the i-vector space, we propose three different strategies for its compensation: i) multi-duration training in Probabilistic Linear Discriminant Analysis (PLDA) model, ii) score calibration using log duration as a Quality Measure Function (QMF), and iii) multi-duration PLDA training with synthesized short duration i-vectors. Experiments are designed based on the 2012 National Institute of Standards and Technology (NIST) Speaker Recognition Evaluation (SRE) protocol with varying test utterance duration. Experimental results demonstrate the effectiveness of the proposed schemes on short duration test conditions, especially with the QMF calibration approach.


IEEE Transactions on Audio, Speech, and Language Processing | 2011

A Study on Universal Background Model Training in Speaker Verification

Taufiq Hasan; John H. L. Hansen

State-of-the-art Gaussian mixture model (GMM)-based speaker recognition/verification systems utilize a universal background model (UBM), which typically requires extensive resources, especially if multiple channel and microphone categories are considered. In this study, a systematic analysis of speaker verification system performance is considered for which the UBM data is selected and purposefully altered in different ways, including variation in the amount of data, sub-sampling structure of the feature frames, and variation in the number of speakers. An objective measure is formulated from the UBM covariance matrix which is found to be highly correlated with system performance when the data amount was varied while keeping the UBM data set constant, and increasing the number of UBM speakers while keeping the data amount constant. The advantages of feature sub-sampling for improving UBM training speed is also discussed, and a novel and effective phonetic distance-based frame selection method is developed. The sub-sampling methods presented are shown to retain baseline equal error rate (EER) system performance using only 1% of the original UBM data, resulting in a drastic reduction in UBM training computation time. This, in theory, dispels the myth of “Theres no data like more data” for the purpose of UBM construction. With respect to the UBM speakers, the effect of systematically controlling the number of training (UBM) speakers versus overall system performance is analyzed. It is shown experimentally that increasing the inter-speaker variability in the UBM data while maintaining the overall total data size constant gradually improves system performance. Finally, two alternative speaker selection methods based on different speaker diversity measures are presented. Using the proposed schemes, it is shown that by selecting a diverse set of UBM speakers, the baseline system performance can be retained using less than 30% of the original UBM speakers.


IEEE Transactions on Audio, Speech, and Language Processing | 2013

Acoustic Factor Analysis for Robust Speaker Verification

Taufiq Hasan; John H. L. Hansen

Factor analysis based channel mismatch compensation methods for speaker recognition are based on the assumption that speaker/utterance dependent Gaussian Mixture Model (GMM) mean super-vectors can be constrained to reside in a lower dimensional subspace. This approach does not consider the fact that conventional acoustic feature vectors also reside in a lower dimensional manifold of the feature space, when feature covariance matrices contain close to zero eigenvalues. In this study, based on observations of the covariance structure of acoustic features, we propose a factor analysis modeling scheme in the acoustic feature space instead of the super-vector space and derive a mixture dependent feature transformation. We demonstrate how this single linear transformation performs feature dimensionality reduction, de-correlation, normalization and enhancement, at once. The proposed transformation is shown to be closely related to signal subspace based speech enhancement schemes. In contrast to traditional front-end mixture dependent feature transformations, where feature alignment is performed using the highest scoring mixture, the proposed transformation is integrated within the speaker recognition system using a probabilistic feature alignment technique, which nullifies the need for regenerating the features/retraining the Universal Background Model (UBM). Incorporating the proposed method with a state-of-the-art i-vector and Gaussian Probabilistic Linear Discriminant Analysis (PLDA) framework, we perform evaluations on National Institute of Science and Technology (NIST) Speaker Recognition Evaluation (SRE) 2010 core telephone and microphone tasks. The experimental results demonstrate the superiority of the proposed scheme compared to both full-covariance and diagonal covariance UBM based systems. Simple equal-weight fusion of baseline and proposed systems also yield significant performance gains.


IEEE Signal Processing Letters | 2009

Suppression of Residual Noise From Speech Signals Using Empirical Mode Decomposition

Taufiq Hasan; Md. Kamrul Hasan

This letter illustrates a novel and effective method for suppressing residual noise from enhanced speech signals as a second-stage post-filtering technique using empirical mode decomposition. The method significantly improves speech listening quality with simultaneous improvement of objective quality indices. The listening test results demonstrate the superiority of the proposed scheme compared to well-known noise suppression and perceptual filtering methods.


IEEE Transactions on Audio, Speech, and Language Processing | 2014

Maximum Likelihood Acoustic Factor Analysis Models for Robust Speaker Verification in Noise

Taufiq Hasan; John H. L. Hansen

Recent speaker recognition/verification systems generally utilize an utterance dependent fixed dimensional vector as features to Bayesian classifiers. These vectors, known as i-Vectors, are lower dimensional representations of Gaussian Mixture Model (GMM) mean super-vectors adapted from a Universal Background Model (UBM) using speech utterance features, and extracted utilizing a Factor Analysis (FA) framework. This method is based on the assumption that the speaker dependent information resides in a lower dimensional sub-space. In this study, we utilize a mixture of Acoustic Factor Analyzers (AFA) to model the acoustic features instead of a GMM-UBM. Following our previously proposed AFA technique (“Acoustic factor analysis for robust speaker verification,” by Hasan and Hansen, IEEE Trans. Audio, Speech, Lang. Process., vol. 21, no. 4, April 2013), this model is based on the assumption that the speaker relevant information lies in a lower dimensional subspace in the multi-dimensional feature space localized by the mixture components. Unlike our previous method, here we train the AFA-UBM model directly from the data using an Expectation-Maximization (EM) algorithm. This method shows improved robustness to noise as the nuisance dimensions are removed in each EM iteration. Two variants of the AFA model are considered utilizing an isotropic and diagonal covariance residual term. The method is integrated within a standard i-Vector system where the hidden variables of the model, termed as acoustic factors, are utilized as the input for total variability modeling. Experimental results obtained on the 2012 National Institute of Standards and Technology (NIST) Speaker Recognition Evaluation (SRE) core-extended trials indicate the effectiveness of the proposed strategy in both clean and noisy conditions.


international conference on acoustics, speech, and signal processing | 2010

A novel feature sub-sampling method for efficient universal background model training in speaker verification

Taufiq Hasan; Yun Lei; Aravind Chandrasekaran; John H. L. Hansen

Speaker recognition/verification systems require an extensive universal background model (UBM), which typically requires extensive resources, especially if new channel domains are considered. In this study we propose an effective and computationally efficient algorithm for training the UBM for speaker verification. A novel method based on Euclidean distance between features is developed for effective sub-sampling of potential training feature vectors. Using only about 1.5 seconds of data from each development utterance, the proposed UBM training method drastically reduces the computation time, while improving, or at least retaining original speaker verification system performance. While methods such as factor analysis can mitigate some of the issues associated with channel/microphone/environmental mismatch, the proposed rapid UBM training scheme offers a viable alternative for rapid environment dependent UBMs.


EURASIP Journal on Advances in Signal Processing | 2013

Multi-modal highlight generation for sports videos using an information-theoretic excitability measure

Taufiq Hasan; Hynek Bořil; Abhijeet Sangwan; John H. L. Hansen

The ability to detect and organize ‘hot spots’ representing areas of excitement within video streams is a challenging research problem when techniques rely exclusively on video content. A generic method for sports video highlight selection is presented in this study which leverages both video/image structure as well as audio/speech properties. Processing begins where the video is partitioned into small segments and several multi-modal features are extracted from each segment. Excitability is computed based on the likelihood of the segmental features residing in certain regions of their joint probability density function space which are considered both exciting and rare. The proposed measure is used to rank order the partitioned segments to compress the overall video sequence and produce a contiguous set of highlights. Experiments are performed on baseball videos based on signal processing advancements for excitement assessment in the commentators’ speech, audio energy, slow motion replay, scene cut density, and motion activity as features. Detailed analysis on correlation between user excitability and various speech production parameters is conducted and an effective scheme is designed to estimate the excitement level of commentator’s speech from the sports videos. Subjective evaluation of excitability and ranking of video segments demonstrate a higher correlation with the proposed measure compared to well-established techniques indicating the effectiveness of the overall approach.


international conference on acoustics, speech, and signal processing | 2015

Automatic broadcast news summarization via rank classifiers and crowdsourced annotation

Srinivas Parthasarathy; Taufiq Hasan

Extractive speech summarization methods generally operate as a binary classifier deciding if a sentence belongs to the summary or not. However, it is well known that even human annotators do not agree on selecting most summary sentences. In this paper, we take a probabilistic view of the summarization ground-truth and assume that more frequently selected sentences by annotators are of higher importance. Using a large summary data-set obtained through crowdsourcing, we empirically show that sentence selection frequency is inversely related to its summarization rank. Consequently, we model the relative importance between sentences using a rank-based classifier. Additionally, we utilize an extended paralinguistic feature set that has not been previously used for speech summarization. Lexical and structural features are also included. Support Vector Machine (SVM) is used as the baseline binary classifier and rank classifier. Experimental evaluations show that the proposed approach outperforms traditional binary classifiers with respect to various ROUGE summarization metrics for different summarization compression ratios (CR).


international conference on acoustics, speech, and signal processing | 2016

Automatic composition of broadcast news summaries using rank classifiers trained with acoustic and lexical features

Taufiq Hasan; Mohammed Abdelwahab; Srinivas Parthasarathy; Carlos Busso; Yang Liu

Research on automatic speech summarization typically focuses on optimizing objective evaluation criteria, such as the ROUGE metric, which depend on word and phrase overlaps between automatic and manually generated summary documents. However, the actual quality of the speech summarizer largely depends on how the end-users perceive the audio output. This work focuses on the task of composing summarized audio streams with the aim of improving the quality and interest perceived by the end-user. First, using crowd-sourced summary annotations on a broadcast news corpus, we train a rank-SVM classifier to learn the relative importance of each sentence in a news story. Acoustic, lexical and structural features are used for training. In addition, we investigate the perceived emotion level in each sentence to aid the summarizer in selecting interesting sentences, yielding an emotion-aware summarizer. Next, we propose several methods to combine these sentences to generate a compressed audio stream. Subjective evaluations are performed to evaluate the quality of the generated summaries on the following criterion: interest, abruptness, informativeness, attractiveness, and overall quality. The results indicate that users are most sensitive to the linguistic coherence and continuity of the audio stream.

Collaboration


Dive into the Taufiq Hasan's collaboration.

Top Co-Authors

Avatar

John H. L. Hansen

University of Texas at Dallas

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Hynek Boril

University of Texas at Dallas

View shared research outputs
Top Co-Authors

Avatar

Abhijeet Sangwan

University of Texas at Dallas

View shared research outputs
Top Co-Authors

Avatar

Gang Liu

University of Texas at Dallas

View shared research outputs
Top Co-Authors

Avatar

Seyed Omid Sadjadi

University of Texas at Dallas

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Md. Kamrul Hasan

Bangladesh University of Engineering and Technology

View shared research outputs
Top Co-Authors

Avatar

Keith W. Godin

University of Texas at Dallas

View shared research outputs
Researchain Logo
Decentralizing Knowledge