Afsaneh Asaei
Idiap Research Institute
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Afsaneh Asaei.
IEEE Signal Processing Magazine | 2014
Michael B. McCoy; Volkan Cevher; Quoc Tran Dinh; Afsaneh Asaei; Luca Baldassarre
Source separation, or demixing, is the process of extracting multiple components entangled within a signal. Contemporary signal processing presents a host of difficult source separation problems, from interference cancellation to background subtraction, blind deconvolution, and even dictionary learning. Despite the recent progress in each of these applications, advances in high-throughput sensor technology place demixing algorithms under pressure to accommodate extremely high-dimensional signals, separate an ever larger number of sources, and cope with more sophisticated signal and mixing models. These difficulties are exacerbated by the need for real-time action in automated decision-making systems.
IEEE Transactions on Audio, Speech, and Language Processing | 2014
Afsaneh Asaei; Mohammad Golbabaee; Volkan Cevher
We tackle the speech separation problem through modeling the acoustics of the reverberant chambers. Our approach exploits structured sparsity models to perform speech recovery and room acoustic modeling from recordings of concurrent unknown sources. The speakers are assumed to lie on a two-dimensional plane and the multipath channel is characterized using the image model. We propose an algorithm for room geometry estimation relying on localization of the early images of the speakers by sparse approximation of the spatial spectrum of the virtual sources in a free-space model. The images are then clustered exploiting the low-rank structure of the spectro-temporal components belonging to each source. This enables us to identify the early support of the room impulse response function and its unique map to the room geometry. To further tackle the ambiguity of the reflection ratios, we propose a novel formulation of the reverberation model and estimate the absorption coefficients through a convex optimization exploiting joint sparsity model formulated upon spatio-spectral sparsity of concurrent speech representation. The acoustic parameters are then incorporated for separating individual speech signals through either structured sparse recovery or inverse filtering the acoustic channels. The experiments conducted on real data recordings of spatially stationary sources demonstrate the effectiveness of the proposed approach for speech separation and recognition.
2011 Joint Workshop on Hands-free Speech Communication and Microphone Arrays | 2011
Mohammad Javad Taghizadeh; Philip N. Garner; Hamid Reza Abutalebi; Afsaneh Asaei
Two of the major challenges in microphone array based adaptive beamforming, speech enhancement and distant speech recognition, are robust and accurate source localization and voice activity detection. This paper introduces a spatial gradient steered response power using the phase transform (SRPPHAT) method which is capable of localization of competing speakers in overlapping conditions. We further investigate the behavior of the SRP function and characterize theoretically a fixed point in its search space for the diffuse noise field. We call this fixed point the null position in the SRP search space. Building on this evidence, we propose a technique for multichannel voice activity detection (MVAD) based on detection of a maximum power corresponding to the null position. The gradient SRP-PHAT in tandem with the MVAD form an integrated framework of multi-source localization and voice activity detection. The experiments carried out on real data recordings show that this framework is very effective in practical applications of hands-free communication.
international conference on acoustics, speech, and signal processing | 2011
Afsaneh Asaei; Volkan Cevher
We leverage the recent algorithmic advances in compressive sensing, and propose a novel source separation algorithm for efficient recovery of convolutive speech mixtures in spectro-temporal domain. Compared to the common sparse component analysis techniques, our approach fully exploits structured sparsity models to obtain substantial improvement over the existing state-of-the-art. We evaluate our method for separation and recognition of a target speaker in a multi-party scenario. Our results provide compelling evidence of the effectiveness of sparse recovery formulations in speech recognition.
international conference on acoustics, speech, and signal processing | 2014
Afsaneh Asaei; Mohammad Javad Taghizadeh; Volkan Cevher
In this paper, the problem of multiple speaker localization via speech separation based on model-based sparse recovery is studies. We compare and contrast computational sparse optimization methods incorporating harmonicity and block structures as well as autoregressive dependencies underlying spectrographic representation of speech signals. The results demonstrate the effectiveness of block sparse Bayesian learning framework incorporating autoregressive correlations to achieve a highly accurate localization performance. Furthermore, significant improvement is obtained using ad-hoc microphones for data acquisition set-up compared to the compact microphone array.
Signal Processing | 2009
Afsaneh Asaei; Mohammad Javad Taghizadeh; Marjan Bahrololum; Mohammed Ghanbari
This paper proposes a joint verification-localization structure based on split-band analysis of speech signal and the mixed voicing level. To address the problems in reverberant acoustic environments, a new fundamental frequency estimation algorithm is proposed based on high resolution spectral estimation. In the reconstruction of the distorted speech this information is utilized to reduce the side effect of acoustic noise on the voicing parts. A speaker verification system examines the features of the reconstructed speech in order to authorize the speaker before localization. This procedure prevents localization and beamforming for non-speech and specially the unwanted speakers in multi-speaker scenarios. The verification is implemented with the Gaussian Mixture Model and a new filtering scheme is proposed based on the voicing likelihood of each frequency band measured in the previous steps for efficient localization of the authorized speaker. The performance of the proposed VSL (verified speaker localization) front-end is evaluated in various reverberant and noisy environments. The VSL is utilized in the development of distant-talking automatic speech recognition by microphone array where the system can lock on a specific source and hence the recognition quality improves noticeably.
Speech Communication | 2016
Pranay Dighe; Afsaneh Asaei; Hervé Bourlard
Automatic speech recognition can be cast as a realization of compressive sensing.Posterior probabilities are suitable features for exemplar-based sparse modeling.Posterior-based sparse representation meets statistical speech recognition formalism.Dictionary learning reduces collection size of exemplars and improves the performance.Collaborative hierarchical sparsity exploits temporal information in continuous speech. In this paper, a compressive sensing (CS) perspective to exemplar-based speech processing is proposed. Relying on an analytical relationship between CS formulation and statistical speech recognition (Hidden Markov Models - HMM), the automatic speech recognition (ASR) problem is cast as recovery of high-dimensional sparse word representation from the observed low-dimensional acoustic features. The acoustic features are exemplars obtained from (deep) neural network sub-word conditional posterior probabilities. Low-dimensional word manifolds are learned using these sub-word posterior exemplars and exploited to construct a linguistic dictionary for sparse representation of word posteriors. Dictionary learning has been found to be a principled way to alleviate the need of having huge collection of exemplars as required in conventional exemplar-based approaches, while still improving the performance. Context appending and collaborative hierarchical sparsity are used to exploit the sequential and group structure underlying word sparse representation. This formulation leads to a posterior-based sparse modeling approach to speech recognition. The potential of the proposed approach is demonstrated on isolated word (Phonebook corpus) and continuous speech (Numbers corpus) recognition tasks.
international conference on acoustics, speech, and signal processing | 2010
Afsaneh Asaei; Benjamin Picart
Class posterior distributions have recently been used quite successfully in Automatic Speech Recognition (ASR), either for frame or phone level classification or as acoustic features, which can be further exploited (usually after some “ad hoc” transformations) in different classifiers (e.g., in Gaussian Mixture based HMMs). In the present paper, we show preliminary results showing that it may be possible to perform speech recognition without explicit subword unit (phone) classification or likelihood estimation, simply answering the question whether two acoustic (posterior) vectors belong to the same subword unit class or not. In this paper, we first exhibit specific properties of the posterior acoustic space before showing how those properties can be exploited to reach very high performance in deciding (based on an appropriate, trained, distance metric, and hypothesis testing approaches) whether two posterior vectors belong to the same class or not. Performance as high as 90% correct decision rates are reported on the TIMIT database, before reporting kNN phone classification rates.
international conference on acoustics, speech, and signal processing | 2016
Pranay Dighe; Gil Luyet; Afsaneh Asaei
We propose to model the acoustic space of deep neural network (DNN) class-conditional posterior probabilities as a union of low-dimensional subspaces. To that end, the training posteriors are used for dictionary learning and sparse coding. Sparse representation of the test posteriors using this dictionary enables projection to the space of training data. Relying on the fact that the intrinsic dimensions of the posterior subspaces are indeed very small and the matrix of all posteriors belonging to a class has a very low rank, we demonstrate how low-dimensional structures enable further enhancement of the posteriors and rectify the spurious errors due to mismatch conditions. The enhanced acoustic modeling method leads to improvements in continuous speech recognition task using hybrid DNN-HMM (hidden Markov model) framework in both clean and noisy conditions, where upto 15.4% relative reduction in word error rate (WER) is achieved.
international conference on acoustics, speech, and signal processing | 2012
Afsaneh Asaei; Mike E. Davies; Volkan Cevher
We cast the under-determined convolutive speech separation as sparse approximation of the spatial spectra of the mixing sources. In this framework we compare and contrast the major practical algorithms for structured sparse recovery of speech signal. Specific attention is paid to characterization of the measurement matrix. We first propose how it can be identified using the Image model of multipath effect where the acoustic parameters are estimated by localizing a speaker and its images in a free space model. We further study the circumstances in which the coherence of the projections induced by microphone array design tend to affect the recovery performance.