Pegah Ghahremani | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Pegah Ghahremani is active.

Explore More

Publication

Featured researches published by Pegah Ghahremani.

international conference on acoustics, speech, and signal processing | 2014

A pitch extraction algorithm tuned for automatic speech recognition

Pegah Ghahremani; Bagher BabaAli; Daniel Povey; Korbinian Riedhammer; Jan Trmal; Sanjeev Khudanpur

In this paper we present an algorithm that produces pitch and probability-of-voicing estimates for use as features in automatic speech recognition systems. These features give large performance improvements on tonal languages for ASR systems, and even substantial improvements for non-tonal languages. Our method, which we are calling the Kaldi pitch tracker (because we are adding it to the Kaldi ASR toolkit), is a highly modified version of the getf0 (RAPT) algorithm. Unlike the original getf0 we do not make a hard decision whether any given frame is voiced or unvoiced; instead, we assign a pitch even to unvoiced frames while constraining the pitch trajectory to be continuous. Our algorithm also produces a quantity that can be used as a probability of voicing measure; it is based on the normalized autocorrelation measure that our pitch extractor uses. We present results on data from various languages in the BABEL project, and show a large improvement over systems without tonal features and systems where pitch and POV information was obtained from SAcC or getf0.

conference of the international speech communication association | 2016

Purely Sequence-Trained Neural Networks for ASR Based on Lattice-Free MMI.

Daniel Povey; Vijayaditya Peddinti; Daniel Galvez; Pegah Ghahremani; Vimal Manohar; Xingyu Na; Yiming Wang; Sanjeev Khudanpur

In this paper we describe a method to perform sequencediscriminative training of neural network acoustic models without the need for frame-level cross-entropy pre-training. We use the lattice-free version of the maximum mutual information (MMI) criterion: LF-MMI. To make its computation feasible we use a phone n-gram language model, in place of the word language model. To further reduce its space and time complexity we compute the objective function using neural network outputs at one third the standard frame rate. These changes enable us to perform the computation for the forward-backward algorithm on GPUs. Further the reduced output frame-rate also provides a significant speed-up during decoding. We present results on 5 different LVCSR tasks with training data ranging from 100 to 2100 hours. Models trained with LFMMI provide a relative word error rate reduction of ∼11.5%, over those trained with cross-entropy objective function, and ∼8%, over those trained with cross-entropy and sMBR objective functions. A further reduction of ∼2.5%, relative, can be obtained by fine tuning these models with the word-lattice based sMBR objective function.

spoken language technology workshop | 2016

Deep neural network-based speaker embeddings for end-to-end speaker verification

David Snyder; Pegah Ghahremani; Daniel Povey; Daniel Garcia-Romero; Yishay Carmiel; Sanjeev Khudanpur

In this study, we investigate an end-to-end text-independent speaker verification system. The architecture consists of a deep neural network that takes a variable length speech segment and maps it to a speaker embedding. The objective function separates same-speaker and different-speaker pairs, and is reused during verification. Similar systems have recently shown promise for text-dependent verification, but we believe that this is unexplored for the text-independent task. We show that given a large number of training speakers, the proposed system outperforms an i-vector baseline in equal error-rate (EER) and at low miss rates. Relative to the baseline, the end-to-end system reduces EER by 13% average and 29% pooled across test conditions. The fused system achieves a reduction of 32% average and 38% pooled.

spoken language technology workshop | 2014

A keyword search system using open source software

Jan Trmal; Guoguo Chen; Daniel Povey; Sanjeev Khudanpur; Pegah Ghahremani; Xiaohui Zhang; Vimal Manohar; Chunxi Liu; Aren Jansen; Dietrich Klakow; David Yarowsky; Florian Metze

Provides an overview of a speech-to-text (STT) and keyword search (KWS) system architecture build primarily on the top of the Kaldi toolkit and expands on a few highlights. The system was developed as a part of the research efforts of the Radical team while participating in the IARPA Babel program. Our aim was to develop a general system pipeline which could be easily and rapidly deployed in any language, independently on the language script and phonological and linguistic features of the language.

conference of the international speech communication association | 2016

Acoustic Modelling from the Signal Domain Using CNNs.

Pegah Ghahremani; Vimal Manohar; Daniel Povey; Sanjeev Khudanpur

Most speech recognition systems use spectral features based on fixed filters, such as MFCC and PLP. In this paper, we show that it is possible to achieve state of the art results by making the feature extractor a part of the network and jointly optimizing it with the rest of the network. The basic approach is to start with a convolutional layer that operates on the signal (say, with a step size of 1.25 milliseconds), and aggregate the filter outputs over a portion of the time axis using a network in network architecture, and then down-sample to every 10 milliseconds for use by the rest of the network. We find that, unlike some previous work on learned feature extractors, the objective function converges as fast as for a network based on traditional features. Because we found that iVector adaptation is less effective in this framework, we also experiment with a different adaptation method that is part of the network, where activation statistics over a medium time span (around a second) are computed at intermediate layers. We find that the resulting ‘direct-fromsignal’ network is competitive with our state of the art networks based on conventional features with iVector adaptation.

international conference on acoustics, speech, and signal processing | 2016

Linearly augmented deep neural network

Pegah Ghahremani; Jasha Droppo; Michael L. Seltzer

Deep neural networks (DNN) are a powerful tool for many large vocabulary continuous speech recognition (LVCSR) tasks. Training a very deep network is a challenging problem and pre-training techniques are needed in order to achieve the best results. In this paper, we propose a new type of network architecture, Linear Augmented Deep Neural Network (LA-DNN). This type of network augments each non-linear layer with a linear connection from layer input to layer output. The resulting LA-DNN model eliminates the need for pre-training, addresses the gradient vanishing problem for deep networks, has higher capacity in modeling linear transformations, trains significantly faster than normal DNN, and produces better acoustic models. The proposed model has been evaluated on TIMIT phoneme recognition and AMI speech recognition tasks. Experimental results show that the LA-DNN models can have 70% fewer parameters than a DNN, while still improving accuracy. On the TIMIT phoneme recognition task, the smaller LA-DNN model improves TIMIT phone accuracy by 2% absolute, and AMI word accuracy by 1.7% absolute.

international conference on acoustics, speech, and signal processing | 2016

Self-stabilized deep neural network

Pegah Ghahremani; Jasha Droppo

Deep neural network models have been successfully applied to many tasks such as image labeling and speech recognition. Mini-batch stochastic gradient descent is the most prevalent method for training these models. A critical part of successfully applying this method is choosing appropriate initial values, as well as local and global learning rate scheduling algorithms. In this paper, we present a method which is less sensitive to choice of initial values, works better than popular learning rate adjustment algorithms, and speeds convergence on model parameters. We show that using the Self-stabilized DNN method, we no longer require initial learning rate tuning and training converges quickly with a fixed global learning rate. The proposed method provides promising results over conventional DNN structure with better convergence rate.

international conference on acoustics, speech, and signal processing | 2017

An empirical evaluation of zero resource acoustic unit discovery

Chunxi Liu; Jinyi Yang; Ming Sun; Santosh Kesiraju; Alena Rott; Lucas Ondel; Pegah Ghahremani; Najim Dehak; Lukas Burget; Sanjeev Khudanpur

Acoustic unit discovery (AUD) is a process of automatically identifying a categorical acoustic unit inventory from speech and producing corresponding acoustic unit tokenizations. AUD provides an important avenue for unsupervised acoustic model training in a zero resource setting where expert-provided linguistic knowledge and transcribed speech are unavailable. Therefore, to further facilitate zero-resource AUD process, in this paper, we demonstrate acoustic feature representations can be significantly improved by (i) performing linear discriminant analysis (LDA) in an unsupervised self-trained fashion, and (ii) leveraging resources of other languages through building a multilingual bottleneck (BN) feature extractor to give effective cross-lingual generalization. Moreover, we perform comprehensive evaluations of AUD efficacy on multiple downstream speech applications, and their correlated performance suggests that AUD evaluations are feasible using different alternative language resources when only a subset of these evaluation resources can be available in typical zero resource applications.

conference of the international speech communication association | 2017

The Kaldi OpenKWS System: Improving Low Resource Keyword Search.

Jan Trmal; Matthew Wiesner; Vijayaditya Peddinti; Xiaohui Zhang; Pegah Ghahremani; Yiming Wang; Vimal Manohar; Hainan Xu; Daniel Povey; Sanjeev Khudanpur

2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) | 2017