Ruhi Sarikaya
Microsoft
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ruhi Sarikaya.
IEEE Transactions on Audio, Speech, and Language Processing | 2014
Ruhi Sarikaya; Geoffrey E. Hinton; Anoop Deoras
Applications of Deep Belief Nets (DBN) to various problems have been the subject of a number of recent studies ranging from image classification and speech recognition to audio classification. In this study we apply DBNs to a natural language understanding problem. The recent surge of activity in this area was largely spurred by the development of a greedy layer-wise pretraining method that uses an efficient learning algorithm called Contrastive Divergence (CD). CD allows DBNs to learn a multi-layer generative model from unlabeled data and the features discovered by this model are then used to initialize a feed-forward neural network which is fine-tuned with backpropagation. We compare a DBN-initialized neural network to three widely used text classification algorithms: Support Vector Machines (SVM), boosting and Maximum Entropy (MaxEnt). The plain DBN-based model gives a call-routing classification accuracy that is equal to the best of the other models. However, using additional unlabeled data for DBN pre-training and combining DBN-based learned features with the original features provides significant gains over SVMs, which, in turn, performed better than both MaxEnt and Boosting.
meeting of the association for computational linguistics | 2006
Imed Zitouni; Jeffrey S. Sorensen; Ruhi Sarikaya
Short vowels and other diacritics are not part of written Arabic scripts. Exceptions are made for important political and religious texts and in scripts for beginning students of Arabic. Script without diacritics have considerable ambiguity because many words with different diacritic patterns appear identical in a diacritic-less setting. We propose in this paper a maximum entropy approach for restoring diacritics in a document. The approach can easily integrate and make effective use of diverse types of information; the model we propose integrates a wide array of lexical, segment-based and part-of-speech tag features. The combination of these feature types leads to a state-of-the-art diacritization model. Using a publicly available corpus (LDCs Arabic Treebank Part 3), we achieve a diacritic error rate of 5.1%, a segment error rate 8.5%, and a word error rate of 17.3%. In case-ending-less setting, we obtain a diacritic error rate of 2.2%, a segment error rate 4.0%, and a word error rate of 7.2%.
ieee automatic speech recognition and understanding workshop | 2013
Puyang Xu; Ruhi Sarikaya
We describe a joint model for intent detection and slot filling based on convolutional neural networks (CNN). The proposed architecture can be perceived as a neural network (NN) version of the triangular CRF model (TriCRF), in which the intent label and the slot sequence are modeled jointly and their dependencies are exploited. Our slot filling component is a globally normalized CRF style model, as opposed to left-to-right models in recent NN based slot taggers. Its features are automatically extracted through CNN layers and shared by the intent model. We show that our slot model component generates state-of-the-art results, outperforming CRF significantly. Our joint model outperforms the standard TriCRF by 1% absolute for both intent and slot. On a number of other domains, our joint model achieves 0.7-1%, and 0.9-2.1% absolute gains over the independent modeling approach for intent and slot respectively.
IEEE Signal Processing Letters | 2000
Ruhi Sarikaya; John H. L. Hansen
This letter investigates the impact of stress on monophone speech recognition accuracy and proposes a new set of acoustic parameters based on high resolution wavelet analysis. The two parameter schemes are entitled wavelet packet parameters (WPP) and subband-based cepstral parameters (SBC). The performance of these features is compared to traditional Mel-frequency cepstral coefficients (MFCC) for stressed speech monophone recognition. The stressed speaking styles considered are neutral, angry, loud, and Lombard effect speech from the SUSAS database. An overall monophone recognition improvement of 20.4% and 17.2% is achieved for loud and angry stressed speech, with a corresponding increase in the neutral monophone rate of 9.9% over MFCC parameters.
international conference on acoustics, speech, and signal processing | 2005
Ruhi Sarikaya; Agustin Gravano; Yuqing Gao
The paper addresses a critical problem in deploying a spoken dialog system (SDS). One of the main bottlenecks of SDS deployment for a new domain is data sparseness in building a statistical language model. Our goal is to devise an efficient method to build a reliable language model for a new SDS. We consider the worst, yet quite common, scenario where only a small amount (/spl sim/1.7 K utterances) of domain specific data is available for the target domain. We present a new method that exploits external static text resources that are collected for other speech recognition tasks as well as dynamic text resources acquired from the World Wide Web (WWW). We show that language models built using external resources can be used jointly with a limited in-domain (baseline) language model to obtain significant improvements in speech recognition accuracy. Combining language models built using external resources with the in-domain language model provides over 20% reduction in WER over the baseline in-domain language model. Equivalently, we achieve almost the same level of performance by having ten times as much in-domain data (17 K utterances).
international conference on acoustics, speech, and signal processing | 2011
Ruhi Sarikaya; Geoffrey E. Hinton; Bhuvana Ramabhadran
This paper considers application of Deep Belief Nets (DBNs) to natural language call routing. DBNs have been successfully applied to a number of tasks, including image, audio and speech classification, thanks to the recent discovery of an efficient learning technique. DBNs learn a multi-layer generative model from unlabeled data and the features discovered by this model are then used to initialize a feed-forward neural network which is fine-tuned with backpropagation. We compare a DBN-initialized neural network to three widely used text classification algorithms; Support Vector machines (SVM), Boosting and Maximum Entropy (MaxEnt). The DBN-based model gives a call-routing classification accuracy that is equal to the best of the other models even though it currently uses an impoverished representation of the input.
IEEE Signal Processing Letters | 2001
Bryan L. Pellom; Ruhi Sarikaya; John H. L. Hansen
This paper describes two effective algorithms that reduce the computational complexity of state likelihood computation in mixture-based Gaussian speech recognition systems. We consider a baseline recognition system that uses nearest-neighbor search and partial distance elimination (PDE) to compute state likelihoods. The first algorithm exploits the high dependence exhibited among subsequent feature vectors to predict the best scoring mixture for each state. The method, termed best mixture prediction (BMP), leads to further speed improvement in the PDE technique. The second technique, termed feature component reordering (FCR), takes advantage of the variable contribution levels made to the final distortion score for each dimension of the feature and mean space vectors. The combination of two techniques with PDE reduces the computational time for likelihood computation by 29.8% over baseline likelihood computation. The algorithms are shown to yield the same accuracy level without further memory requirements for the November 1992 ARPA Wall Street Journal (WSJ) task.
ieee automatic speech recognition and understanding workshop | 2009
Stanley F. Chen; Lidia Mangu; Bhuvana Ramabhadran; Ruhi Sarikaya; Abhinav Sethy
In [1], we show that a novel class-based language model, Model M, and the method of regularized minimum discrimination information (rMDI) models outperform comparable methods on moderate amounts of Wall Street Journal data. Both of these methods are motivated by the observation that shrinking the sum of parameter magnitudes in an exponential language model tends to improve performance [2]. In this paper, we investigate whether these shrinkage-based techniques also perform well on larger training sets and on other domains. First, we explain why good performance on large data sets is uncertain, by showing that gains relative to a baseline n-gram model tend to decrease as training set size increases. Next, we evaluate several methods for data/model combination with Model M and rMDI models on limited-scale domains, to uncover which techniques should work best on large domains. Finally, we apply these methods on a variety of medium-to-large-scale domains covering several languages, and show that Model M consistently provides significant gains over existing language models for state-of-the-art systems in both speech recognition and machine translation.
Theory of Computing Systems \/ Mathematical Systems Theory | 2006
Yuqing Gao; Bowen Zhou; Ruhi Sarikaya; Mohamed Afify; Hong-Kwang Kuo; Weizhong Zhu; Yonggang Deng; Charles Prosser; Wei Zhang; Laurent Besacier
In this paper, we describe the IBM MASTOR, a speech-to-speech translation system that can translate spontaneous free-form speech in real-time on both laptop and hand-held PDAs. Challenges include speech recognition and machine translation in adverse environments, lack of training data and linguistic resources for under-studied languages, and the need to rapidly develop capabilities for new languages. Another challenge is designing algorithms and building models in a scalable manner to perform well even on memory and CPU deficient hand-held computers. We describe our approaches, experience, and success in building working free-form S2S systems that can handle two language pairs (including a low-resource language).
international conference on acoustics, speech, and signal processing | 2002
Brian Kingsbury; George Saon; Lidia Mangu; Mukund Padmanabhan; Ruhi Sarikaya
We report on the system IBM fielded in the second SPeech In Noisy Environments (SPINE-2) evaluation, conducted by the Naval Research Laboratory in October 2001. The key components of the system include an HMM-based automatic segmentation module using a novel set of LDA-transformed voicing and energy features, a multiple-pass decoding strategy that uses several speaker-and environment-normalization operations to deal with the highly variable acoustics of the evaluation, the combination of hypotheses from decoders operating on three distinct acoustic feature sets, and a class-based language model that uses both the SPINE-1 and SPINE-2 training data to estimate reliable probabilities for the new SPINE-2 vocabulary.