Ebru Arisoy
IBM
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Ebru Arisoy.
international conference on acoustics, speech, and signal processing | 2013
Tara N. Sainath; Brian Kingsbury; Vikas Sindhwani; Ebru Arisoy; Bhuvana Ramabhadran
While Deep Neural Networks (DNNs) have achieved tremendous success for large vocabulary continuous speech recognition (LVCSR) tasks, training of these networks is slow. One reason is that DNNs are trained with a large number of training parameters (i.e., 10-50 million). Because networks are trained with a large number of output targets to achieve good performance, the majority of these parameters are in the final weight layer. In this paper, we propose a low-rank matrix factorization of the final weight layer. We apply this low-rank technique to DNNs for both acoustic modeling and language modeling. We show on three different LVCSR tasks ranging between 50-400 hrs, that a low-rank factorization reduces the number of parameters of the network by 30-50%. This results in roughly an equivalent reduction in training time, without a significant loss in final recognition accuracy, compared to a full-rank representation.
ACM Transactions on Speech and Language Processing | 2007
Mathias Creutz; Teemu Hirsimäki; Mikko Kurimo; Antti Puurula; Janne Pylkkönen; Vesa Siivola; Matti Varjokallio; Ebru Arisoy; Murat Saraclar; Andreas Stolcke
We explore the use of morph-based language models in large-vocabulary continuous-speech recognition systems across four so-called morphologically rich languages: Finnish, Estonian, Turkish, and Egyptian Colloquial Arabic. The morphs are subword units discovered in an unsupervised, data-driven way using the Morfessor algorithm. By estimating n-gram language models over sequences of morphs instead of words, the quality of the language model is improved through better vocabulary coverage and reduced data sparsity. Standard word models suffer from high out-of-vocabulary (OOV) rates, whereas the morph models can recognize previously unseen word forms by concatenating morphs. It is shown that the morph models do perform fairly well on OOVs without compromising the recognition accuracy on in-vocabulary words. The Arabic experiment constitutes the only exception since here the standard word model outperforms the morph model. Differences in the datasets and the amount of data are discussed as a plausible explanation.
IEEE Transactions on Audio, Speech, and Language Processing | 2009
Ebru Arisoy; Dogan Can; Siddika Parlak; Hasim Sak; Murat Saraclar
This paper summarizes our recent efforts for building a Turkish Broadcast News transcription and retrieval system. The agglutinative nature of Turkish leads to a high number of out-of-vocabulary (OOV) words which in turn lower automatic speech recognition (ASR) accuracy. This situation compromises the performance of speech retrieval systems based on ASR output. Therefore using a word-based ASR is not adequate for transcribing speech in Turkish. To alleviate this problem, various sub-word-based recognition units are utilized. These units solve the OOV problem with moderate size vocabularies and perform even better than a 500 K word vocabulary as far as recognition accuracy is concerned. As a novel approach, the interaction between recognition units, words and sub-words, and discriminative training is explored. Sub-word models benefit from discriminative training more than word models do, especially in the discriminative language modeling framework. For speech retrieval, a spoken term detection system based on automata indexation is utilized. As with transcription, retrieval performance is measured under various schemes incorporating words and sub-words. Best results are obtained using a cascade of word and sub-word indexes together with term-specific thresholding.
language and technology conference | 2006
Mikko Kurimo; Antti Puurula; Ebru Arisoy; Vesa Siivola; Teemu Hirsimäki; Janne Pylkkönen; Tanel Alumäe; Murat Saraclar
It is practically impossible to build a word-based lexicon for speech recognition in agglutinative languages that would cover all the relevant words. The problem is that words are generally built by concatenating several prefixes and suffixes to the word roots. Together with compounding and inflections this leads to millions of different, but still frequent word forms. Due to inflections, ambiguity and other phenomena, it is also not trivial to automatically split the words into meaningful parts. Rule-based morphological analyzers can perform this splitting, but due to the handcrafted rules, they also suffer from an out-of-vocabulary problem. In this paper we apply a recently proposed fully automatic and rather language and vocabulary independent way to build sub-word lexica for three different agglutinative languages. We demonstrate the language portability as well by building a successful large vocabulary speech recognizer for each language and show superior recognition performance compared to the corresponding word-based reference systems.
IEEE Transactions on Audio, Speech, and Language Processing | 2014
Ebru Arisoy; Stanley F. Chen; Bhuvana Ramabhadran; Abhinav Sethy
Neural Network Language Models (NNLMs) have achieved very good performance in large-vocabulary continuous speech recognition (LVCSR) systems. Because decoding with NNLMs is very computationally expensive, there is interest in developing methods to approximate NNLMs with simpler language models that are suitable for fast decoding. In this work, we propose an approximate method for converting a feedforward NNLM into a back-off n-gram language model that can be used directly in existing LVCSR decoders. We convert NNLMs of increasing order to pruned back-off language models, using lower-order models to constrain the n-grams allowed in higher-order models. In experiments on Broadcast News data, we find that the resulting back-off models retain the bulk of the gain achieved by NNLMs over conventional n-gram language models, and give significant accuracy improvements as compared to existing methods for converting NNLMs to back-off models. In addition, the proposed approach can be applied to any type of non-back-off language model to enable efficient decoding.
IEEE Transactions on Audio, Speech, and Language Processing | 2012
Ebru Arisoy; Murat Saraclar; Brian Roark; Izhak Shafran
This paper focuses on integrating linguistically motivated and statistically derived information into language modeling. We use discriminative language models (DLMs) as a complementary approach to the conventional n-gram language models to benefit from discriminatively trained parameter estimates for overlapping features. In our DLM approach, relevant information is encoded as features. Feature weights are discriminatively trained using training examples and used to re-rank the N -best hypotheses of the baseline automatic speech recognition (ASR) system. In addition to presenting a more complete picture of previously proposed feature sets that extract implicit information available at lexical and sub-lexical levels using both linguistic and statistical approaches, this paper attempts to incorporate semantic information in the form of topic sensitive features. We explore linguistic features to incorporate complex morphological and syntactic language characteristics of Turkish, an agglutinative language with rich morphology, into language modeling. We also apply DLMs to our sub-lexical-based ASR system where the vocabulary is composed of sub-lexical units. Obtaining implicit linguistic information from sub-lexical hypotheses is not as straightforward as word hypotheses, so we use statistical methods to derive useful information from sub-lexical units. DLMs with linguistic and statistical features yield significant, 0.8%-1.1% absolute, improvements over our baseline word-based and sub-word-based ASR systems. The explored features can be easily extended to DLM for other languages .
international conference on acoustics, speech, and signal processing | 2010
Ebru Arisoy; Murat Saraclar; Brian Roark; Izhak Shafran
This paper investigates syntactic and sub-lexical features in Turkish discriminative language models (DLMs). DLM is a feature-based language modeling approach. It reranks the ASR output with discriminatively trained feature parameters. Syntactic information is incorporated into DLM as part-of-speech (PoS) tag n-gram features and head-to-head dependency relations. Sub-lexical units are first utilized as language modeling units in the baseline recognizer. Then, sub-lexical features are used to rerank the sub-lexical hypotheses. We explore features, similar to syntactic features, on sub-lexical units to reveal the implicit morpho-syntactic information conveyed by these units. We find out that DLM yields more improvement for sub-lexical units than for words. Basic sub-lexical n-gram features result in 0.6% reduction over the baseline and morpho-syntactic features yield an additional 0.4% reduction on the test set.
international conference on acoustics, speech, and signal processing | 2015
Ebru Arisoy; Abhinav Sethy; Bhuvana Ramabhadran; Stanley F. Chen
Recurrent neural network language models have enjoyed great success in speech recognition, partially due to their ability to model longer-distance context than word n-gram models. In recurrent neural networks (RNNs), contextual information from past inputs is modeled with the help of recurrent connections at the hidden layer, while Long Short-Term Memory (LSTM) neural networks are RNNs that contain units that can store values for arbitrary amounts of time. While conventional unidirectional networks predict outputs from only past inputs, one can build bidirectional networks that also condition on future inputs. In this paper, we propose applying bidirectional RNNs and LSTM neural networks to language modeling for speech recognition. We discuss issues that arise when utilizing bidirectional models for speech, and compare unidirectional and bidirectional models on an English Broadcast News transcription task. We find that bidirectional RNNs significantly outperform unidirectional RNNs, but bidirectional LSTMs do not provide any further gain over their unidirectional counterparts.
international conference on acoustics, speech, and signal processing | 2013
Ebru Arisoy; Stanley F. Chen; Bhuvana Ramabhadran; Abhinav Sethy
Neural network language models (NNLMs) have achieved very good performance in large-vocabulary continuous speech recognition (LVCSR) systems. Because decoding with NNLMs is computationally expensive, there is interest in developing methods to approximate NNLMs with simpler language models that are suitable for fast decoding. In this work, we propose an approximate method for converting a feedforward NNLM into a back-off n-gram language model that can be used directly in existing LVCSR decoders. We convert NNLMs of increasing order to pruned back-off language models, using lower-order models to constrain the n-grams allowed in higher-order models. In experiments on Broadcast News data, we find that the resulting back-off models retain the bulk of the gain achieved by NNLMs over conventional n-gram language models, and give accuracy improvements as compared to existing methods for converting NNLMs to back-off models. In addition, the proposed approach can be applied to any type of non-back-off language model to enable efficient decoding.
IEEE Transactions on Audio, Speech, and Language Processing | 2009
Ebru Arisoy; Murat Saraclar
This paper presents two-pass speech recognition techniques to handle the out-of-vocabulary (OOV) problem in Turkish newspaper content transcription. OOV words are assumed to be replaced by acoustically ldquosimilarrdquo in-vocabulary (IV) words during decoding. Therefore, the first pass recognition lattice is used as the prior knowledge to adapt the vocabulary and the search space for the second pass. Vocabulary adaptation and lattice extension are performed with words similar to the hypothesis lattice words. These words are selected from a fallback vocabulary using distance functions that take the agglutinative language characteristics of Turkish into account. Morphology-based and phonetic-distance-based similarity functions respectively yield 1.9% and 4.6% absolute accuracy improvements. Statistical sub-word units are also utilized to handle the OOV problem encountered in the word-based system. Using sub-words alleviates the OOV problem and improves the recognition accuracy - OOV accuracy improved from 0% to 60.2%. However, this introduces ungrammatical items to the recognition output. Since automatically derived sub-word units do not provide explicit morphological features, the lattice extension strategy is modified to correct these ungrammatical items. Lattice extension for sub-words reduces the word error rate to 32.3% from 33.9%. This improvement is statistically significant at p=0.002 as measured by the NIST MAPSSWE significance test.