Andrej Ljolje | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Andrej Ljolje is active.

Explore More

Publication

Featured researches published by Andrej Ljolje.

Archive | 1996

Automatic Generation of Detailed Pronunciation Lexicons

Michael Riley; Andrej Ljolje

We explore different ways of “spelling” a word in a speech recognizer’s lexicon and how to obtain those spellings. In particular, we compare using as the source of sub-words units for which we build acoustic models (1) a coarse phonemic representation, (2) a single, fine phonetic realization, and (3) multiple phonetic realizations with associated likelihoods. We describe how we obtain these different pronunciations from text-to-speech systems and from procedures that build decision trees trained on phonetically-labeled corpora. We evaluate these methods applied to speech recognition with the DARPA Resource Management (RM) and the North American Business News (NAB) tasks. For the RM task (with perplexity 60 grammar), we obtain 93.4% word accuracy using phonemic pronunciations, 94.1% using a single phonetic pronunciation per word, and 96.3% using multiple phonetic pronunciations per word with associated likelihoods. For the NAB task (with 60K vocabulary and 34M 1–5 grams), we obtain 87.3% word accuracy with phonemic pronuncations and 90.0% using multiple phonetic pronuncations.

international conference on acoustics, speech, and signal processing | 1991

Automatic segmentation and labeling of speech

Andrej Ljolje; Michael Riley

The authors investigate an automatic approach to segmentation of labeled speech and labeling and segmentation of speech when only the orthographic transcription of speech is available. The technique is based on a phone recognition system based on a trigram phonotactic model, gamma distribution phone duration models, and a spectral model based on five different structures for phone models of varying contextual dependencies. The alignment of speech with a given phone sequence is performed as a very constrained phone recognition task with the phonotactic model based only on the given phone sequence. When only orthographic transcription is provided, a classification-tree-based prediction of most likely phone realizations is used as an input network for the phone recognizer. The maximum likelihood phone sequence is then treated as the true phone sequence and its segment boundaries are compared with the reference boundaries.<<ETX>>

IEEE Transactions on Signal Processing | 1991

Development of an acoustic-phonetic hidden Markov model for continuous speech recognition

Andrej Ljolje; Stephen E. Levinson

The techniques used to develop an acoustic-phonetic hidden Markov model, the problems associated with representing the whole acoustic-phonetic structure, the characteristics of the model, and how it performs as a phonetic decoder for recognition of fluent speech are discussed. The continuous variable duration model was trained using 450 sentences of fluent speech, each of which was spoken by a single speaker, and segmented and labeled using a fixed number of phonemes, each of which has a direct correspondence to the states of the matrix. The inherent variability of each phoneme is modeled as the observable random process of the Markov chain, while the phonotactic model of the unobservable phonetic sequence is represented by the state transition matrix of the hidden Markov model. The model assumes that the observed spectral data were generated by a Gaussian source. However, an analysis of the data shows that the spectra for the most of the phonemes are not normally distributed and that an alternative representation would be beneficial. >

Computer Speech & Language | 1994

High accuracy phone recognition using context clustering and quasi-triphonic models

Andrej Ljolje

Abstract A new phone recognizer has been implemented which extends the (phonotactic) decoding constraint to sequences of three phones. It is based on a structure similar to a second order ergodic hidden Markov model (HMM). This kind of a model assumes direct correspondence between the model states and phones, thus constraints on possible state sequences are equivalent to phonotactic constraints. Very high coverage by both left and right context-dependent phone models has been achieved using two methods. The first assumes that some contexts have the same or very similar effect on the phone in question. Thus they are merged into the same contextual class. The outcome is a set of 19 left context classes and 18 right context classes. The second assumes that left context mostly influences the beginning of a phone, whereas the right context influences the end of the phone. Each phone (a state in an ergodic HMM) is represented by a sequence of three probability density functions (pdfs), which is similar to a three state left-to-right HMM. We generate acoustic models such that the first pdf in the model is conditioned on the left context, the middle pdf is context independent (or it can also be context dependent), and the last pdf is conditioned on the right context. A large number of such quasi-triphonic acoustic models can be generated, thus providing a good triphone coverage for a given task, efficiently utilizing the available training data set. The current implementations of the recognizer described here have been applied to the DARPA Resource Management Task to demonstrate feasibility of performing phone (not phoneme ) recognition using an untranscribed database, and the TIMIT database, for comparison to existing phone recognition systems. Since true phone sequences for the training utterances are not available for the RM database, they are estimated from text using a phone realization classification tree trained on the TIMIT database transcriptions. The estimates of the true phone sequences are used in training the models and generating reference phone sequences for scoring. The best phone recognition match between the most likely path through the classification tree and the phone recognizer output for the DARPA February 89 test set was 80·5% accurate and 84·0% correct. The best result obtained using the same recognizer structure on the TIMIT database is 69·4% accurate and 74·8% correct, which is a significant improvement over the best published result, when they are both reduced to the same phone set.

international conference on acoustics, speech, and signal processing | 1990

Estimation of hidden Markov model parameters by minimizing empirical error rate

Andrej Ljolje; Yariv Ephraim; Lawrence R. Rabiner

An approach for designing a set of acoustic models for speech recognition applications which results in a minimal empirical error rate for a given decoder and training data is studied. In an evaluation of the system for an isolated word recognition task, hidden Markov models (HMMs) are used to characterize the probability density functions of the acoustic signals from the different words in the vocabulary. Decoding is performed by applying the maximum aposteriori decision rule to the acoustic models. The HMMs are estimated by minimizing a differentiable cost function, which approximates the empirical error rate function, using the steepest descent method. The HMMs designed by the minimum empirical error rate approach were used in multispeaker recognition of the English E-set words and compared to models designed by the standard maximum-likelihood estimation approach. The approach increased recognition accuracy from 68.2% to 76.2% on the training set and from 53.4% to 56.4% on an independent set of test data.<<ETX>>

Computer Speech & Language | 1994

The importance of cepstral parameter correlations in speech recognition

Andrej Ljolje

Abstract In this work we demonstrate that explicit modeling of correlations between spectral parameters in speech recognition improves speech models both in terms of their descriptive power (higher likelihoods) and classification accuracy. Most large-vocabulary speech recognition systems are based on some form of hidden Markov models (HMMs) modeling sub-word speech segments. Most of the time speech segments are represented using short term spectra. In this work we employ three-state left-to-right phone models and LPC cepstral parameters including their first and second order time differentials. We investigate the importance of modeling correlations between cepstral parameters for high accuracy phone recognition. Several different types of distributions for each HMM state are compared. The simplest uses a single multivariate Gaussian distribution with a full covariance matrix. The next uses a weighted mixture of multivariate Gaussian distributions with diagonal covariances. It uses implicit rather than explicit modeling of parameter correlations. The most elaborate model employs a mixture of Gaussian distributions, just like the previous model, but in addition it uses a parameter space rotation which is specific to a given state in an HMM. It thus explicitly models parameter correlations in exactly the same way as the simplest model which uses a single distribution per state. The highest phone accuracy on the DARPA Resource Management task Feb 89 test set is obtained using the most elaborate model, with mixtures and space rotation - 82·4% phone accuracy. The next best result was achieved using single distributions, which also explicitly model parameter correlations, with 80·8% phone accuracy. The worst result was obtained using distributions which only implicitly model parameter correlations, achieving 78·7% phone accuracy. These results clearly demonstrate the importance of explicitly modeling parameter correlations for improving speech recognition performance.

human language technology | 1989

Speaker independent phonetic transcription of fluent speech for large vocabulary speech recognition

Stephen E. Levinson; Mark Liberman; Andrej Ljolje; L. G. Miller

Speaker independent phonetic transcription of fluent speech is performed using an ergodic continuously variable duration hidden Markov model (CVDHMM) to represent the acoustic, phonetic and phonotactic structure of speech. An important property of the model is that each of its fifty-one states is uniquely identified with a single phonetic unit. Thus, for any spoken utterance, a phonetic transcription is obtained from a dynamic programming (DP) procedure for finding the state sequence of maximum likelihood. A model has been constructed based on 4020 sentences from the TIMIT database. When tested on 180 different sentences from this database, phonetic accuracy was observed to be 56% with 9% insertions. A speaker dependent version of the model was also constructed. The transcription algorithm was then combined with lexical access and parsing routines to form a complete recognition system. When tested on sentences from the DARPA resource management task spoken over the local switched telephone network, phonetic accuracy of 64% with 8% insertions and word accuracy of 87% with 3% insertions was measured. This system is presently operating in an on-line mode over the local switched telephone network in less than ten times real time on an Alliant FX-80.

human language technology | 1989

Continuous speech recognition from phonetic transcription

Stephen E. Levinson; Andrej Ljolje

Previous research by the authors has been directed toward phonetic transcription of fluent speech. We have applied our techniques to speech recognition on the DARPA Resource Management Task. In order to perform speech recognition, however, the phonetic transcription must be interpreted as a sequence of words. A central component of this process is lexical access for which a novel method is proposed.

international conference on acoustics speech and signal processing | 1988

Large vocabulary speech recognition using a hidden Markov model for acoustic/phonetic classification

Stephen E. Levinson; Andrej Ljolje; L. G. Miller

Experiments with a speech recognition system are reported. The system comprises an acoustic/phonetic decoder, a lexical access mechanism and a syntax analyzer. The acoustic, phonetic and lexical processing are based on a continuously variable duration hidden Markov model (CVDHMM). The syntactic component is based on the Cocke-Kasami-Young (CKY) parser and a content-free covering grammar of English. Lexical items are represented in terms of the 43 phonetic units. In recognition tests conducted on a separate data set, a 70% correct recognition rate on phonetic units in fluent speech was observed. In two additional tests on isolated words, a 40% word recognition was observed with the complete 52000 word lexicon. When the vocabulary size was reduced to 1040 words, the recognition rate improved to 80%. After syntax analysis the word recognition rate rose to 90%.<<ETX>>

Computer Speech & Language | 1987

Recognition of isolated prosodic patterns using Hidden Markov Models

Andrej Ljolje; Frank Fallside

Abstract The use of Hidden Markov Models with continuous asymmetric probability density functions for modelling prosodic patterns in isolated utterances is described. Finite Gaussian mixtures were used in each state of the models which describe fundamental frequency, fundamental frequency time derivative, energy and smoothed energy. Six models were generated using recordings of 16 monosyllabic words spoken by five speakers per model representing four basic intonational features: falls, rises, fall rises and rise falls. Falls and rises were each described by two models, one for high and the other for low pitch in the speakers natural pitch range. The recognition results based on these models clearly show the ability of Hidden Markov Models to model some aspects of the underlying prosodic structure.

Explore More