Egidio P. Giachin
Bell Labs
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Egidio P. Giachin.
Computer Speech & Language | 1992
Chin-Hui Lee; Egidio P. Giachin; Lawrence R. Rabiner; Roberto Pieraccini; Aaron E. Rosenberg
Abstract We report on some recent improvements to an HMM-based, continuous speech recognition system which is being developed at AT&T Bell Laboratories. These advances, which include the incorporation of inter-word, context-dependent units and an improved feature analysis, lead to a recognition system which gives a 95% word accuracy for speaker-independent recognition of the 1000-word DARPA resource management task using the standard word-pair grammar (with a perplexity of about 60). It will be shown that the incorporation of inter-word units into training results in better acoustic models of word juncture coarticulation and gives a 20% reduction in error rate. The effect of an improved set of spectral and log-energy, features is further to reduce word error rate by about 30%. Since we use a continuous density HMM to characterize each subword unit, it is simple and straightforward to add new features to the feature vector (initially a 24-element vector, consisting of 12 cepstral and 12 delta cepstral coefficients). We investigate augmenting the feature vector with 12 second difference (delta-delta) cepstral coefficients and with first (delta) and second difference (delta-delta) log energies, thereby giving a 38-element feature vector. Additional error rate reductions of 11% and 18% were achieved, respectively. With the improved acoustic modeling of subword units, the overall error rate reduction was over 42%. We also found that the spectral vectors, corresponding to the same speech unit, behave differently statistically, depending on whether they are at word boundaries or within a word. The results suggest that intra-word and inter-word units should be modeled independently, even when they appear in the same context. Using a set of subword units which included variants for intra-word and inter-word, context-dependent phones, an additional decrease of about 6–10% in word error rate resulted.
Computer Speech & Language | 1991
Egidio P. Giachin; Aaron E. Rosenberg; Chin-Hui Lee
Abstract Words uttered in isolation are pronounced differently than when they are uttered in continuous speech, a major cause being coarticulation at word junctures. Between-word context-dependent phones have been proposed to provide a more precise phonetic representation of word junctures. This technique permits one to accurately model “soft” pronunciation changes (changes in which a phone undergoes a comparatively small alteration). However, “hard” pronunciation changes (changes in which a phone is completely deleted or replaced by a different phone) are much less frequent and hence cannot be modeled adequately due to the lack of training material. To overcome this problem we use a set of phonological rules to redefine word junctures, specifying how to replace or delete the boundary phones according to the neighboring phones. No new speech units are required, thus avoiding most of the training issues. Results, which are evaluated on the 991-word speaker-independent DARPA task, show that phonological rules are effective in providing corrective capability at low computational cost.
human language technology | 1990
Chin-Hui Lee; Egidio P. Giachin; Lawrence R. Rabiner; Roberto Pieraccini; Aaron E. Rosenberg
We report on some recent improvements to an HMM-based, continuous speech recognition system which is being developed at AT&T Bell Laboratories. These advances, which include the incorporation of inter-word, context-dependent units and an improved feature analysis, lead to a recognition system which achieve better than 95% word accuracy for speaker independent recognition of the 1000-word, DARPA resource management task using the standard word-pair grammar (with a perplexity of about 60). It will be shown that the incorporation of inter-word units into training results in better acoustic models of word juncture coarticulation and gives a 20% reduction in error rate. The effect of an improved set of spectral and log energy features is to further reduce word error rate by about 30%. We also found that the spectral vectors, corresponding to the same speech unit, behave differently statistically, depending on whether they are at word boundaries or within a word. The results suggest that intra-word and inter-word units should be modeled independently, even when they appear in the same context. Using a set of sub-word units which included variants for intra-word and inter-word, context-dependent phones, an additional decrease of about 10% in word error rate resulted.
human language technology | 1990
Roberto Pieraccini; Chin-Hui Lee; Egidio P. Giachin; Lawrence R. Rabiner
Most large vocabulary speech recognition systems essentially consist of a training algorithm and a recognition structure which is essentially a search for the best path through a rather large decoding network. Although the performance of the recognizer is crucially tied to the details of the training procedure, it is absolutely essential that the recognition structure be efficient in terms of computation and memory, and accurate in terms of actually determining the best path through the lattice, so that a wide range of training (sub-word unit creation) strategies can be efficiently evaluated in a reasonable time period. We have considered an architecture in which we incorporate several well known procedures (beam search, compiled network, etc.) with some new ideas (stacks of active network nodes, likelihood computation on demand, guided search, etc.) to implement a search procedure which maintains the accuracy of the full search but which can decode a single sentence in about one minute of computing time (about 20 times real time) on a vectorized, concurrent processor. The ways in which we have realized this significant computational reduction are described in this paper.
Computer Speech & Language | 1992
Egidio P. Giachin; Chin-Hui Lee; Lawrence R. Rabiner; Aaron E. Rosenberg; Roberto Pieraccini
Abstract Word juncture coarticulation is one of the major sources of acoustic variability for initial and final word segments when spoken in fluent speech. One way to improve characterization of word pronuciations in continuous speech is to include inter-word contexts in lexical representations, similar to the way intra-word contexts are utilized. In this paper we investigate the issues related to the modeling of this set of inter-word, context-dependent units using continuous density hidden Markov models. Under such a modeling framework, it is usually required to have enough training tokens available for each unit in order reliably to estimate the parameters of the unit. Therefore, each context-dependent unit is included in the set of units to be modeled only when its frequency of occurrence in the training data exceeds a prescribed threshold. Testing such a unit selection and modeling strategy on the DARPA resource management task, it was found that the incorporation of inter-word units gave a 15–25% word error reduction compared to the base-line continuous speech recognition system using only intra-word units.
international conference on acoustics, speech, and signal processing | 1991
Roberto Pieraccini; Chin-Hui Lee; Egidio P. Giachin; Lawrence R. Rabiner
The authors provide a detailed description of all aspects of the implementation of a large-vocabulary speaker-independent, continuous speech recognizer used as a tool for the development of recognition algorithms based on hidden Markov models (HMMs) and Viterbi decoding. The complexity of HMM recognizers is greatly increased by the introduction of detailed context-dependent units for representing interword coarticulation. A vectorized representation of the data structures involved in the decoding process, along with compilation of the connection information among temporally consecutive words and an efficient implementation of the beam search pruning, has led to a speedup of the algorithm of about one order of magnitude. A guided search can be used during a tuning phase for obtaining a speedup of more than three times. An average recognition time of about 25 s per sentence, although far from real time, allows one to perform a series of training experiments and to tune the recognition system parameters in order to obtain high word accuracy on complex recognition tasks such as the DARPA resource management task.<<ETX>>
Archive | 1992
Roberto Pieraccini; Chin-Hui Lee; Egidio P. Giachin; Lawrence R. Rabiner
In this paper, we present an efficient data structure for implementing a continuous, large vocabulary, speech recognizer. The recognition system is based on hidden Markov models of phonetic units for representing both intraword and interword context dependent phones. Due to the large number of connections present in the decoding network, the structure of the recognizer must be carefully designed in order to perform experiments in a reasonable amount of computing time.
International Journal of Pattern Recognition and Artificial Intelligence | 1994
Paolo Baggia; Luciano Fissore; Egidio P. Giachin; Giorgio Micca; Claudio Rullent; Pietro Laface
This paper describes a Continuous Speech Understanding System that allows information services to be accessed through the telephone line. It accepts queries within a restricted semantic domain, expressed in free but syntactically correct natural language, with a lexicon of the order of 800 words. In the implementation here described, a user can access an electronic mailbox or a train information service through a PABX telephone line. The architecture of the system is based on two main modules that represent and use different knowledge sources. A speaker independent recognition module generates, for each utterance, a lattice of word hypotheses which is the interface to an understanding module that performs the syntactic and semantic analysis. The recognition module is based on Hidden Markov Models of subword units, and performs the acoustic decoding process according to a beam search strategy. The understanding module finds the most likely sequence of words and represents its meaning in a format which facilitates the access to a database. It makes use of a modified caseframe analysis guided by the word hypotheses scores. Experiments were performed with 600 sentences from 10 speakers on the E-Mail application task. Using 15 Gaussian mixtures per state, a word accuracy of 75.7 was obtained with a test vocabulary of 787 words and no linguistic constraints. Linguistic processing of the corresponding lattices achieved a sentence understanding rate of 82%.
conference of the international speech communication association | 1991
Egidio P. Giachin; Chin-Hui Lee; Lawrence R. Rabiner; Aaron E. Rosenberg; Roberto Pieraccini
conference of the international speech communication association | 1995
Loreia Moisa; Egidio P. Giachin