Jay G. Wilpon | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jay G. Wilpon is active.

Explore More

Publication

Featured researches published by Jay G. Wilpon.

IEEE Transactions on Acoustics, Speech, and Signal Processing | 1990

Automatic recognition of keywords in unconstrained speech using hidden Markov models

Jay G. Wilpon; Lawrence R. Rabiner; Chin-Hui Lee; E. R. Goldman

The modifications made to a connected word speech recognition algorithm based on hidden Markov models (HMMs) which allow it to recognize words from a predefined vocabulary list spoken in an unconstrained fashion are described. The novelty of this approach is that statistical models of both the actual vocabulary word and the extraneous speech and background are created. An HMM-based connected word recognition system is then used to find the best sequence of background, extraneous speech, and vocabulary word models for matching the actual input. Word recognition accuracy of 99.3% on purely isolated speech (i.e., only vocabulary items and background noise were present), and 95.1% when the vocabulary word was embedded in unconstrained extraneous speech, were obtained for the five word vocabulary using the proposed recognition algorithm. >

international conference on acoustics speech and signal processing | 1988

High performance connected digit recognition, using hidden Markov models

Lawrence R. Rabiner; Jay G. Wilpon; Frank K. Soong

The authors use an enhanced analysis feature set consisting of both instantaneous and transitional spectral information and test the hidden-Markov-model (HMM)-based connected-digit recognizer in speaker-trained, multispeaker, and speaker-independent modes. For the evaluation, both a 50-talker connected-digit database recorded over local, dialed-up telephone lines, and the Texas Instruments, 225-adult-talker, connected-digits database are used. Using these databases, the performance achieved was 0.35, 1.65, and 1.75% string error rates for known-length strings, for speaker-trained, multispeaker, and speaker-independent modes, respectively, and 0.78, 2.85, and 2.94% string error rates for unknown-length strings of up to seven digits in length for the three modes. Several experiments were carried out to determine the best set of conditions (e.g., training, recognition, parameters, etc.) for recognition of digits. The results and the interpretation of these experiments are described. >

Proceedings of the IEEE | 2000

Speech and language processing for next-millennium communications services

Richard V. Cox; Candace A. Kamm; Lawrence R. Rabiner; Juergen Schroeter; Jay G. Wilpon

In the future, the world of telecommunications will be vastly different than it is today. The driving force will be the seamless integration of real time communications (e.g. voice, video, music, etc.) and data into a single network, with ubiquitous access to that network anywhere, anytime, and by a wide range of devices. The only currently available ubiquitous access device to the network is the telephone, and the only ubiquitous user access technology mode is spoken voice commands and natural language dialogues with machines. In the future, new access devices and modes will augment speech in this role, but are unlikely to supplant the telephone and access by speech anytime soon. Speech technologies have progressed to the point where they are now viable for a broad range of communications services, including: compression of speech for use over wired and wireless networks; speech synthesis, recognition, and understanding for dialogue access to information, people, and messaging; and speaker verification for secure access to information and services. The paper provides brief overviews of these technologies, discusses some of the unique properties of wireless, plain old telephone service, and Internet protocol networks that make voice communication and control problematic, and describes the types of voice services available in the past and today, and those that we foresee becoming available over the next several years.

Computer Speech & Language | 1990

Acoustic modeling for large vocabulary speech recognition

Chin-Hui Lee; Lawrence R. Rabiner; Roberto Pieraccini; Jay G. Wilpon

Abstract The field of large vocabulary, continuous-speech recognition has advanced to the point where there are several systems capable of attaining between 90 and 95% word accuracy for speaker-independent recognition, of a 1000-word vocabulary, spoken fluently for a task with a perplexity (average word branching factor) of about 60. There are several factors which account for the high performance achieved by these systems, including the use of hidden Markov model (HMM) methodology, the use of context-dependent sub-word units, the representation of between-word phonemic variations, and the use of corrective training techniques to emphasize differences between acoustically similar words in the vocabulary. In this paper we describe one of the large vocabulary speech-recognition systems which is being investigated at AT&T Bell Laboratories, and discuss the methods used to provide high word-recognition accuracy. In particular we focus on the techniques used to provide the acoustic models of the sub-word units (both context-independent and context-dependent units), and discuss the resulting system performance as a function of the type of acoustic modeling used.

international conference on acoustics speech and signal processing | 1996

A study of speech recognition for children and the elderly

Jay G. Wilpon; Claus Jacobsen

Although children and the elderly have obvious needs for voice operated interfaces, hardly anything is known about the performance of the current automatic speech recognition technology with these people. In this paper we report the results of a thorough investigation into this field using a connected digit recognizer and a major telephone speech database. One would generally assume that the recognition of speech from these people would only be a matter of having enough, sufficiently representative training data. This turns out to be true only, as long as the speakers belong to the age range 15 to approximately 70. Outside this range the error rates increase dramatically, even with balanced amounts of training data. For males, the lower limit is very sharp and can be attributed to the change of pitch frequency during puberty. For females, the lower limit is gradual and caused by the slowly changing dimensions of the vocal tract length only. For both genders, the upper limit is very gradual and can possibly be attributed to changes in the glottis area and the internal control loops of the human articulatory system. The paper presents some supporting evidence for the above assertions and gives results for various attempts to improve the performance. Recognition of children and the elderly will require much more research if we are to fully understand the characteristics of these age group on current and future speech recognition systems.

IEEE Transactions on Acoustics, Speech, and Signal Processing | 1985

A modified K-means clustering algorithm for use in isolated work recognition

Jay G. Wilpon; Lawrence R. Rabiner

Studies of isolated word recognition systems have shown that a set of carefully chosen templates can be used to bring the performance of speaker-independent systems up to that of systems trained to the individual speaker. The earliest work in this area used a sophisticated set of pattern recognition algorithms in a human-interactive mode to create the set of templates (multiple patterns) for each word in the vocabulary. Not only was this procedure time consuming but it was impossible to reproduce exactly because it was highly dependent on decisions made by the experimenter. Subsequent work led to an automatic clustering procedure which, given only a set of clustering parameters, clustered patterns with the same performance as the previously developed supervised algorithms. The one drawback of the automatic procedure was that the specification of the input parameter set was found to be somewhat dependent on the vocabulary type and size of population to be clustered. Since a naive user of such a statistical clustering algorithm could not be expected, in general, to know how to choose the word clustering parameters, even this automatic clustering algorithm was not appropriate for a completely general word recognition system. It is the purpose of this paper to present a clustering algorithm based on a standard K-means approach which requires no user parameter specification. Experimental data show that this new algorithm performs as well or better than the previously used clustering techniques when tested as part of a speaker-independent isolated word recognition system.

international conference on acoustics, speech, and signal processing | 1989

HMM clustering for connected word recognition

Lawrence R. Rabiner; Chin-Hui Lee; Biing-Hwang Juang; Jay G. Wilpon

The authors describe an HMM (hidden Markov model) clustering procedure and discuss its application to connected-word systems and to large-vocabulary recognition based on phonelike units. It is shown that the conventional approach of maximizing likelihood is easily implemented but does not work well in practice, as it tends to give improved models of tokens for which the initial model was generally quite good, but does not improve tokens which are poorly represented by the initial model. The authors have developed a splitting procedure which initializes each new cluster (statistical model) by splitting off all tokens in the training set which were poorly represented by the current set of models. This procedure is highly efficient and gives excellent recognition performance in connected-word tasks. In particular, for speaker-independent connected-digit recognition, using two HMM-clustered models, the recognition performance is as good as or better than previous results using 4-6 models/digit obtained from template-based clustering.<<ETX>>

Journal of the Acoustical Society of America | 1980

A simplified, robust training procedure for speaker trained, isolated word recognition systems

Lawrence R. Rabiner; Jay G. Wilpon

One of the most important operations in isolated word speech recognition systems is the method used to obtain the word reference templates. For speaker trained systems, the techniques that have been used include casual training, averaging, and use of statistical pattern recognition clustering methods. In a recent study, Rabiner and Wilpon showed that the statistical techniques, when combined with the technique of averaging the autocorrelation coefficients of all tokens within the cluster, provided a reliable, robust set of reference templates. The only drawback to this method was the extensive, burdensome training required for the statistical analysis. Since the statistical training method could not be used in most practical situations, techniques were investigated for obtaining a simplified, robust training procedure which would incorporate many of the ideas of the statistical approach. Such a training method is described in this paper. The advantages of this new training procedure, over (previously used...

international conference on acoustics, speech, and signal processing | 1992

Modeling state durations in hidden Markov models for automatic speech recognition

Padma Ramesh; Jay G. Wilpon

Hidden Markov modeling (HMM) techniques have been used successfully for connected speech recognition in the last several years. In the traditional HMM algorithms, the probability of duration of a state decreases exponentially with time which is not appropriate for representing the temporal structure of speech. Non-parametric modeling of duration using semi-Markov chains does accomplish the task with a large increase in the computational complexity. Applying a postprocessing state duration penalty after Viterbi decoding adds very little computation but does not affect the forward recognition path. The authors present a way of modeling state durations in HMM using time-dependent state transitions. This inhomogeneous HMM (IHMM) does increase the computation by a small amount but reduces recognition error rates by 14-25%. Also, a suboptimal implementation of this scheme that requires no more computation than the traditional HMM is presented which also has reduced errors by 14-22% on a variety of databases.<<ETX>>

international conference on acoustics, speech, and signal processing | 1992

A speech understanding system based on statistical representation of semantics

Roberto Pieraccini; Evelyne Tzoukermann; Zakhar Gorelov; Jean-Luc Gauvain; Esther Levin; Chin-Hui Lee; Jay G. Wilpon

An understanding system, designed for both speech and text input, has been implemented based on statistical representation of task specific semantic knowledge. The core of the system is the conceptual decoder, which extracts the words and their association to the conceptual structure of the task directly from the acoustic signal. The conceptual information, which is also used to clarify the English sentences, is encoded following a statistical paradigm. A template generator and an SQL (structured query language) translator process the sentence and produce SQL code for querying a relational database. Results of the system on the official DARPA test are given.<<ETX>>

Explore More