James K. Baker | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where James K. Baker is active.

Explore More

Publication

Featured researches published by James K. Baker.

IEEE Transactions on Acoustics, Speech, and Signal Processing | 1975

The DRAGON system--An overview

James K. Baker

This paper briefly describes the major features of the DRAGON speech understanding system. DRAGON makes systematic use of a general abstract model to represent each of the knowledge sources necessary for automatic recognition of continuous speech. The model--that of a probabilistic function of a Markov process--is very flexible and leads to features which allow DRAGON to function despite high error rates from individual knowledge sources. Repeated use of a simple abstract model produces a system which is simple in structure, but powerful in capabilities.

Journal of the Acoustical Society of America | 1986

Speech recognition apparatus and method

James K. Baker; Paul G. Bamberg; Mark F. Sidell; Robert Roth

A system is disclosed for recognizing a pattern in a collection of data given a context of one or more other patterns previously identified. Preferably the system is a speech recognition system, the patterns are words and the collection of data is a sequence of acoustic frames. During the processing of each of a plurality of frames, for each word in an active vocabulary, the system updates a likelihood score representing a probability of a match between the word and the frame, combines a language model score based on one or more previously recognized words with that likelihood score, and prunes the word from the active vocabulary if the combined score is below a threshold. A rapid match is made between the frames and each word of an initial vocabulary to determine which words should originally be placed in the active vocabulary. Preferably the system enables an operator to confirm the systems best guess as to the spoken word merely by speaking another word, to indicate that an alternate guess by the system is correct by typing a key associated with that guess, and to indicate that neither the best guess nor the alternate guesses was correct by typing yet another key. The system includes other features, including ones for determining where among the frames to look for the start of speech, and a special hardware processor for computing likelihood scores.

Journal of the Acoustical Society of America | 1987

Parallel pattern verifier with dynamic time warping

James K. Baker; Janet M. Baker

A speech recognition system is disclosed which employs a network of elementary local decision modules for matching an observed time-varying speech pattern against all possible time warpings of the stored prototype patterns. For each elementary speech segment, an elementary recognizer provides a score indicating the degree of correlation of the input speech segment with stored spectral patterns. Each local decision module receives the results of the elementary recognizer and, at the same time, receives an input from selected ones of the other local decision modules. Each local decision module specializes in a particular node in the network wherein each node matches the probability of how well the input segment of speech matches the particular sound segments in the sounds of the words spoken. Each local decision module takes the prior decisions of all preceding sound segments which are input from the other local decision modules and makes a selection of the locally optimum time warping to be permitted. By this selection technique, each speech segment is stretched or compressed by an arbitrary, nonlinear function based on the control of the interconnections of the other local decision modules to a particular local decision module. Each local decision module includes an accumulator memory which stores the logarithmic probabilities of the current observation which is conditional upon the internal event specified by a word to be matched or identifier of the particular pattern that corresponds to the subject node for that particular pattern. For each observation, these probabilities are computed and loaded into the accumulator memory of all the modules and, the result of the locally optimum time warping representing the accumulated score or network path to a node for the word with the highest probability is chosen.

Journal of the Acoustical Society of America | 1990

Method for speech analysis and speech recognition

Paul G. Bamberg; James K. Baker; Laurence S. Gillick; Robert Roth

A method of speech analysis calculates one or more difference parameters for each of a sequence of acoustic frames, where each difference parameter is a function of the difference between an acoustic parameter in one frame and an acoustic parameter in a nearby frame. The method is used in speech recognition which compares the difference parameters of each frame against acoustic models representing speech units, where each speech-unit model has a model of the difference parameters associated with the frames of its speech unit. The difference parameters can be slope parameters or energy difference parameters. Slope parameters are derived by finding the difference between the energy of a given spectral parameter of a given frame and the energy, in a nearby frame, of a spectral parameter associated with a different frequency band. The resulting parameter indicates the extent to which the frequency of energy in the part of the spectrum represented by the given parameter is going up or going down. Energy difference parameters are calculated as a function of the difference between a given spectral parameter in one frame and a spectral parameter in a nearby frame representing the same frequency band. In one embodiment of the invention, dynamic programming compares the difference parameters of a sequence of frames to be recognized against a sequence of dynamic programming elements associated with each of a plurality of speech-unit models. In another embodiment of the invention, each speech-unit model represents one phoneme, and the speech-unit models for a plurality of phonemes are compared against individual frames, to associate with each such frame the one or more phonemes whose models compare most closely with it.

Journal of the Acoustical Society of America | 1972

More Visible Speech

Janet M. Baker; James K. Baker; Jerome Y. Lettvin

This paper presents a method for visual display of acoustical waveforms with which different phonemes are more readily distinguished than with spectrograms. The visual display is based on the interval between successive up crossings of the zero axis in the waveform. Although zero crossings and tip crossings have been used by several investigators since Licklinders studies demonstrating that zero‐crossing information is sufficient for intelligibility of speech, most investigators have ignored an essential property of up‐crossing analysis. Analysis of up crossings is a time domain technique and as such, allows a perfect resolution of events in time, a resolution which is lost if the up‐crossing data is averaged over intervals of time. Such precise time resolution is critical for the recognition of certain distinguishing features in various consonants. These features permit a visual display in which the phonemes can easily be distinguished, even in connected speech. Furthermore, the important features for d...

Journal of the Acoustical Society of America | 1982

Unifying dynamic programming methods

James K. Baker

Dynamic programming has come to be widely used and accepted in automatic speech recognition. However, two different but similar applications have often been described more in terms of their differences than their similarities. On the one hand, dynamic programming is used to find the best nonlinear dynamic time warping to align two instances of a word. On the other hand, dynamic programming may be used to find the best state sequence for a hidden Markov process. Not only are these procedures essentially equivalent, but significant generalization comes from an explicit unification. Dynamic programming may be used not only to align two instances of a word, but also to align an instance of a word with an arbitrary finite state model for the word, or even to align two arbitray models. Multiple instances of a word may contribute to a single model, and multiple passes on a finite set of training data can be used to further refine word models.

Journal of the Acoustical Society of America | 1980

Two processor continuous speech recognition system

Peter F. Brown; James C. Spohrer; James K. Baker

A research‐oriented continuous speech recognition system based on separating the grammar level from the word level processing in order to exploit a two processor architecture has been developed by the Continuous Speech Recognition Group at Dialog Systems, Inc. A high speed array processor performs acoustic pattern matching and word level dynamic programming, while a general purpose mini‐computer performs sentence level dynamic programming, overhead, and I/0 functions. Software components that allow graph generation, sentence level syntax generation, model training, and model evaluation will also be discussed in the framework of this parallel processor system.

Journal of the Acoustical Society of America | 1975

On the similarity of noisy phonetic strings produced by different words

James K. Baker

In a speech recognition system with an acoustic processor which attempts to automatically estimate a phonetic transcription, it is necessary to know the similarity of the probability distributions of phonetic strings when different words are spoken and input to the acoustic processor. Let aeA,a=a1a2a3…an represent an arbitrary phonetic string. Define the similarity between the words W1 and W2 by S = ∑ a Pr(a|W1) Pr(a|W2). The number of terms in the sum defining S grows exponentially with the length of the words W1 and W2. However, if the nodes of the phonological graphs for W1 and W2 are properly ordered, S can be calculated inductively by a generalization of the computations used in modeling a probabilistic function of a Markov process. The number of computations is approximately the product of the number of arcs in W1 times the number of arcs in W2.

Journal of the Acoustical Society of America | 1974

DRAGON speech understanding system

James K. Baker

Through consistent use of a single theoretical model, the DRAGON system is implemented much more easily than comparable speech recognition systems. Speech is modeled as a hierarchy of probabilistic functions of Markov processes. Various sources of knowledge—acoustic, lexical, phonological, syntactic, and semantic—are represented in a conceptual hierarchy: each is a probabilistic function of the sources of knowledge which are deeper in the hierarchy. The knowledge is integrated into a network representing an overall Markov process. Recognition of a given utterance is then formulated as finding the path through this network which maximizes the probability of the given acoustic observations. Several efficient computational schemes exist for finding such a maximizing path.

Journal of the Acoustical Society of America | 1974

Comparative Visual Displays of Time and Frequency Domain Information in Connected Speech

Janet M. Baker; Robert W. Ramsey; Mark Lloyd Miller; James K. Baker; Christopher Cooper

Despite the well‐known redundancy of speech, certain kinds of acoustic information are uniquely or better seen in either the frequency or time domain. Some of these differences can be clearly seen between different visual displays of digital analyses automatically performed in both domains. In the frequency domain, these include digital spectrograms [Robert Ramey] and spectrum plots [Christopher Cooper and Robert Ramey], and in the time domain, pitch‐synchronous plots Qames Baker], instantaneous‐frequency plots Qanet Baker], and the waveform [Mark Miller] itself. These displays are produced on a Xerox Graphic Printer as hard copy and on a CRT screen for interaction. For example, formants and other acoustic features of steady and quasisteady states, e.g., stressed vowels, are more apparent when studied in the frequency domain. However, often transient events, encompassing only one or several cycles, are seen exclusively in the time domain. Such events, readily seen in the speech waveform, occur at certain ...

Explore More