Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Stephen A. Zahorian is active.

Publication


Featured researches published by Stephen A. Zahorian.


IEEE Transactions on Speech and Audio Processing | 1995

The challenge of spoken language systems: Research directions for the nineties

Ron Cole; L. Hirschman; L. Atlas; M. Beckman; Alan W. Biermann; M. Bush; Mark A. Clements; L. Cohen; Oscar N. Garcia; B. Hanson; Hynek Hermansky; S. Levinson; Kathleen R. McKeown; Nelson Morgan; David G. Novick; Mari Ostendorf; Sharon L. Oviatt; Patti Price; Harvey F. Silverman; J. Spiitz; Alex Waibel; Cliff Weinstein; Stephen A. Zahorian; Victor W. Zue

A spoken language system combines speech recognition, natural language processing and human interface technology. It functions by recognizing the persons words, interpreting the sequence of words to obtain a meaning in terms of the application, and providing an appropriate response back to the user. Potential applications of spoken language systems range from simple tasks, such as retrieving information from an existing database (traffic reports, airline schedules), to interactive problem solving tasks involving complex planning and reasoning (travel planning, traffic routing), to support for multilingual interactions. We examine eight key areas in which basic research is needed to produce spoken language systems: (1) robust speech recognition; (2) automatic training and adaptation; (3) spontaneous speech; (4) dialogue models; (5) natural language response generation; (6) speech synthesis and speech generation; (7) multilingual systems; and (8) interactive multimodal systems. In each area, we identify key research challenges, the infrastructure needed to support research, and the expected benefits. We conclude by reviewing the need for multidisciplinary research, for development of shared corpora and related resources, for computational support and far rapid communication among researchers. The successful development of this technology will increase accessibility of computers to a wide range of users, will facilitate multinational communication and trade, and will create new research specialties and jobs in this rapidly expanding area. >


Journal of the Acoustical Society of America | 1993

Spectral‐shape features versus formants as acoustic correlates for vowels

Stephen A. Zahorian; Amir J. Jagharghi

The first three formants, i.e., the first three spectral prominences of the short-time magnitude spectra, have been the most commonly used acoustic cues for vowels ever since the work of Peterson and Barney [J. Acoust. Soc. Am. 24, 175-184 (1952)]. However, spectral shape features, which encode the global smoothed spectrum, provide a more complete spectral description, and therefore might be even better acoustic correlates for vowels. In this study automatic vowel classification experiments were used to compare formants and spectral-shape features for monopthongal vowels spoken in the context of isolated CVC words, under a variety of conditions. The roles of static and time-varying information for vowel discrimination were also compared. Spectral shape was encoded using the coefficients in a cosine expansion of the nonlinearly scaled magnitude spectrum. Under almost all conditions investigated, in the absence of fundamental frequency (F0) information, automatic vowel classification based on spectral-shape features was superior to that based on formants. If F0 was used as an additional feature, vowel classification based on spectral shape features was still superior to that based on formants, but the differences between the two feature sets were reduced. It was also found that the error pattern of perceptual confusions was more closely correlated with errors in automatic classification obtained from spectral-shape features than with classification errors from formants. Therefore it is concluded that spectral-shape features are a more complete set of acoustic correlates for vowel identity than are formants. In comparing static and time-varying features, static features were the most important for vowel discrimination, but feature trajectories were valuable secondary sources of information.


Journal of the Acoustical Society of America | 1976

Vibrotactile frequency for encoding a speech parameter

Martin Rothenberg; Ronald T. Verrillo; Stephen A. Zahorian; Michael L. Brachman; Stanley J. Bolanowski

Frequency of vibration has not been widely used as a parameter for encoding speech-derived information on the skin. Where it has been used, the frequencies employed have not necessarily been compatible with the capabilities of the tactile channel, and no determination was made of the information transmitted by the frequency variable, as differentiated from other parameters used simultaneously, such as duration, amplitude, and location. However, several investigators have shown that difference limens for vibration frequency may be small enough to make stimulus frequency useful in encoding a speech-derived parameter such as the fundamental frequency of voiced speech. In the studies reported here, measurements have been made of the frequency discrimination ability of the volar forearm, using both sinusoidal and pulse waveforms. Stimulus configurations included the constant-frequency vibrations used by other laboratories as well as frequency-modulated (warbled) stimulus patterns. The frequency of a warbled stimulus was designed to have temporal variations analogous to those found in speech. The results suggest that it may be profitable to display the fundamental frequency of voiced speech on the skin as vibratory frequency, thought it might be desirable to recode fundamental frequency into a frequency range more closely matched to the skins capability.


international conference on acoustics, speech, and signal processing | 1991

Text-independent talker identification with neural networks

L. Rudasi; Stephen A. Zahorian

The authors introduce a novel method for partitioning a large classification problem using N*(N-1)/2 binary pair classifiers. The binary pair classifier has been applied to a speaker identification problem using neural networks for the binary classifiers. The binary partitioned approach was used to develop an identification system for the 47 male speakers belonging to the Northern dialect region of the TIMIT database. The system performs with 100% accuracy in a text-independent mode when trained with about 9 to 14 s of speech and tested with 8 s of speech. The partitioned approach performs comparably, or even better, than a single large neural network. For large values of N (>10), the partitioned approach requires only a fraction of the training time required for a single large network. For N=47, the training time for the partitioned network would be about two orders of magnitude less than for the single large network.<<ETX>>


Journal of the Acoustical Society of America | 1991

Dynamic spectral shape features as acoustic correlates for initial stop consonants

Zaki B. Nossair; Stephen A. Zahorian

A comprehensive investigation of two acoustic feature sets for English stop consonants spoken in syllable initial position was conducted to determine the relative invariance of the features that cue place and voicing. The features evaluated were overall spectral shape, encoded as the cosine transform coefficients of the nonlinearly scaled amplitude spectrum, and formants. In addition, features were computed both for the static case, i.e., from one 25-ms frame starting at the burst, and for the dynamic case, i.e., as parameter trajectories over several frames of speech data. All features were evaluated with speaker-independent automatic classification experiments using the data from 15 speakers to train the classifier and the data from 15 different speakers for testing. The primary conclusions from these experiments, as measured via automatic recognition rates, are as follows: (1) spectral shape features are superior to both formants, and formants plus amplitudes; (2) features extracted from the dynamic sp...


international conference on acoustics, speech, and signal processing | 2002

Yet Another Algorithm for Pitch Tracking

Kavita Kasi; Stephen A. Zahorian

In this paper, we present a pitch detection algorithm that is extremely robust for both high quality and telephone speech. The kernel method for this algorithm is the “NCCF or Normalized Cross Correlation” reported by David Talkin [1]. Major innovations include: processing of the original acoustic signal and a nonlinearly processed version of the signal to partially restore very weak F0 components; intelligent peak picking to select multiple F0 candidates and assign merit factors; and, incorporation of highly rohust pitch contours obtained from smoothed versions of low frequency portions of spectrograms. Dynamic programming is used to find the “best” pitch track among all the candidates, using both local and transition costs. We evaluated our algorithm using the Keele pitch extraction reference database as “ground truth” for both “high quality” and “telephone” speech. For both types of speech, the error rates obtained are lower than the lowest reported in the literature.


international conference on acoustics, speech, and signal processing | 1997

Phone classification with segmental features and a binary-pair partitioned neural network classifier

Stephen A. Zahorian; Peter L. Silsbee; Xihong Wang

This paper presents methods and experimental results for phonetic classification using 39 phone classes and the NIST recommended training and test sets for NTIMIT and TIMIT. Spectral/temporal features which represent the smoothed trajectory of FFT derived speech spectra over 300 ms intervals are used for the analysis. Classification tests are made with both a binary-pair partitioned (BPP) neural network system (one neural network for each of the 741 pairs of phones) and a single large neural network. The classification accuracy is very similar for the two types of networks, but the BPP method has the advantage of a much shorter training time. The best results obtained (77% for TIMIT and 67.4% for NTIMIT) compare favorably to the best results reported in the literature for this task.


IEEE Transactions on Speech and Audio Processing | 1999

A partitioned neural network approach for vowel classification using smoothed time/frequency features

Stephen A. Zahorian; Zaki B. Nossair

A novel pattern classification technique and a new feature extraction method are described and tested for vowel classification. The pattern classification technique partitions an N-way classification task into N*(N-1)/2 two-way classification tasks. Each two-way classification task is performed using a neural network classifier that is trained to discriminate the two members of one pair of categories. Multiple two way classification decisions are then combined to form an N-way decision. Some of the advantages of the new classification approach include the partitioning of the task allowing independent feature and classifier optimization for each pair of categories, lowered sensitivity of classification performance on network parameters, a reduction in the amount of training data required, and potential for superior performance relative to a single large network. The features described in this paper, closely related to the cepstral coefficients and delta cepstra commonly used in speech analysis, are developed using a unified mathematical framework which allows arbitrary nonlinear frequency, amplitude, and time scales to compactly represent the spectral/temporal characteristics of speech. This classification approach, combined with a feature ranking algorithm which selected the 35 most discriminative spectral/temporal features for each vowel pair, resulted in 71.5% accuracy for classification of 16 vowels extracted from the TIMIT database. These results, significantly higher than other published results for the same task, illustrate the potential for the methods presented in this paper.


international conference on acoustics, speech, and signal processing | 2002

A new robust algorithm for isolated word endpoint detection

Lingyun Gu; Stephen A. Zahorian

Teager Energy and Energy-Entropy Features are two approaches, which have recently been used for locating the endpoints of an utterance. However, each of them has some drawbacks for speech in noisy environments. This paper proposes a novel method to combine these two approaches to locate endpoint intervals and yet make a final decision based on energy, which requires far less time than the feature based methods. After the algorithm description, an experimental evaluation is presented, comparing the automatically determined endpoints with those determined by skilled personnel. It is shown that the accuracy of this algorithm is quite satisfactory and acceptable.


IEEE Transactions on Speech and Audio Processing | 2001

Signal modeling for high-performance robust isolated word recognition

Montri Karnjanadecha; Stephen A. Zahorian

This paper describes speech signal modeling techniques which are well-suited to high performance and robust isolated word recognition. We present new techniques for incorporating spectral/temporal information as a function of the temporal position within each word. In particular, spectral/temporal parameters are computed using both variable length blocks with a variable spacing between blocks. We tested features computed with these methods using an alphabet recognition task based on the ISOLET database. The hidden Markov model toolkit (HTK) was used to implement the isolated word recognizer with whole word HMM models. The best accuracy achieved for speaker independent alphabet recognition, using 50 features, was 97.9%, which represents a new benchmark for this task. We also tested these methods with deliberate signal degradation using additive Gaussian noise and telephone band limiting and found that the recognition degrades gracefully and to a smaller degree than for control cases based on MFCC coefficients and delta cepstra terms.

Collaboration


Dive into the Stephen A. Zahorian's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jiang Wu

Binghamton University

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge