Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Melvyn J. Hunt is active.

Publication


Featured researches published by Melvyn J. Hunt.


international conference on acoustics, speech, and signal processing | 1989

A comparison of several acoustic representations for speech recognition with degraded and undegraded speech

Melvyn J. Hunt; C. Lefebvre

Several acoustic representations have been compared in speaker-dependent and independent connected and isolated-word recognition tests with undegraded speech and with speech degraded by adding white noise and by applying a 6-dB/octave spectral tilt. The representations comprised the output of an auditory model, cepstrum coefficients derived from an FFT-based mel-scale filter bank with various weighting schemes applied to the coefficients, cepstrum coefficients augmented with measures of their rates of change with time, and sets of linear discriminant functions derived from the filter-bank output and called IMELDA. The model outperformed the cepstrum representations except in noise-free connected-word tests, where it had a high insertion rate. The best cepstrum weighting scheme was derived from within-class variances. Its behavior may explain the empirical adjustments found necessary with other schemes. IMELDA outperformed all other representations in all conditions and is computationally simple.<<ETX>>


international conference on acoustics speech and signal processing | 1988

Speaker dependent and independent speech recognition experiments with an auditory model

Melvyn J. Hunt; C. Lefebvre

The performance of an auditory model has been compared with that of a conventional filterbank mel-cepstrum representation in speaker-dependent and speaker-independent spoken digit recognition tests. The model produces two outputs: one sensitive to voicing and onsets, and the other sensitive to formant structure and showing two-tone suppression. Linear discriminant analysis has been used to combine the outputs into eight coefficients. Undegraded, noisy and spectrally tilted male speech was tested with a quasi-isolated-word system. A subset of the tests were repeated with a connected-word system, and with undegraded female speech. In all cases the model performed better than the conventional representation. With degraded speech the differences were large.<<ETX>>


Journal of the Acoustical Society of America | 1979

A statistical approach to metrics for word and syllable recognition

Melvyn J. Hunt

Time‐warping pattern‐comparison algorithms are widely used in speech recognition. Two words or syllables being compared are described by a series of time frames each containing values of a set of acoustic parameters. After time alignment, the squared distance between the patterns is summed over the parameters within a frame and then across frames. The sum obtained is assumed to be proportional to the log probability of the two patterns having the same identity. This assumption is generally invalid, but it may be made substantially true by analyzing the variability between different examples of the same syllable and adjusting the metric accordingly. Variability is estimated both as a function of frame position within the syllable as a function of the acoustic parameters. In the latter case, within‐ and between‐class covariance matrices can be estimated and standard linear discriminant analysis methods applied. This permits the combination of disparate acoustic parameters into a single distance measure. In ...


international conference on acoustics, speech, and signal processing | 1987

Speech recognition using an auditory model with pitch-synchronous analysis

Melvyn J. Hunt; C. Lefebvre

An auditory model with two-tone suppression has previously been shown to perform better in speech recognition experiments than a conventional filterbank representation, particularly with noisy or distorted speech. It was, however, known to have several defects including an uneven response across the spectrum and a tendency to detect harmonics of F0rather than F1. We show that instants of glottal excitation can be derived from the model even with noisy speech. By using this information to carry out pitch-synchronous analysis in a slightly modified model the problem of interaction with harmonics of F0can be solved. An analysis of the behavior of the model leads to a specification of a class of processes showing two-tone suppression and hence to a redesigned model avoiding the known defects. The pitch-synchronous analysis is then no longer necessary, but the robust indication of excitation points may have other uses. Spectrograms from the old and new models illustrate the improvements obtained.


Pattern Recognition Letters | 1987

Delayed decisions in speech recognition—the case of formants

Melvyn J. Hunt

Abstract Consciously designed message bearing signals are generally well suited to bottom-up decoding. Signals are first segmented into low-level units, the units are classified, and the higher level information is then deduced. Evidence from the reading of printed text suggests that humans may not use such a strategy even on signals apparently well suited to it, and that spontaneous modes of communication - handwriting and speech - are quite unsuited to the stratery. It is argued that the most effective algorithms for automatic speech recognition derive their effectiveness from an ability to delay low-level decisions (such as segmential identities and boundaries) until higher level decisions (such as word identities) have been made. A case is made for the representation of speech for recognition purposes in terms of the frequencies of the vocal tract resonances ( formants ). The fact that formant frequencies have not hitherto been widely used in speech recognition is ascribed to their resembling other low-level features in that they too cannot be reliably extracted and labeled prior to some higher level decisions. An algorithm is presented that allows decisions on formant identities to be delayed and made contingent on higher level decisions. A technique is described for deriving the cost functions used in this algorithm from the statistical properties of formants, and some practical applications of the algorithm are briefly described.


international conference on acoustics, speech, and signal processing | 1986

Speech recognition using a cochlear model

Melvyn J. Hunt; C. Lefebvre

At the 1984 IEEE ICASSP meeting Seneff described a computational model of the peripheral auditory system consisting of a bank of digital filters followed by compression and half-wave rectification stages and by a set of generalized synchrony detectors (gsds) that respond to coherence in the signal at the center frequency of the channel. We have added adjacent-channel cross-correlation and modified the gsd. This results in improved sensitivity to formants in noise and allows human frequency masking measurements to be replicated quantitatively. When the output of the model is used in a speech recognition task it shows an advantage over a conventional filter-bank representation both with undistorted speech and in the presence of noise and linear distortion. Spectrograms generated from the model are presented both for artificially degraded speech and for speech recorded in flight in a helicopter and a fighter/trainer.


international conference on acoustics speech and signal processing | 1988

Evaluating the performance of connected-word speech recognition systems

Melvyn J. Hunt

Outputs of connected-word recognizers may contain substitution, deletion and insertion errors, and their interpretation is not trivial. Simulations show that the commonly used dynamic programming word-sequence matching algorithm has serious shortcomings as an evaluation method at low performance levels, though it is generally reliable at high performance levels. The strategy of comparing input and output words in strict sequence is found to have little to recommend it. A method using word end-point information, which provides precise, detailed performance analyses, is described. Tests with real data confirm the reliability of the end-point method and the presence of positive bias in performance estimates form the word-sequence matching method.<<ETX>>


international conference on acoustics, speech, and signal processing | 1984

Time alignment of natural speech to synthetic speech

Melvyn J. Hunt

A capacity to carry out reliable automatic time alignment of synthetic speech to naturally produced speech offers potential benfits in speech recognition and speaker recognition as well as in synthesis itself. Phrase alignment experiments are described that indicate that alignment to synthetic speech is more difficult than alignment of speech from two natural speakers. An artificial speech recognition experiment is introduced as a convenient means of assessing alignment accuracy. By this measure, alignment accuracy is found to be improved considerably by applying certain speaker adaptation transformations to the synthetic speech, by modifying the spectrum similarity metric, and by generating the synthetic spectra directly from the control parameters using simplified excitation spectra. The improvements seem to limit, however, at a level below that found between natural speakers. It is conjectured that further improvement requires modifications to the synthesis rules themselves.


Journal of the Acoustical Society of America | 1981

Speaker adaptation for word‐based speech recognition systems

Melvyn J. Hunt

This work is aimed at enhancing the speaker‐independent performance of word‐based speech recognition systems by rapidly and automatically deducing general characteristics of the current speaker and using them to derive speaker‐normalizing transforms. DP matching is used to align and compare corresponding frames of the incoming speech and reference vocabulary. A single transform is then computed for all voiced speech and another for all unvoiced speech. The transform consist of a linear filtering component and, optionally, a constrained frequency shift. Experiments have been carried out with twenty male and female, native and non‐native English speakers each producing 150 digits. Adaptation on all 150 digits reduces recognition errors by a factor of three (4.5% to 1.5%). With adaptation on just three randomly selected digits, the reduction factor is two. Frequency shifting is useful only when the amount of adaptation material is large and the reference speech is not exclusively from the same sex as the cur...


international conference on acoustics, speech, and signal processing | 1983

Further experiments in text-independent speaker recognition over communications channels

Melvyn J. Hunt

Experiments are described in automatic, text-independent speaker recognition using three databases: good quality read speech, conversations over simulated telephone links, and conversations over real telephone links. A recognition system is evaluated on this material using a set of features which were believed to have some resistance to transmission degradations, namely, F 0 statistics and statistics of low-order cepstrum coefficient variation. Performance is reasonable on the first two databases but poor on the telephone speech. A new set of features based on the frequencies of peaks in the short-term smoothed spectrum is found to perform better on the telephone speech, presumably because of its greater resistance to noise and nonlinear distortions. A computer simulation of the recognition experiments is described. The results of the simulation indicate that performance estimates from recognition experiments should be allowed wide error tolerances, and they illustrate the danger of trying too many features on the same database.

Collaboration


Dive into the Melvyn J. Hunt's collaboration.

Top Co-Authors

Avatar

C. Lefebvre

National Research Council

View shared research outputs
Researchain Logo
Decentralizing Knowledge