Paul Mermelstein
Nortel
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Paul Mermelstein.
Journal of the Acoustical Society of America | 1996
Vasu Iyengar; Rafi Rabipour; Paul Mermelstein; Brian R. Shelton
A speech bandwidth extension method and apparatus analyzes narrowband speech sampled at 8 kHz using LPC analysis to determine its spectral shape and inverse filtering to extract its excitation signal. The excitation signal is interpolated to a sampling rate of 16 kHz and analyzed for pitch control and power level. A white noise generated wideband signal is then filtered to provide a synthesized wideband excitation signal. The narrowband shape is determined and compared to templates in respective vector quantizer codebooks, to select respective highband shape and gain. The synthesized wideband excitation signal is then filtered to provide a highband signal which is, in turn, added to the narrowband signal, interpolated to the 16 kHz sample rate, to produce an artificial wideband signal. The apparatus may be implemented on a digital signal processor chip.
Journal of the Acoustical Society of America | 1975
Paul Mermelstein
As a first step toward automatic phonetic analysis of speech, one desires to segment the signal into syllable‐sized units. Experiments were conducted in automatic segmentation techniques for continuous, reading‐rate speech to derive such units. A new segmentation algorithm is described that allows assessment of the significance of a loudness minimum to be a potential syllabic boundary from the difference between the convex hull of the loudness function and the loudness function itself. Tested on roughly 400 syllables of continuous text, the algorithm results in 6.9% syllables missed and 2.6% extra syllables relative to a nominal, slow‐speech syllable count. It is suggested that inclusion of alternative fluent‐form syllabifications for multisyllabic words and the use of phonological rules for predicting syllabic contractions can further improve agreement between predicted and experimental syllable counts.Subject Classification: 70.40, 70.60.
Journal of the Acoustical Society of America | 1976
Paul Mermelstein
Difference limens (DL) of formant frequencies were measured for two steady‐state vowels and the same vowels in symmetric stop‐consonant contexts. The stimuli were generated using a computer‐programmed synthesizer, and the formant‐frequency parameters were adjusted to be steady or symmetric transition functions around the temporal center of the syllable. The DL for the time‐varying consonant–vowel–consonant (CVC) stimuli were found to be significantly larger than those for the steady‐state vowels. In some cases the DL for the second formant was found to be larger in the direction of expected formant shift due to consonantal coarticulation than in the reverse direction. For CV or VC stimuli the increase in vowel‐formant DL is reduced. The difference in DL values in and out of context has, at least partially, an auditory origin. However, the phonetic decoding of the CVC stimuli may also contribute to the loss of vowel‐quality information.
Journal of the Acoustical Society of America | 1979
Paul Mermelstein
Two objective measures of speech quality, the traditional signal to quantization noise ratio (SNR) and a segmental SNR, are compared from the point of view of their ability to predict subjective preference ratings of PCM and ADPCM coded speech. The SNR shows an 8‐dB bias for PCM coding, i.e., the SNR of the PCM has to exceed the SNR of ADPCM by 8 dB before the two are judged to be equally preferable. In contrast, the segmental SNR shows no such bias. The segmental SNR is considered a better perceptual model since it evaluates the quantization noise with respect to the energy in each underlying speech segment.
Journal of the Acoustical Society of America | 1975
Paul Mermelstein
The acoustic manifestation of nasal murmurs is significantly context dependent. To what extent can the class of nasals be automatically detected without prior detailed knowledge of the segmental context? This contribution reports on the characterization of the spectral change accompanying the transition between vowel and nasal for the purpose of automatic detection of nasal murmurs. The speech is first segmented into syllable‐sized units, the voiced sonorant region within the syllable is delimited, and the points of maximal spectral change on either side of the syllabic peak are hypothesized to be potential nasal transitions. Four simply extractible acoustic parameters, the relative energy change in the frequency bands 0–1, 1–2, and 2–5 kHz, and the frequency centroid of the 0–500‐Hz band at four points in time spaced 12.8 msec apart are used to represent the dynamic transition. Categorization of the transitions using multivariate statistics on some 524 transition segments from data of two speakers resulted in a 91% correct nasal/non‐nasal decision rate.
Journal of the Acoustical Society of America | 1982
Mamoru Nakatsui; Paul Mermelstein
The ultimate performance measure for evaluating voice communication systems is the subjective quality of the received speech. Modern digital speech‐coding techniques achieve high intelligibility and significant transmission economies. The high level of speech intelligibility is a necessary but insufficient condition for user acceptance of the systems. Quality, as well, must meet acceptability criteria. However, no adequate single measure of overall speech quality has yet been developed. This work takes a utilitarian approach in attempting to satisfy the urgent requirement for a practical measurement method. The subjective speech‐to‐noise‐ratio (SNR), derived from the forced‐choice pair‐comparison test using the psychometric analysis procedure commonly used in the method of constants, is evaluated. A speech signal degraded by varying amounts of multiplicative white noise is selected as the reference system in the test. Seven types of digital speech coders are simulated and evaluated in this study, including log‐PCM, ADM, ADPCM coders with variable or fixed predictor, APC, residual‐excited and pitch‐excited LP coders (RELP and LPC). Thirteen configurations of these coders covering the transmission bit rates of 2.4 to 64 kb/s are included. Pair‐comparison tests were conducted in two separate sessions 14 months apart using different groups of speakers and listeners. The subjective SNR estimated from 13 coder configurations ranges from 7 to 40 dB and well represents overall speech quality in a single dimension. No significant speaker and listener variation is found for a wide range of waveform coders. The subjective SNR estimate is found to be highly reproducible with different speakers and listeners. Arbitrary selection of as few as five listeners yields a stable subjective SNR estimate for the waveform coders. On the other hand, highly significant listener variation is found for the narrow‐band digital vocoders (RELP and LPC). This listener variability reflects a limitation of the measure that may prevent its extension to vocoded speech whose distortions differ significantly from those of the reference speech.
Journal of the Acoustical Society of America | 1965
Paul Mermelstein; M. R. Schroeder
Approximate formant frequencies for a given vocal‐tract configuration can be calculated by integrating Websters horn equation. No general methods are available for the solution of the inverse problem, that of determining the cross‐sectional area from the formant frequencies. Previous attempts involved approximations of the tract shape in terms of a small number of parameters of a given model based on the physical constraints acting on the vocal tract [see, for example, K. N. Stevens and A. S. House, J. Acoust. Soc. Am. 27, 484–493 (1955)]. The procedure presented here elimininates the necessity for such a priori assumptions. The logarithm of the area function is represented by a spatial Fourier cosine series of a symmetrical tract of length twice that of the original tract. Starting with a uniform tract closed at one end, a steepest‐descent search in the multidimensional space of Fourier coefficient is found to converge rapidly to a configuration yielding low‐order formant frequencies nearly identical to those specified. Results are presented for articulations of sustained vowels calculated from the first three formant frequencies and represented by 3 Fourier coefficients. The area functions so obtained are found to be good smoothed approximations to those derived from x‐ray data.
Journal of the Acoustical Society of America | 1978
Steven B. Davis; Paul Mermelstein
Several recent investigations have hypothesized that syllable‐sized segments may be more appropriate units than phoneme‐sized segments for use in continuous speech recognition systems. The significant acoustic information in these segments may be represented by a variety of parameters, some obtained by linear predictive techniques [e.g., linear prediction coefficients (LPC), reflection coefficients (RC), and cepstral coefficients (LPCC)], others by spectral techniques [e.g., linear‐frequency cepstral coefficients (LFCC) and mel‐frequency cepstral coefficients (MFCC)]. This study compared the performance of these parameters in a word identification test. Two male speakers produced 57 sentences in each of two sessions, and 676 tokens of 52 CVC words in a variety of syntactic positions were manually segmented. For each speaker, a symmetric dynamic warping technique was used for time registration of half of the data to form composite templates and for comparisons of the remaining data against the templates. T...
Journal of the Acoustical Society of America | 1967
Paul Mermelstein
A previously suggested method for smoothing the narrow‐band spectra of voiced speech segments by “low‐pass filtering” the logarithm of the power spectrum [M. R. Schroeder and A. M. Noll, “Recent Studies in Speech Research at Bell Telephone Laboratories. I,” Paper A21 in Proceedings of the Fifth International Congress on Acoustics, Liege, Belgium, 1965, D. E. Commins, Ed. (Georges Thone, Liege, 1966)] has been found inadequate for high‐pitched female voices. Low‐pass filtering, with a cutoff “quefrency” (independent variable in the Fourier‐transform domain of power spectra) below the pitch‐period value, may result in elimination of peaks due to the formant structure. An approximation to the spectrum fine structure (due to the quasiperiodic excitation of the vocal tract) is generated by requiring its Fourier cosine transform to match that of the log power spectrum of the speech signal in the immediate neighborhood of the pitch peak, but to have no other significant quefrency components. By subtraction of th...
Journal of the Acoustical Society of America | 1971
Paul Mermelstein
An extension of Flanagans model for vocal‐cord oscillations has been used to excite a transmission‐line analog of the vocal tract and thereby synthesize voiced and unvoiced stops. Postulation of a two‐spring vibrating system allows the lateral compression of the vibrating vocal‐cord mass to be a nonlinear function of the force exerted by the air on the cords. When the supraglottal pressure is low, e.g., for an unconstricted vocal tract, the vibrating system is essentially linear and reduces to that proposed by Flanagan. When the tract is constricted, e.g., for voiced stops, the nonlinearities introduced serve to limit expansion of the glottal area in spite of the higher air pressure on the vocal cords. Unvoiced stops are modeled by an active separation of the vocal cords which switches the oscillator from an oscillatory stationary mode. Examples of simulations of intervocalic voiced and unvoiced stops are given.