Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Steven Greenberg is active.

Publication


Featured researches published by Steven Greenberg.


Speech Communication | 1998

Robust speech recognition using the modulation spectrogram

Brian Kingsbury; Nelson Morgan; Steven Greenberg

Abstract The performance of present-day automatic speech recognition (ASR) systems is seriously compromised by levels of acoustic interference (such as additive noise and room reverberation) representative of real-world speaking conditions. Studies on the perception of speech by human listeners suggest that recognizer robustness might be improved by focusing on temporal structure in the speech signal that appears as low-frequency (below 16 Hz) amplitude modulations in subband channels following critical-band frequency analysis. A speech representation that emphasizes this temporal structure, the “modulation spectrogram”, has been developed. Visual displays of speech produced with the modulation spectrogram are relatively stable in the presence of high levels of background noise and reverberation. Using the modulation spectrogram as a front end for ASR provides a significant improvement in performance on highly reverberant speech. When the modulation spectrogram is used in combination with log-RASTA-PLP (log RelAtive SpecTrAl Perceptual Linear Predictive analysis) performance over a range of noisy and reverberant conditions is significantly improved, suggesting that the use of multiple representations is another promising method for improving the robustness of ASR systems.


Journal of Phonetics | 2003

Temporal properties of spontaneous speech—a syllable-centric perspective

Steven Greenberg; Hannah Carvey; Leah Hitchcock; Shuangyu Chang

Temporal properties of the speech signal are of potentially great importance for understanding spoken language and may provide significant insight into the manner in which listeners process spoken language with so little apparent effort. It is the thesis of this study that durational properties of phonetic segments differentially reflect the amount of information contained within a syllable, and that syllable prominence is an indirect measure of linguistic entropy. The ability to understand spoken language appears to depend on a broad distribution of syllable duration, ranging between 50 and 400 ms (for American English), which is reflected in the modulation spectrum of the acoustic signal. The upper branch of the modulation spectrum (6–20 Hz) reflects unstressed syllables, while the lower branch (o5 Hz) represents mostly heavily stressed syllables. Low-pass filtering the modulation spectrum reduces the intelligibility of spoken sentences in a manner consistent with the differential contribution of stressed and unstressed syllables to understanding spoken language. The origins of this phenomenon are investigated in terms of the durational properties of phonetic segments contained in a corpus of spontaneous American English telephone dialogues (SWITCHBOARD). Forty-five minutes of this material was manually annotated with respect to stress accent, and the relation between accent level and segmental duration examined. Statistical analysis indicates that much of the temporal variation observed at the syllabic and phonetic-segment levels can be accounted for in terms of two basic parameters: (1) stress-accent pattern and (2) position of the segment within the syllable. Segments are generally longest in heavily stressed syllables and shortest in syllables without stress. However, the magnitude of accent’s impact on duration varies as a function of syllable position. Duration of the nucleus is heavily affected by stress-accent level—heavily stressed nuclei are, on average, twice as long as their unstressed counterparts, while the duration of the onset is also significantly sensitive to stress, but to a lesser degree. In contrast, stress has relatively little impact on coda duration. This pattern of durational variation suggests that the vocalic nucleus absorbs much of the impact of stress accent and ARTICLE IN PRESS


international conference on acoustics, speech, and signal processing | 1997

The modulation spectrogram: in pursuit of an invariant representation of speech

Steven Greenberg; Brian Kingsbury

Understanding the human ability to reliably process and decode speech across a wide range of acoustic conditions and speaker characteristics is a fundamental challenge for current theories of speech perception. Conventional speech representations such as the sound spectrogram emphasize many spectro-temporal details that are not directly germane to the linguistic information encoded in the speech signal and which consequently do not display the perceptual stability characteristic of human listeners. We propose a new representational format, the modulation spectrogram, that discards much of the spectro-temporal detail in the speech signal and instead focuses on the underlying, stable structure incorporated in the low-frequency portion of the modulation spectrum distributed across critical-band-like channels. We describe the representation and illustrate its stability with color-mapped displays and with results from automatic speech recognition experiments.


international conference on acoustics speech and signal processing | 1998

Incorporating information from syllable-length time scales into automatic speech recognition

Su-Lin Wu; E.D. Kingsbury; Nelson Morgan; Steven Greenberg

Including information distributed over intervals of syllabic duration (100-250 ms) may greatly improve the performance of automatic speech recognition (ASR) systems. ASR systems primarily use representations and recognition units covering phonetic durations (40-100 ms). Humans certainly use information at phonetic time scales, but results from psychoacoustics and psycholinguistics highlight the crucial role of the syllable, and syllable-length intervals, in speech perception. We compare the performance of three ASR systems: a baseline system that uses phone-scale representations and units, an experimental system that uses a syllable-oriented front-end representation and syllabic units for recognition, and a third system that combines the phone-scale and syllable-scale recognizers by merging and rescoring N-best lists. Using the combined recognition system, we observed an improvement in word error rate for telephone-bandwidth, continuous numbers from 6.8% to 5.5% on a clean test set, and from 27.8% to 19.6% on a reverberant test set, over the baseline phone-based system.


international conference on acoustics, speech, and signal processing | 2005

Landmark-based speech recognition: report of the 2004 Johns Hopkins summer workshop

Mark Hasegawa-Johnson; James Baker; Sarah Borys; Ken Chen; Emily Coogan; Steven Greenberg; Katrin Kirchhoff; Karen Livescu; Srividya Mohan; Jennifer Muller; M. Kemal Sönmez; Tianyu Wang

Three research prototype speech recognition systems are described, all of which use recently developed methods from artificial intelligence (specifically support vector machines (SVM); dynamic Bayesian networks, and maximum entropy classification) in order to implement, in the form of an ASR, current theories of human speech perception and phonology. All systems begin with a high-D multiframe acoustic-to-distinctive feature transformation, implemented using SVMs trained to detect and classify acoustic phonetic landmarks. Distinctive feature probabilities estimated by the SVMs are then integrated using one of 3 pronunciation models: a dynamic programming algorithm that assumes canonical pronunciation of each word, a dynamic Bayesian network implementation of articulatory phonology, or a discriminative pronunciation model trained using the methods of maximum entropy classification. Log probability scores computed by these models are then combined, using log-linear combination, with other word scores available in the lattice output of a 1st pass recognizer, and the resulting combination score is used to compute a 2nd-pass speech recognition output.


international conference on acoustics, speech, and signal processing | 1997

Integrating syllable boundary information into speech recognition

Su-Lin Wu; Michael L. Shire; Steven Greenberg; Nelson Morgan

We examine the proposition that knowledge of the timing of syllabic onsets may be useful in improving the performance of speech recognition systems. A method of estimating the location of syllable onsets derived from the analysis of energy trajectories in critical band channels has been developed, and a syllable-based decoder has been designed and implemented that incorporates this onset information into the speech recognition process. For a small, continuous speech recognition task the addition of artificial syllabic onset information (derived from advance knowledge of the word transcriptions) lowers the word error rate by 38%. Incorporating acoustically-derived syllabic onset information reduces the word error rate by 10% on the same task. The latter experiment has highlighted representational issues on coordinating acoustic and lexical syllabifications, a topic we are beginning to explore.


international conference on acoustics speech and signal processing | 1998

Speech intelligibility in the presence of cross-channel spectral asynchrony

Takayuki Arai; Steven Greenberg

The spectrum of spoken sentences was partitioned into quarter-octave channels and the onset of each channel shifted in time relative to the others so as to desynchronize spectral information across the frequency axis. Human listeners are remarkably tolerant of cross-channel spectral asynchrony induced in this fashion. Speech intelligibility remains relatively unimpaired until the average asynchrony spans three or more phonetic segments. Such perceptual robustness is correlated with the magnitude of the low-frequency (3-6 Hz) modulation spectrum and thus highlights the importance of syllabic segmentation and analysis for robust processing of spoken language. High-frequency channels (>1.5 kHz) play a particularly important role when the spectral asynchrony is sufficiently large as to significantly reduce the power in the low-frequency modulation spectrum (analogous to acoustic reverberation) and may thereby account for the deterioration of speech intelligibility among the hearing impaired under conditions of acoustic interference (such as background noise and reverberation) characteristic of the real world.


Journal of the Acoustical Society of America | 2007

Integration efficiency for speech perception within and across sensory modalities by normal-hearing and hearing-impaired individuals.

Ken W. Grant; Jennifer B. Tufts; Steven Greenberg

In face-to-face speech communication, the listener extracts and integrates information from the acoustic and optic speech signals. Integration occurs within the auditory modality (i.e., across the acoustic frequency spectrum) and across sensory modalities (i.e., across the acoustic and optic signals). The difficulties experienced by some hearing-impaired listeners in understanding speech could be attributed to losses in the extraction of speech information, the integration of speech cues, or both. The present study evaluated the ability of normal-hearing and hearing-impaired listeners to integrate speech information within and across sensory modalities in order to determine the degree to which integration efficiency may be a factor in the performance of hearing-impaired listeners. Auditory-visual nonsense syllables consisting of eighteen medial consonants surrounded by the vowel [a] were processed into four nonoverlapping acoustic filter bands between 300 and 6000 Hz. A variety of one, two, three, and four filter-band combinations were presented for identification in auditory-only and auditory-visual conditions: A visual-only condition was also included. Integration efficiency was evaluated using a model of optimal integration. Results showed that normal-hearing and hearing-impaired listeners integrated information across the auditory and visual sensory modalities with a high degree of efficiency, independent of differences in auditory capabilities. However, across-frequency integration for auditory-only input was less efficient for hearing-impaired listeners. These individuals exhibited particular difficulty extracting information from the highest frequency band (4762-6000 Hz) when speech information was presented concurrently in the next lower-frequency band (1890-2381 Hz). Results suggest that integration of speech information within the auditory modality, but not across auditory and visual modalities, affects speech understanding in hearing-impaired listeners.


Journal of the Acoustical Society of America | 1987

Responses of auditory-nerve fibers to multiple-tone complexes

Li Deng; C. Daniel Geisler; Steven Greenberg

To relate level-dependent properties of auditory-nerve-fiber responses to nasal consonant-vowels to the basic frequency selective and suppressive properties of the fibers, multitone complexes, with the amplitude of a single (probe) component incremented, were used as stimuli. Quantitative relations were obtained between the systematic increase of fiber synchrony to the probe tone and the decrease of synchrony to CF, as the amplitude of the probe tone was increased. When such relations are interpreted as a measure of fiber frequency selectivity based on a relative synchrony criterion, a breadth of frequency tuning is obtained, at a 70-dB SPL multitone sound-pressure level, which is generally broader than that of the fibers threshold tuning curve. Quantitative comparisons with the same fibers responses to the nasal speech sounds indicate that the fibers speech responses share some common features with its probe-tone responses.


Journal of the Acoustical Society of America | 1998

Speech intelligibility is highly tolerant of cross‐channel spectral asynchrony

Steven Greenberg; Takayuki Arai

A detailed auditory analysis of the short‐term acoustic spectrum is generally considered essential for understanding spoken language. This assumption is called into question by the results of an experiment in which the spectrum of spoken sentences (from the TIMIT corpus) was partitioned into quarter‐octave channels and the onset of each channel shifted in time relative to the others so as to desynchronize spectral information across the frequency plane. Intelligibility of sentential material (as measured in terms of word accuracy) is unaffected by a (maximum) onset jitter of 80 ms or less and remains high (>75%) even for jitter intervals of 140 ms. Only when the jitter imposed across channels exceeds 220 ms does intelligibility fall below 50%. These results imply that the cues required to understand spoken language are not optimally specified in the short‐term spectral domain, but may rather be based on some other set of representational cues such as the modulation spectrogram [S. Greenberg and B. Kingsbury, Proc. IEEE ICASSP (1997), pp. 1647–1650]. Consistent with this hypothesis is the fact that intelligibility (as a function of onset‐jitter interval) is highly correlated with the magnitude of the modulation spectrum between 3 and 8 Hz.

Collaboration


Dive into the Steven Greenberg's collaboration.

Top Co-Authors

Avatar

Shuangyu Chang

University of California

View shared research outputs
Top Co-Authors

Avatar

Nelson Morgan

University of California

View shared research outputs
Top Co-Authors

Avatar

Leah Hitchcock

International Computer Science Institute

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Su-Lin Wu

University of California

View shared research outputs
Top Co-Authors

Avatar

James T. Marsh

University of California

View shared research outputs
Top Co-Authors

Avatar

Rosaria Silipo

University of California

View shared research outputs
Researchain Logo
Decentralizing Knowledge