Mark A. Clements
Georgia Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mark A. Clements.
IEEE Transactions on Signal Processing | 1991
John H. L. Hansen; Mark A. Clements
The basis of an improved form of iterative speech enhancement for single-channel inputs is sequential maximum a posteriori estimation of the speech waveform and its all-pole parameters, followed by imposition of constraints upon the sequence of speech spectra. The approaches impose intraframe and interframe constraints on the input speech signal. Properties of the line spectral pair representation of speech allow for an efficient and direct procedure for application of many of the constraint requirements. Substantial improvement over the unconstrained method is observed in a variety of domains. Informed listener quality evaluation tests and objective speech quality measures demonstrate the techniques effectiveness for additive white Gaussian noise. A consistent terminating point of the iterative technique is shown. The current systems result in substantially improved speech quality and linear predictive coding (LPC) parameter estimation with only a minor increase in computational requirements. The algorithms are evaluated with respect to improving automatic recognition of speech in the presence of additive noise and shown to outperform other enhancement methods in this application. >
IEEE Transactions on Speech and Audio Processing | 1995
Ron Cole; L. Hirschman; L. Atlas; M. Beckman; Alan W. Biermann; M. Bush; Mark A. Clements; L. Cohen; Oscar N. Garcia; B. Hanson; Hynek Hermansky; S. Levinson; Kathleen R. McKeown; Nelson Morgan; David G. Novick; Mari Ostendorf; Sharon L. Oviatt; Patti Price; Harvey F. Silverman; J. Spiitz; Alex Waibel; Cliff Weinstein; Stephen A. Zahorian; Victor W. Zue
A spoken language system combines speech recognition, natural language processing and human interface technology. It functions by recognizing the persons words, interpreting the sequence of words to obtain a meaning in terms of the application, and providing an appropriate response back to the user. Potential applications of spoken language systems range from simple tasks, such as retrieving information from an existing database (traffic reports, airline schedules), to interactive problem solving tasks involving complex planning and reasoning (travel planning, traffic routing), to support for multilingual interactions. We examine eight key areas in which basic research is needed to produce spoken language systems: (1) robust speech recognition; (2) automatic training and adaptation; (3) spontaneous speech; (4) dialogue models; (5) natural language response generation; (6) speech synthesis and speech generation; (7) multilingual systems; and (8) interactive multimodal systems. In each area, we identify key research challenges, the infrastructure needed to support research, and the expected benefits. We conclude by reviewing the need for multidisciplinary research, for development of shared corpora and related resources, for computational support and far rapid communication among researchers. The successful development of this technology will increase accessibility of computers to a wide range of users, will facilitate multinational communication and trade, and will create new research specialties and jobs in this rapidly expanding area. >
IEEE Transactions on Biomedical Engineering | 2008
Elliot Moore; Mark A. Clements; John W. Peifer; Lydia Weisser
The motivation for this work is in an attempt to rectify the current lack of objective tools for clinical analysis of emotional disorders. This study involves the examination of a large breadth of objectively measurable features for use in discriminating depressed speech. Analysis is based on features related to prosodics, the vocal tract, and parameters extracted directly from the glottal waveform. Discrimination of the depressed speech was based on a feature selection strategy utilizing the following combinations of feature domains: prosodic measures alone, prosodic and vocal tract measures, prosodic and glottal measures, and all three domains. The combination of glottal and prosodic features produced better discrimination overall than the combination of prosodic and vocal tract features. Analysis of discriminating feature sets used in the study reflect a clear indication that glottal descriptors are vital components of vocal affect analysis.
Journal of the Acoustical Society of America | 1995
Kathleen E. Cummings; Mark A. Clements
The problems of automatic recognition of and synthesis of multistyle speech have become important topics of research in recent years. This paper reports an extensive investigation of the variations that occur in the glottal excitation of eleven commonly encountered speech styles. Glottal waveforms were extracted from utterances of non-nasalized vowels for two speakers for each of the eleven speaking styles. The extracted waveforms were parametrized into four duration-related and two slope-related values. Using these six parameters, the glottal waveforms from the eleven styles of speech were analyzed both qualitatively and quantitatively. The glottal waveforms from each style speech were analyzed both qualitatively and quantitatively. The glottal waveforms from each style of speech have been shown to be significantly and identifiably different from all other styles, thereby confirming the importance of the glottal waveform in conveying speech style information and in causing speech waveform variations. The degree of variation in styled glottal waveforms has been shown to be consistent when trained on one speaker and compared with another.
Signal Processing | 1988
John H. L. Hansen; Mark A. Clements
This thesis addresses the problem of automatic speech recognition in noisy, stressful environments. The main contributions include a comprehensive and unified investigation which revealed new and statistically reliable acoustic correlates of speech under stress, the formulation of a new class of constrained iterative speech enhancement algorithms, and the achievement of robust automatic speech recognition through the development of speech enhancement and stress compensation programs. The first goal of improving recognition of speech produced under stressful conditions was accomplished through extensive investigations revealing new and statistically reliable acoustic correlates of speech under stress. Analysis was performed on (i) speech with simulated stress, (ii) speech from stress inducing workload tasks or speech in noise, and (iii) speech produced under actual stress or emotional conditions. Characteristics from five speech production domains were addressed (pitch, glottal source, duration, intensity, and vocal-tract shaping). Statistical evaluation ascertained the reliability of variation in average, variability, and distribution of each speech parameter as a stress relayer. A new class of constrained iterative speech enhancement algorithms were formulated for the purposes of improving recognition performance in noisy environments. The new approaches apply inter- and intra-frame spectral constraints in the estimation procedure to ensure optimum speech quality across all speech classes. Constraints are applied based on the presence of perceptually important speech characteristics obtained during the enhancement procedure. The algorithms are preferable to existing techniques in several respects: (i) they result in subtantially improved speech quality and parameter estimation over past techniques for additive white noise distortion, (ii) they have been extended and shown to perform well on non-stationary colored noise, and (iii) they possess a more consistent terminating criterion which was previously unavailable. The final goal of robust recognition in noisy stressful environments was addressed based on formulation of enhancement and stress compensation preprocessors. Enhancement preprocessors were shown to improve recognition performance for neutral speech over past enhancement techniques for all signal-to-noise ratios considered. Stress compensation algorithms are shown to reduce stress effects prior to recognition. Finally, combined speech enhancement stress compensation preprocessing is shown to be extremely effective in reducing and even eliminating effects caused by stress and noise for robust automatic recognition.
computer vision and pattern recognition | 2013
James M. Rehg; Gregory D. Abowd; Agata Rozga; Mario Romero; Mark A. Clements; Stan Sclaroff; Irfan A. Essa; Opal Ousley; Yin Li; Chanho Kim; Hrishikesh Rao; Jonathan C. Kim; Liliana Lo Presti; Jianming Zhang; Denis Lantsman; Jonathan Bidwell; Zhefan Ye
We introduce a new problem domain for activity recognition: the analysis of childrens social and communicative behaviors based on video and audio data. We specifically target interactions between children aged 1-2 years and an adult. Such interactions arise naturally in the diagnosis and treatment of developmental disorders such as autism. We introduce a new publicly-available dataset containing over 160 sessions of a 3-5 minute child-adult interaction. In each session, the adult examiner followed a semi-structured play interaction protocol which was designed to elicit a broad range of social behaviors. We identify the key technical challenges in analyzing these behaviors, and describe methods for decoding the interactions. We present experimental results that demonstrate the potential of the dataset to drive interesting research questions, and show preliminary results for multi-modal activity recognition.
EURASIP Journal on Advances in Signal Processing | 2002
Xiaozheng Zhang; Charles C. Broun; Russell M. Mersereau; Mark A. Clements
There has been growing interest in introducing speech as a new modality into the human-computer interface (HCI). Motivated by the multimodal nature of speech, the visual component is considered to yield information that is not always present in the acoustic signal and enables improved system performance over acoustic-only methods, especially in noisy environments. In this paper, we investigate the usefulness of visual speech information in HCI related applications. We first introduce a new algorithm for automatically locating the mouth region by using color and motion information and segmenting the lip region by making use of both color and edge information based on Markov random fields. We then derive a relevant set of visual speech parameters and incorporate them into a recognition engine. We present various visual feature performance comparisons to explore their impact on the recognition accuracy, including the lip inner contour and the visibility of the tongue and teeth. By using a common visual feature set, we demonstrate two applications that exploit speechreading in a joint audio-visual speech signal processing task: speech recognition and speaker verification. The experimental results based on two databases demonstrate that the visual information is highly effective for improving recognition performance over a variety of acoustic noise levels.
Medical Engineering & Physics | 2002
Robert W. Morris; Mark A. Clements
This paper investigates a method for the real-time reconstruction of normal speech from whispers. This system could be used by aphonic individuals as a voice prosthesis. It could also provide improved verbal communication when normal speech is not appropriate. The normal speech is synthesized using the mixed excitation linear prediction model. Differences between whispered and phonated speech are discussed and methods for estimating the parameters of this model from whispered speech for real-time synthesis are proposed. This includes smoothing the noisy linear prediction spectra, modifying the formants, and synthesizing of the excitation signal. Trade-offs between computational complexity, delay, and accuracy of different methods are discussed.
International Journal of Speech Technology | 2002
Peter S. Cardillo; Mark A. Clements; Michael S. Miller
A new technique is presented for searching digital audio at the word/phrase level. Unlike previous methods based upon Large Vocabulary Continuous Speech Recognition (LVCSR, with inherent problems of closed vocabulary and high word error rate), phonetic searching combines high speed and accuracy, supports open vocabulary, imposes low penalty for new words, permits phonetic and inexact spelling, enables user-determined depth of search, and is amenable to parallel execution for highly scalable deployment. A detailed comparison of accuracy between phonetic searching and one popular embodiment of LVCSR is presented along with other operating characteristics of the new technique. The current implementation for Digital Media Asset Management (DMAM) is described along with suggested applications in other domains.
IEEE Transactions on Speech and Audio Processing | 1997
Michael W. Macon; Mark A. Clements
Although sinusoidal models have been shown to be useful for time-scale and pitch modification of voiced speech, objectionable artifacts often arise when such models are applied to unvoiced speech. This article presents a sinusoidal model-based speech modification algorithm that preserves the natural character of unvoiced speech sounds after pitch and time-scale modification, eliminating commonly encountered artifacts. This advance is accomplished via a perceptually motivated modulation of the sinusoidal component phases that mitigates artifacts in the reconstructed signal after time-scale and pitch modification.