Robert G. Malkin
Carnegie Mellon University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Robert G. Malkin.
CLEaR | 2006
Andrey Temko; Robert G. Malkin; Christian Zieger; Dusan Macho; Climent Nadeu; Maurizio Omologo
In this paper, we present the results of the Acoustic Event Detection (AED) and Classification (AEC) evaluations carried out in February 2006 by the three participant partners from the CHIL project. The primary evaluation task was AED of the testing portions of the isolated sound databases and seminar recordings produced in CHIL. Additionally, a secondary AEC evaluation task was designed using only the isolated sound databases. The set of meeting-room acoustic event classes and the metrics were agreed by the three partners and ELDA was in charge of the scoring task. In this paper, the various systems for the tasks of AED and AEC and their results are presented.
international conference on acoustics, speech, and signal processing | 2002
Chiori Hori; Sadaoki Furui; Robert G. Malkin; Hua Yu; Alex Waibel
This paper reports an automatic speech summarization method and experimental results using English broadcast news speech. In our proposed method, a set of words maximizing a summarization score indicating an appropriateness of summarization is extracted from automatically transcribed speech. This extraction is performed using a Dynamic Programming (DP) technique according to a target compression ratio. We have previously tested the performance of our method using Japanese broadcast news speech. Since our method is based on a statistical approach, it could be applied to any language. In this paper, English broadcast news speech transcribed using a speech recognizer is automatically summarized. In order to apply our method to English, the model of estimating word concatenation probabilities based on a dependency structure in the original speech given by a Stochastic Dependency Context Free Grammar (SDCFG) is modified. A summarization method for multiple utterances using two-level DP technique is also proposed.
ACM Transactions on Multimedia Computing, Communications, and Applications | 2007
Datong Chen; Jie Yang; Robert G. Malkin; Howard D. Wactlar
Social interaction plays an important role in our daily lives. It is one of the most important indicators of physical or mental changes in aging patients. In this article, we investigate the problem of detecting social interaction patterns of patients in a skilled nursing facility using audio/visual records. Our studies consist of both a “Wizard of Oz” style study and an experimental study of various sensors and detection models for detecting and summarizing social interactions among aging patients and caregivers. We first simulate plausible sensors using human labeling on top of audio and visual data collected from a skilled nursing facility. The most useful sensors and robust detection models are determined using the simulated sensors. We then present the implementation of some real sensors based on video and audio analysis techniques and evaluate the performance of these implementations in detecting interactions. We conclude the article with discussions and future work.
international conference on acoustics, speech, and signal processing | 2005
Robert G. Malkin; Alex Waibel
Many mobile devices and applications can act in context-sensitive ways, but rely on explicit human action for context awareness. It would be preferable if our devices were able to attain context awareness without human intervention. One important aspect of user context is environment. We present a novel method for classifying environment types based on acoustic signals. This method makes use of linear autoencoding neural networks, and is motivated by the observation that biological coding systems seem to be heavily influenced by the statistics of their environments. We show that the autoencoder method achieved a lower error rate than a standard Gaussian mixture model on a representative sample task, and that a linear combination of autoencoders and GMMs yielded better performance than either alone.
international conference on multimodal interfaces | 2002
Brad A. Myers; Robert G. Malkin; Michael Bett; Alex Waibel; Ben Bostwick; Robert C. Miller; Jie Yang; Matthias Denecke; Edgar Seemann; Jie Zhu; Choon Hong Peck; Dave Kong; Jeffrey Nichols; Bill Scherlis
We describe our system which facilitates collaboration using multiple modalities, including speech, handwriting, gestures, gaze tracking, direct manipulation, large projected touch-sensitive displays, laser pointer tracking, regular monitors with a mouse and keyboard, and wireless networked handhelds. Our system allows multiple, geographically dispersed participants to simultaneously and flexibly mix different modalities using the right interface at the right time on one or more machines. We discuss each of the modalities provided, how they were integrated in the system architecture, and how the user interface enabled one or more people to flexibly use one or more devices.
international conference on multimodal interfaces | 2004
Datong Chen; Robert G. Malkin; Jie Yang
In this paper, we propose a multimodal system for detecting human activity and interaction patterns in a nursing home. Activities of groups of people are firstly treated as interaction patterns between any pair of partners and are then further broken into individual activities and behavior events using a multi-level context hierarchy graph. The graph is implemented using a dynamic Bayesian network to statistically model the multi-level concepts. We have developed a coarse-to-fine prototype system to illustrate the proposed concept. Experimental results have demonstrated the feasibility of the proposed approaches. The objective of this research is to automatically create concise and comprehensive reports of activities and behaviors of patients to support physicians and caregivers in a nursing facility.
EURASIP Journal on Advances in Signal Processing | 2003
Chiori Hori; Sadaoki Furui; Robert G. Malkin; Hua Yu; Alex Waibel
This paper proposes a statistical approach to automatic speech summarization. In our method, a set of words maximizing a summarization score indicating the appropriateness of summarization is extracted from automatically transcribed speech and then concatenated to create a summary. The extraction process is performed using a dynamic programming (DP) technique based on a target compression ratio. In this paper, we demonstrate how an English news broadcast transcribed by a speech recognizer is automatically summarized. We adapted our method, which was originally proposed for Japanese, to English by modifying the model for estimating word concatenation probabilities based on a dependency structure in the original speech given by a stochastic dependency context free grammar (SDCFG). We also propose a method of summarizing multiple utterances using a two-level DP technique. The automatically summarized sentences are evaluated by summarization accuracy based on a comparison with a manual summary of speech that has been correctly transcribed by human subjects. Our experimental results indicate that the method we propose can effectively extract relatively important information and remove redundant and irrelevant information from English news broadcasts.
international conference on acoustics speech and signal processing | 1998
Hua Yu; C. Clark; Robert G. Malkin; Alex Waibel
We describe our early exploration of automatic recognition of conversational speech in meetings for use in automatic summarizers and browsers to produce meeting minutes effectively and rapidly. To achieve optimal performance we started from two different baseline English recognizers adapted to meeting conditions and tested the resulting performance. The data were found to be highly disfluent (conversational human to human speech), noisy (due to lapel microphones and environment), and overlapped with background noise, resulting in error rates comparable so far to those on the CallHome conversational database (40-50% WER). A meeting browser is presented that allows the user to search and skim through highlights from a meeting efficiently despite the recognition errors.
CLEaR | 2006
Robert G. Malkin
We describe the CLEAR 2006 acoustic environment classification evaluation and the CMU system used in the evaluation. Environment classification is a critical technology for the CHIL Connector service [1] in that Connector relies on maintaining awareness of user state to make intelligent decisions about the optimal times, places, and methods to deal with requests for human-to-human communication. Environment is an important aspect of user state with respect to this problem; humans may be more or less able to deal with voice or text communications depending on whether they are, for instance, in an office, a car, a cafe, or a cinema. We unfortunately cannot rely on the availability of the full CHIL sensor suite when users are not in the CHIL room; hence, we are motivated to explore the use of the only sensor which is reliably available on every mobile communication device: the microphone.
international conference on multimodal interfaces | 2006
Robert G. Malkin; Datong Chen; Jie Yang; Alex Waibel
Context-aware computer systems are characterized by the ability to consider user state information in their decision logic. One example application of context-aware computing is the smart mobile telephone. Ideally, a smart mobile telephone should be able to consider both social factors (i.e., known relationships between contactor and contactee) and environmental factors (i.e., the contactees current locale and activity) when deciding how to handle an incoming request for communication.Toward providing this kind of user state information and improving the ability of the mobile phone to handle calls intelligently, we present work on inferring environmental factors from sensory data and using this information to predict user interruptibility. Specifically, we learn the structure and parameters of a user state model from continuous ambient audio and visual information from periodic still images, and attempt to associate the learned states with user-reported interruptibility levels. We report experimental results using this technique on real data, and show how such an approach can allow for adaptation to specific user preferences.