Andreas Kathol
SRI International
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Andreas Kathol.
Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge | 2014
Vikramjit Mitra; Elizabeth Shriberg; Mitchell McLaren; Andreas Kathol; Colleen Richey; Dimitra Vergyri; Martin Graciarena
Though depression is a common mental health problem with significant impact on human society, it often goes undetected. We explore a diverse set of features based only on spoken audio to understand which features correlate with self-reported depression scores according to the Beck depression rating scale. These features, many of which are novel for this task, include (1) estimated articulatory trajectories during speech production, (2) acoustic characteristics, (3) acoustic-phonetic characteristics and (4) prosodic features. Features are modeled using a variety of approaches, including support vector regression, a Gaussian backend and decision trees. We report results on the AVEC-2014 depression dataset and find that individual systems range from 9.18 to 11.87 in root mean squared error (RMSE), and from 7.68 to 9.99 in mean absolute error (MAE). Initial fusion brings further improvement; fusion and feature selection work is still in progress.
international conference on acoustics, speech, and signal processing | 2009
Murat Akbacak; Horacio Franco; Michael W. Frandsen; Saša Hasan; Huda Jameel; Andreas Kathol; Shahram Khadivi; Xin Lei; Arindam Mandal; Saab Mansour; Kristin Precoda; Colleen Richey; Dimitra Vergyri; Wen Wang; Mei Yang; Jing Zheng
We summarize recent progress on SRIs IraqComm™ Iraqi Arabic-English two-way speech-to-speech translation system. In the past year we made substantial developments in our speech recognition and machine translation technology, leading to significant improvements in both accuracy and speed of the IraqComm system. On the 2008 NIST-evaluation dataset our twoway speech-to-text (S2T) system achieved 6% to 8% absolute improvement in BLEU in both directions, compared to our previous year system [1].
international conference on acoustics, speech, and signal processing | 2013
Necip Fazil Ayan; Arindam Mandal; Michael W. Frandsen; Jing Zheng; Peter Blasco; Andreas Kathol; Frédéric Béchet; Benoit Favre; Alex Marin; Tom Kwiatkowski; Mari Ostendorf; Luke Zettlemoyer; Philipp Salletmayr; Julia Hirschberg; Svetlana Stoyanchev
We present a novel approach for improving communication success between users of speech-to-speech translation systems by automatically detecting errors in the output of automatic speech recognition (ASR) and statistical machine translation (SMT) systems. Our approach initiates system-driven targeted clarification about errorful regions in user input and repairs them given user responses. Our system has been evaluated by unbiased subjects in live mode, and results show improved success of communication between users of the system.
north american chapter of the association for computational linguistics | 2004
Kristin Precoda; Horacio Franco; Ascander Dost; Michael W. Frandsen; John Fry; Andreas Kathol; Colleen Richey; Susanne Z. Riehemann; Dimitra Vergyri; Jing Zheng; Christopher Culy
This paper describes a prototype system for near-real-time spontaneous, bidirectional translation between spoken English and Pashto, a language presenting many technological challenges because of its lack of resources, including both data and expert knowledge. Development of the prototype is ongoing, and we propose to demonstrate a fully functional version which shows the basic capabilities, though not yet their final depth and breadth.
international conference on acoustics, speech, and signal processing | 2008
Andreas Kathol; Gökhan Tür
Understanding multi-party meetings involves tasks such as dialog act segmentation and tagging, action item extraction, and summarization. In this paper we introduce a new task for multi-party meetings: extracting question/answer pairs. This is a practical application for further processing such as summarization. We propose a method based on discriminative classification of individual sentences as questions and answers via lexical, speaker, and dialog act tag information, followed by a contextual optimization via Markov models. Our results indicate that it is possible to outperform a non-trivial baseline using dialog act tag information. More specifically, our method achieves a 13% relative improvement over the baseline for the task of detecting answers in meetings.
international conference on acoustics, speech, and signal processing | 2011
Arindam Mandal; Dimitra Vergyri; Murat Akbacak; Colleen Richey; Andreas Kathol
In this work, we compare several known approaches for multilingual acoustic modeling for three languages, Dari, Farsi and Pashto, which are of recent geo-political interest. We demonstrate that we can train a single multilingual acoustic model for these languages and achieve recognition accuracy close to that of monolingual (or language-dependent) models. When only a small amount of training data is available for each of these languages, the multilingual model may even outperform the monolingual ones. We also explore adapting the multilingual model to target language data, which are able to achieve improved automatic speech recognition (ASR) performance compared to the monolingual models for both large and small amounts of training data by 3% relative word error rate (WER).
spoken language technology workshop | 2016
Chris Bartels; Wen Wang; Vikramjit Mitra; Colleen Richey; Andreas Kathol; Dimitra Vergyri; Harry Bratt; Chiachi Hung
This work addresses lexical unit discovery for languages without (usable) written resources. Previous work has addressed this problem using entirely unsupervised methodologies. Our approach in contrast investigates the use of linguistic and speaker knowledge which are often available even if text resources are not. We create a framework that benefits from such resources, not assuming orthographic representations and avoiding generation of word-level transcriptions. We adapt a universal phone recognizer to the target language and use it to convert audio into a searchable phone string for lexical unit discovery via fuzzy sub-string matching. Linguistic knowledge is used to constrain phone recognition output and to constrain lexical unit discovery on the phone recognizer output.
international conference on acoustics, speech, and signal processing | 2017
Jennifer Smith; Andreas Tsiartas; Elizabeth Shriberg; Andreas Kathol; Adrian R. Willoughby; Massimiliano de Zambotti
Interactive voice technologies can leverage biosignals, such as heart rate (HR), to infer the psychophysiological state of the user. Voice-based detection of HR is attractive because it does not require additional sensors. We predict HR from speech using the SRI BioFrustration Corpus. In contrast to previous studies we use continuous spontaneous speech as input. Results using random forests show modest but significant effects on HR prediction. We further explore the effects on HR of speaking itself, and contrast the effects when interactions induce neutral versus frustrated responses from users. Results reveal that regardless of the users emotional state, HR tends to increase while the user is engaged in speaking to a dialog system relative to a silent region right before speech, and that this effect is greater when the subject is expressing frustration. We also find that the users HR does not recover to pre-speaking levels as quickly after frustrated speech as it does after neutral speech. Implications and future directions are discussed.
computer vision and pattern recognition | 2017
Robert C. Bolles; J. Brian Burns; Martin Graciarena; Andreas Kathol; Aaron Lawson; Mitchell McLaren; Thomas Mensink
This paper is part of a larger effort to detect manipulations of video by searching for and combining the evidence of multiple types of inconsistencies between the audio and visual channels. Here, we focus on inconsistencies between the type of scenes detected in the audio and visual modalities (e.g., audio indoor, small room versus visual outdoor, urban), and inconsistencies in speaker identity tracking over a video given audio speaker features and visual face features (e.g., a voice change, but no talking face change). The scene inconsistency task was complicated by mismatches in the categories used in current visual scene and audio scene collections. To deal with this, we employed a novel semantic mapping method. The speaker identity inconsistency process was challenged by the complexity of comparing face tracks and audio speech clusters, requiring a novel method of fusing these two sources. Our progress on both tasks was demonstrated on two collections of tampered videos.
conference of the international speech communication association | 2016
Andreas Kathol; Elizabeth Shriberg; Massimiliano de Zambotti
We introduce the SRI CLEO (Conversational Language about Everyday Objects) Speaker-State Corpus of speech, video, and biosignals. The goal of the corpus is providing insight on the speech and physiological changes resulting from subtle, context-based influences on affect and cognition. Speakers were prompted by collections of pictures of neutral everyday objects and were instructed to provide speech related to any subset of the objects for a preset period of time (120 or 180 seconds depending on task). The corpus provides signals for 43 speakers under four different speaker-state conditions: (1) neutral and emotionally charged audiovisual background; (2) cognitive load; (3) time pressure; and (4) various acted emotions. Unlike previous studies that have linked speaker state to the content of the speaking task itself, the CLEO prompts remain largely pragmatically, semantically, and affectively neutral across all conditions. This framework enables for more direct comparisons across both conditions and speakers. The corpus also includes more traditional speaker tasks involving reading and free-form reporting of neutral and emotionally charged content. The explored biosignals include skin conductance, respiration, blood pressure, and ECG. The corpus is in the final stages of processing and will be made available to the research community.