Andreas Kathol | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Andreas Kathol is active.

Explore More

Publication

Featured researches published by Andreas Kathol.

Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge | 2014

The SRI AVEC-2014 Evaluation System

Vikramjit Mitra; Elizabeth Shriberg; Mitchell McLaren; Andreas Kathol; Colleen Richey; Dimitra Vergyri; Martin Graciarena

Though depression is a common mental health problem with significant impact on human society, it often goes undetected. We explore a diverse set of features based only on spoken audio to understand which features correlate with self-reported depression scores according to the Beck depression rating scale. These features, many of which are novel for this task, include (1) estimated articulatory trajectories during speech production, (2) acoustic characteristics, (3) acoustic-phonetic characteristics and (4) prosodic features. Features are modeled using a variety of approaches, including support vector regression, a Gaussian backend and decision trees. We report results on the AVEC-2014 depression dataset and find that individual systems range from 9.18 to 11.87 in root mean squared error (RMSE), and from 7.68 to 9.99 in mean absolute error (MAE). Initial fusion brings further improvement; fusion and feature selection work is still in progress.

international conference on acoustics, speech, and signal processing | 2009

Recent advances in SRI'S IraqComm™ Iraqi Arabic-English speech-to-speech translation system

Murat Akbacak; Horacio Franco; Michael W. Frandsen; Saša Hasan; Huda Jameel; Andreas Kathol; Shahram Khadivi; Xin Lei; Arindam Mandal; Saab Mansour; Kristin Precoda; Colleen Richey; Dimitra Vergyri; Wen Wang; Mei Yang; Jing Zheng

We summarize recent progress on SRIs IraqComm™ Iraqi Arabic-English two-way speech-to-speech translation system. In the past year we made substantial developments in our speech recognition and machine translation technology, leading to significant improvements in both accuracy and speed of the IraqComm system. On the 2008 NIST-evaluation dataset our twoway speech-to-text (S2T) system achieved 6% to 8% absolute improvement in BLEU in both directions, compared to our previous year system [1].

international conference on acoustics, speech, and signal processing | 2013

“Can you give me another word for hyperbaric?”: Improving speech translation using targeted clarification questions

Necip Fazil Ayan; Arindam Mandal; Michael W. Frandsen; Jing Zheng; Peter Blasco; Andreas Kathol; Frédéric Béchet; Benoit Favre; Alex Marin; Tom Kwiatkowski; Mari Ostendorf; Luke Zettlemoyer; Philipp Salletmayr; Julia Hirschberg; Svetlana Stoyanchev

We present a novel approach for improving communication success between users of speech-to-speech translation systems by automatically detecting errors in the output of automatic speech recognition (ASR) and statistical machine translation (SMT) systems. Our approach initiates system-driven targeted clarification about errorful regions in user input and repairs them given user responses. Our system has been evaluated by unbiased subjects in live mode, and results show improved success of communication between users of the system.

north american chapter of the association for computational linguistics | 2004

Limited-domain speech-to-speech translation between English and Pashto

Kristin Precoda; Horacio Franco; Ascander Dost; Michael W. Frandsen; John Fry; Andreas Kathol; Colleen Richey; Susanne Z. Riehemann; Dimitra Vergyri; Jing Zheng; Christopher Culy

This paper describes a prototype system for near-real-time spontaneous, bidirectional translation between spoken English and Pashto, a language presenting many technological challenges because of its lack of resources, including both data and expert knowledge. Development of the prototype is ongoing, and we propose to demonstrate a fully functional version which shows the basic capabilities, though not yet their final depth and breadth.

international conference on acoustics, speech, and signal processing | 2008

Extracting question/answer pairs in multi-party meetings

Andreas Kathol; Gökhan Tür

Understanding multi-party meetings involves tasks such as dialog act segmentation and tagging, action item extraction, and summarization. In this paper we introduce a new task for multi-party meetings: extracting question/answer pairs. This is a practical application for further processing such as summarization. We propose a method based on discriminative classification of individual sentences as questions and answers via lexical, speaker, and dialog act tag information, followed by a contextual optimization via Markov models. Our results indicate that it is possible to outperform a non-trivial baseline using dialog act tag information. More specifically, our method achieves a 13% relative improvement over the baseline for the task of detecting answers in meetings.

international conference on acoustics, speech, and signal processing | 2011

Acoustic data sharing for Afghan and Persian languages

Arindam Mandal; Dimitra Vergyri; Murat Akbacak; Colleen Richey; Andreas Kathol

In this work, we compare several known approaches for multilingual acoustic modeling for three languages, Dari, Farsi and Pashto, which are of recent geo-political interest. We demonstrate that we can train a single multilingual acoustic model for these languages and achieve recognition accuracy close to that of monolingual (or language-dependent) models. When only a small amount of training data is available for each of these languages, the multilingual model may even outperform the monolingual ones. We also explore adapting the multilingual model to target language data, which are able to achieve improved automatic speech recognition (ASR) performance compared to the monolingual models for both large and small amounts of training data by 3% relative word error rate (WER).

spoken language technology workshop | 2016

Toward human-assisted lexical unit discovery without text resources

Chris Bartels; Wen Wang; Vikramjit Mitra; Colleen Richey; Andreas Kathol; Dimitra Vergyri; Harry Bratt; Chiachi Hung

This work addresses lexical unit discovery for languages without (usable) written resources. Previous work has addressed this problem using entirely unsupervised methodologies. Our approach in contrast investigates the use of linguistic and speaker knowledge which are often available even if text resources are not. We create a framework that benefits from such resources, not assuming orthographic representations and avoiding generation of word-level transcriptions. We adapt a universal phone recognizer to the target language and use it to convert audio into a searchable phone string for lexical unit discovery via fuzzy sub-string matching. Linguistic knowledge is used to constrain phone recognition output and to constrain lexical unit discovery on the phone recognizer output.

international conference on acoustics, speech, and signal processing | 2017

Analysis and prediction of heart rate using speech features from natural speech

Jennifer Smith; Andreas Tsiartas; Elizabeth Shriberg; Andreas Kathol; Adrian R. Willoughby; Massimiliano de Zambotti

Interactive voice technologies can leverage biosignals, such as heart rate (HR), to infer the psychophysiological state of the user. Voice-based detection of HR is attractive because it does not require additional sensors. We predict HR from speech using the SRI BioFrustration Corpus. In contrast to previous studies we use continuous spontaneous speech as input. Results using random forests show modest but significant effects on HR prediction. We further explore the effects on HR of speaking itself, and contrast the effects when interactions induce neutral versus frustrated responses from users. Results reveal that regardless of the users emotional state, HR tends to increase while the user is engaged in speaking to a dialog system relative to a silent region right before speech, and that this effect is greater when the subject is expressing frustration. We also find that the users HR does not recover to pre-speaking levels as quickly after frustrated speech as it does after neutral speech. Implications and future directions are discussed.

computer vision and pattern recognition | 2017

Spotting Audio-Visual Inconsistencies (SAVI) in Manipulated Video

Robert C. Bolles; J. Brian Burns; Martin Graciarena; Andreas Kathol; Aaron Lawson; Mitchell McLaren; Thomas Mensink

This paper is part of a larger effort to detect manipulations of video by searching for and combining the evidence of multiple types of inconsistencies between the audio and visual channels. Here, we focus on inconsistencies between the type of scenes detected in the audio and visual modalities (e.g., audio indoor, small room versus visual outdoor, urban), and inconsistencies in speaker identity tracking over a video given audio speaker features and visual face features (e.g., a voice change, but no talking face change). The scene inconsistency task was complicated by mismatches in the categories used in current visual scene and audio scene collections. To deal with this, we employed a novel semantic mapping method. The speaker identity inconsistency process was challenged by the complexity of comparing face tracks and audio speech clusters, requiring a novel method of fusing these two sources. Our progress on both tasks was demonstrated on two collections of tampered videos.

conference of the international speech communication association | 2016

The SRI CLEO Speaker-State Corpus.

Andreas Kathol; Elizabeth Shriberg; Massimiliano de Zambotti

We introduce the SRI CLEO (Conversational Language about Everyday Objects) Speaker-State Corpus of speech, video, and biosignals. The goal of the corpus is providing insight on the speech and physiological changes resulting from subtle, context-based influences on affect and cognition. Speakers were prompted by collections of pictures of neutral everyday objects and were instructed to provide speech related to any subset of the objects for a preset period of time (120 or 180 seconds depending on task). The corpus provides signals for 43 speakers under four different speaker-state conditions: (1) neutral and emotionally charged audiovisual background; (2) cognitive load; (3) time pressure; and (4) various acted emotions. Unlike previous studies that have linked speaker state to the content of the speaking task itself, the CLEO prompts remain largely pragmatically, semantically, and affectively neutral across all conditions. This framework enables for more direct comparisons across both conditions and speakers. The corpus also includes more traditional speaker tasks involving reading and free-form reporting of neutral and emotionally charged content. The explored biosignals include skin conductance, respiration, blood pressure, and ECG. The corpus is in the final stages of processing and will be made available to the research community.

Explore More