Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Kornel Laskowski is active.

Publication


Featured researches published by Kornel Laskowski.


annual meeting of the special interest group on discourse and dialogue | 2008

Modeling Vocal Interaction for Text-Independent Participant Characterization in Multi-Party Conversation

Kornel Laskowski; Mari Ostendorf; Tanja Schultz

An important task in automatic conversation understanding is the inference of social structure governing participant behavior. We explore the dependence between several social dimensions, including assigned role, gender, and seniority, and a set of low-level features descriptive of talkspurt deployment in a multiparticipant context. Experiments conducted on two large, publicly available meeting corpora suggest that our features are quite useful in predicting these dimensions, excepting gender. The classification experiments we present exhibit a relative error rate reduction of 37% to 67% compared to choosing the majority class.


international conference on acoustics, speech, and signal processing | 2008

An instantaneous vector representation of delta pitch for speaker-change prediction in conversational dialogue systems

Kornel Laskowski; Jens Edlund; Mattias Heldner

As spoken dialogue systems become deployed in increasingly complex domains, they face rising demands on the naturalness of interaction. We focus on system responsiveness, aiming to mimic human-like dialogue flow control by predicting speaker changes as observed in real human-human conversations. We derive an instantaneous vector representation of pitch variation and show that it is amenable to standard acoustic modeling techniques. Using a small amount of automatically labeled data, we train models which significantly outperform current state-of-the-art pause-only systems, and replicate to within 1% absolute the performance of our previously published hand-crafted baseline. The new system additionally offers scope for run-time control over the precision or recall of locations at which to speak.


international conference on acoustics, speech, and signal processing | 2006

Unsupervised Learning of Overlapped Speech Model Parameters For Multichannel Speech Activity Detection in Meetings

Kornel Laskowski; Tanja Schultz

The study of meetings, and multi-party conversation in general, is currently the focus of much attention, calling for more robust and more accurate speech activity detection systems. We present a novel multichannel speech activity detection algorithm, which explicitly models the overlap incurred by participants taking turns at speaking. Parameters for overlapped speech states are estimated during decoding by using and combining knowledge from other observed states in the same meeting, in an unsupervised manner. We demonstrate on the NIST Rich Transcription Spring 2004 data set that the new system almost halves the number of frames missed by a competitive algorithm within regions of overlapped speech. The overall speech detection error on unseen data is reduced by 36% relative


international conference on acoustics, speech, and signal processing | 2011

A single-port non-parametric model of turn-taking in multi-party conversation

Kornel Laskowski; Jens Edlund; Mattias Heldner

The taking of turns to speak is an intrinsic property of conversation. It is expected that models of taking turns, providing a prior distribution over conversational form, can reduce the perplexity of what is attended to and processed by spoken dialogue systems. We propose a single-port model of multi-party turn-taking which allows conversants to behave independently but to condition their behavior on the past of the entire group. The model performs at least as well as an existing multi-port model on perplexity over subsequent speech activity. We quantify the effect of longer histories and more distant future horizons, and argue that the framework has the potential to inform the design and behavior of spoken dialogue systems.


international conference on machine learning | 2008

Detection of Laughter-in-Interaction in Multichannel Close-Talk Microphone Recordings of Meetings

Kornel Laskowski; Tanja Schultz

Laughter is a key element of human-human interaction, occurring surprisingly frequently in multi-party conversation. In meetings, laughter accounts for almost 10% of vocalization effort by time, and is known to be relevant for topic segmentation and the automatic characterization of affect. We present a system for the detection of laughter, and its attribution to specific participants, which relies on simultaneously decoding the vocal activity of all participants given multi-channel recordings. The proposed framework allows us to disambiguate laughter and speech not only acoustically, but also by constraining the number of simultaneous speakers and the number of simultaneous laughers independently, since participants tend to take turns speaking but laugh together. We present experiments on 57 hours of meeting data, containing almost 11000 unique instances of laughter.


international conference on acoustics, speech, and signal processing | 2009

Contrasting emotion-bearing laughter types in multiparticipant vocal activity detection for meetings

Kornel Laskowski

The detection of laughter in conversational interaction presents an important challenge in meeting understanding, important primarily because laughter is predictive of the emotional state of participants. We present evidence which suggests that ignoring unvoiced laughter improves the prediction of emotional involvement in collocated speech, making a case for the distinction between voiced and unvoiced laughter during laughter detection. Our experiments show that the exclusion of unvoiced laughter during laughter model training as well as its explicit modeling lead to detection scores for voiced laughter which are much higher than those otherwise obtained for all laughter. Furthermore, duration modeling is shown to be a more effective means of improving precision than interaction modeling through joint-participant decoding. Taken together, the final detection F-scores we present for voiced laughter on our development set comprise a 20% reduction of error, relative to F-scores for all laughter reported in previous work, and 6% and 22% relative reductions in error on two larger datasets unseen during development.


international conference on machine learning | 2006

The ISL RT-06S speech-to-text system

Christian Fügen; Shajith Ikbal; Florian Kraft; Kenichi Kumatani; Kornel Laskowski; John W. McDonough; Mari Ostendorf; Sebastian Stüker; Matthias Wölfel

This paper describes the 2006 lecture and conference meeting speech-to-text system developed at the Interactive Systems Laboratories (ISL), for the individual head-mounted microphone (IHM), single distant microphone (SDM), and multiple distant microphone (MDM) conditions, which was evaluated in the RT-06S Rich Transcription Meeting Evaluation sponsored by the US National Institute of Standards and Technologies (NIST). We describe the principal differences between our current system and those submitted in previous years, namely improved acoustic and language models, cross adaptation between systems with different front-ends and phoneme sets, and the use of various automatic speech segmentation algorithms.


international conference on acoustics, speech, and signal processing | 2009

Modeling instantaneous intonation for speaker identification using the fundamental frequency variation spectrum

Kornel Laskowski; Qin Jin

In recent years, the field of automatic speaker identification has begun to exploit high-level sources of speaker-discriminative information, in addition to traditional models of spectral shape. These sources include pronunciation models, prosodic dynamics, pitch, pause, and duration features, phone streams, and conversational interaction. As part of this broader thrust, we explore a new frame-level vector representation of the instantaneous change in fundamental frequency, known as fundamental frequency variation (FFV). The FFV spectrum consists of 7 continuous coefficients, and can be directly modeled in a standard Gaussian mixture model (GMM) framework. Our experiments indicate that FFV features contain useful information for discriminating among speakers, and that model-space combination of FFV and cepstral features outperforms cepstral features alone. In particular, our results on 16kHz Wall Street Journal data show relative reductions in error rate of 54% and 40% for female and male speakers, respectively.


spoken language technology workshop | 2008

Modeling vocal interaction for text-independent detection of involvement hotspots in multi-party meetings

Kornel Laskowski

Indexing, retrieval, and summarization in recordings of meetings have, to date, focused largely on the propositional content of what participants say. Although objectively relevant, such content may not be the sole or even the main aim of potential system users. Instead, users may be interested in information bearing on conversation flow. We explore the automatic detection of one example of such information, namely that of hotspots defined in terms of participant involvement. Our proposed system relies exclusively on low-level vocal activity features, and yields a classification accuracy of 84%, representing a 39% reduction of error relative to a baseline which selects the majority class.


international conference on acoustics, speech, and signal processing | 2010

Comparing the contributions of context and prosody in text-independent dialog act recognition

Kornel Laskowski; Elizabeth Shriberg

Automatic segmentation and classification of dialog acts (DAs; e.g., statements versus questions) is important for spoken language understanding (SLU). While most systems have relied on word and word boundary information, interest in privacy-sensitive applications and non-ASR-based processing requires an approach that is text-independent. We propose a framework for employing both speech/non-speech-based (“contextual”) features and prosodic features, and apply it to DA segmentation and classification in multiparty meetings. We find that: (1) contextual features are better for recognizing turn edge DA types and DA boundary types, while prosodic features are better for finding floor mechanisms and backchannels; (2) the two knowledge sources are complementary for most of the DA types studied; and (3) the performance of the resulting system approaches that achieved using oracle lexical information for several DA types. These results suggest that there is significant promise in text-independent features for DA recognition, and possibly for other SLU tasks, particularly when words are not available.

Collaboration


Dive into the Kornel Laskowski's collaboration.

Top Co-Authors

Avatar

Jens Edlund

Royal Institute of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Susanne Burger

Carnegie Mellon University

View shared research outputs
Top Co-Authors

Avatar

Qin Jin

Renmin University of China

View shared research outputs
Top Co-Authors

Avatar

Mari Ostendorf

University of Washington

View shared research outputs
Top Co-Authors

Avatar

Alex Waibel

Karlsruhe Institute of Technology

View shared research outputs
Top Co-Authors

Avatar

Matthias Wölfel

Karlsruhe Institute of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

John W. McDonough

Carnegie Mellon University

View shared research outputs
Researchain Logo
Decentralizing Knowledge