David A. van Leeuwen
Radboud University Nijmegen
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by David A. van Leeuwen.
Speech Communication | 2007
Khiet Phuong Truong; David A. van Leeuwen
Emotions can be recognized by audible paralinguistic cues in speech. By detecting these paralinguistic cues that can consist of laughter, a trembling voice, coughs, changes in the intonation contour etc., information about the speakers state and emotion can be revealed. This paper describes the development of a gender-independent laugh detector with the aim to enable automatic emotion recognition. Different types of features (spectral, prosodic) for laughter detection were investigated using different classification techniques (Gaussian Mixture Models, Support Vector Machines, Multi Layer Perceptron) often used in language and speaker recognition. Classification experiments were carried out with short pre-segmented speech and laughter segments extracted from the ICSI Meeting Recorder Corpus (with a mean duration of approximately 2s). Equal error rates of around 3% were obtained when tested on speaker-independent speech data. We found that a fusion between classifiers based on Gaussian Mixture Models and classifiers based on Support Vector Machines increases discriminative power. We also found that a fusion between classifiers that use spectral features and classifiers that use prosodic information usually increases the performance for discrimination between laughter and speech. Our acoustic measurements showed differences between laughter and speech in mean pitch and in the ratio of the durations of unvoiced to voiced portions, which indicate that these prosodic features are indeed useful for discrimination between laughter and speech.
international conference on acoustics, speech, and signal processing | 2013
Taufiq Hasan; Rahim Saeidi; John H. L. Hansen; David A. van Leeuwen
Speaker recognition systems trained on long duration utterances are known to perform significantly worse when short test segments are encountered. To address this mismatch, we analyze the effect of duration variability on phoneme distributions of speech utterances and i-vector length. We demonstrate that, as utterance duration is decreased, number of detected unique phonemes and i-vector length approaches zero in a logarithmic and non-linear fashion, respectively. Assuming duration variability as an additive noise in the i-vector space, we propose three different strategies for its compensation: i) multi-duration training in Probabilistic Linear Discriminant Analysis (PLDA) model, ii) score calibration using log duration as a Quality Measure Function (QMF), and iii) multi-duration PLDA training with synthesized short duration i-vectors. Experiments are designed based on the 2012 National Institute of Standards and Technology (NIST) Speaker Recognition Evaluation (SRE) protocol with varying test utterance duration. Experimental results demonstrate the effectiveness of the proposed schemes on short duration test conditions, especially with the QMF calibration approach.
international conference on acoustics, speech, and signal processing | 2013
Mohamad Hasan Bahari; Rahim Saeidi; Hugo Van hamme; David A. van Leeuwen
In this paper, three utterance modelling approaches, namely Gaussian Mean Supervector (GMS), i-vector and Gaussian Posterior Probability Supervector (GPPS), are applied to the accent recognition problem. For each utterance modeling method, three different classifiers, namely the Support Vector Machine (SVM), the Naive Bayesian Classifier (NBC) and the Sparse Representation Classifier (SRC), are employed to find out suitable matches between the utterance modelling schemes and the classifiers. The evaluation database is formed by using English utterances of speakers whose native languages are Russian, Hindi, American English, Thai, Vietnamese and Cantonese. These utterances are drawn from the National Institute of Standards and Technology (NIST) 2008 Speaker Recognition Evaluation (SRE) database. The study results show that GPPS and i-vector are more effective than GMS in this accent recognition task. It is also concluded that among the employed classifiers, the best matches for i-vector and GPPS are SVM and SRC, respectively.
IEEE Transactions on Audio, Speech, and Language Processing | 2013
Miranti Indar Mandasari; Rahim Saeidi; Mitchell McLaren; David A. van Leeuwen
This paper investigates the effect of utterance duration to the calibration of a modern i-vector speaker recognition system with probabilistic linear discriminant analysis (PLDA) modeling. A calibration approach to deal with these effects using quality measure functions (QMFs) is proposed to include duration in the calibration transformation. Extensive experiments are performed in order to evaluate the robustness of the proposed calibration approach for unseen conditions in the training of calibration parameters. Using the latest NIST corpora for evaluation, results highlight the importance of considering the quality metrics like duration in calibrating the scores for automatic speaker recognition systems.
international conference on acoustics, speech, and signal processing | 2011
Marijn Huijbregts; Mitchell McLaren; David A. van Leeuwen
In this paper we present a method for automatically generating acoustic sub-word units that can substitute conventional phone models in a query-by-example spoken term detection system. We generate the sub-word units with a modified version of our speaker diarization system. Given a speech recording, the original diarization system generates a set of speaker models in an unsupervised manner without the need for training or development data. Modifying the diarization system to process the speech of a single speaker and decreasing the minimum segment duration constraint allows us to detect speaker-dependent sub-word units. For the task of query-by-example spoken term detection, we show that the proposed system performs well on both broadcast and non-broadcast recordings, unlike a conventional phone-based system trained solely on broadcast data. A mean average precision of 0.28 and 0.38 was obtained for experiments on broadcast news and on a set of war veteran interviews, respectively.
international conference on acoustics, speech, and signal processing | 2012
Miranti Indar Mandasari; Mitchell McLaren; David A. van Leeuwen
Motivated by the application of speaker recognition in forensic area, this paper presents a study on noise robustness of several automatic speaker recognition system approaches, ranging from simple dot-scoring and a standard i-vector system with cosine distance scoring to a state-of-the-art i-vector Probabilistic Linear Discriminant Analysis (PLDA) system. Using the recent NIST 2010 Speaker Recognition Evaluation (SRE) data, the systems are analyzed in added noise conditions with a range of signal to noise ratios. Various experiments were conducted to study the influence of the noise on the speech activity detection and Wiener filtering in the front-end of the system.
Engineering Applications of Artificial Intelligence | 2014
Mohamad Hasan Bahari; Mitchell McLaren; Hugo Van hamme; David A. van Leeuwen
In this paper, a new approach for age estimation from speech signals based on i-vectors is proposed. In this method, each utterance is modeled by its corresponding i-vector. Then, a Within-Class Covariance Normalization technique is used for session variability compensation. Finally, a least squares support vector regression (LSSVR) is applied to estimate the age of speakers. The proposed method is trained and tested on telephone conversations of the National Institute for Standard and Technology (NIST) 2010 and 2008 speaker recognition evaluation databases. Evaluation results show that the proposed method yields significantly lower mean absolute error and higher Pearson correlation coefficient between chronological speaker age and estimated speaker age compared to different conventional schemes. The obtained relative improvements of mean absolute error and correlation coefficient compared to our best baseline system are around 5% and 2% respectively. Finally, the effect of some major factors influencing the proposed age estimation system, namely utterance length and spoken language are analyzed. HighlightsA new approach for age estimation from speech signals based on i-vectors is proposed.Utterances are modeled using the i-vector framework.Within-class covariance normalization is used for session variability compensation.Least squares support vector regression is applied to estimate the age of speakers.The proposed method significantly improves conventional schemes.
Proceedings of the international workshop on Human-centered multimedia | 2007
Willem A. Melder; K.P. Truong; Marten Den Uyl; David A. van Leeuwen; Mark A. Neerincx; Lodewijk R. Loos; B. Stock Plum
In this paper, we present a multimodal affective mirror that senses and elicits laughter. Currently, the mirror contains a vocal and a facial affect-sensing module, a component that fuses the output of these two modules to achieve a user-state assessment, a user state transition model, and a component to present audiovisual affective feedback that should keep or bring the user in the intended state. Interaction with this intelligent interface involves a full cyclic process of sensing, interpreting, reacting, sensing (of the reaction effects), interpreting & The intention of the mirror is to evoke positive emotions, to make people laugh and to increase the laughter. The first user experiences tests showed that users show cooperative behavior, resulting in mutual user-mirror action-reaction cycles. Most users enjoyed the interaction with the mirror and immersed in an excellent user experience.
Speech Communication | 2012
Khiet Phuong Truong; David A. van Leeuwen; Franciska de Jong
The differences between self-reported and observed emotion have only marginally been investigated in the context of speech-based automatic emotion recognition. We address this issue by comparing self-reported emotion ratings to observed emotion ratings and look at how differences between these two types of ratings affect the development and performance of automatic emotion recognizers developed with these ratings. A dimensional approach to emotion modeling is adopted: the ratings are based on continuous arousal and valence scales. We describe the TNO-Gaming Corpus that contains spontaneous vocal and facial expressions elicited via a multiplayer videogame and that includes emotion annotations obtained via self-report and observation by outside observers. Comparisons show that there are discrepancies between self-reported and observed emotion ratings which are also reflected in the performance of the emotion recognizers developed. Using Support Vector Regression in combination with acoustic and textual features, recognizers of arousal and valence are developed that can predict points in a 2-dimensional arousal-valence space. The results of these recognizers show that the self-reported emotion is much harder to recognize than the observed emotion, and that averaging ratings from multiple observers improves performance.
international conference on foundations of augmented cognition | 2007
Khiet Phuong Truong; David A. van Leeuwen; Mark A. Neerincx
Two unobtrusive modalities for automatic emotion recognition are discussed: speech and facial expressions. First, an overview is given of emotion recognition studies based on a combination of speech and facial expressions. We will identify difficulties concerning data collection, data fusion, system evaluation and emotion annotation that one is most likely to encounter in emotion recognition research. Further, we identify some of the possible applications for emotion recognition such as health monitoring or e-learning systems. Finally, we will discuss the growing need for developing agreed standards in automatic emotion recognition research.