Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Stavros Petridis is active.

Publication


Featured researches published by Stavros Petridis.


international conference on acoustics, speech, and signal processing | 2008

Audiovisual discrimination between laughter and speech

Stavros Petridis; Maja Pantic

Past research on automatic laughter detection has focused mainly on audio-based detection. Here we present an audiovisual approach to distinguishing laughter from speech and we show that integrating the information from audio and video leads to an improved reliability of audiovisual approach in comparison to single-modal approaches. We also investigated the level at which audiovisual information should be fused for the best performance. When tested on 96 audiovisual sequences depicting spontaneously displayed (as opposed to posed) laughter and speech episodes, the proposed audiovisual feature-level approach achieved a 86.9% recall rate with 76.7% precision.


Image and Vision Computing | 2013

The MAHNOB Laughter database

Stavros Petridis; Brais Martinez; Maja Pantic

Laughter is clearly an audiovisual event, consisting of the laughter vocalization and of facial activity, mainly around the mouth and sometimes in the upper face. A major obstacle in studying the audiovisual aspects of laughter is the lack of suitable data. For this reason, the majority of past research on laughter classification/detection has focused on audio-only approaches. A few audiovisual studies exist which use audiovisual data from existing corpora of recorded meetings. The main problem with such data is that they usually contain large head movements which make audiovisual analysis very difficult. In this work, we present a new publicly available audiovisual database, the MAHNOB Laughter database, suitable for studying laughter. It contains 22 subjects who were recorded while watching stimulus material, using two microphones, a video camera and a thermal camera. The primary goal was to elicit laughter, but in addition, posed smiles, posed laughter, and speech were recorded as well. In total, 180 sessions are available with a total duration of 3h and 49min. There are 563 laughter episodes, 849 speech utterances, 51 posed laughs, 67 speech-laughs episodes and 167 other vocalizations annotated in the database. We also report baseline experiments for audio, visual and audiovisual approaches for laughter-vs-speech discrimination as well as further experiments on discrimination between voiced laughter, unvoiced laughter and speech. These results suggest that the combination of audio and visual information is beneficial in the presence of acoustic noise and helps discriminating between voiced laughter episodes and speech utterances. Finally, we report preliminary experiments on laughter-vs-speech discrimination based on thermal images.


IEEE Transactions on Multimedia | 2011

Audiovisual Discrimination Between Speech and Laughter: Why and When Visual Information Might Help

Stavros Petridis; Maja Pantic

Past research on automatic laughter classification/detection has focused mainly on audio-based approaches. Here we present an audiovisual approach to distinguishing laughter from speech, and we show that integrating the information from audio and video channels may lead to improved performance over single-modal approaches. Both audio and visual channels consist of two streams (cues), facial expressions and head pose for video and cepstral and prosodic features for audio. Two types of experiments were performed: 1) subject-independent cross-validation on the AMI dataset and 2) cross-database experiments on the AMI and SAL datasets. We experimented with different combinations of cues with the most informative being the combination of facial expressions, cepstral, and prosodic features. Our results suggest that the performance of the audiovisual approach is better on average than single-modal approaches. The addition of visual information produces better results when it comes to female subjects. When the training conditions are less diverse in terms of head movements than the testing conditions (training on the SAL dataset, testing on the AMI dataset), then no improvement was observed with the addition of visual information. On the other hand, when the training conditions are similar (cross validation on the AMI dataset), or more diverse (training on the AMI dataset, testing on the SAL dataset), in terms of head movements than is the case in the testing conditions, an absolute increase of about 3% in the F1 rate for laughter is reported when visual information is added to audio information.


international conference on multimedia and expo | 2009

Is this joke really funny? judging the mirth by audiovisual laughter analysis

Stavros Petridis; Maja Pantic

This paper presents the results of an empirical study suggesting that, while laughter is a very good indicator of amusement, the kind of laughter (unvoiced laughter vs.voiced laughter) is correlated with the mirth of laughter and could potential be used to judge the actual hilarity of the stimulus joke. For this study, an automated method for audiovisual analysis of laugher episodes exhibited while watching movie clips or observing the behaviour of a conversational agent has been developed. The audio and visual features, based on spectral properties of the acoustic signal and facial expressions respectively, have been integrated using feature level fusion, resulting in a multimodal approach to distinguishing voiced laughter from unvoiced laughter and speech. The classification accuracy of such a system tested on spontaneous laughter episodes is 74 %. Finally, preliminary results are presented which provide evidence that unvoiced laughter can be interpreted as less gleeful than voiced laughter and consequently the detection of those two types of laughter can be used to label multimedia content as little funny or very funny respectively.


international conference on multimodal interfaces | 2009

Static vs. dynamic modeling of human nonverbal behavior from multiple cues and modalities

Stavros Petridis; Hatice Gunes; Sebastian Kaltwang; Maja Pantic

Human nonverbal behavior recognition from multiple cues and modalities has attracted a lot of interest in recent years. Despite the interest, many research questions, including the type of feature representation, choice of static vs. dynamic classification schemes, the number and type of cues or modalities to use, and the optimal way of fusing these, remain open research questions. This paper compares frame-based vs window-based feature representation and employs static vs. dynamic classification schemes for two distinct problems in the field of automatic human nonverbal behavior analysis: multicue discrimination between posed and spontaneous smiles from facial expressions, head and shoulder movements, and audio-visual discrimination between laughter and speech. Single cue and single modality results are compared to multicue and multimodal results by employing Neural Networks, Hidden Markov Models (HMMs), and 2- and 3-chain coupled HMMs. Subject independent experimental evaluation shows that: 1) both for static and dynamic classification, fusing data coming from multiple cues and modalities proves useful to the overall task of recognition, 2) the type of feature representation appears to have a direct impact on the classification performance, and 3) static classification is comparable to dynamic classification both for multicue discrimination between posed and spontaneous smiles, and audio-visual discrimination between laughter and speech.


international conference on acoustics, speech, and signal processing | 2016

Deep complementary bottleneck features for visual speech recognition

Stavros Petridis; Maja Pantic

Deep bottleneck features (DBNFs) have been used successfully in the past for acoustic speech recognition from audio. However, research on extracting DBNFs for visual speech recognition is very limited. In this work, we present an approach to extract deep bottleneck visual features based on deep autoencoders. To the best of our knowledge, this is the first work that extracts DBNFs for visual speech recognition directly from pixels. We first train a deep autoencoder with a bottleneck layer in order to reduce the dimensionality of the image. Then the autoencoders decoding layers are replaced by classification layers which make the bottleneck features more discriminative. Discrete Cosine Transform (DCT) features are also appended in the bottleneck layer during training in order to make the bottleneck features complementary to DCT features. Long-Short Term Memory (LSTM) networks are used to model the temporal dynamics and the performance is evaluated on the OuluVS and AVLetters databases. The extracted complementary DBNF in combination with DCT features achieve the best performance resulting in an absolute improvement of up to 5% over the DCT baseline.


conference on image and video retrieval | 2008

Fusion of audio and visual cues for laughter detection

Stavros Petridis; Maja Pantic

Past research on automatic laughter detection has focused mainly on audio-based detection. Here we present an audio-visual approach to distinguishing laughter from speech and we show that integrating the information from audio and video channels leads to improved performance over single-modal approaches. Each channel consists of 2 streams (cues), facial expressions and head movements for video and spectral and prosodic features for audio. We used decision level fusion to integrate the information from the two channels and experimented using the SUM rule and a neural network as the integration functions. The results indicate that even a simple linear function such as the SUM rule achieves very good performance in audiovisual fusion. We also experimented with different combinations of cues with the most informative being the facial expressions and the spectral features. The best combination of cues is the integration of facial expressions, spectral and prosodic features when a neural network is used as the fusion method. When tested on 96 audiovisual sequences, depicting spontaneously displayed (as opposed to posed) laughter and speech episodes, in a person independent way the proposed audiovisual approach achieves over 90% recall rate and over 80% precision.


international conference on acoustics, speech, and signal processing | 2011

Audiovisual classification of vocal outbursts in human conversation using Long-Short-Term Memory networks

Florian Eyben; Stavros Petridis; Björn W. Schuller; Georgios Tzimiropoulos; Stefanos Zafeiriou; Maja Pantic

We investigate classification of non-linguistic vocalisations with a novel audiovisual approach and Long Short-Term Memory (LSTM) Recurrent Neural Networks as highly successful dynamic sequence classifiers. As database of evaluation serves this years Paralinguistic Challenges Audiovisual Interest Corpus of human-to-human natural conversation. For video-based analysis we compare shape and appearance based features. These are fused in an early manner with typical audio descriptors. The results show significant improvements of LSTM networks over a static approach based on Support Vector Machines. More important, we can show a significant gain in performance when fusing audio and visual shape features.


international conference on acoustics, speech, and signal processing | 2017

End-to-end visual speech recognition with LSTMS

Stavros Petridis; Zuwei Li; Maja Pantic

Traditional visual speech recognition systems consist of two stages, feature extraction and classification. Recently, several deep learning approaches have been presented which automatically extract features from the mouth images and aim to replace the feature extraction stage. However, research on joint learning of features and classification is very limited. In this work, we present an end-to-end visual speech recognition system based on Long-Short Memory (LSTM) networks. To the best of our knowledge, this is the first model which simultaneously learns to extract features directly from the pixels and perform classification and also achieves state-of-the-art performance in visual speech classification. The model consists of two streams which extract features directly from the mouth and difference images, respectively. The temporal dynamics in each stream are modelled by an LSTM and the fusion of the two streams takes place via a Bidirectional LSTM (BLSTM). An absolute improvement of 9.7% over the base line is reported on the OuluVS2 database, and 1.5% on the CUAVE database when compared with other methods which use a similar visual front-end.


Pattern Recognition Letters | 2015

The MAHNOB Mimicry Database

Sanjay Bilakhia; Stavros Petridis; Antinus Nijholt; Maja Pantic

We present an audiovisual dataset for investigation of mimicry behaviour.We report baseline performances from per-session mimicry classification experiments.Performance is session-dependent, due to variability in subject expressiveness.Current mimicry classification methods need more development for spontaneous data. People mimic verbal and nonverbal expressions and behaviour of their counterparts in various social interactions. Research in psychology and social sciences has shown that mimicry has the power to influence social judgment and various social behaviours, including negotiation and debating, courtship, empathy and helping behaviour. Hence, automatic recognition of mimicry behaviour would be a valuable tool in various domains, and especially in negotiation skills enhancement and medical help provision training. In this work, we present the MAHNOB mimicry database, a set of fully synchronised, multi-sensory, audiovisual recordings of naturalistic dyadic interactions, suitable for investigation of mimicry and negotiation behaviour. The database contains 11?h of recordings, split over 54 sessions of dyadic interactions between 12 confederates and their 48 counterparts, being engaged either in a socio-political discussion or negotiating a tenancy agreement. To provide a benchmark for efforts in machine understanding of mimicry behaviour, we report a number of baseline experiments based on visual data only. Specifically, we consider face and head movements, and report on binary classification of video sequences into mimicry and non-mimicry categories based on the following widely-used methodologies: two similarity-based methods (cross correlation and time warping), and a state-of-the-art temporal classifier (Long Short Term Memory Recurrent Neural Network). The best reported results are session-dependent, and affected by the sparsity of positive examples in the data. This suggests that there is much room for improvement upon the reported baseline experiments.

Collaboration


Dive into the Stavros Petridis's collaboration.

Top Co-Authors

Avatar

Maja Pantic

Imperial College London

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Jie Shen

Imperial College London

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Zuwei Li

Imperial College London

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Hatice Gunes

University of Cambridge

View shared research outputs
Researchain Logo
Decentralizing Knowledge