Mihalis A. Nicolaou
Imperial College London
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mihalis A. Nicolaou.
IEEE Transactions on Affective Computing | 2011
Mihalis A. Nicolaou; Hatice Gunes; Maja Pantic
Past research in analysis of human affect has focused on recognition of prototypic expressions of six basic emotions based on posed data acquired in laboratory settings. Recently, there has been a shift toward subtle, continuous, and context-specific interpretations of affective displays recorded in naturalistic and real-world settings, and toward multimodal analysis and recognition of human affect. Converging with this shift, this paper presents, to the best of our knowledge, the first approach in the literature that: 1) fuses facial expression, shoulder gesture, and audio cues for dimensional and continuous prediction of emotions in valence and arousal space, 2) compares the performance of two state-of-the-art machine learning techniques applied to the target problem, the bidirectional Long Short-Term Memory neural networks (BLSTM-NNs), and Support Vector Machines for Regression (SVR), and 3) proposes an output-associative fusion framework that incorporates correlations and covariances between the emotion dimensions. Evaluation of the proposed approach has been done using the spontaneous SAL data from four subjects and subject-dependent leave-one-sequence-out cross validation. The experimental results obtained show that: 1) on average, BLSTM-NNs outperform SVR due to their ability to learn past and future context, 2) the proposed output-associative fusion framework outperforms feature-level and model-level fusion by modeling and learning correlations and patterns between the valence and arousal dimensions, and 3) the proposed system is well able to reproduce the valence and arousal ground truth obtained from human coders.
computer vision and pattern recognition | 2016
George Trigeorgis; Patrick Snape; Mihalis A. Nicolaou; Epameinondas Antonakos; Stefanos Zafeiriou
Cascaded regression has recently become the method of choice for solving non-linear least squares problems such as deformable image alignment. Given a sizeable training set, cascaded regression learns a set of generic rules that are sequentially applied to minimise the least squares problem. Despite the success of cascaded regression for problems such as face alignment and head pose estimation, there are several shortcomings arising in the strategies proposed thus far. Specifically, (a) the regressors are learnt independently, (b) the descent directions may cancel one another out and (c) handcrafted features (e.g., HoGs, SIFT etc.) are mainly used to drive the cascade, which may be sub-optimal for the task at hand. In this paper, we propose a combined and jointly trained convolutional recurrent neural network architecture that allows the training of an end-to-end to system that attempts to alleviate the aforementioned drawbacks. The recurrent module facilitates the joint optimisation of the regressors by assuming the cascades form a nonlinear dynamical system, in effect fully utilising the information between all cascade levels by introducing a memory unit that shares information across all levels. The convolutional module allows the network to extract features that are specialised for the task at hand and are experimentally shown to outperform hand-crafted features. We show that the application of the proposed architecture for the problem of face alignment results in a strong improvement over the current state-of-the-art.
ieee international conference on automatic face gesture recognition | 2011
Mihalis A. Nicolaou; Hatice Gunes; Maja Pantic
Many problems in machine learning and computer vision consist of predicting multi-dimensional output vectors given a specific set of input features. In many of these problems, there exist inherent temporal and spacial dependencies between the output vectors, as well as repeating output patterns and input-output associations, that can provide more robust and accurate predictors when modelled properly. With this intrinsic motivation, we propose a novel Output-Associative Relevance Vector Machine (OA-RVM) regression framework that augments the traditional RVM regression by being able to learn non-linear input and output dependencies. Instead of depending solely on the input patterns, OA-RVM models output structure and covariances within a predefined temporal window, thus capturing past, current and future context. As a result, output patterns manifested in the training data are captured within a formal probabilistic framework, and subsequently used during inference. As a proof of concept, we target the highly challenging problem of dimensional and continuous prediction of emotions from naturalistic facial expressions. We demonstrate the advantages of the proposed OA-RVM regression by performing both subject-dependent and subject-independent experiments using the SAL database. The experimental results show that OA-RVM regression outperforms the traditional RVM and SVM regression approaches in prediction accuracy, generating more robust and accurate models.
international conference on pattern recognition | 2010
Mihalis A. Nicolaou; Hatice Gunes; Maja Pantic
This paper focuses on audio-visual (using facial expression, shoulder and audio cues) classification of spontaneous affect, utilising generative models for classification (i) in terms of Maximum Likelihood Classification with the assumption that the generative model structure in the classifier is correct, and (ii) Likelihood Space Classification with the assumption that the generative model structure in the classifier may be incorrect, and therefore, the classification performance can be improved by projecting the results of generative classifiers onto likelihood space, and then using discriminative classifiers. Experiments are conducted by utilising Hidden Markov Models for single cue classification, and 2 and 3-chain coupled Hidden Markov Models for fusing multiple cues and modalities. For discriminative classification, we utilise Support Vector Machines. Results show that Likelihood Space Classification improves the performance (91.76%) of Maximum Likelihood Classification (79.1%). Thereafter, we introduce the concept of fusion in the likelihood space, which is shown to outperform the typically used model-level fusion, attaining a classification accuracy of 94.01% and further improving all previous results.
IEEE Transactions on Pattern Analysis and Machine Intelligence | 2014
Mihalis A. Nicolaou; Vladimir Pavlovic; Maja Pantic
Fusing multiple continuous expert annotations is a crucial problem in machine learning and computer vision, particularly when dealing with uncertain and subjective tasks related to affective behavior. Inspired by the concept of inferring shared and individual latent spaces in Probabilistic Canonical Correlation Analysis (PCCA), we propose a novel, generative model that discovers temporal dependencies on the shared/individual spaces (Dynamic Probabilistic CCA, DPCCA). In order to accommodate for temporal lags, which are prominent amongst continuous annotations, we further introduce a latent warping process, leading to the DPCCA with Time Warpings (DPCTW) model. Finally, we propose two supervised variants of DPCCA/DPCTW which incorporate inputs (i.e., visual or audio features), both in a generative (SG-DPCCA) and discriminative manner (SD-DPCCA). We show that the resulting family of models (i) can be used as a unifying framework for solving the problems of temporal alignment and fusion of multiple annotations in time, (ii) can automatically rank and filter annotations based on latent posteriors or other model statistics, and (iii) that by incorporating dynamics, modeling annotation-specific biases, noise estimation, time warping and supervision, DPCTW outperforms state-of-the-art methods for both the aggregation of multiple, yet imperfect expert annotations as well as the alignment of affective behavior.
IEEE Transactions on Pattern Analysis and Machine Intelligence | 2016
Yannis Panagakis; Mihalis A. Nicolaou; Stefanos Zafeiriou; Maja Pantic
Recovering correlated and individual components of two, possibly temporally misaligned, sets of data is a fundamental task in disciplines such as image, vision, and behavior computing, with application to problems such as multi-modal fusion (via correlated components), predictive analysis, and clustering (via the individual ones). Here, we study the extraction of correlated and individual components under real-world conditions, namely i) the presence of gross non-Gaussian noise and ii) temporally misaligned data. In this light, we propose a method for the Robust Correlated and Individual Component Analysis (RCICA) of two sets of data in the presence of gross, sparse errors. We furthermore extend RCICA in order to handle temporal incongruities arising in the data. To this end, two suitable optimization problems are solved. The generality of the proposed methods is demonstrated by applying them onto 4 applications, namely i) heterogeneous face recognition, ii) multi-modal feature fusion for human behavior analysis (i.e., audio-visual prediction of interest and conflict), iii) face clustering, and iv) the temporal alignment of facial expressions. Experimental results on 2 synthetic and 7 real world datasets indicate the robustness and effectiveness of the proposed methods on these application domains, outperforming other state-of-the-art methods in the field.
IEEE Journal of Selected Topics in Signal Processing | 2017
Panagiotis Tzirakis; George Trigeorgis; Mihalis A. Nicolaou; Björn W. Schuller; Stefanos Zafeiriou
Automatic affect recognition is a challenging task due to the various modalities emotions can be expressed with. Applications can be found in many domains including multimedia retrieval and human–computer interaction. In recent years, deep neural networks have been used with great success in determining emotional states. Inspired by this success, we propose an emotion recognition system using auditory and visual modalities. To capture the emotional content for various styles of speaking, robust features need to be extracted. To this purpose, we utilize a convolutional neural network (CNN) to extract features from the speech, while for the visual modality a deep residual network of 50 layers is used. In addition to the importance of feature extraction, a machine learning algorithm needs also to be insensitive to outliers while being able to model the context. To tackle this problem, long short-term memory networks are utilized. The system is then trained in an end-to-end fashion where—by also taking advantage of the correlations of each of the streams—we manage to significantly outperform, in terms of concordance correlation coefficient, traditional approaches based on auditory and visual handcrafted features for the prediction of spontaneous and natural emotions on the RECOLA database of the AVEC 2016 research challenge on emotion recognition.
Computer Analysis of Human Behaviour | 2011
Hatice Gunes; Mihalis A. Nicolaou; Maja Pantic
Human affective behavior is multimodal, continuous and complex. Despite major advances within the affective computing research field, modeling, analyzing, interpreting and responding to human affective behavior still remains a challenge for automated systems as affect and emotions are complex constructs, with fuzzy boundaries and with substantial individual differences in expression and experience [7]. Therefore, affective and behavioral computing researchers have recently invested increased effort in exploring how to best model, analyze and interpret the subtlety, complexity and continuity (represented along a continuum e.g., from −1 to +1) of affective behavior in terms of latent dimensions (e.g., arousal, power and valence) and appraisals, rather than in terms of a small number of discrete emotion categories (e.g., happiness and sadness). This chapter aims to (i) give a brief overview of the existing efforts and the major accomplishments in modeling and analysis of emotional expressions in dimensional and continuous space while focusing on open issues and new challenges in the field, and (ii) introduce a representative approach for multimodal continuous analysis of affect from voice and face, and provide experimental results using the audiovisual Sensitive Artificial Listener (SAL) Database of natural interactions. The chapter concludes by posing a number of questions that highlight the significant issues in the field, and by extracting potential answers to these questions from the relevant literature. The chapter is organized as follows. Section 10.2 describes theories of emotion, Sect. 10.3 provides details on the affect dimensions employed in the literature as well as how emotions are perceived from visual, audio and physiological modalities. Section 10.4 summarizes how current technology has been developed, in terms of data acquisition and annotation, and automatic analysis of affect in continuous space by bringing forth a number of issues that need to be taken into account when applying a dimensional approach to emotion recognition, namely, determining the duration of emotions for automatic analysis, modeling the intensity of emotions, determining the baseline, dealing with high inter-subject expression variation, defining optimal strategies for fusion of multiple cues and modalities, and identifying appropriate machine learning techniques and evaluation measures. Section 10.5 presents our representative system that fuses vocal and facial expression cues for dimensional and continuous prediction of emotions in valence and arousal space by employing the bidirectional Long Short-Term Memory neural networks (BLSTM-NN), and introduces an output-associative fusion framework that incorporates correlations between the emotion dimensions to further improve continuous affect prediction. Section 10.6 concludes the chapter.
international conference on computer vision | 2013
Lazaros Zafeiriou; Mihalis A. Nicolaou; Stefanos Zafeiriou; Symeon Nikitidis; Maja Pantic
A recently introduced latent feature learning technique for time varying dynamic phenomena analysis is the so called Slow Feature Analysis (SFA). SFA is a deterministic component analysis technique for multi-dimensional sequences that by minimizing the variance of the first order time derivative approximation of the input signal finds uncorrelated projections that extract slowly-varying features ordered by their temporal consistency and constancy. In this paper, we propose a number of extensions in both the deterministic and the probabilistic SFA optimization frameworks. In particular, we derive a novel deterministic SFA algorithm that is able to identify linear projections that extract the common slowest varying features of two or more sequences. In addition, we propose an Expectation Maximization (EM) algorithm to perform inference in a probabilistic formulation of SFA and similarly extend it in order to handle two and more time varying data sequences. Moreover, we demonstrate that the probabilistic SFA (EMSFA) algorithm that discovers the common slowest varying latent space of multiple sequences can be combined with dynamic time warping techniques for robust sequence time alignment. The proposed SFA algorithms were applied for facial behavior analysis demonstrating their usefulness and appropriateness for this task.
international conference on acoustics, speech, and signal processing | 2014
Mihalis A. Nicolaou; Yannis Panagakis; Stefanos Zafeiriou; Maja Pantic
The problem of automatically estimating the interest level of a subject has been gaining attention by researchers, mostly due to the vast applicability of interest detection. In this work, we obtain a set of continuous interest annotations for the SE-MAINE database, which we analyse also in terms of emotion dimensions such as valence and arousal. Most importantly, we propose a robust variant of Canonical Correlation Analysis (RCCA) for performing audio-visual fusion, which we apply to the prediction of interest. RCCA recovers a low-rank subspace which captures the correlations of fused modalities, while isolating gross errors in the data without making any assumptions regarding Gaussianity. We experimentally show that RCCA is more appropriate than other standard fusion techniques (such as l2-CCA and feature-level fusion), since it both captures interactions between modalities while also decontaminating the obtained subspace from errors which are dominant in real-world problems.