Heysem Kaya
Namik Kemal University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Heysem Kaya.
international conference on acoustics, speech, and signal processing | 2014
Heysem Kaya; Florian Eyben; Albert Ali Salah; Björn W. Schuller
In this study we make use of Canonical Correlation Analysis (CCA) based feature selection for continuous depression recognition from speech. Besides its common use in multi-modal/multi-view feature extraction, CCA can be easily employed as a feature selector. We introduce several novel ways of CCA based filter (ranking) methods, showing their relations to previous work. We test the suitability of proposed methods on the AVEC 2013 dataset under the ACM MM 2013 Challenge protocol. Using 17% of features, we obtained a relative improvement of 30% on the challenges test-set baseline Root Mean Square Error.
international conference on multimodal interfaces | 2015
Heysem Kaya; Furkan Gürpınar; Sadaf Afshar; Albert Ali Salah
This paper presents our contribution to ACM ICMI 2015 Emotion Recognition in the Wild Challenge (EmotiW 2015). We participate in both static facial expression (SFEW) and audio-visual emotion recognition challenges. In both challenges, we use a set of visual descriptors and their early and late fusion schemes. For AFEW, we also exploit a set of popularly used spatio-temporal modeling alternatives and carry out multi-modal fusion. For classification, we employ two least squares regression based learners that are shown to be fast and accurate on former EmotiW Challenge corpora. Specifically, we use Partial Least Squares Regression (PLS) and Kernel Extreme Learning Machines (ELM), which is closely related to Kernel Regularized Least Squares. We use a General Procrustes Analysis (GPA) based alignment for face registration. By employing different alignments, descriptor types, video modeling strategies and classifiers, we diversify learners to improve the final fusion performance. Test set accuracies reached in both challenges are relatively 25% above the respective baselines.
computer vision and pattern recognition | 2016
Furkan Gürpınar; Heysem Kaya; Hamdi Dibeklioglu; Albert Ali Salah
We propose a two-level system for apparent age estimation from facial images. Our system first classifies samples into overlapping age groups. Within each group, the apparent age is estimated with local regressors, whose outputs are then fused for the final estimate. We use a deformable parts model based face detector, and features from a pretrained deep convolutional network. Kernel extreme learning machines are used for classification. We evaluate our system on the ChaLearn Looking at People 2016 - Apparent Age Estimation challenge dataset, and report 0.3740 normal score on the sequestered test set.
Proceedings of the 4th International Workshop on Audio/Visual Emotion Challenge | 2014
Heysem Kaya; Fazilet Çilli; Albert Ali Salah
This paper presents our work on ACM MM Audio Visual Emotion Corpus 2014 (AVEC 2014) using the baseline features in accordance with the challenge protocol. For prediction, we use Canonical Correlation Analysis (CCA) in affect sub-challenge (ASC) and Moore-Penrose generalized inverse (MPGI) in depression sub-challenge (DSC). The video baseline provides histograms of Local Gabor Binary Patterns from Three Orthogonal Planes (LGBP-TOP) features. Based on our preliminary experiments on AVEC 2013 challenge data, we focus on the inner facial regions that correspond to eyes and mouth area. We obtain an ensemble of regional linear regressors via CCA and MPGI. We also enrich the 2014 baseline set with Local Phase Quantization (LPQ) features extracted using Intraface toolkit detected/tracked faces. Combining both representations in a CCA ensemble approach, on the challenge test set we reach an average Pearsons Correlation Coefficient (PCC) of 0.3932, outperforming the ASC test set baseline PCC of 0.1966. On the DSC, combining modality specific MPGI based ensemble systems, we reach 9.61 Root Mean Square Error (RMSE).
Image and Vision Computing | 2017
Heysem Kaya; Furkan Gürpınar; Albert Ali Salah
Abstract Multimodal recognition of affective states is a difficult problem, unless the recording conditions are carefully controlled. For recognition “in the wild”, large variances in face pose and illumination, cluttered backgrounds, occlusions, audio and video noise, as well as issues with subtle cues of expression are some of the issues to target. In this paper, we describe a multimodal approach for video-based emotion recognition in the wild. We propose using summarizing functionals of complementary visual descriptors for video modeling. These features include deep convolutional neural network (CNN) based features obtained via transfer learning, for which we illustrate the importance of flexible registration and fine-tuning. Our approach combines audio and visual features with least squares regression based classifiers and weighted score level fusion. We report state-of-the-art results on the EmotiW Challenge for “in the wild” facial expression recognition. Our approach scales to other problems, and ranked top in the ChaLearn-LAP First Impressions Challenge 2016 from video clips collected in the wild.
international conference on speech and computer | 2015
Elena E. Lyakso; Olga V. Frolova; Evgeniya Dmitrieva; Aleksei Grigorev; Heysem Kaya; Albert Ali Salah; Alexey Karpov
We present the first child emotional speech corpus in Russian, called “EmoChildRu”, which contains audio materials of 3–7 year old kids. The database includes over 20 K recordings (approx. 30 h), collected from 100 children. Recordings were carried out in three controlled settings by creating different emotional states for children: playing with a standard set of toys; repetition of words from a toy-parrot in a game store setting; watching a cartoon and retelling of the story, respectively. This corpus is designed to study the reflection of the emotional state in the characteristics of voice and speech and for studies of the formation of emotional states in ontogenesis. A portion of the corpus is annotated for three emotional states (discomfort, neutral, comfort). Additional data include brain activity measurements (original EEG, evoked potentials records), the results of the adult listeners analysis of child speech, questionnaires, and description of dialogues. The paper reports two child emotional speech analysis experiments on the corpus: by adult listeners (humans) and by an automatic classifier (machine), respectively. Automatic classification results are very similar to human perception, although the accuracy is below 55 % for both, showing the difficulty of child emotion recognition from speech under naturalistic conditions.
IEEE Signal Processing Letters | 2015
Heysem Kaya; Tuğçe Özkaptan; Albert Ali Salah; Fikret S. Gürgen
Computational paralinguistics deals with underlying meaning of the verbal messages, which is of interest in manifold applications ranging from intelligent tutoring systems to affect sensitive robots. The state-of-the-art pipeline of paralinguistic speech analysis utilizes brute-force feature extraction, and the features need to be tailored according to the relevant task. In this work, we extend a recent discriminative projection based feature selection method using the power of stochasticity to overcome local minima and to reduce the computational complexity. The proposed approach assigns weights both to groups and to features individually in many randomly selected contexts and then combines them for a final ranking. The efficacy of the proposed method is shown in a recent paralinguistic challenge corpus to detect level of conflict in dyadic and group conversations. We advance the state-of-the-art in this corpus using the INTERSPEECH 2013 Challenge protocol.
international conference on multimodal interfaces | 2014
Heysem Kaya; Albert Ali Salah
This paper presents our contribution to ACM ICMI 2014 Emotion Recognition in the Wild Challenge and Workshop. The proposed system utilizes Extreme Learning Machines (ELM) for modeling modality-specific features and combines the scores for final prediction. The state-of-the-art results in acoustic and visual emotion recognition are obtained either using deep Neural Networks (DNN) or Support Vector Machines (SVM). The ELM paradigm is proposed as a fast and accurate alternative to these two popular machine learning methods. Benefiting from fast learning advantage of ELM, we carry out extensive tests on the data using moderate computational resources. In the video modality, we test combination of regional visual features obtained from the inner face. In the audio modality, we carry out tests to enhance training via other emotional corpora. We further investigate the suitability of several recently proposed feature selection approaches to prune the acoustic features. In our study, the best results for both modalities are obtained with Kernel ELM compared to basic ELM. On the challenge test set, we obtain 37.84%, 39.07% and 44.23% classification accuracies for audio, video and multimodal fusion, respectively.
acm multimedia | 2014
Heysem Kaya; Albert Ali Salah
This paper presents our work on ACM MM Audio Visual Emotion Corpus 2013 (AVEC 2013) depression recognition sub-challenge using the baseline features in accordance with the challenge protocol. We use Canonical Correlation Analysis for audio-visual fusion as well as covariate extraction for the target task. The video baseline provides histograms of local phase quantization features extracted from 4x4=16 regions of the detected face. We summarize the video features over segments of length 20 seconds using mode and range functionals. We observe that features of range functional that measure the variance tendency provides statistically significantly higher canonical correlation than mode functional features that measure the mean tendency. Moreover, when audio-visual features are used with varying number of covariates per region, the regions that were consistently found the best are the ones corresponding to two eyes and the right part of the mouth.
conference of the international speech communication association | 2016
Heysem Kaya; Alexey Karpov
The field of Computational Paralinguistics is rapidly growing and is of interest in various application domains ranging from biomedical engineering to forensics. The INTERSPEECH ComParE challenge series has a field-leading role, introducing novel problems with a common benchmark protocol for comparability. In this work, we tackle all three ComParE 2016 Challenge corpora (Native Language, Sincerity and Deception) benefiting from multi-level normalization on features followed by fast and robust kernel learning methods. Moreover, we employ computer vision inspired low level descriptor representation methods such as the Fisher vector encoding. After nonlinear preprocessing, obtained Fisher vectors are kernelized and mapped to target variables by classifiers based on Kernel Extreme Learning Machines and Partial Least Squares regression. We finally combine predictions of models trained on popularly used functional based descriptor encoding (openSMILE features) with those obtained from the Fisher vector encoding. In the preliminary experiments, our approach has significantly outperformed the baseline systems for Native Language and Sincerity sub-challenges both in the development and test sets.