A. Tanju Erdem
Eastman Kodak Company
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by A. Tanju Erdem.
Speech Communication | 2011
Elif Bozkurt; Engin Erzin; Çigˇdem Erogˇlu Erdem; A. Tanju Erdem
In this paper, we propose novel spectrally weighted mel-frequency cepstral coefficient (WMFCC) features for emotion recognition from speech. The idea is based on the fact that formant locations carry emotion-related information, and therefore critical spectral bands around formant locations can be emphasized during the calculation of MFCC features. The spectral weighting is derived from the normalized inverse harmonic mean function of the line spectral frequency (LSF) features, which are known to be localized around formant frequencies. The above approach can be considered as an early data fusion of spectral content and formant location information. We also investigate methods for late decision fusion of unimodal classifiers. We evaluate the proposed WMFCC features together with the standard spectral and prosody features using HMM based classifiers on the spontaneous FAU Aibo emotional speech corpus. The results show that unimodal classifiers with the WMFCC features perform significantly better than the classifiers with standard spectral features. Late decision fusion of classifiers provide further significant performance improvements.
IEEE Transactions on Circuits and Systems for Video Technology | 2000
Candemir Toklu; A. Murat Tekalp; A. Tanju Erdem
We describe a semi-automatic approach for segmenting a video sequence into spatio-temporal video objects in the presence of occlusion. Motion and shape of each video object is represented by a 2-D mesh. Assuming that the boundary of an object of interest is interactively marked on some keyframes, the proposed method finds the boundary of the object in all other frames automatically by tracking the 2-D mesh representation of the object in both forward and backward directions. A key contribution of the proposed method is automatic detection of covered and uncovered regions at each frame, and assignment of pixels in the uncovered regions to the object or background based on color and motion similarity. Experimental results are presented on two MPEG-4 test sequences and the resulting segmentations are evaluated both visually and quantitatively.
Proceedings of the 3rd international workshop on Affective interaction in natural environments | 2010
Cigdem Eroglu Erdem; Elif Bozkurt; Engin Erzin; A. Tanju Erdem
Training datasets containing spontaneous emotional expressions are often imperfect due the ambiguities and difficulties of labeling such data by human observers. In this paper, we present a Random Sampling Consensus (RANSAC) based training approach for the problem of emotion recognition from spontaneous speech recordings. Our motivation is to insert a data cleaning process to the training phase of the Hidden Markov Models (HMMs) for the purpose of removing some suspicious instances of labels that may exist in the training dataset. Our experiments using HMMs with various number of states and Gaussian mixtures per state indicate that utilization of RANSAC in the training phase provides an improvement of up to 2.84% in the unweighted recall rates on the test set. . This improvement in the accuracy of the classifier is shown to be statistically significant using McNemars test.
Signal Processing-image Communication | 1995
A. Tanju Erdem; M. Ibrahim Sezan
Abstract We address the problem of compressing 10-bits per pixel video using the tools of the emerging MPEG-2 standard, which is primarily targeted to 8-bits per pixel video. We show that an amplitude scalable compression scheme for 10-bit video can be developed using the MPEG-2 syntax and tools. We experimentally evaluate the performance of the scalable approach and compare it with the straightforward non-scalable approach where the 10-bit input is rounded to 8 bits and usual 8-bit MPEG-2 compression is applied. In addition to general performance evaluation of scalable and non-scalable approaches, we also evaluate their multi-generation characteristics where the input video undergoes successive compression-decompression cycles. We show that it is possible to quantitatively analyze the multi-generation characteristics of the non-scalable approach using the theory of generalized projections.
international conference on pattern recognition | 2010
Elif Bozkurt; Engin Erzin; Cigdem Eroglu Erdem; A. Tanju Erdem
We propose the use of the line spectral frequency (LSF) features for emotion recognition from speech, which have not been been previously employed for emotion recognition to the best of our knowledge. Spectral features such as mel-scaled cepstral coefficients have already been successfully used for the parameterization of speech signals for emotion recognition. The LSF features also offer a spectral representation for speech, moreover they carry intrinsic information on the formant structure as well, which are related to the emotional state of the speaker [4]. We use the Gaussian mixture model (GMM) classifier architecture, that captures the static color of the spectral features. Experimental studies performed over the Berlin Emotional Speech Database and the FAU Aibo Emotion Corpus demonstrate that decision fusion configurations with LSF features bring a consistent improvement over the MFCC based emotion classification rates.
international conference on acoustics, speech, and signal processing | 1997
Candemir Toklu; A. Tanju Erdem; A. Murat Tekalp
This paper addresses 2-D mesh-based object tracking and mesh-based object mosaic construction for synthetic transfiguration of deformable video objects with deformable boundaries in the presence of another occluding object and/or self-occlusion. In particular, we update the 2-D triangular mesh model of a video object incrementally to account for the newly uncovered parts of the object as they are detected during the tracking process. Then, the minimum number of reference views (still images of a replacement object) needed to perform the synthetic transfiguration (object replacement and animation) is determined (depending on the complexity of the motion of the object-to-be-replaced), and the transfiguration of the replacement object is accomplished by 2-D mesh-based texture mapping in between these reference views. The proposed method is demonstrated by replacing an orange juice bottle by a cranberry juice bottle in a real video clip.
IEEE Transactions on Image Processing | 2000
Candemir Toklu; A. Tanju Erdem; A. Murat Tekalp
We present a two-dimensional (2-D) mesh-based mosaic representation, consisting of an object mesh and a mosaic mesh for each frame and a final mosaic image, for video objects with mildly deformable motion in the presence of self and/or object-to-object (external) occlusion. Unlike classical mosaic representations where successive frames are registered using global motion models, we map the uncovered regions in the successive frames onto the mosaic reference frame using local affine models, i.e., those of the neighboring mesh patches. The proposed method to compute this mosaic representation is tightly coupled with an occlusion adaptive 2-D mesh tracking procedure, which consist of propagating the object mesh frame to frame, and updating of both object and mosaic meshes to optimize texture mapping from the mosaic to each instance of the object. The proposed representation has been applied to video object rendering and editing, including self transfiguration, synthetic transfiguration, and 2-D augmented reality in the presence of self and/or external occlusion. We also provide an algorithm to determine the minimum number of still views needed to reconstruct a replacement mosaic which is needed for synthetic transfiguration. Experimental results are provided to demonstrate both the 2-D mesh-based mosaic synthesis and two different video object editing applications on real video sequences.
Journal on Multimodal User Interfaces | 2008
F. Ofli; Y. Demir; Yücel Yemez; Engin Erzin; A. Murat Tekalp; Koray Balci; İdil Kızoğlu; Lale Akarun; Cristian Canton-Ferrer; Joëlle Tilmanne; Elif Bozkurt; A. Tanju Erdem
We present a framework for training and synthesis of an audio-driven dancing avatar. The avatar is trained for a given musical genre using the multicamera video recordings of a dance performance. The video is analyzed to capture the time-varying posture of the dancer’s body whereas the musical audio signal is processed to extract the beat information. We consider two different marker-based schemes for the motion capture problem. The first scheme uses 3D joint positions to represent the body motion whereas the second uses joint angles. Body movements of the dancer are characterized by a set of recurring semantic motion patterns, i.e., dance figures. Each dance figure is modeled in a supervised manner with a set of HMM (Hidden Markov Model) structures and the associated beat frequency. In the synthesis phase, an audio signal of unknown musical type is first classified, within a time interval, into one of the genres that have been learnt in the analysis phase, based on mel frequency cepstral coefficients (MFCC). The motion parameters of the corresponding dance figures are then synthesized via the trained HMM structures in synchrony with the audio signal based on the estimated tempo information. Finally, the generated motion parameters, either the joint angles or the 3D joint positions of the body, are animated along with the musical audio using two different animation tools that we have developed. Experimental results demonstrate the effectiveness of the proposed framework.
visual communications and image processing | 1998
A. Tanju Erdem; Cigdem Eroglu
The effect of image stabilization on the performance of the MPEG-2 video coding algorithm is investigated. It is shown that image stabilization prior to MPEG-2 coding of an unsteady image sequence does increase the quality of the compressed video considerably. The quality improvement is explained by fact that an actual zero motion vector is favored over a zero differential motion vector in the MPEG-2 video coding scheme. The bits saved in coding the motion information in P frames are then utilized in the coding of the DCT data in I frames. The temporal prediction of the macroblocks in P and B frames are also improved because of the increased quality of the compressed I frames and because an image stabilization algorithm can compensate for displacement with better than 1/2 pixel accuracy.
visual communications and image processing | 1991
Mehmet K. Ozkan; M. Ibrahim Sezan; A. Tanju Erdem; A. Murat Tekalp
In this paper we propose a computationally efficient multiframe Wiener filtering algorithm, called the cross-correlated multiframe (CCMF) Wiener filtering, for restoring image sequences that are degraded by both blur and noise. The CCMF approach accounts for both intraframe (spatial) and interframe (temporal) correlations by directly utilizing power and cross-power spectra of the frames. We propose an efficient implementation of the CCMF filter which requires the inversion of only N X N matrices, where N is the number of frames used in the restoration. Furthermore, is it shown that if the auto and cross-power spectra are estimated based on a three-dimensional (3-D) multiframe autoregressive (AR) model, no matrix inversion is required. We present restoration results using the proposed approach, and compare them with those obtained by restoring each frame independently using a single-frame Wiener filter. In addition, we provide the results of an extensive study on the performance and robustness of the proposed algorithm in the case of varying blur, noise, and power and cross- power spectra estimation methods using different image sequences.