Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Engin Erzin is active.

Publication


Featured researches published by Engin Erzin.


IEEE Transactions on Multimedia | 2007

Audiovisual Synchronization and Fusion Using Canonical Correlation Analysis

M. E. Sargin; Yücel Yemez; Engin Erzin; A.M. Tekalp

It is well-known that early integration (also called data fusion) is effective when the modalities are correlated, and late integration (also called decision or opinion fusion) is optimal when modalities are uncorrelated. In this paper, we propose a new multimodal fusion strategy for open-set speaker identification using a combination of early and late integration following canonical correlation analysis (CCA) of speech and lip texture features. We also propose a method for high precision synchronization of the speech and lip features using CCA prior to the proposed fusion. Experimental results show that i) the proposed fusion strategy yields the best equal error rates (EER), which are used to quantify the performance of the fusion strategy for open-set speaker identification, and ii) precise synchronization prior to fusion improves the EER; hence, the best EER is obtained when the proposed synchronization scheme is employed together with the proposed fusion strategy. We note that the proposed fusion strategy outperforms others because the features used in the late integration are truly uncorrelated, since they are output of the CCA analysis.


IEEE Signal Processing Letters | 1999

Teager energy based feature parameters for speech recognition in car noise

Firas Jabloun; A.E. Cetin; Engin Erzin

In this letter, a new set of speech feature parameters based on multirate signal processing and the Teager energy operator is introduced. The speech signal is first divided into nonuniform subbands in mel-scale using a multirate filterbank, then the Teager energies of the subsignals are estimated. Finally, the feature vector is constructed by log-compression and inverse discrete cosine transform (DCT) computation. The new feature parameters have robust speech recognition performance in the presence of car engine noise.


IEEE Transactions on Image Processing | 2006

Discriminative Analysis of Lip Motion Features for Speaker Identification and Speech-Reading

H.E. Cetingul; Yücel Yemez; Engin Erzin; A.M. Tekalp

There have been several studies that jointly use audio, lip intensity, and lip geometry information for speaker identification and speech-reading applications. This paper proposes using explicit lip motion information, instead of or in addition to lip intensity and/or geometry information, for speaker identification and speech-reading within a unified feature selection and discrimination analysis framework, and addresses two important issues: 1) Is using explicit lip motion information useful, and, 2) if so, what are the best lip motion features for these two applications? The best lip motion features for speaker identification are considered to be those that result in the highest discrimination of individual speakers in a population, whereas for speech-reading, the best features are those providing the highest phoneme/word/phrase recognition rate. Several lip motion feature candidates have been considered including dense motion features within a bounding box about the lip, lip contour motion features, and combination of these with lip shape features. Furthermore, a novel two-stage, spatial, and temporal discrimination analysis is introduced to select the best lip motion features for speaker identification and speech-reading applications. Experimental results using an hidden-Markov-model-based recognition system indicate that using explicit lip motion information provides additional performance gains in both applications, and lip motion features prove more valuable in the case of speech-reading application


IEEE Signal Processing Letters | 1994

Adaptive filtering for non-Gaussian stable processes

Orhan Arikan; A. Enis Cetin; Engin Erzin

A large class of physical phenomena observed in practice exhibit non-Gaussian behavior. In the letter /spl alpha/-stable distributions, which have heavier tails than Gaussian distributions, are considered to model non-Gaussian signals. Adaptive signal processing in the presence of such a noise is a requirement of many practical problems. Since direct application of commonly used adaptation techniques fail in these applications, new algorithms for adaptive filtering for /spl alpha/-stable random processes are introduced.<<ETX>>


IEEE Transactions on Multimedia | 2005

Multimodal speaker identification using an adaptive classifier cascade based on modality reliability

Engin Erzin; Yücel Yemez; A.M. Tekalp

We present a multimodal open-set speaker identification system that integrates information coming from audio, face and lip motion modalities. For fusion of multiple modalities, we propose a new adaptive cascade rule that favors reliable modality combinations through a cascade of classifiers. The order of the classifiers in the cascade is adaptively determined based on the reliability of each modality combination. A novel reliability measure, that genuinely fits to the open-set speaker identification problem, is also proposed to assess accept or reject decisions of a classifier. A formal framework is developed based on probability of correct decision for analytical comparison of the proposed adaptive rule with other classifier combination rules. The proposed adaptive rule is more robust in the presence of unreliable modalities, and outperforms the hard-level max rule and soft-level weighted summation rule, provided that the employed reliability measure is effective in assessment of classifier decisions. Experimental results that support this assertion are provided.


Signal Processing | 2006

Multimodal speaker/speech recognition using lip motion, lip texture and audio

H.E. Cetingul; Engin Erzin; Yücel Yemez; A.M. Tekalp

We present a new multimodal speaker/speech recognition system that integrates audio, lip texture and lip motion modalities. Fusion of audio and face texture modalities has been investigated in the literature before. The emphasis of this work is to investigate the benefits of inclusion of lip motion modality for two distinct cases: speaker and speech recognition. The audio modality is represented by the well-known mel-frequency cepstral coefficients (MFCC) along with the first and second derivatives, whereas lip texture modality is represented by the 2D-DCT coefficients of the luminance component within a bounding box about the lip region. In this paper, we employ a new lip motion modality representation based on discriminative analysis of the dense motion vectors within the same bounding box for speaker/speech recognition. The fusion of audio, lip texture and lip motion modalities is performed by the so-called reliability weighted summation (RWS) decision rule. Experimental results show that inclusion of lip motion modality provides further performance gains over those which are obtained by fusion of audio and lip texture alone, in both speaker identification and isolated word recognition scenarios.


international conference on acoustics, speech, and signal processing | 2003

Joint audio-video processing for biometric speaker identification

A. Kanak; Engin Erzin; Yücel Yemez; A.M. Tekalp

We present a bimodal audio-visual speaker identification system. The objective is to improve the recognition performance over conventional unimodal schemes. The proposed system exploits not only the temporal and spatial correlations existing in the speech and video signals of a speaker, but also the cross-correlation between these two modalities. Lip images extracted from each video frame are transformed onto an eigenspace. The obtained eigenlip coefficients are interpolated to match the rate of the speech signal and fused with Mel frequency cepstral coefficients (MFCC) of the corresponding speech signal. The resulting joint feature vectors are used to train and test a hidden Markov model (HMM) based identification system. Experimental results are included to demonstrate the system performance.


IEEE MultiMedia | 2006

Multimodal person recognition for human-vehicle interaction

Engin Erzin; Yücel Yemez; A.M. Tekalp; Aytül Erçil; Hakan Erdogan; Hüseyin Abut

The authors combine two different biometric modalities for next-generation vehicles that use biometric person recognition. Next-generation vehicles will undoubtedly feature biometric person recognition as part of an effort to improve the driving experience. Todays technology prevents such systems from operating satisfactorily under adverse conditions. A proposed framework for achieving person recognition successfully combines different biometric modalities, borne out in two case studies.


international conference on acoustics, speech, and signal processing | 1995

Subband analysis for robust speech recognition in the presence of car noise

Engin Erzin; A.E. Cetin; Y. Yardimci

A new set of speech feature representations for robust speech recognition in the presence of car noise is proposed. These parameters are based on subband analysis of the speech signal. Line spectral frequency (LSF) representation of the linear prediction (LP) analysis in subbands and cepstral coefficients derived from subband analysis (SUBCEP) are introduced, and the performances of the new feature representations are compared to mel scale cepstral coefficients (MELCEP) in the presence of car noise. Subband analysis based parameters are observed to be more robust than the commonly employed MELCEP representations.


Speech Communication | 2011

Formant position based weighted spectral features for emotion recognition

Elif Bozkurt; Engin Erzin; Çigˇdem Erogˇlu Erdem; A. Tanju Erdem

In this paper, we propose novel spectrally weighted mel-frequency cepstral coefficient (WMFCC) features for emotion recognition from speech. The idea is based on the fact that formant locations carry emotion-related information, and therefore critical spectral bands around formant locations can be emphasized during the calculation of MFCC features. The spectral weighting is derived from the normalized inverse harmonic mean function of the line spectral frequency (LSF) features, which are known to be localized around formant frequencies. The above approach can be considered as an early data fusion of spectral content and formant location information. We also investigate methods for late decision fusion of unimodal classifiers. We evaluate the proposed WMFCC features together with the standard spectral and prosody features using HMM based classifiers on the spontaneous FAU Aibo emotional speech corpus. The results show that unimodal classifiers with the WMFCC features perform significantly better than the classifiers with standard spectral features. Late decision fusion of classifiers provide further significant performance improvements.

Collaboration


Dive into the Engin Erzin's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge