Mihai Gurban
École Polytechnique Fédérale de Lausanne
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mihai Gurban.
IEEE Transactions on Signal Processing | 2009
Mihai Gurban; Jean-Philippe Thiran
The problem of feature selection has been thoroughly analyzed in the context of pattern classification, with the purpose of avoiding the curse of dimensionality. However, in the context of multimodal signal processing, this problem has been studied less. Our approach to feature extraction is based on information theory, with an application on multimodal classification, in particular audio-visual speech recognition. Contrary to previous work in information theoretic feature selection applied to multimodal signals, our proposed methods penalize features for their redundancy, achieving more compact feature sets and better performance. We propose two greedy selection algorithms, one that penalizes a proportion of feature redundancy, while the other uses conditional mutual information as an evaluation measure, for the selection of visual features for audio-visual speech recognition. Our features perform better than linear discriminant analysis, the most usual transform for dimensionality reduction in the field, across a wide range of dimensionality values and combined with audio at different quality levels.
IEEE Transactions on Audio, Speech, and Language Processing | 2012
Virginia Estellers; Mihai Gurban; Jean-Philippe Thiran
The integration of audio and visual information improves speech recognition performance, specially in the presence of noise. In these circumstances it is necessary to introduce audio and visual weights to control the contribution of each modality to the recognition task. We present a method to set the value of the weights associated to each stream according to their reliability for speech recognition, allowing them to change with time and adapt to different noise and working conditions. Our dynamic weights are derived from several measures of the stream reliability, some specific to speech processing and others inherent to any classification task, and take into account the special role of silence detection in the definition of audio and visual weights. In this paper, we propose a new confidence measure, compare it to existing ones, and point out the importance of the correct detection of silence utterances in the definition of the weighting system. Experimental results support our main contribution: the inclusion of a voice activity detector in the weighting scheme improves speech recognition over different system architectures and confidence measures, leading to an increase in performance more relevant than any difference between the proposed confidence measures.
international conference on multimodal interfaces | 2008
Mihai Gurban; Jean-Philippe Thiran; Thomas Drugman; Thierry Dutoit
Merging decisions from different modalities is a crucial problem in Audio-Visual Speech Recognition. To solve this, state synchronous multi-stream HMMs have been proposed for their important advantage of incorporating stream reliability in their fusion scheme. This paper focuses on stream weight adaptation based on modality confidence estimators. We assume different and time-varying environment noise, as can be encountered in realistic applications, and, for this, adaptive methods are best suited. Stream reliability is assessed directly through classifier outputs since they are not specific to either noise type or level. The influence of constraining the weights to sum to one is also discussed.
multimedia signal processing | 2007
Thomas Drugman; Mihai Gurban; Jean-Philippe Thiran
We present a feature selection method based on information theoretic measures, targeted at multimodal signal processing, showing how we can quantitatively assess the relevance of features from different modalities. We are able to find the features with the highest amount of information relevant for the recognition task, and at the same having minimal redundancy. Our application is audio-visual speech recognition, and in particular selecting relevant visual features. Experimental results show that our method outperforms other feature selection algorithms from the literature by improving recognition accuracy even with a significantly reduced number of features.
Multimodal Signal Processing#R##N#Theory and Applications for Human–Computer Interaction | 2010
Mihai Gurban; Jean-Philippe Thiran
Keywords: Multimodal ; Signal Processing ; Human-Computer Interaction ; LTS5 Reference EPFL-CHAPTER-144046 Record created on 2010-02-02, modified on 2017-05-10
international conference on image processing | 2009
Virginia Estellers; Mihai Gurban; Jean-Philippe Thiran
A quantitative measure of relevance is proposed for the task of constructing visual feature sets which are at the same time relevant and compact. A features relevance is given by the amount of information that it contains about the problem, while compactness is achieved by preventing the replication of information between features. To achieve these goals, we use mutual information both for assessing relevance and measuring the redundancy between features. Our application is speechreading, that is, speech recognition performed on the video of the speaker. This is justified by the fact that the performance of audio speech recognition can be improved by augmenting the audio features with visual ones, especially when there is noise in the audio channel. We report significant improvements compared to the most common method of dimensionality reduction for speechreading, Linear Discriminant Analysis (LDA).
Archive | 2008
Mihai Gurban; Verónica Vilaplana; Jean-Philippe Thiran; Ferran Marqués
Two of the communication channels conveying more information in human-to-human interaction are face and speech. A robust interpretation of the information being expressed by people can be obtained by the combined analysis of both sources of information: short-term facial feature evolution (face) and speech information. This way, face and speech combined analysis is the basis of a large number of human computer interfaces and services. Regardless of the final application of such interfaces, there are two aspects that are commonly required: detection of human faces and combination of both sources of information. In the first section of the chapter, we review the state of the art of face and facial feature detection. The various methods are analyzed from the perspective of the different models that they use to represent images and patterns: pixel based, block based, transform coefficient based and region based techniques. In the second section of the chapter, we present two examples of multimodal signal processing applications. The first one allows the localization of the speakers mouth in a video sequence, using both the audio signal and the motion extracted from the video. The second application consists in recognizing the spoken words in a video sequence using both the audio and the images of moving lips.
Multimodal Signal Processing#R##N#Theory and Applications for Human–Computer Interaction | 2010
Mihai Gurban; Jean-Philippe Thiran
Keywords: multimodal ; LTS5 Reference EPFL-CHAPTER-144074 Record created on 2010-02-03, modified on 2017-05-10
european signal processing conference | 2006
Mihai Gurban; Jean-Philippe Thiran
european signal processing conference | 2005
Mihai Gurban; Jean-Philippe Thiran