Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Hervé Bourlard is active.

Publication


Featured researches published by Hervé Bourlard.


Image and Vision Computing | 2009

Social signal processing

Alessandro Vinciarelli; Maja Pantic; Hervé Bourlard

The ability to understand and manage social signals of a person we are communicating with is the core of social intelligence. Social intelligence is a facet of human intelligence that has been argued to be indispensable and perhaps the most important for success in life. This paper argues that next-generation computing needs to include the essence of social intelligence - the ability to recognize human social signals and social behaviours like turn taking, politeness, and disagreement - in order to become more effective and more efficient. Although each one of us understands the importance of social signals in everyday life situations, and in spite of recent advances in machine analysis of relevant behavioural cues like blinks, smiles, crossed arms, laughter, and similar, design and development of automated systems for social signal processing (SSP) are rather difficult. This paper surveys the past efforts in solving these problems by a computer, it summarizes the relevant findings in social psychology, and it proposes a set of recommendations for enabling the development of the next generation of socially aware computing.


acm multimedia | 2008

Social signal processing: state-of-the-art and future perspectives of an emerging domain

Alessandro Vinciarelli; Maja Pantic; Hervé Bourlard; Alex Pentland

The ability to understand and manage social signals of a person we are communicating with is the core of social intelligence. Social intelligence is a facet of human intelligence that has been argued to be indispensable and perhaps the most important for success in life. This paper argues that next-generation computing needs to include the essence of social intelligence - the ability to recognize human social signals and social behaviours like politeness, and disagreement - in order to become more effective and more efficient. Although each one of us understands the importance of social signals in everyday life situations, and in spite of recent advances in machine analysis of relevant behavioural cues like blinks, smiles, crossed arms, laughter, and similar, design and development of automated systems for Social Signal Processing (SSP) are rather difficult. This paper surveys the past efforts in solving these problems by a computer, it summarizes the relevant findings in social psychology, and it proposes a set of recommendations for enabling the development of the next generation of socially-aware computing.


international conference on multimodal interfaces | 2008

Social signals, their function, and automatic analysis: a survey

Alessandro Vinciarelli; Maja Pantic; Hervé Bourlard; Alex Pentland

Social Signal Processing (SSP) aims at the analysis of social behaviour in both Human-Human and Human-Computer interactions. SSP revolves around automatic sensing and interpretation of social signals, complex aggregates of nonverbal behaviours through which individuals express their attitudes towards other human (and virtual) participants in the current social context. As such, SSP integrates both engineering (speech analysis, computer vision, etc.) and human sciences (social psychology, anthropology, etc.) as it requires multimodal and multidisciplinary approaches. As of today, SSP is still in its early infancy, but the domain is quickly developing, and a growing number of works is appearing in the literature. This paper provides an introduction to nonverbal behaviour involved in social signals and a survey of the main results obtained so far in SSP. It also outlines possibilities and challenges that SSP is expected to face in the next years if it is to reach its full maturity.


Computer Speech & Language | 2003

Robust Speech Recognition and Feature Extraction Using HMM2

Katrin Weber; Shajith Ikbal; Samy Bengio; Hervé Bourlard

This paper presents the theoretical basis and preliminary experimental results of a new HMM model, referred to as HMM2, which can be considered as a mixture of HMMs. In this new model, the emission probabilities of the temporal (primary) HMM are estimated through secondary, state specific, HMMs working in the acoustic feature space. Thus, while the primary HMM is performing the usual time warping and integration, the secondary HMMs are responsible for extracting/modeling the possible feature dependencies, while performing frequency warping and integration. Such a model has several potential advantages, such as a more flexible modeling of the time/frequency structure of the speech signal. When working with spectral features, such a system can also perform nonlinear spectral warping, effectively implementing a form of nonlinear vocal tract normalization. Furthermore, it will be shown that HMM2 can be used to extract noise robust features, supposed to be related to formant regions, which can be used as extra features for traditional HMM recognizers to improve their performance. These issues are evaluated in the present paper, and different experimental results are reported on the Numbers95 database.


Signal Processing | 2014

Enhanced diffuse field model for ad hoc microphone array calibration

Mohammad Javad Taghizadeh; Philip N. Garner; Hervé Bourlard

In this paper, we investigate the diffuse field coherence model for microphone array pairwise distance estimation. We study the fundamental constraints and assumptions underlying this approach and propose evaluation methodologies to measure the adequacy of diffuseness for microphone array calibration. In addition, an enhanced schemebased on coherence averaging and histogramming, is presented to improve the robustness and performance of the pairwise distance estimation approach. The proposed theories and algorithms are evaluated on simulated and real data recordings for calibration of microphone array geometry in an ad hoc set-up. HighlightsAveraging and histogramming improve the diffuse field coherence model for calibration.A novel approach for assessment of the adequacy of diffuseness is formulated.The relation between distance, enclosure dimension and diffuseness is characterized.A methodology for augmenting the diffuse sound field is proposed.Fundamental limitation of calibration based on the coherence model is analyzed.


Speech Communication | 2016

Sparse modeling of neural network posterior probabilities for exemplar-based speech recognition

Pranay Dighe; Afsaneh Asaei; Hervé Bourlard

Automatic speech recognition can be cast as a realization of compressive sensing.Posterior probabilities are suitable features for exemplar-based sparse modeling.Posterior-based sparse representation meets statistical speech recognition formalism.Dictionary learning reduces collection size of exemplars and improves the performance.Collaborative hierarchical sparsity exploits temporal information in continuous speech. In this paper, a compressive sensing (CS) perspective to exemplar-based speech processing is proposed. Relying on an analytical relationship between CS formulation and statistical speech recognition (Hidden Markov Models - HMM), the automatic speech recognition (ASR) problem is cast as recovery of high-dimensional sparse word representation from the observed low-dimensional acoustic features. The acoustic features are exemplars obtained from (deep) neural network sub-word conditional posterior probabilities. Low-dimensional word manifolds are learned using these sub-word posterior exemplars and exploited to construct a linguistic dictionary for sparse representation of word posteriors. Dictionary learning has been found to be a principled way to alleviate the need of having huge collection of exemplars as required in conventional exemplar-based approaches, while still improving the performance. Context appending and collaborative hierarchical sparsity are used to exploit the sequential and group structure underlying word sparse representation. This formulation leads to a posterior-based sparse modeling approach to speech recognition. The potential of the proposed approach is demonstrated on isolated word (Phonebook corpus) and continuous speech (Numbers corpus) recognition tasks.


Speech Communication | 2016

On structured sparsity of phonological posteriors for linguistic parsing

Milos Cernak; Afsaneh Asaei; Hervé Bourlard

Phonological posterior is a sparse vector consisting of phonological class probabilities.Phonological posterior is estimated from a short speech segment using deep neural network.Segmental phonological posterior conveys supra-segmental information on linguistic events.A linguistic class is characterized by a codebook of binary phonological structures.Linguistic parsing is achieved with high accuracy using binary pattern matching. The speech signal conveys information on different time scales from short (2040ms) time scale or segmental, associated to phonological and phonetic information to long (150250ms) time scale or supra segmental, associated to syllabic and prosodic information. Linguistic and neurocognitive studies recognize the phonological classes at segmental level as the essential and invariant representations used in speech temporal organization.In the context of speech processing, a deep neural network (DNN) is an effective computational method to infer the probability of individual phonological classes from a short segment of speech signal. A vector of all phonological class probabilities is referred to as phonological posterior. There are only very few classes comprising a short term speech signal; hence, the phonological posterior is a sparse vector. Although the phonological posteriors are estimated at segmental level, we claim that they convey supra-segmental information. Specifically, we demonstrate that phonological posteriors are indicative of syllabic and prosodic events.Building on findings from converging linguistic evidence on the gestural model of Articulatory Phonology as well as the neural basis of speech perception, we hypothesize that phonological posteriors convey properties of linguistic classes at multiple time scales, and this information is embedded in their support (index) of active coefficients. To verify this hypothesis, we obtain a binary representation of phonological posteriors at the segmental level which is referred to as first-order sparsity structure; the high-order structures are obtained by the concatenation of first-order binary vectors. It is then confirmed that the classification of supra-segmental linguistic events, the problem known as linguistic parsing, can be achieved with high accuracy using a simple binary pattern matching of first-order or high-order structures.


Signal Processing | 2015

Ad hoc microphone array calibration

Mohammad Javad Taghizadeh; Reza Parhizkar; Philip N. Garner; Hervé Bourlard; Afsaneh Asaei

This paper addresses the problem of ad hoc microphone array calibration where only partial information about the distances between microphones is available. We construct a matrix consisting of the pairwise distances and propose to estimate the missing entries based on a novel Euclidean distance matrix completion algorithm by alternative low-rank matrix completion and projection onto the Euclidean distance space. This approach confines the recovered matrix to the EDM cone at each iteration of the matrix completion algorithm. The theoretical guarantees of the calibration performance are obtained considering the random and locally structured missing entries as well as the measurement noise on the known distances. This study elucidates the links between the calibration error and the number of microphones along with the noise level and the ratio of missing distances. Thorough experiments on real data recordings and simulated setups are conducted to demonstrate these theoretical insights. A significant improvement is achieved by the proposed Euclidean distance matrix completion algorithm over the state-of-the-art techniques for ad hoc microphone array calibration. HighlightsEuclidean matrix completion enables calibration from partial distance measurements.A novel Euclidean matrix completion algorithm is proposed.The relation between error and number of microphones, noise and missing distances is derived.Theoretical insights are demonstrated by thorough experiments on real and simulated data.The performance is compared with SDP, S-Stress, MDS-MAP and Matrix completion.


international conference on machine learning | 2006

Audio-Visual processing in meetings: seven questions and current AMI answers

Marc Al-Hames; Thomas Hain; Jan Cernocky; Sascha Schreiber; Mannes Poel; Ronald Müller; Sébastien Marcel; David A. van Leeuwen; Jean-Marc Odobez; Sileye Ba; Hervé Bourlard; Fabien Cardinaux; Daniel Gatica-Perez; Adam Janin; Petr Motlicek; Stephan Reiter; Steve Renals; Jeroen van Rest; Rutger Rienks; Gerhard Rigoll; Kevin Smith; Andrew Thean; Pavel Zemcik

The project Augmented Multi-party Interaction (AMI) is concerned with the development of meeting browsers and remote meeting assistants for instrumented meeting rooms – and the required component technologies R&D themes: group dynamics, audio, visual, and multimodal processing, content abstraction, and human-computer interaction. The audio-visual processing workpackage within AMI addresses the automatic recognition from audio, video, and combined audio-video streams, that have been recorded during meetings. In this article we describe the progress that has been made in the first two years of the project. We show how the large problem of audio-visual processing in meetings can be split into seven questions, like “Who is acting during the meeting?”. We then show which algorithms and methods have been developed and evaluated for the automatic answering of these questions.


Speech Communication | 2016

Computational methods for underdetermined convolutive speech localization and separation via model-based sparse component analysis

Afsaneh Asaei; Hervé Bourlard; Mohammad Javad Taghizadeh; Volkan Cevher

Model-based sparse component analysis exploits structured sparsity for source separation.Spectral sparsity structures are formulated upon the principles of auditory scene analysis.Spatial sparsity structures are formulated upon the image model of multipath propagation.Performance of greedy, convex and Bayesian sparse recovery are evaluated.Ad hoc microphone arrays may lead to significant improvement in SCA performance. In this paper, the problem of speech source localization and separation from recordings of convolutive underdetermined mixtures is addressed. This problem is cast as recovering the spatio-spectral speech information embedded in a microphone array compressed measurements of the acoustic field. A model-based sparse component analysis framework is formulated for sparse reconstruction of the speech spectra in a reverberant acoustic resulting in joint localization and separation of the individual sources. We compare and contrast the algorithmic approaches to model-based sparse recovery exploiting spatial sparsity as well as spectral structures underlying spectrographic representation of speech signals. In this context, we explore identification of the sparsity structures at the auditory and acoustic representation spaces. The audiory structures are formulated upon the principles of structural grouping based on proximity, autoregressive correlation and harmonicity of the spectral coefficients and they are incoporated for sparse reconstruction. The acoustic structures are formulated upon the image model of multipath propagation and they are exploited to characterize the compressive measurement matrix associated with microphone array recordings.Three approaches to sparse recovery relying on combinatorial optimization, convex relaxation and sparse Bayesian learning are studied and evaluated on thorough experiments. The sparse Bayesian learning method is shown to yield better perception quality while the interference suppression is also achieved using the combinatorial approach with the advantage of offering the most efficient computational cost. Furthermore, it is demonstrated that an average autoregressive model can be learned for speech localization while exploiting the proximity structure in the form of block sparse coefficients enables accurate localization and high quality speech separation. Throughout the extensive empirical evaluation, we confirm that a large and random placement of the microphones enables significant improvement in source localization and separation performance.

Collaboration


Dive into the Hervé Bourlard's collaboration.

Top Co-Authors

Avatar

Steve Renals

University of Edinburgh

View shared research outputs
Top Co-Authors

Avatar

Samy Bengio

International Computer Science Institute

View shared research outputs
Top Co-Authors

Avatar

Mathew Magimai.-Doss

École Polytechnique Fédérale de Lausanne

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Afsaneh Asaei

Idiap Research Institute

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Katrin Weber

Idiap Research Institute

View shared research outputs
Top Co-Authors

Avatar

John Dines

Idiap Research Institute

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge