Mirco Ravanelli
fondazione bruno kessler
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Mirco Ravanelli.
ieee automatic speech recognition and understanding workshop | 2015
Mirco Ravanelli; Luca Cristoforetti; Roberto Gretter; Marco Pellin; Alessandro Sosi; Maurizio Omologo
This paper introduces the contents and the possible usage of the DIRHA-ENGLISH multi-microphone corpus, recently realized under the EC DIRHA project. The reference scenario is a domestic environment equipped with a large number of microphones and microphone arrays distributed in space. The corpus is composed of both real and simulated material, and it includes 12 US and 12 UK English native speakers. Each speaker uttered different sets of phonetically-rich sentences, newspaper articles, conversational speech, keywords, and commands. From this material, a large set of 1-minute sequences was generated, which also includes typical domestic background noise as well as inter/intra-room reverberation effects. Dev and test sets were derived, which represent a very precious material for different studies on multi-microphone speech processing and distant-speech recognition. Various tasks and corresponding Kaldi recipes have already been developed. The paper reports a first set of baseline results obtained using different techniques, including Deep Neural Networks (DNN), aligned with the state-of-the-art at international level.
Hands-free Speech Communication and Microphone Arrays (HSCMA), 2014 4th Joint Workshop on | 2014
Alessio Brutti; Mirco Ravanelli; Piergiorgio Svaizer; Maurizio Omologo
Domestic environments are particularly challenging for distant speech recognition and audio processing in general. Reverberation, background noise and interfering sources, as well as the propagation of acoustic events across adjacent rooms, critically degrade the performance of standard speech processing algorithms. The DIRHA EU project addresses the development of distant-speech interaction with devices and services within the multiple rooms of typical apartments. A corpus of multichannel acoustic data has been created to represent realistic acoustic scenes, of different degrees of complexity, occurring in such an environment. It includes multichannel simulations based on measured impulse responses and real data collected in the same apartment. A basic but fundamental task of the front-end processing enabling effective ASR is the detection and localization of speech events generated by users, without constraints on their position or orientation within the various rooms. In this paper we describe the acoustic corpus and present a baseline approach to the joint task of speech detection and source localization, using speech related features such as pitch, combined with features derived from spatial coherence.
international conference on acoustics, speech, and signal processing | 2017
Mirco Ravanelli; Philemon Brakel; Maurizio Omologo; Yoshua Bengio
Despite the remarkable progress recently made in distant speech recognition, state-of-the-art technology still suffers from a lack of robustness, especially when adverse acoustic conditions characterized by non-stationary noises and reverberation are met.
conference of the international speech communication association | 2016
Mirco Ravanelli; Piergiorgio Svaizer; Maurizio Omologo
The availability of realistic simulated corpora is of key importance for the future progress of distant speech recognition technology. The reliability, flexibility and low computational cost of a data simulation process may ultimately allow researchers to train, tune and test different techniques in a variety of acoustic scenarios, avoiding the laborious effort of directly recording real data from the targeted environment. In the last decade, several simulated corpora have been released to the research community, including the data-sets distributed in the context of projects and international challenges, such as CHiME and REVERB. These efforts were extremely useful to derive baselines and common evaluation frameworks for comparison purposes. At the same time, in many cases they highlighted the need of a better coherence between real and simulated conditions. In this paper, we examine this issue and we describe our approach to the generation of realistic corpora in a domestic context. Experimental validation, conducted in a multi-microphone scenario, shows that a comparable performance trend can be observed with both real and simulated data across different recognition frameworks, acoustic models, as well as multi-microphone processing techniques.
acm multimedia | 2015
Mirco Ravanelli; Benjamin Elizalde; Julia Bernd; Gerald Friedland
Multimedia Event Detection (MED) aims to identify events-also called scenes-in videos, such as a flash mob or a wedding ceremony. Audio content information complements cues such as visual content and text. In this paper, we explore the optimization of neural networks (NNs) for audio-based multimedia event classification, and discuss some insights towards more effectively using this paradigm for MED. We explore different architectures, in terms of number of layers and number of neurons. We also assess the performance impact of pre-training with Restricted Boltzmann Machines (RBMs) in contrast with random initialization, and explore the effect of varying the context window for the input to the NNs. Lastly, we compare the performance of Hidden Markov Models (HMMs) with a discriminative classifier for the event classification. We used the publicly available event-annotated YLI-MED dataset. Our results showed a performance improvement of more than 6% absolute accuracy compared to the latest results reported in the literature. Interestingly, these results were obtained with a single-layer neural network with random initialization, suggesting that standard approaches with deep learning and RBM pre-training are not fully adequate to address the high-level video event-classification task.
spoken language technology workshop | 2016
Mirco Ravanelli; Philemon Brakel; Maurizio Omologo; Yoshua Bengio
Improving distant speech recognition is a crucial step towards flexible human-machine interfaces. Current technology, however, still exhibits a lack of robustness, especially when adverse acoustic conditions are met. Despite the significant progress made in the last years on both speech enhancement and speech recognition, one potential limitation of state-of-the-art technology lies in composing modules that are not well matched because they are not trained jointly.
international conference on acoustics, speech, and signal processing | 2015
Erich Zwyssig; Mirco Ravanelli; Piergiorgio Svaizer; Maurizio Omologo
This paper describes a new corpus of multi-channel audio data designed to study and develop distant-speech recognition systems able to cope with known interfering sounds propagating in an environment. The corpus consists of both real and simulated signals and of a corresponding detailed annotation. An extensive set of speech recognition experiments was conducted using three different Acoustic Echo Cancellation (AEC) techniques to establish baseline results for future reference. The AEC techniques were applied both to single distant microphone input signals and beamformed signals generated using two state-of-the-art beamforming techniques. We show that the speech recognition performance using the different techniques is comparable for both the simulated and real data, demonstrating the usefulness of this corpus for speech research. We also show that a significant improvement in speech recognition performance can be obtained by combining state-of-the-art AEC and beamforming techniques, compared to using a single distant microphone input.
international symposium on chinese spoken language processing | 2014
Mirco Ravanelli; Van Hai Do; Adam Janin
To improve speech recognition performance, a combination between TANDEM and bottleneck Deep Neural Networks (DNN) is investigated. In particular, exploiting a feature combination performed by means of a multi-stream hierarchical processing, we show a performance improvement by combining the same input features processed by different neural networks. The experiments are based on the spontaneous telephone recordings of the Cantonese IARPA Babel corpus using both standard MFCCs and Gabor as input features.
Speech Communication | 2018
Mirco Ravanelli; Maurizio Omologo
Distant speech recognition is being revolutionized by deep learning, that has contributed to significantly outperform previous HMM-GMM systems. A key aspect behind the rapid rise and success of DNNs is their ability to better manage large time contexts. With this regard, asymmetric context windows that embed more past than future frames have been recently used with feed-forward neural networks. This context configuration turns out to be useful not only to address low-latency speech recognition, but also to boost the recognition performance under reverberant conditions. This paper investigates on the mechanisms occurring inside DNNs, which lead to an effective application of asymmetric this http URL particular, we propose a novel method for automatic context window composition based on a gradient analysis. The experiments, performed with different acoustic environments, features, DNN architectures, microphone settings, and recognition tasks show that our simple and efficient strategy leads to a less redundant frame configuration, which makes DNN training more effective in reverberant scenarios.
language resources and evaluation | 2014
Luca Cristoforetti; Mirco Ravanelli; Maurizio Omologo; Alessandro Sosi; Alberto Abad; Martin Hagmueller; Petros Maragos