Dusan Macho
Polytechnic University of Catalonia
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Dusan Macho.
CLEaR | 2006
Andrey Temko; Robert G. Malkin; Christian Zieger; Dusan Macho; Climent Nadeu; Maurizio Omologo
In this paper, we present the results of the Acoustic Event Detection (AED) and Classification (AEC) evaluations carried out in February 2006 by the three participant partners from the CHIL project. The primary evaluation task was AED of the testing portions of the isolated sound databases and seminar recordings produced in CHIL. Additionally, a secondary AEC evaluation task was designed using only the isolated sound databases. The set of meeting-room acoustic event classes and the metrics were agreed by the three partners and ELDA was in charge of the scoring task. In this paper, the various systems for the tasks of AED and AEC and their results are presented.
Pattern Recognition | 2008
Andrey Temko; Dusan Macho; Climent Nadeu
Acoustic event classification may help to describe acoustic scenes and contribute to improve the robustness of speech technologies. In this work, fusion of different information sources with the fuzzy integral (FI), and the associated fuzzy measure (FM), are applied to the problem of classifying a small set of highly confusable human non-speech sounds. As FI is a meaningful formalism for combining classifier outputs that can capture interactions among the various sources of information, it shows in our experiments a significantly better performance than that of any single classifier entering the FI fusion module. Actually, that FI decision-level fusion approach shows comparable results to the high-performing SVM feature-level fusion and thus it seems to be a good choice when feature-level fusion is not an option. We have also observed that the importance and the degree of interaction among the various feature types given by the FM can be used for feature selection, and gives a valuable insight into the problem.
international conference on acoustics, speech, and signal processing | 2006
Pere Pujol; Dusan Macho; Climent Nadeu
This work aims at gaining an insight into the mean and variance normalization technique (MVN), which is commonly used to increase the robustness of speech recognition features. Several versions of MVN are empirically investigated, and the factors affecting their performance are considered. The reported experimental work with real-world speech data (Speecon) particularly focuses on the recursive updating of MVN parameters, paying attention to the involved algorithmical delay. First, we propose a decoupling of the look-ahead factor (which determines the delay) and the initial estimation of mean and variance, and show that the latter is a key factor for the recognition performance. Then, several kinds of initial estimations that make sense in different application environments are tested, and their performance is compared
international conference on acoustics, speech, and signal processing | 2007
Andrey Temko; Dusan Macho; Climent Nadeu
Speech activity detection (SAD) is a key objective in speech-related technologies. In this work, an enhanced version of the training stage of a SAD system based on a support vector machine (SVM) classifier is presented, and its performance is tested with the RT05 and RT06 evaluation tasks. A fast algorithm of data reduction based on proximal SVM has been developed and, furthermore, the specific characteristics of the metric used in the NIST SAD evaluation have been taken into account during training. Tested with the RT06 data, the resulting SVM SAD system has shown better scores than the best GMM-based system developed by the authors and submitted to the past RT06 evaluation.
international conference on acoustics, speech, and signal processing | 2005
Jaume Padrell; Dusan Macho; Climent Nadeu
Speech detection becomes more complicated when performed in noisy and reverberant environments like e.g. smart rooms. In this work, we design a robust speech activity detection (SAD) algorithm and we evaluate it on distant microphone signals acquired in a smart room-like environment. The algorithm is based on a measure obtained from applying linear discriminant analysis (LDA) on frequency filtering (FF) features. With a time sequence of this measure, a decision tree based speech/non-speech classifier is trained. The proposed SAD system is evaluated together with other SAD systems (GSM SAD and ETSI advanced front-end standard SAD) using a set of general SAD metrics as well as using the ASR accuracy as a metric. The proposed SAD algorithm shows better average results than the other tested SAD systems for both the set of general SAD metrics and the ASR performance.
ubiquitous computing | 2009
Joachim Neumann; Josep R. Casas; Dusan Macho; Javier Ruiz Hidalgo
At the Technical University of Catalonia (UPC), a smart room has been equipped with 85 microphones and 8 cameras. This paper describes the setup of the sensors, gives an overview of the underlying hardware and software infrastructure and indicates possibilities for high- and low-level multi-modal interaction. An example of usage of the information collected from the distributed sensor network is explained in detail: the system supports a group of students that have to solve a lab assignment related problem.
CLEaR | 2006
Jordi Luque; Ramon Morros; Ainara Garde; Jan Anguita; Mireia Farrús; Dusan Macho; Ferran Marqués; Claudi Martinez; Verónica Vilaplana; Javier Hernando
In this paper, we address the modality integration issue on the example of a smart room environment aiming at enabling person identification by combining acoustic features and 2D face images. First we introduce the monomodal audio and video identification techniques and then we present the use of combined input speech and face images for person identification. The various sensory modalities, speech and faces, are processed both individually and jointly. Its shown that the multimodal approach results in improved performance in the identification of the participants.
international conference on acoustics, speech, and signal processing | 2001
Dusan Macho; Yan Ming Cheng
We introduce a new concept in advancing the noise robustness of a speech recognition front-end. The presented method, called SNR-dependent waveform processing (SWP), exploits SNR variability within a speech period for enhancing the high SNR period portion and attenuating the low SNR period portion in the waveform time domain. In this way, the overall SNR of noisy-speech is increased, and at the same time, the periodicity of voiced speech is enhanced. This approach differs significantly from the well-known speech enhancement techniques, which are mostly frequency domain based, and we use it in this work as a complementary technique to them. In tests with SWP, we present significant clean and noisy speech recognition performance gains using the AURORA 2 database and recognition system as defined by ETSI for the robust frontend standardization process. Moreover, the presented algorithm is very simple and it is attractive also in terms of computational load.
CLEaR | 2006
Alberto Abad; Cristian Canton-Ferrer; Carlos Segura; José Luis Landabaso; Dusan Macho; Josep R. Casas; Javier Hernando; Montse Pardàs; Climent Nadeu
Reliable measures of person positions are needed for computational perception of human activities taking place in a smart-room environment. In this work, we present the Person Tracking systems developed at UPC for audio, video and audio-video modalities in the context of the EU funded CHIL project research activities. The aim of the designed systems, and particularly of the new contributions proposed, is to deal robustly in both single and multiperson localization tasks independently on the environmental conditions. Besides the technology description, experimental results conducted for the CLEAR evaluation workshop are also reported.
international conference on acoustics, speech, and signal processing | 2011
Woojay Jeon; Changxue Ma; Dusan Macho
We propose a novel utterance comparison model based on probability theory and factor analysis that computes the likelihood of two speech utterances originating from the same speaker. The model depends only on a set of statistics extracted from each utterance and can efficiently compare utterances using these statistics without requiring the indefinite storage of speech features. We apply the model as a distance metric for speaker clustering in the CALLHOME telephone conversation corpus to achieve competitive results compared to three other known similarity measures: the Generalized Likelihood Ratio, Cross-Likelihood Ratio, and eigenvoice distance.