Xavier Anguera | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Xavier Anguera is active.

Explore More

Publication

Featured researches published by Xavier Anguera.

international conference on acoustics, speech, and signal processing | 2012

The Spoken Web Search Task at MediaEval 2011

Florian Metze; Nitendra Rajput; Xavier Anguera; Marelie H. Davel; Guillaume Gravier; Charl Johannes van Heerden; Gautam Varma Mantena; Armando Muscariello; Kishore Prahallad; Igor Szöke; Javier Tejedor

international conference on acoustics, speech, and signal processing | 2013

The spoken web search task at MediaEval 2012

Florian Metze; Xavier Anguera; Etienne Barnard; Marelie H. Davel; Guillaume Gravier

In this paper, we describe the “Spoken Web Search” Task, which was held as part of the 2011 MediaEval benchmark campaign. The purpose of this task was to perform audio search with audio input in four languages, with very few resources being available in each language. The data was taken from “spoken web” material collected over mobile phone connections by IBM India. We present results from several independent systems, developed by five teams and using different approaches, compare them, and provide analysis and directions for future research.In this paper we describe the systems presented by Telefonica Research to the Spoken Web Search task of the Mediaeval 2012 evaluation. This year we proposed two systems. The rst one consists on a segmental DTW system, similar to the one presented in 2011, with a few improvements. The second system also uses a DTW-like approach but allowing for all reference les o be searched at once using an information retrieval approach.

international conference on multimedia and expo | 2012

MASK: Robust Local Features for Audio Fingerprinting

Xavier Anguera; Antonio Garzon; Tomasz Adamek

This paper presents a novel local audio fingerprint called MASK (Masked Audio Spectral Keypoints) that can effectively encode the acoustic information existent in audio documents and discriminate between transformed versions of the same acoustic documents and other unrelated documents. The fingerprint has been designed to be resilient to strong transformations of the original signal and to be usable for generic audio, including music and speech. Its main characteristics are its locality, binary encoding, robustness and compactness. The proposed audio fingerprint encodes the local spectral energies around salient points selected among the main spectral peaks in a given signal. Such encoding is done by centering on each point a carefully designed mask defining regions of the spectrogram whose average energies are compared with each other. From each comparison we obtain a single bit depending on which region has more energy, and group all bits into a final binary fingerprint. In addition, the fingerprint also stores the frequency of each peak, quantized using a Mel filterbank. The length of the fingerprint is solely defined by the number of compared regions being used, and can be adapted to the requirements of any particular application. In addition, the number of salient points encoded per second can be also easily modified. In the experimental section we show the suitability of such fingerprint to find matching segments by using the NIST-TRECVID benchmarking evaluation datasets by comparing it with a well known fingerprint, obtaining up to 26% relative improvement in NDCR score.

international conference on acoustics, speech, and signal processing | 2010

Partial sequence matching using an Unbounded Dynamic Time Warping algorithm

Xavier Anguera; Robert Macrae; Nuria Oliver

Before the advent of Hidden Markov Models(HMM)-based speech recognition, many speech applications were built using pattern matching algorithms like the Dynamic Time Warping (DTW) algorithm, which are generally robust to noise and easy to implement. The standard DTW algorithm usually suffers from lack of flexibility on start-end matching points and has high computational costs. Although some DTW-based algorithms have been proposed over the years to solve either one of these problems, none is able to discover multiple alignment paths with low computational costs. In this paper, we present an “unbounded” version on the DTW (U-DTW in short) that is computationally lightweight and allows for total flexibility on where the matching segment occurs. Results on a word matching database show very competitive performances both in accuracy and processing time compared to existing alternatives.

international conference on multimedia and expo | 2013

Memory efficient subsequence DTW for Query-by-Example Spoken Term Detection

Xavier Anguera; Miquel Ferrarons

In this paper we propose a fast and memory efficient Dynamic Time Warping (MES-DTW) algorithm for the task of Query-by-Example Spoken Term Detection (QbE-STD). The proposed algorithm is based on the subsequence-DTW (S-DTW) algorithm, which allows the search for small spoken queries within a much bigger search collection of spoken documents by considering fixed start-end points in the query and discovering optimal matching subsequences along the search collection. The proposed algorithm applies some modifications to S-DTW that make it better suited for the QbE-STD task, including a way to perform the matching with virtually no system memory, optimal when querying large scale databases. We also describe the system used to perform QbE-STD, including an energy-based quantification for speech/non-speech detection and an overlap detector for putative matches. We test the system proposed using the Mediaeval 2012 spoken-web-search dataset and show that, in addition to the memory savings, the proposed algorithm brings an advantage in terms of matching accuracy (up to 0.235 absolute MTWV increase) and speed (around 25% faster) in comparison to the original S-DTW.

multimedia information retrieval | 2008

Multimodal photo annotation and retrieval on a mobile phone

Xavier Anguera; JieJun Xu; Nuria Oliver

Mobile phones are becoming multimedia devices. It is common to observe users capturing photos and videos on their mobile phones on a regular basis. As the amount of digital multimedia content expands, it becomes increasingly difficult to find specific images in the device. In this paper, we present a multimodal and mobile image retrieval prototype named MAMI (Multimodal Automatic Mobile Indexing). It allows users to annotate, index and search for digital photos on their phones via speech or image input. Speech annotations can be added at the time of capturing photos or at a later time. Additional metadata such as location, user identification, date and time of capture is stored in the phone automatically. A key advantage of MAMI is tha it is implemented as a stand-alone application which runs in real-time on the phone. Therefore, users can search for photos in their personal archives without the need of connectivity to a server. In this paper, we compare multimodal and monomodal approaches for image retrieval and we propose a novel algorithm named the Multimodal Redundancy Reduction (MR2) Algorithm. In addition to describing in detail the proposed approaches, we present our experimental results and compare the retrieval accuracy of monomodal versus multimodal algorithms.

international conference on acoustics, speech, and signal processing | 2012

Speaker independent discriminant feature extraction for acoustic pattern-matching

Xavier Anguera

Acoustic pattern-matching algorithms have recently become prominent again for automatically processing speech utterances where no prior knowledge of the spoken language is required. Applications of such technology include, but are not limited to, query-by-example search, spoken term detection and automatic word discovery. Obtaining content-aware acoustic features as independent as possible from speaker and acoustic environment variations is a key step in these algorithms. Currently, GMM posteriorgrams are found to outperform the standard MFCC features even though they were not designed to optimize the discrimination between acoustic classes. In this paper we combine the K-means clustering algorithm with the GMM posteriorgrams front-end to obtain more discriminant features. Results on a query-by-example task show that the proposed approaches outperform standard MFCC features by 7.8% absolute P@N and GMM-based posteriorgram features by 3.7% absolute P@N when using a 64-dimensional feature vector.

Computer Speech & Language | 2014

Language independent search in MediaEval's Spoken Web Search task

Florian Metze; Xavier Anguera; Etienne Barnard; Marelie H. Davel; Guillaume Gravier

Abstract In this paper, we describe several approaches to language-independent spoken term detection and compare their performance on a common task, namely “Spoken Web Search”. The goal of this part of the MediaEval initiative is to perform low-resource language-independent audio search using audio as input. The data was taken from “spoken web” material collected over mobile phone connections by IBM India as well as from the LWAZI corpus of African languages. As part of the 2011 and 2012 MediaEval benchmark campaigns, a number of diverse systems were implemented by independent teams, and submitted to the “Spoken Web Search” task. This paper presents the 2011 and 2012 results, and compares the relative merits and weaknesses of approaches developed by participants, providing analysis and directions for future research, in order to improve voice access to spoken information in low resource settings.

Multimodal Technologies for Perception of Humans | 2008

Speaker Diarization for Conference Room: The UPC RT07s Evaluation System

Jordi Luque; Xavier Anguera; Andrey Temko; Javier Hernando

In this paper the authors present the UPC speaker diarization system for the NIST Rich Transcription Evaluation (RT07s) [1] conducted on the conference environment. The presented system is based on the ICSI RT06s system, which employs agglomerative clustering with a modified Bayesian Criterion (BIC) measure to decide which pairs of clusters to merge and to determine when to stop merging clusters [2]. This is the first participation of the UPC in the RT Speaker Diarization Evaluation and the purpose of this work has been the consolidation of a baseline system which can be used in the future for further research in the field of diarization. We have introduced, as prior modules before the diarization system, an Speech/Non-Speech detection module based on a Support Vector Machine from UPC and a Wiener Filtering from an implementation of the QIO front-end. In the speech parameterization a Frequency Filtering (FF) of the filter-bank energies is applied instead the classical Discrete Cosine Transform in the Mel-Cepstrum analysis. In addition, it is introduced a small changes in the complexity selection algorithm and a new post-processing technique which process the shortest clusters at the end of each Viterbi segmentation.

Proceedings of the first SIGMM workshop on Social media | 2009

The role of tags and image aesthetics in social image search

Pere Obrador; Xavier Anguera; Rodrigo de Oliveira; Nuria Oliver

In recent years, there has been a proliferation of consumer digital photographs taken and stored in both personal and online repositories. As the amount of user-generated digital photos increases, there is a growing need for efficient ways to search for relevant images to be shared with friends and family. Text-query based search approaches rely heavily on the similarity between the input textual query and the tags added by users to the digital content. Unfortunately, text-query based search results might include a large number of relevant photos, all of them containing very similar tags, but with varying levels of image quality and aesthetic appeal. In this paper we introduce an image re-ranking algorithm that takes into account the aesthetic appeal of the images retrieved by a consumer image sharing site search engine (Googles Picasa Web Album). In order to do so, we extend a state-of-the-art image aesthetic appeal algorithm by incorporating a set of features aimed at consumer photographs. The results of a controlled user study with 37 participants reveal that image aesthetics play a varying role on the selected images depending on the query type and on the user preferences.

Explore More