Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Annamaria Mesaros is active.

Publication


Featured researches published by Annamaria Mesaros.


Eurasip Journal on Audio, Speech, and Music Processing | 2013

Context-dependent sound event detection

Toni Heittola; Annamaria Mesaros; Antti Eronen; Tuomas Virtanen

AbstractThe work presented in this article studies how the context information can be used in the automatic sound event detection process, and how the detection system can benefit from such information. Humans are using context information to make more accurate predictions about the sound events and ruling out unlikely events given the context. We propose a similar utilization of context information in the automatic sound event detection process. The proposed approach is composed of two stages: automatic context recognition stage and sound event detection stage. Contexts are modeled using Gaussian mixture models and sound events are modeled using three-state left-to-right hidden Markov models. In the first stage, audio context of the tested signal is recognized. Based on the recognized context, a context-specific set of sound event classes is selected for the sound event detection stage. The event detection stage also uses context-dependent acoustic models and count-based event priors. Two alternative event detection approaches are studied. In the first one, a monophonic event sequence is outputted by detecting the most prominent sound event at each time instance using Viterbi decoding. The second approach introduces a new method for producing polyphonic event sequence by detecting multiple overlapping sound events using multiple restricted Viterbi passes. A new metric is introduced to evaluate the sound event detection performance with various level of polyphony. This combines the detection accuracy and coarse time-resolution error into one metric, making the comparison of the performance of detection algorithms simpler. The two-step approach was found to improve the results substantially compared to the context-independent baseline system. In the block-level, the detection accuracy can be almost doubled by using the proposed context-dependent event detection.


european signal processing conference | 2016

TUT database for acoustic scene classification and sound event detection

Annamaria Mesaros; Toni Heittola; Tuomas Virtanen

We introduce TUT Acoustic Scenes 2016 database for environmental sound research, consisting of binaural recordings from 15 different acoustic environments. A subset of this database, called TUT Sound Events 2016, contains annotations for individual sound events, specifically created for sound event detection. TUT Sound Events 2016 consists of residential area and home environments, and is manually annotated to mark onset, offset and label of sound events. In this paper we present the recording and annotation procedure, the database content, a recommended cross-validation setup and performance of supervised acoustic scene classification system and event detection baseline system using mel frequency cepstral coefficients and Gaussian mixture models. The database is publicly released to provide support for algorithm development and common ground for comparison of different techniques.


Eurasip Journal on Audio, Speech, and Music Processing | 2010

Automatic recognition of lyrics in singing

Annamaria Mesaros; Tuomas Virtanen

The paper considers the task of recognizing phonemes and words from a singing input by using a phonetic hidden Markov model recognizer. The system is targeted to both monophonic singing and singing in polyphonic music. A vocal separation algorithm is applied to separate the singing from polyphonic music. Due to the lack of annotated singing databases, the recognizer is trained using speech and linearly adapted to singing. Global adaptation to singing is found to improve singing recognition performance. Further improvement is obtained by gender-specific adaptation. We also study adaptation with multiple base classes defined by either phonetic or acoustic similarity. We test phoneme-level and word-level n-gram language models. The phoneme language models are trained on the speech database text. The large-vocabulary word-level language model is trained on a database of textual lyrics. Two applications are presented. The recognizer is used to align textual lyrics to vocals in polyphonic music, obtaining an average error of 0.94 seconds for line-level alignment. A query-by-singing retrieval application based on the recognized words is also constructed; in 57% of the cases, the first retrieved song is the correct one.


international conference on acoustics, speech, and signal processing | 2015

Sound event detection in real life recordings using coupled matrix factorization of spectral representations and class activity annotations

Annamaria Mesaros; Toni Heittola; Onur Dikmen; Tuomas Virtanen

Methods for detection of overlapping sound events in audio involve matrix factorization approaches, often assigning separated components to event classes. We present a method that bypasses the supervised construction of class models. The method learns the components as a non-negative dictionary in a coupled matrix factorization problem, where the spectral representation and the class activity annotation of the audio signal share the activation matrix. In testing, the dictionaries are used to estimate directly the class activations. For dealing with large amount of training data, two methods are proposed for reducing the size of the dictionary. The methods were tested on a database of real life recordings, and outperformed previous approaches by over 10%.


international conference on acoustics, speech, and signal processing | 2013

Supervised model training for overlapping sound events based on unsupervised source separation

Toni Heittola; Annamaria Mesaros; Tuomas Virtanen; Moncef Gabbouj

Sound event detection is addressed in the presence of overlapping sounds. Unsupervised sound source separation into streams is used as a preprocessing step to minimize the interference of overlapping events. This poses a problem in supervised model training, since there is no knowledge about which separated stream contains the targeted sound source. We propose two iterative approaches based on EM algorithm to select the most likely stream to contain the target sound: one by selecting always the most likely stream and another one by gradually eliminating the most unlikely streams from the training. The approaches were evaluated with a database containing recordings from various contexts, against the baseline system trained without applying stream selection. Both proposed approaches were found to give a reasonable increase of 8 percentage units in the detection accuracy.


workshop on applications of signal processing to audio and acoustics | 2013

Sound event detection using non-negative dictionaries learned from annotated overlapping events

Onur Dikmen; Annamaria Mesaros

Detection of overlapping sound events generally requires training class models either from separate data for each class or by making assumptions about the dominating events in the mixed signals. Methods based on sound source separation are currently used in this task, but involve the problem of assigning separated components to sources. In this paper, we propose a method which bypasses the need to build separate sound models. Instead, non-negative dictionaries for the sound content and their annotations are learned in a coupled sense. In the testing stage, time activations of the sound dictionary columns are estimated and used to reconstruct annotations using the annotation dictionary. The method requires no separate training data for classes and in general very promising results are obtained using only a small amount of data.


international conference on acoustics, speech, and signal processing | 2010

Recognition of phonemes and words in singing

Annamaria Mesaros; Tuomas Virtanen

This paper studies the influence of n-gram language models in the recognition of sung phonemes and words. We train uni-, bi-, and trigram language models for phonemes and bi- and trigrams for words. The word-level language model is estimated from a textual lyrics database. In the recognition we use a hidden Markov model based phonetic recognizer adapted to singing voice. The models were tested on monophonic singing and on vocal lines separated from polyphonic music. On clean singing the phoneme recognition accuracies varied from 20% (no language model) to 39% (bigram) and on polyphonic music from 6% (no language model) to 20% (bigram). In word recognition, one fifth of the words were recognized in clean singing, the performance being lower on polyphonic music. We study the use of the recognition results in a query-by-singing application. Using the recognized words, we retrieve the songs by searching for the text in a text lyrics database. For the word recognition system having only 24% correct recognition rate, the first retrieved song is correct in 57% of the test cases.


international conference on acoustics, speech, and signal processing | 2014

Unsupervised feature extraction for multimedia event detection and ranking using audio content

Ehsan Amid; Annamaria Mesaros; Kalle J. Palomäki; Jorma Laaksonen; Mikko Kurimo

In this paper, we propose a new approach to classify and rank multimedia events based purely on audio content using video data from TRECVID-2013 multimedia event detection (MED) challenge. We perform several layers of nonlinear mappings to extract a set of unsupervised features from an initial set of temporal and spectral features to obtain a superior presentation of the atomic audio units. Additionally, we propose a novel weighted divergence measure for kernel based classifiers. The extensive set of experiments confirms that augmentation of the proposed steps results in an improved accuracy for most of the event classes.


Eurasip Journal on Audio, Speech, and Music Processing | 2014

Method for creating location-specific audio textures

Toni Heittola; Annamaria Mesaros; Dani Korpi; Antti Eronen; Tuomas Virtanen

An approach is proposed for creating location-specific audio textures for virtual location-exploration services. The presented approach creates audio textures by processing a small amount of audio recorded at a given location, providing a cost-effective way to produce a versatile audio signal that characterizes the location. The resulting texture is non-repetitive and conserves the location-specific characteristics of the audio scene, without the need of collecting large amount of audio from each location. The method consists of two stages: analysis and synthesis. In the analysis stage, the source audio recording is segmented into homogeneous segments. In the synthesis stage, the audio texture is created by randomly drawing segments from the source audio so that the consecutive segments will have timbral similarity near the segment boundaries. Results obtained in listening experiments show that there is no statistically significant difference in the audio quality or location-specificity of audio when the created audio textures are compared to excerpts of the original recordings. Therefore, the proposed audio textures could be utilized in virtual location-exploration services. Examples of source signals and audio textures created from them are available at http://www.cs.tut.fi/~heittolt/audiotexture.


international conference on acoustics, speech, and signal processing | 2013

Analysis of acoustic-semantic relationship for diversely annotated real-world audio data

Annamaria Mesaros; Toni Heittola; Kalle J. Palomäki

A common problem of freely annotated or user contributed audio databases is the high variability of the labels, related to homonyms, synonyms, plurals, etc. Automatically re-labeling audio data based on audio similarity could offer a solution to this problem. This paper studies the relationship between audio and labels in a sound event database, by evaluating semantic similarity of labels of acoustically similar sound event instances. The assumption behind the study is that acoustically similar events are annotated with semantically similar labels. Indeed, for 43% of the tested data, there was at least one in ten acoustically nearest neighbors having a synonym as label, while the closest related term is on average one level higher or lower in the semantic hierarchy.

Collaboration


Dive into the Annamaria Mesaros's collaboration.

Top Co-Authors

Avatar

Tuomas Virtanen

Tampere University of Technology

View shared research outputs
Top Co-Authors

Avatar

Toni Heittola

Tampere University of Technology

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Dani Korpi

Tampere University of Technology

View shared research outputs
Top Co-Authors

Avatar

Jaakko Astola

Tampere University of Technology

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge