Is this you? Create Your Porfile

Gerard Roma

Georgia Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Gerard Roma is active.

Explore More

Publication

Featured researches published by Gerard Roma.

international conference on latent variable analysis and signal separation | 2015

Deep Karaoke: Extracting Vocals from Musical Mixtures Using a Convolutional Deep Neural Network

Andrew J. R. Simpson; Gerard Roma; Mark D. Plumbley

Identification and extraction of singing voice from within musical mixtures is a key challenge in source separation and machine audition. Recently, deep neural networks DNN have been used to estimate ideal binary masks for carefully controlled cocktail party speech separation problems. However, it is not yet known whether these methods are capable of generalizing to the discrimination of voice and non-voice in the context of musical mixtures. Here, we trained a convolutional DNN of around a billion parameters to provide probabilistic estimates of the ideal binary mask for separation of vocal sounds from real-world musical mixtures. We contrast our DNN results with more traditional linear methods. Our approach may be useful for automatic removal of vocal sounds from musical mixtures for karaoke type applications.

conference of the international speech communication association | 2016

Combining Mask Estimates for Single Channel Audio Source Separation using Deep Neural Networks

Emad M. Grais; Gerard Roma; Andrew J. R. Simpson; Mark D. Plumbley

Deep neural networks (DNNs) are usually used for single channel source separation to predict either soft or binary time frequency masks. The masks are used to separate the sources from the mixed signal. Binary masks produce separated sources with more distortion and less interference than soft masks. In this paper, we propose to use another DNN to combine the estimates of binary and soft masks to achieve the advantages and avoid the disadvantages of using each mask individually. We aim to achieve separated sources with low distortion and low interference between each other. Our experimental results show that combining the estimates of binary and soft masks using DNN achieves lower distortion than using each estimate individually and achieves as low interference as the binary mask.

european signal processing conference | 2016

Evaluation of audio source separation models using hypothesis-driven non-parametric statistical methods

Andrew J. R. Simpson; Gerard Roma; Emad M. Grais; Russell Mason; Christopher Hummersone; Antoine Liutkus; Mark D. Plumbley

Audio source separation models are typically evaluated using objective separation quality measures, but rigorous statistical methods have yet to be applied to the problem of model comparison. As a result, it can be difficult to establish whether or not reliable progress is being made during the development of new models. In this paper, we provide a hypothesis-driven statistical analysis of the results of the recent source separation SiSEC challenge involving twelve competing models tested on separation of voice and accompaniment from fifty pieces of “professionally produced” contemporary music. Using non-parametric statistics, we establish reliable evidence for meaningful conclusions about the performance of the various models.

international conference on latent variable analysis and signal separation | 2017

Discriminative Enhancement for Single Channel Audio Source Separation using Deep Neural Networks

Emad M. Grais; Gerard Roma; Andrew J. R. Simpson; Mark D. Plumbley

The sources separated by most single channel audio source separation techniques are usually distorted and each separated source contains residual signals from the other sources. To tackle this problem, we propose to enhance the separated sources to decrease the distortion and interference between the separated sources using deep neural networks (DNNs). Two different DNNs are used in this work. The first DNN is used to separate the sources from the mixed signal. The second DNN is used to enhance the separated signals. To consider the interactions between the separated sources, we propose to use a single DNN to enhance all the separated sources together. To reduce the residual signals of one source from the other separated sources (interference), we train the DNN for enhancement discriminatively to maximize the dissimilarity between the predicted sources. The experimental results show that using discriminative enhancement decreases the distortion and interference between the separated sources.

IEEE Transactions on Audio, Speech, and Language Processing | 2017

Two-Stage Single-Channel Audio Source Separation Using Deep Neural Networks

Emad M. Grais; Gerard Roma; Andrew J. R. Simpson; Mark D. Plumbley

Most single channel audio source separation approaches produce separated sources accompanied by interference from other sources and other distortions. To tackle this problem, we propose to separate the sources in two stages. In the first stage, the sources are separated from the mixed signal. In the second stage, the interference between the separated sources and the distortions are reduced using deep neural networks (DNNs). We propose two methods that use DNNs to improve the quality of the separated sources in the second stage. In the first method, each separated source is improved individually using its own trained DNN, while in the second method all the separated sources are improved together using a single DNN. To further improve the quality of the separated sources, the DNNs in the second stage are trained discriminatively to further decrease the interference and the distortions of the separated sources. Our experimental results show that using two stages of separation improves the quality of the separated signals by decreasing the interference between the separated sources and distortions compared to separating the sources using a single stage of separation.

international conference on latent variable analysis and signal separation | 2017

Psychophysical Evaluation of Audio Source Separation Methods

Andrew J. R. Simpson; Gerard Roma; Emad M. Grais; Russell Mason; Christopher Hummersone; Mark D. Plumbley

Source separation evaluation is typically a top-down process, starting with perceptual measures which capture fitness-for-purpose and followed by attempts to find physical (objective) measures that are predictive of the perceptual measures. In this paper, we take a contrasting bottom-up approach. We begin with the physical measures provided by the Blind Source Separation Evaluation Toolkit (BSS Eval) and we then look for corresponding perceptual correlates. This approach is known as psychophysics and has the distinct advantage of leading to interpretable, psychophysical models. We obtained perceptual similarity judgments from listeners in two experiments featuring vocal sources within musical mixtures. In the first experiment, listeners compared the overall quality of vocal signals estimated from musical mixtures using a range of competing source separation methods. In a loudness experiment, listeners compared the loudness balance of the competing musical accompaniment and vocal. Our preliminary results provide provisional validation of the psychophysical approach.

audio mostly conference | 2017

Handwaving: Gesture Recognition for Participatory Mobile Music

Gerard Roma; Anna Xambó; Jason Freeman

This paper describes handwaving, a system for participatory mobile music based on accelerometer gesture recognition. The core of the system is a library that can be used to recognize and map arbitrary gestures to sound synthesizers. Such gestures can be quickly learnt by mobile phone users in order to produce sounds in a musical context. The system is implemented using web standards, so it can be used with most current smartphones without the need of installing specific software.

Computational Analysis of Sound Scenes and Events | 2018

Sound Sharing and Retrieval

Frederic Font; Gerard Roma; Xavier Serra

Multimedia sharing has experienced an enormous growth in recent years, and sound sharing has not been an exception. Nowadays one can find online sound sharing sites in which users can search, browse, and contribute large amounts of audio content such as sound effects, field and urban recordings, music tracks, and music samples. This poses many challenges to enable search, discovery, and ultimately reuse of this content. In this chapter we give an overview of different ways to approach such challenges. We describe how to build an audio database by outlining different aspects to be taken into account. We discuss metadata-based descriptions of audio content and different searching and browsing techniques that can be used to navigate the database. In addition to metadata, we show sound retrieval techniques based on the extraction of audio features from (possibly) unannotated audio. We end the chapter by discussing advanced approaches to sound retrieval and by drawing some conclusions about present and future of sound sharing and retrieval. In addition to our explanations, we provide code examples that illustrate some of the concepts discussed.

audio mostly conference | 2017

Turn-Taking and Chatting in Collaborative Music Live Coding

Anna Xambó; Pratik Shah; Gerard Roma; Jason Freeman; Brian Magerko

Co-located collaborative live coding is a potential approach to network music and to the music improvisation practice known as live coding. A common strategy to support communication between live coders and the audience is the use of a chat window. However, paying attention to simultaneous multi-user actions, such as chat texts and code, can be too demanding to follow. In this paper, we explore collaborative music live coding (CMLC) using the live coding environment and pedagogical tool EarSketch. In particular, we examine the use of turn-taking and a customized chat window inspired by the practice of pair programming, a team-based strategy to efficiently solving computational problems. Our approach to CMLC also aims at facilitating the understanding of this practice to the audience. We conclude discussing the benefits of this approach in both performance and educational settings.

Journal of Intelligent Information Systems | 2017

Environmental sound recognition using short-time feature aggregation

Gerard Roma; Perfecto Herrera; Waldo Nogueira

Recognition of environmental sound is usually based on two main architectures, depending on whether the model is trained with frame-level features or with aggregated descriptions of acoustic scenes or events. The former architecture is appropriate for applications where target categories are known in advance, while the later affords a less supervised approach. In this paper, we propose a framework for environmental sound recognition based on blind segmentation and feature aggregation. We describe a new set of descriptors, based on Recurrence Quantification Analysis (RQA), which can be extracted from the similarity matrix of a time series of audio descriptors. We analyze their usefulness for recognition of acoustic scenes and events in addition to standard feature aggregation. Our results show the potential of non-linear time series analysis techniques for dealing with environmental sounds.

Explore More