Ron J. Weiss | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ron J. Weiss is active.

Explore More

Publication

Featured researches published by Ron J. Weiss.

international conference on acoustics, speech, and signal processing | 2015

Speech acoustic modeling from raw multichannel waveforms

Yedid Hoshen; Ron J. Weiss; Kevin W. Wilson

Standard deep neural network-based acoustic models for automatic speech recognition (ASR) rely on hand-engineered input features, typically log-mel filterbank magnitudes. In this paper, we describe a convolutional neural network - deep neural network (CNN-DNN) acoustic model which takes raw multichannel waveforms as input, i.e. without any preceding feature extraction, and learns a similar feature representation through supervised training. By operating directly in the time domain, the network is able to take advantage of the signals fine time structure that is discarded when computing filterbank magnitude features. This structure is especially useful when analyzing multichannel inputs, where timing differences between input channels can be used to localize a signal in space. The first convolutional layer of the proposed model naturally learns a filterbank that is selective in both frequency and direction of arrival, i.e. a bank of bandpass beamformers with an auditory-like frequency scale. When trained on data corrupted with noise coming from different spatial locations, the network learns to filter them out by steering nulls in the directions corresponding to the noise sources. Experiments on a simulated multichannel dataset show that the proposed acoustic model outperforms a DNN that uses log-mel filterbank magnitude features under noisy and reverberant conditions.

Computer Speech & Language | 2010

Speech separation using speaker-adapted eigenvoice speech models

Ron J. Weiss; Daniel P. W. Ellis

We present a system for model-based source separation for use on single channel speech mixtures where the precise source characteristics are not known a priori. The sources are modeled using hidden Markov models (HMM) and separated using factorial HMM methods. Without prior speaker models for the sources in the mixture it is difficult to exactly resolve the individual sources because there is no way to determine which state corresponds to which source at any point in time. This is solved to a small extent by the temporal constraints provided by the Markov models, but permutations between sources remains a significant problem. We overcome this by adapting the models to match the sources in the mixture. We do this by representing the space of speaker variation with a parametric signal model-based on the eigenvoice technique for rapid speaker adaptation. We present an algorithm to infer the characteristics of the sources present in a mixture, allowing for significantly improved separation performance over that obtained using unadapted source models. The algorithm is evaluated on the task defined in the 2006 Speech Separation Challenge [Cooke, M.P., Lee, T.-W., 2008. The 2006 Speech Separation Challenge. Computer Speech and Language] and compared with separation using source-dependent models. Although performance is not as good as with speaker-dependent models, we show that the system based on model adaptation is able to generalize better to held out speakers.

international conference on acoustics, speech, and signal processing | 2017

CNN architectures for large-scale audio classification

Shawn Hershey; Sourish Chaudhuri; Daniel P. W. Ellis; Jort F. Gemmeke; Aren Jansen; R. Channing Moore; Manoj Plakal; Devin Platt; Rif A. Saurous; Bryan Seybold; Malcolm Slaney; Ron J. Weiss; Kevin W. Wilson

Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task.

IEEE Journal of Selected Topics in Signal Processing | 2011

Unsupervised Discovery of Temporal Structure in Music

Ron J. Weiss; Juan Pablo Bello

We describe a data-driven algorithm for automatically identifying repeated patterns in music which analyzes a feature matrix using shift-invariant probabilistic latent component analysis. We utilize sparsity constraints to automatically identify the number of patterns and their lengths, parameters that would normally need to be fixed in advance, as well as to control the structure of the decomposition. The proposed analysis is applied to beat-synchronous chromagrams in order to concurrently extract recurrent harmonic motifs and their locations within a song. We demonstrate how the analysis can be used to accurately identify riffs in popular music and explore the relationship between the derived parameters and a songs underlying metrical structure. Finally, we show how this analysis can be used for long-term music structure segmentation, resulting in an algorithm that is competitive with other state-of-the-art segmentation algorithms based on hidden Markov models and self similarity matrices.

conference on recommender systems | 2013

Learning to rank recommendations with the k-order statistic loss

Jason Weston; Hector Yee; Ron J. Weiss

Making recommendations by learning to rank is becoming an increasingly studied area. Approaches that use stochastic gradient descent scale well to large collaborative filtering datasets, and it has been shown how to approximately optimize the mean rank, or more recently the top of the ranked list. In this work we present a family of loss functions, the k-order statistic loss, that includes these previous approaches as special cases, and also derives new ones that we show to be useful. In particular, we present (i) a new variant that more accurately optimizes precision at k, and (ii) a novel procedure of optimizing the mean maximum rank, which we hypothesize is useful to more accurately cover all of the users tastes. The general approach works by sampling N positive items, ordering them by the score assigned by the model, and then weighting the example as a function of this ordered set. Our approach is studied in two real-world systems, Google Music and YouTube video recommendations, where we obtain improvements for computable metrics, and in the YouTube case, increased user click through and watch duration when deployed live on www.youtube.com.

conference on recommender systems | 2013

Nonlinear latent factorization by embedding multiple user interests

Jason Weston; Ron J. Weiss; Hector Yee

Classical matrix factorization approaches to collaborative filtering learn a latent vector for each user and each item, and recommendations are scored via the similarity between two such vectors, which are of the same dimension. In this work, we are motivated by the intuition that a user is a much more complicated entity than any single item, and cannot be well described by the same representation. Hence, the variety of a users interests could be better captured by a more complex representation. We propose to model the user with a richer set of functions, specifically via a set of latent vectors, where each vector captures one of the users latent interests or tastes. The overall recommendation model is then nonlinear where the matching score between a user and a given item is the maximum matching score over each of the users latent interests with respect to the items latent representation. We describe a simple, general and efficient algorithm for learning such a model, and apply it to large scale, real-world datasets from YouTube and Google Music, where our approach outperforms existing techniques.

international symposium/conference on music information retrieval | 2010

Clustering beat-chroma patterns in a large music database

Thierry Bertin-Mahieux; Ron J. Weiss; Daniel P. W. Ellis

A musical style or genre implies a set of common conventions and patterns combined and deployed in different ways to make individual musical pieces; for instance, most would agree that contemporary pop music is assembled from a relatively small palette of harmonic and melodic patterns. The purpose of this paper is to use a database of tens of thousands of songs in combination with a compact representation of melodic-harmonic content (the beatsynchronous chromagram) and data-mining tools (clustering) to attempt to explicitly catalog this palette ‐ at least within the limitations of the beat-chroma representation. We use online k-means clustering to summarize 3.7 million 4-beat bars in a codebook of a few hundred prototypes. By measuring how accurately such a quantized codebook can reconstruct the original data, we can quantify the degree of diversity (distortion as a function of codebook size) and temporal structure (i.e. the advantage gained by joint quantizing multiple frames) in this music. The most popular codewords themselves reveal the common chords used in the music. Finally, the quantized representation of music can be used for music retrieval tasks such as artist and genre classification, and identifying songs that are similar in terms of their melodic-harmonic content.

workshop on applications of signal processing to audio and acoustics | 2007

Monaural Speech Separation using Source-Adapted Models

Ron J. Weiss; Daniel P. W. Ellis

We propose a model-based source separation system for use on single channel speech mixtures where the precise source characteristics are not known a priori. We do this by representing the space of source variation with a parametric signal model based on the eigenvoice technique for rapid speaker adaptation. We present an algorithm to infer the characteristics of the sources present in a mixture, allowing for significantly improved separation performance over that obtained using unadapted source models. The algorithm is evaluated on the task defined in the 2006 Speech Separation Challenge [1] and compared with separation using source-dependent models.

conference of the international speech communication association | 2016

Neural Network Adaptive Beamforming for Robust Multichannel Speech Recognition

Bo Li; Tara N. Sainath; Ron J. Weiss; Kevin W. Wilson; Michiel Bacchiani

Joint multichannel enhancement and acoustic modeling using neural networks has shown promise over the past few years. However, one shortcoming of previous work [1, 2, 3] is that the filters learned during training are fixed for decoding, potentially limiting the ability of these models to adapt to previously unseen or changing conditions. In this paper we explore a neural network adaptive beamforming (NAB) technique to address this issue. Specifically, we use LSTM layers to predict time domain beamforming filter coefficients at each input frame. These filters are convolved with the framed time domain input signal and summed across channels, essentially performing FIR filter-andsum beamforming using the dynamically adapted filter. The beamformer output is passed into a waveform CLDNN acoustic model [4] which is trained jointly with the filter prediction LSTM layers. We find that the proposed NAB model achieves a 12.7% relative improvement in WER over a single channel model [4] and reaches similar performance to a “factored” model architecture which utilizes several fixed spatial filters [3] on a 2,000-hour Voice Search task, with a 17.9% decrease in computational cost.

Speech Communication | 2011

Combining localization cues and source model constraints for binaural source separation

Ron J. Weiss; Michael I. Mandel; Daniel P. W. Ellis

We describe a system for separating multiple sources from a two-channel recording based on interaural cues and prior knowledge of the statistics of the underlying source signals. The proposed algorithm effectively combines information derived from low level perceptual cues, similar to those used by the human auditory system, with higher level information related to speaker identity. We combine a probabilistic model of the observed interaural level and phase differences with a prior model of the source statistics and derive an EM algorithm for finding the maximum likelihood parameters of the joint model. The system is able to separate more sound sources than there are observed channels in the presence of reverberation. In simulated mixtures of speech from two and three speakers the proposed algorithm gives a signal-to-noise ratio improvement of 1.7dB over a baseline algorithm which uses only interaural cues. Further improvement is obtained by incorporating eigenvoice speaker adaptation to enable the source model to better match the sources present in the signal. This improves performance over the baseline by 2.7dB when the speakers used for training and testing are matched. However, the improvement is minimal when the test data is very different from that used in training.

Explore More