Daniel P. W. Ellis | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Daniel P. W. Ellis is active.

Explore More

Publication

Featured researches published by Daniel P. W. Ellis.

international conference on acoustics, speech, and signal processing | 2017

Audio Set: An ontology and human-labeled dataset for audio events

Jort F. Gemmeke; Daniel P. W. Ellis; Dylan Freedman; Aren Jansen; Wade Lawrence; R. Channing Moore; Manoj Plakal; Marvin Ritter

Audio event recognition, the human-like ability to identify and relate sounds from audio, is a nascent problem in machine perception. Comparable problems such as object detection in images have reaped enormous benefits from comprehensive datasets - principally ImageNet. This paper describes the creation of Audio Set, a large-scale dataset of manually-annotated audio events that endeavors to bridge the gap in data availability between image and audio research. Using a carefully structured hierarchical ontology of 632 audio classes guided by the literature and manual curation, we collect data from human labelers to probe the presence of specific audio classes in 10 second segments of YouTube videos. Segments are proposed for labeling using searches based on metadata, context (e.g., links), and content analysis. The result is a dataset of unprecedented breadth and size that will, we hope, substantially stimulate the development of high-performance audio event recognizers.

international conference on acoustics, speech, and signal processing | 2017

CNN architectures for large-scale audio classification

Shawn Hershey; Sourish Chaudhuri; Daniel P. W. Ellis; Jort F. Gemmeke; Aren Jansen; R. Channing Moore; Manoj Plakal; Devin Platt; Rif A. Saurous; Bryan Seybold; Malcolm Slaney; Ron J. Weiss; Kevin W. Wilson

Convolutional Neural Networks (CNNs) have proven very effective in image classification and show promise for audio. We use various CNN architectures to classify the soundtracks of a dataset of 70M training videos (5.24 million hours) with 30,871 video-level labels. We examine fully connected Deep Neural Networks (DNNs), AlexNet [1], VGG [2], Inception [3], and ResNet [4]. We investigate varying the size of both training set and label vocabulary, finding that analogs of the CNNs used in image classification do well on our audio classification task, and larger training and label sets help up to a point. A model using embeddings from these classifiers does much better than raw features on the Audio Set [5] Acoustic Event Detection (AED) classification task.

international conference on acoustics, speech, and signal processing | 2017

Large-scale audio event discovery in one million YouTube videos

Aren Jansen; Jort F. Gemmeke; Daniel P. W. Ellis; Xiaofeng Liu; Wade Lawrence; Dylan Freedman

Internet videos provide a virtually boundless source of audio with a conspicuous lack of localized annotations, presenting an ideal setting for unsupervised methods. With this motivation, we perform an unprecedented exploration into the large-scale discovery of recurring audio events in a diverse corpus of one million YouTube videos (45K hours of audio). Our approach is to apply a streaming, nonparametric clustering algorithm to both spectral features and out-of-domain neural audio embeddings. We use a small portion of manually annotated audio events to quantitatively estimate the intrinsic clustering performance. In addition to providing a useful mechanism for unsupervised active learning, we demonstrate the effectiveness of the discovered audio event clusters in two downstream applications. The first is weakly-supervised learning, where we exploit the association of video-level metadata and cluster occurrences to temporally localize audio events. The second is informative activity detection, an unsupervised method for semantic saliency based on the corpus statistics of the discovered event clusters.

Archive | 2018

Introduction to sound scene and event analysis

Tuomas Virtanen; Mark D. Plumbley; Daniel P. W. Ellis

Sounds carry a great deal of information about our environments, from individual physical events to sound scenes as a whole. In recent years several novel methods have been proposed to analyze this information automatically, and several new applications have emerged. This chapter introduces the basic concepts and research problems and engineering challenges in computational environmental sound analysis. We motivate the field by briefly describing various applications where the methods can be used. We discuss the commonalities and differences of environmental sound analysis to other major audio content analysis fields such as automatic speech recognition and music information retrieval. We discuss the main challenges in the field, and give a short historical perspective of the development of the field. We also shortly summarize the role of each chapter in the book.

Nature Communications | 2018

Inharmonic speech reveals the role of harmonicity in the cocktail party problem

Sara Popham; Dana Boebinger; Daniel P. W. Ellis; Hideki Kawahara; Josh H. McDermott

The “cocktail party problem” requires us to discern individual sound sources from mixtures of sources. The brain must use knowledge of natural sound regularities for this purpose. One much-discussed regularity is the tendency for frequencies to be harmonically related (integer multiples of a fundamental frequency). To test the role of harmonicity in real-world sound segregation, we developed speech analysis/synthesis tools to perturb the carrier frequencies of speech, disrupting harmonic frequency relations while maintaining the spectrotemporal envelope that determines phonemic content. We find that violations of harmonicity cause individual frequencies of speech to segregate from each other, impair the intelligibility of concurrent utterances despite leaving intelligibility of single utterances intact, and cause listeners to lose track of target talkers. However, additional segregation deficits result from replacing harmonic frequencies with noise (simulating whispering), suggesting additional grouping cues enabled by voiced speech excitation. Our results demonstrate acoustic grouping cues in real-world sound segregation.Harmonicity is associated with a single sound source and may be a useful cue with which to segregate the speech of multiple talkers. Here the authors introduce a method for perturbing the constituent frequencies of speech and show that violating harmonicity degrades intelligibility of speech mixtures.

Science Advances | 2018

Eavesdropping on the Arctic: Automated bioacoustics reveal dynamics in songbird breeding phenology

Ruth Y. Oliver; Daniel P. W. Ellis; Helen E. Chmura; Jesse S. Krause; Jonathan H. Pérez; Shannan K. Sweet; Laura Gough; John C. Wingfield; Natalie T. Boelman

Soundscape-level acoustic recordings revealed delay in arrival of songbird community to arctic breeding grounds. Bioacoustic networks could vastly expand the coverage of wildlife monitoring to complement satellite observations of climate and vegetation. This approach would enable global-scale understanding of how climate change influences phenomena such as migratory timing of avian species. The enormous data sets that autonomous recorders typically generate demand automated analyses that remain largely undeveloped. We devised automated signal processing and machine learning approaches to estimate dates on which songbird communities arrived at arctic breeding grounds. Acoustically estimated dates agreed well with those determined via traditional surveys and were strongly related to the landscape’s snow-free dates. We found that environmental conditions heavily influenced daily variation in songbird vocal activity, especially before egg laying. Our novel approaches demonstrate that variation in avian migratory arrival can be detected autonomously. Large-scale deployment of this innovation in wildlife monitoring would enable the coverage necessary to assess and forecast changes in bird migration in the face of climate change.

Archive | 2018

Computational Analysis of Sound Scenes and Events

Tuomas Virtanen; Mark D. Plumbley; Daniel P. W. Ellis

This book presents computational methods for extracting the useful information from audio signals, collecting the state of the art in the field of sound event and scene analysis. The authors cover the entire procedure for developing such methods, ranging from data acquisition and labeling, through the design of taxonomies used in the systems, to signal processing methods for feature extraction and machine learning methods for sound recognition. The book also covers advanced techniques for dealing with environmental variation and multiple overlapping sound sources, and taking advantage of multiple microphones or other modalities. The book gives examples of usage scenarios in large media databases, acoustic monitoring, bioacoustics, and context-aware devices. Graphical illustrations of sound signals and their spectrographic representations are presented, as well as block diagrams and pseudocode of algorithms. nGives an overview of methods for computational analysis of sounds scenes and events, allowing those new to the field to become fully informed; nCovers all the aspects of the machine learning approach to computational analysis of sound scenes and events, ranging from data capture and labeling process to development of algorithms; nIncludes descriptions of algorithms accompanied by a website from which software implementations can be downloaded, facilitating practical interaction with the techniques.

Archive | 2018

Datasets and Evaluation

Annamaria Mesaros; Toni Heittola; Daniel P. W. Ellis

Developing computational systems requires methods for evaluating their performance to guide development and compare alternate approaches. A reliable evaluation procedure for a classification or recognition system will involve a standard dataset of example input data along with the intended target output, and well-defined metrics to compare the systems’ outputs with this ground truth. This chapter examines the important factors in the design and construction of evaluation datasets and goes through the metrics commonly used in system evaluation, comparing their properties. We include a survey of currently available datasets for environmental sound scene and event recognition and conclude with advice for designing evaluation protocols.

arXiv: Sound | 2018