Is this you? Create Your Porfile

Benjamin Elizalde

International Computer Science Institute

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Benjamin Elizalde is active.

Explore More

Publication

Featured researches published by Benjamin Elizalde.

Communications of The ACM | 2016

YFCC100M: the new data in multimedia research

Bart Thomee; David A. Shamma; Gerald Friedland; Benjamin Elizalde; Karl Ni; Douglas N. Poland; Damian Borth; Li-Jia Li

This publicly available curated dataset of almost 100 million photos and videos is free and legal for all.This publicly available curated dataset of almost 100 million photos and videos is free and legal for all.

acm multimedia | 2014

The Placing Task: A Large-Scale Geo-Estimation Challenge for Social-Media Videos and Images

Jaeyoung Choi; Bart Thomee; Gerald Friedland; Liangliang Cao; Karl Ni; Damian Borth; Benjamin Elizalde; Luke R. Gottlieb; Carmen J. Carrano; Roger A. Pearce; Douglas N. Poland

The Placing Task is a yearly challenge offered by the MediaEval Multimedia Benchmarking Initiative that requires participants to develop algorithms that automatically predict the geo-location of social media videos and images. We introduce a recent development of a new standardized web-scale geo-tagged dataset for Placing Task 2014, which contains 5.5 million photos and 35,000 videos. This standardized benchmark with a large persistent dataset allows research community to easily evaluate new algorithms and to analyze their performance with respect to the state-of-the-art approaches. We discuss the characteristics of this years Placing Task along with the description of the new dataset components and how they were collected.

international conference on multimedia retrieval | 2015

Audio-Based Multimedia Event Detection with DNNs and Sparse Sampling

Khalid Ashraf; Benjamin Elizalde; Forrest N. Iandola; Matthew W. Moskewicz; Julia Bernd; Gerald Friedland; Kurt Keutzer

This paper presents advances in analyzing audio content information to detect events in videos, such as a parade or a birthday party. We developed a set of tools for audio processing within the predominantly vision-focused deep neural network (DNN) framework Caffe. Using these tools, we show, for the first time, the potential of using only a DNN for audio-based multimedia event detection. Training DNNs for event detection using the entire audio track from each video causes a computational bottleneck. Here, we address this problem by developing a sparse audio frame-sampling method that improves event-detection speed and accuracy. We achieved a 10 percentage-point improvement in event classification accuracy, with a 200x reduction in the number of training input examples as compared to using the entire track. This reduction in input feature volume led to a 16x reduction in the size of the DNN architecture and a 300x reduction in training time. We applied our method using the recently released YLI-MED dataset and compared our results with a state-of-the-art system and with results reported in the literature for TRECVIDMED. Our results show much higher MAP scores compared to a baseline i-vector system - at a significantly reduced computational cost. The speed improvement is relevant for processing videos on a large scale, and could enable more effective deployment in mobile systems.

international symposium on multimedia | 2013

An i-Vector Representation of Acoustic Environments for Audio-Based Video Event Detection on User Generated Content

Benjamin Elizalde; Howard Lei; Gerald Friedland

Audio-based video event detection (VED) on user-generated content (UGC) aims to find videos that show an observable event such as a wedding ceremony or birthday party rather than a sound, such as music, clapping or singing. The difficulty of video content analysis on UGC lies in the acoustic variability and lack of structure of the data. The UGC task has been explored mainly by computer vision, but can be benefited by the used of audio. The i-vector system is state-of-the-art in Speaker Verification, and is outperforming a conventional Gaussian Mixture Model (GMM)-based approach. The system compensates for undesired acoustic variability and extracts information from the acoustic environment, making it a meaningful choice for detection on UGC. This paper employs the i-vector-based system for audio-based VED on UGC and expands the understanding of the system on the task. It also includes a performance comparison with the conventional GMM-based and state-of-the-art Random Forest (RF)-based systems. The i-vector system aids audio-based event detection by addressing UGC audio characteristics. It outperforms the GMM-based system, and is competitive with the RF-based system in terms of the Missed Detection (MD) rate at 4% and 2.8% False Alarm (FA) rates, and complements the RF-based system by demonstrating slightly improvement in combination over the standalone systems.

acm multimedia | 2015

Kickstarting the Commons: The YFCC100M and the YLI Corpora

Julia Bernd; Damian Borth; Carmen J. Carrano; Jaeyoung Choi; Benjamin Elizalde; Gerald Friedland; Luke R. Gottlieb; Karl Ni; Roger A. Pearce; Douglas N. Poland; Khalid Ashraf; David A. Shamma; Bart Thomee

The publication of the Yahoo Flickr Creative Commons 100 Million dataset (YFCC100M)--to date the largest open-access collection of photos and videos--has provided a unique opportunity to stimulate new research in multimedia analysis and retrieval. To make the YFCC100M even more valuable, we have started working towards supplementing it with a comprehensive set of precomputed features and high-quality ground truth annotations. As part of our efforts, we are releasing the YLI feature corpus, as well as the YLI-GEO and YLI-MED annotation subsets. Under the Multimedia Commons Project (MMCP), we are currently laying the groundwork for a common platform and framework around the YFCC100M that (i) facilitates researchers in contributing additional features and annotations, (ii) supports experimentation on the dataset, and (iii) enables sharing of obtained results. This paper describes the YLI features and annotations released thus far, and sketches our vision for the MMCP.

acm multimedia | 2015

Insights into Audio-Based Multimedia Event Classification with Neural Networks

Mirco Ravanelli; Benjamin Elizalde; Julia Bernd; Gerald Friedland

Multimedia Event Detection (MED) aims to identify events-also called scenes-in videos, such as a flash mob or a wedding ceremony. Audio content information complements cues such as visual content and text. In this paper, we explore the optimization of neural networks (NNs) for audio-based multimedia event classification, and discuss some insights towards more effectively using this paradigm for MED. We explore different architectures, in terms of number of layers and number of neurons. We also assess the performance impact of pre-training with Restricted Boltzmann Machines (RBMs) in contrast with random initialization, and explore the effect of varying the context window for the input to the NNs. Lastly, we compare the performance of Hidden Markov Models (HMMs) with a discriminative classifier for the event classification. We used the publicly available event-annotated YLI-MED dataset. Our results showed a performance improvement of more than 6% absolute accuracy compared to the latest results reported in the literature. Interestingly, these results were obtained with a single-layer neural network with random initialization, suggesting that standard approaches with deep learning and RBM pre-training are not fully adequate to address the high-level video event-classification task.

Eurasip Journal on Audio, Speech, and Music Processing | 2018

AudioPairBank: towards a large-scale tag-pair-based audio content analysis

Sebastian Sager; Benjamin Elizalde; Damian Borth; Christian Schulze; Bhiksha Raj; Ian R. Lane

Recently, sound recognition has been used to identify sounds, such as the sound of a car, or a river. However, sounds have nuances that may be better described by adjective-noun pairs such as “slow car” and verb-noun pairs such as “flying insects,” which are underexplored. Therefore, this work investigates the relationship between audio content and both adjective-noun pairs and verb-noun pairs. Due to the lack of datasets with these kinds of annotations, we collected and processed the AudioPairBank corpus consisting of a combined total of 1123 pairs and over 33,000 audio files. In this paper, we include previously unavailable documentation of the challenges and implications of collecting audio recordings with these types of labels. We have also shown the degree of correlation between the audio content and the labels through classification experiments, which yielded 70% accuracy. The results and study in this paper encourage further exploration of the nuances in sounds and are meant to complement similar research performed on images and text in multimedia analysis.

ieee international conference on multimedia big data | 2016

City-Identification of Flickr Videos Using Semantic Acoustic Features

Benjamin Elizalde; Guan-Lin Chao; Ming Zeng; Ian R. Lane

City-identification of videos aims to determine the likelihood of a video belonging to a set of cities. In this paper, we present an approach using only audio, thus we do not use any additional modality such as images, user-tags or geo-tags. In this manner, we show to what extent the city-location of videos correlates to their acoustic information. Success in this task suggests improvements can be made to complement the other modalities. In particular, we present a method to compute and use semantic acoustic features to perform city-identification and the features show semantic evidence of the identification. The semantic evidence is given by a taxonomy of urban sounds and expresses the potential presence of these sounds in the city-soundtracks. We used the MediaEval Placing Task set, which contains Flickr videos labeled by city. In addition, we used the UrbanSound8K set containing audio clips labeled by sound-type. Our method improved the state-of-the-art performance and provides a novel semantic approach to this task.

arXiv: Multimedia | 2015

The New Data and New Challenges in Multimedia Research.

Bart Thomee; David A. Shamma; Gerald Friedland; Benjamin Elizalde; Karl Ni; Douglas N. Poland; Damian Borth; Li-Jia Li

Proc. of NIST TRECVID and Workshop, Gaithersberg, USA | 2012

SRI-Sarnoff AURORA System at TRECVID 2012: Multimedia Event Detection and Recounting

Hui Cheng; Jingen Liu; Saad Ali; Omar Javed; Qian Yu; Amir Tamrakar; Ajay Divakaran; Harpreet S. Sawhney; R. Manmatha; James Allan; Alexander G. Hauptmann; Mubarak Shah; Subhabrata Bhattacharya; Afshin Dehghan; Gerald Friedland; Benjamin Elizalde; Trevor Darrell; Michael J. Witbrock; Jon Curtis

Explore More