George Toderici | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where George Toderici is active.

Explore More

Publication

Featured researches published by George Toderici.

computer vision and pattern recognition | 2015

Beyond short snippets: Deep networks for video classification

Joe Yue-Hei Ng; Matthew J. Hausknecht; Sudheendra Vijayanarasimhan; Oriol Vinyals; Rajat Monga; George Toderici

Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handling full length videos. The first method explores various convolutional temporal feature pooling architectures, examining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN. Our best networks exhibit significant performance improvements over previously published results on the Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101 datasets with (88.6% vs. 88.0%) and without additional optical flow information (82.6% vs. 73.0%).

computer vision and pattern recognition | 2010

Finding meaning on YouTube: Tag recommendation and category discovery

George Toderici; Hrishikesh Aradhye; Marius Pasca; Luciano Sbaiz; Jay Yagnik

We present a system that automatically recommends tags for YouTube videos solely based on their audiovisual content. We also propose a novel framework for unsupervised discovery of video categories that exploits knowledge mined from the World-Wide Web text documents/searches. First, video content to tag association is learned by training classifiers that map audiovisual content-based features from millions of videos on YouTube.com to existing uploader-supplied tags for these videos. When a new video is uploaded, the labels provided by these classifiers are used to automatically suggest tags deemed relevant to the video. Our system has learned a vocabulary of over 20,000 tags. Secondly, we mined large volumes of Web pages and search queries to discover a set of possible text entity categories and a set of associated is-A relationships that map individual text entities to categories. Finally, we apply these is-A relationships mined from web text on the tags learned from audiovisual content of videos to automatically synthesize a reliable set of categories most relevant to videos – along with a mechanism to predict these categories for new uploads. We then present rigorous rating studies that establish that: (a) the average relevance of tags automatically recommended by our system matches the average relevance of the uploader-supplied tags at the same or better coverage and (b) the average precision@K of video categories discovered by our system is 70% with K=5.

international conference on data mining | 2009

Video2Text: Learning to Annotate Video Content

Hrishikesh Aradhye; George Toderici; Jay Yagnik

This paper discusses a new method for automatic discovery and organization of descriptive concepts (labels) within large real-world corpora of user-uploaded multimedia, such as YouTube. com. Conversely, it also provides validation of existing labels, if any. While training, our method does not assume any explicit manual annotation other than the weak labels already available in the form of video title, description, and tags. Prior work related to such auto-annotation assumed that a vocabulary of labels of interest (e. g., indoor, outdoor, city, landscape) is specified a priori. In contrast, the proposed method begins with an empty vocabulary. It analyzes audiovisual features of 25 million YouTube. com videos -- nearly 150 years of video data -- effectively searching for consistent correlation between these features and text metadata. It autonomously extends the label vocabulary as and when it discovers concepts it can reliably identify, eventually leading to a vocabulary with thousands of labels and growing. We believe that this work significantly extends the state of the art in multimedia data mining, discovery, and organization based on the technical merit of the proposed ideas as well as the enormous scale of the mining exercise in a very challenging, unconstrained, noisy domain.

computer vision and pattern recognition | 2011

Discriminative tag learning on YouTube videos with latent sub-tags

Weilong Yang; George Toderici

We consider the problem of content-based automated tag learning. In particular, we address semantic variations (sub-tags) of the tag. Each video in the training set is assumed to be associated with a sub-tag label, and we treat this sub-tag label as latent information. A latent learning framework based on LogitBoost is proposed, which jointly considers both the tag label and the latent sub-tag label. The latent sub-tag information is exploited in our framework to assist the learning of our end goal, i.e., tag prediction. We use the cowatch information to initialize the learning process. In experiments, we show that the proposed method achieves significantly better results over baselines on a large-scale testing video set which contains about 50 million YouTube videos.

acm multimedia | 2009

Automatic, efficient, temporally-coherent video enhancement for large scale applications

George Toderici; Jay Yagnik

A fast and robust method for video contrast enhancement is presented. The method uses the histogram of each frame, along with upper and lower bounds computed per shot in order to enhance the current frame. This ensures that the artifacts introduced during the enhancement is reduced to a minimum. Traditional methods that do not compute per-shot estimates tend to over-enhance parts of the video such as fades and transitions. Our method does not suffer from this problem, which is essential for a fully automatic algorithm. We present the parameters for our methods which yielded the best human feedback, which showed that out of 208 videos, 203 were enhanced, while the remaining 5 were of too poor quality to be enhanced. Additionally, we present a visual comparison of our work with the recently-proposed Weighted Thresholded Histogram Equalization (WTHE) algorithm.

acm multimedia | 2015

Introduction to: Special Issue on Extended Best Papers from ACM Multimedia 2014

Hayley Hung; George Toderici

This special issue continues the tradition of inviting the best papers from ACM Multimedia to extend their work to a journal article. In 2015, the conference was held in Orlando, FL, USA. A number of new areas were introduced this year. The two articles presented in this special issue came from the Deep Learning for Multimedia area and the Emotional and Social Signals in Multimedia area. As usual, a rigorous review process was carried out followed by an intense two-day colocated technical program committee meeting. Selecting the final set of best-paper candidates was a very intense process for all concerned, sparking a lot of debate about how important it is to have best paper candidates that are multimodal and take a fresh perspective on new topics. The following best paper extensions underwent a rigorous review procedure to ensure that the work was sufficiently extended compared to their respective conference paper versions. We thank the anonymous reviewers who helped to ensure the quality of these two extended papers. The first article, “Emotion Recognition During Speech Using Dynamics of Multiple Regions of Face” by Yelin Kim and Emily Mower-Provost, addresses the challenging task of performing automated facial emotion recognition when someone is speaking simultaneously. In this article, the authors exploit the context of the speech to disambiguate facial behavior that is caused by speech production from true expressions of facial emotion. They investigate an unsupervised method of segmenting the facial movements due to speech, demonstrating an improvement in facial-emotion recognition performance on the IEMOCAP and SAVEE datasets. Importantly, they describe the correspondence of their experimental findings in relation to existing emotion perception studies. This work is particularly valuable in the development of more naturalistic human-centered and emotionally aware multimedia interfaces. The second article, “Correspondence Autoencoders for Cross-Modal Retrieval” by Fangxiang Feng, Xiaojie Wang, and Ruifan Li, tackles the task of cross-modal retrieval by using correspondence autoencoder which connects the text and image modality. This enables users to issue text queries and have images retrieved using their shared representations. The authors present three distinct architectures for achieving this. A correspondence cross-modal autoencoder reconstructs its input, which may consist of text phrases or images, while using a shared bottleneck layer (with the text and image belonging to the same entity). In the second variant, the full-modal architecture, both inputs must be reconstructed given a single modality. The final deep architecture employs restricted Boltzmann machines. Experimental results show that the described architectures improve upon previously published literature in this domain on the Wikipedia, Pascal, and NUS-WIDE-10k datasets. Moreover, the authors

acm multimedia | 2009

Adaptive, selective, automatic tonal enhancement of faces

Hrishikesh Aradhye; George Toderici; Jay Yagnik

This paper presents an efficient, personalizable and yet completely automatic algorithm for enhancing the brightness, tonal balance, and contrast of faces in thumbnails of online videos where multiple colored illumination sources are the norm and artifacts such as poor illumination and backlight are common. These artifacts significantly lower the perceptual quality of faces and skin, and cannot be easily corrected by common global image transforms. The same identifiable user, however, often uploads or participates in multiple photos, videos, or video chat sessions with varying illumination conditions. The proposed algorithm adaptively transforms the skin pixels in a poor illumination environment to match the skin color model of a prototypical face of the same user in a better illumination environment. It leaves the remaining non-skin portions of the image virtually unchanged while ascertaining a smooth, natural appearance. A component of our system automatically selects such a prototypical face for each user given a collection of uploaded videos/photo albums or prior video chat sessions by that user. We present several human rating studies on YouTube data that quantitatively demonstrate significant improvement in facial quality using the proposed algorithm.

international conference on learning representations | 2016

Variable Rate Image Compression with Recurrent Neural Networks

George Toderici; Sean O'Malley; Sung Jin Hwang; Damien Vincent; David Minnen; Shumeet Baluja; Michele Covell; Rahul Sukthankar

computer vision and pattern recognition | 2018

AVA: A Video Dataset of Spatio-Temporally Localized Atomic Visual Actions

Chunhui Gu; Chen Sun; David A. Ross; Carl Vondrick; Caroline Pantofaru; Yeqing Li; Sudheendra Vijayanarasimhan; George Toderici; Susanna Ricco; Rahul Sukthankar; Cordelia Schmid; Jitendra Malik

Archive | 2010