Subhashini Venugopalan

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Subhashini Venugopalan is active.

Explore More

Publication

Featured researches published by Subhashini Venugopalan.

international conference on computer vision | 2015

Sequence to Sequence -- Video to Text

Subhashini Venugopalan; Marcus Rohrbach; Jeffrey Donahue; Raymond J. Mooney; Trevor Darrell; Kate Saenko

Real-world videos often have complex dynamics, methods for generating open-domain video descriptions should be sensitive to temporal structure and allow both input (sequence of frames) and output (sequence of words) of variable length. To approach this problem we propose a novel end-to-end sequence-to-sequence model to generate captions for videos. For this we exploit recurrent neural networks, specifically LSTMs, which have demonstrated state-of-the-art performance in image caption generation. Our LSTM model is trained on video-sentence pairs and learns to associate a sequence of video frames to a sequence of words in order to generate a description of the event in the video clip. Our model naturally is able to learn the temporal structure of the sequence of frames as well as the sequence model of the generated sentences, i.e. a language model. We evaluate several variants of our model that exploit different visual features on a standard set of YouTube videos and two movie description datasets (M-VAD and MPII-MD).

north american chapter of the association for computational linguistics | 2015

Translating Videos to Natural Language Using Deep Recurrent Neural Networks

Subhashini Venugopalan; Huijuan Xu; Jeff Donahue; Marcus Rohrbach; Raymond J. Mooney; Kate Saenko

Solving the visual symbol grounding problem has long been a goal of artificial intelligence. The field appears to be advancing closer to this goal with recent breakthroughs in deep learning for natural language grounding in static images. In this paper, we propose to translate videos directly to sentences using a unified deep neural network with both convolutional and recurrent structure. Described video datasets are scarce, and most existing methods have been applied to toy domains with a small vocabulary of possible words. By transferring knowledge from 1.2M+ images with category labels and 100,000+ images with captions, our method is able to create sentence descriptions of open-domain videos with large vocabularies. We compare our approach with recent work using language generation metrics, subject, verb, and object prediction accuracy, and a human evaluation.

computer vision and pattern recognition | 2016

Deep Compositional Captioning: Describing Novel Object Categories without Paired Training Data

Lisa Anne Hendricks; Subhashini Venugopalan; Marcus Rohrbach; Raymond J. Mooney; Kate Saenko; Trevor Darrell

While recent deep neural network models have achieved promising results on the image captioning task, they rely largely on the availability of corpora with paired image and sentence captions to describe objects in context. In this work, we propose the Deep Compositional Captioner (DCC) to address the task of generating descriptions of novel objects which are not present in paired imagesentence datasets. Our method achieves this by leveraging large object recognition datasets and external text corpora and by transferring knowledge between semantically similar concepts. Current deep caption models can only describe objects contained in paired image-sentence corpora, despite the fact that they are pre-trained with large object recognition datasets, namely ImageNet. In contrast, our model can compose sentences that describe novel objects and their interactions with other objects. We demonstrate our models ability to describe novel concepts by empirically evaluating its performance on MSCOCO and show qualitative results on ImageNet images of objects for which no paired image-sentence data exist. Further, we extend our approach to generate descriptions of objects in video clips. Our results show that DCC has distinct advantages over existing image and video captioning approaches for generating descriptions of new objects in context.

IEEE Transactions on Pattern Analysis and Machine Intelligence | 2017

Long-Term Recurrent Convolutional Networks for Visual Recognition and Description

Jeff Donahue; Lisa Anne Hendricks; Marcus Rohrbach; Subhashini Venugopalan; Sergio Guadarrama; Kate Saenko; Trevor Darrell

Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or “temporally deep”, are effective for tasks involving sequences, visual and otherwise. We develop a novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and demonstrate the value of these models on benchmark video recognition tasks, image description and retrieval problems, and video narration challenges. In contrast to current models which assume a fixed spatio-temporal receptive field or simple temporal averaging for sequential processing, recurrent convolutional models are “doubly deep” in that they can be compositional in spatial and temporal “layers”. Such models may have advantages when target concepts are complex and/or training data are limited. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Long-term RNN models are appealing in that they directly can map variable-length inputs (e.g., video frames) to variable length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent long-term models are directly connected to modern visual convnet models and can be jointly trained to simultaneously learn temporal dynamics and convolutional perceptual representations. Our results show such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

empirical methods in natural language processing | 2016

Improving LSTM-based Video Description with Linguistic Knowledge Mined from Text

Subhashini Venugopalan; Lisa Anne Hendricks; Raymond J. Mooney; Kate Saenko

This paper investigates how linguistic knowledge mined from large text corpora can aid the generation of natural language descriptions of videos. Specifically, we integrate both a neural language model and distributional semantics trained on large text corpora into a recent LSTM-based architecture for video description. We evaluate our approach on a collection of Youtube videos as well as two large movie description datasets showing significant improvements in grammaticality while modestly improving descriptive quality.

computer vision and pattern recognition | 2017

Captioning Images with Diverse Objects

Subhashini Venugopalan; Lisa Anne Hendricks; Marcus Rohrbach; Raymond J. Mooney; Trevor Darrell; Kate Saenko

Recent captioning models are limited in their ability to scale and describe concepts unseen in paired image-text corpora. We propose the Novel Object Captioner (NOC), a deep visual semantic captioning model that can describe a large number of object categories not present in existing image-caption datasets. Our model takes advantage of external sources – labeled images from object recognition datasets, and semantic knowledge extracted from unannotated text. We propose minimizing a joint objective which can learn from these diverse data sources and leverage distributional semantic embeddings, enabling the model to generalize and describe novel objects outside of image-caption datasets. We demonstrate that our model exploits semantic information to generate captions for hundreds of object categories in the ImageNet object recognition dataset that are not observed in MSCOCO image-caption training data, as well as many categories that are observed very rarely. Both automatic evaluations and human judgements show that our model considerably outperforms prior work in being able to describe many more categories of objects.

workshop on applications of computer vision | 2017

Semantic Text Summarization of Long Videos

Shagan Sah; Sourabh Kulhare; Allison Gray; Subhashini Venugopalan; Emily Prud'hommeaux; Raymond W. Ptucha

Long videos captured by consumers are typically tied to some of the most important moments of their lives, yet ironically are often the least frequently watched. The time required to initially retrieve and watch sections can be daunting. In this work we propose novel techniques for summarizing and annotating long videos. Existing video summarization techniques focus exclusively on identifying keyframes and subshots, however evaluating these summarized videos is a challenging task. Our work proposes methods to generate visual summaries of long videos, and in addition proposes techniques to annotate and generate textual summaries of the videos using recurrent networks. Interesting segments of long video are extracted based on image quality as well as cinematographic and consumer preference. Key frames from the most impactful segments are converted to textual annotations using sequential encoding and decoding deep learning models. Our summarization technique is benchmarked on the VideoSet dataset, and evaluated by humans for informative and linguistic content. We believe this to be the first fully automatic method capable of simultaneous visual and textual summarization of long consumer videos.

Archive | 2015

Natural Language Video Description using Deep Recurrent Neural Networks

Subhashini Venugopalan

Abstract : For most people, watching a brief video and describing what happened (inwords) is an easy task. For machines, extracting the meaning from video pixelsand generating a sentence description is a very complex problem. The goal of myresearch is to develop models that can automatically generate natural language(NL) descriptions for events in videos. As a first step, this proposal presentsdeep recurrent neural network models for video to text generation. I build onrecent deep machine learning approaches to develop video description modelsusing a unified deep neural network with both convolutional and recurrentstructure. This technique treats the video domain as another language andtakes a machine translation approach using the deep network to translate videosto text. In my initial approach, I adapt a model that can learn on images andcaptions to transfer knowledge from this auxiliary task to generate descriptionsfor short video clips. Next, I present an end-to-end deep network that can jointlymodel a sequence of video frames and a sequence of words. The second part ofthe proposal outlines a set of models to significantly extend work in this area.Specifically, I propose techniques to integrate linguistic knowledge from plaintext corpora; and attention methods to focus on objects and track their interactionsto generate more diverse and accurate descriptions. To move beyondshort video clips, I also outline models to process multi-activity movie videos,learning to jointly segment and describe coherent event sequences. I proposefurther extensions to take advantage of movie scripts and subtitle informationto generate richer descriptions.

2011 IEEE 5th International Conference on Internet Multimedia Systems Architecture and Application | 2011

People and entity retrieval in implicit social networks

Suman K. Pathapati; Subhashini Venugopalan; Ashok Pon Kumar; Anuradha Bhamidipaty

Online social networks can be viewed as implicit real world networks, that manage to capture a wealth of information about heterogeneous nodes and edges, which are highly interconnected. Such abundant data can be beneficial in finding and retrieving relevant people and entities within these networks. Effective methods of achieving this can be useful in systems ranging from recommender systems to people and entity discovery systems. Our main contribution in this paper is the proposal of a novel localized algorithm that operates on the sub graph of the social graph and retrieves relevant people or entities. We also demonstrate how such an algorithm can be used in large real world social networks and graphs to efficiently retrieve relevant people/entities.

computer vision and pattern recognition | 2015