Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Rakshith Shetty is active.

Publication


Featured researches published by Rakshith Shetty.


international conference on computer vision | 2017

Speaking the Same Language: Matching Machine to Human Captions by Adversarial Training

Rakshith Shetty; Marcus Rohrbach; Lisa Anne Hendricks; Mario Fritz; Bernt Schiele

With an increasing number of users sharing information online, privacy implications entailing such actions are a major concern. For explicit content, such as user profile or GPS data, devices (e.g. mobile phones) as well as web services ( e.g. facebook) offer to set privacy settings in order to enforce the users’ privacy preferences. We propose the first approach that extends this concept to image content in the spirit of a Visual Privacy Advisor. First, we categorize personal information in images into 68 image attributes and collect a dataset, which allows us to train models that predict such information directly from images. Second, we run a user study to understand the privacy preferences of different users w.r.t. such attributes. Third, we propose models that predict user specific privacy score from images in order to enforce the users’ privacy preferences. Our model is trained to predict the user specific privacy risk and even outperforms the judgment of the users, who often fail to follow their own privacy preferences on image data.While strong progress has been made in image captioning recently, machine and human captions are still quite distinct. This is primarily due to the deficiencies in the generated word distribution, vocabulary size, and strong bias in the generators towards frequent captions. Furthermore, humans – rightfully so – generate multiple, diverse captions, due to the inherent ambiguity in the captioning task which is not explicitly considered in todays systems. To address these challenges, we change the training objective of the caption generator from reproducing ground-truth captions to generating a set of captions that is indistinguishable from human written captions. Instead of handcrafting such a learning target, we employ adversarial training in combination with an approximate Gumbel sampler to implicitly match the generated distribution to the human one. While our method achieves comparable performance to the state-of-the-art in terms of the correctness of the captions, we generate a set of diverse captions that are significantly less biased and better match the global uni-, bi- and tri-gram distributions of the human captions.


international conference on computer vision | 2017

Paying Attention to Descriptions Generated by Image Captioning Models

Hamed R. Tavakoliy; Rakshith Shetty; Ali Borji; Jorma Laaksonen

To bridge the gap between humans and machines in image understanding and describing, we need further insight into how people describe a perceived scene. In this paper, we study the agreement between bottom-up saliency-based visual attention and object referrals in scene description constructs. We investigate the properties of human-written descriptions and machine-generated ones. We then propose a saliency-boosted image captioning model in order to investigate benefits from low-level cues in language models. We learn that (1) humans mention more salient objects earlier than less salient ones in their descriptions, (2) the better a captioning model performs, the better attention agreement it has with human descriptions, (3) the proposed saliencyboosted model, compared to its baseline form, does not improve significantly on the MS COCO database, indicating explicit bottom-up boosting does not help when the task is well learnt and tuned on a data, (4) a better generalization is, however, observed for the saliency-boosted model on unseen data.


IEEE MultiMedia | 2018

Image and Video Captioning with Augmented Neural Architectures

Rakshith Shetty; Hamed Rezazadegan Tavakoli; Jorma Laaksonen

Neural-network-based image and video captioning can be substantially improved by utilizing architectures that make use of special features from the scene context, objects, and locations. A novel discriminatively trained evaluator network for choosing the best caption among those generated by an ensemble of caption generator networks further improves accuracy.


acm multimedia | 2016

Proceedings of the 2016 ACM workshop on Vision and Language Integration Meets Multimedia Fusion

Rakshith Shetty; Hamed Rezazadegan Tavakoli; Jorma Laaksonen

This paper presents a framework for image captioning by exploiting the scene context. To date, most of the captioning models have been relying on the combination of Convolutional Neural Networks (CNN) and the Long-Short Term Memory (LSTM) model, trained in an end-to-end fashion. Recently, there has been extensive research towards improving the language model and the CNN architecture, utilizing attention mechanisms, and improving the learning techniques in such systems. A less studied area is the contribution of the scene context in the captioning. In this work, we study the role of the scene context, consisting of the scene type and objects. To this end, we augment the CNN features with scene context features, including scene detectors, objects and their localization, and their combinations. We use the scene context features as an initialization feature at the zeroth time step in a LSTM model with deep residual connections. In subsequent time steps, the model, however, uses the original CNN features. The proposed language model, contrary to more conventional ones, thus has access to visual features through the whole process of sentence generation. We demonstrate that the scene context features affect the language formation and improve the captioning results in the proposed framework. We also report results from the Microsoft COCO benchmark, where our model achieves the state-of-the-art performance on the test set.


acm multimedia | 2016

Exploiting Scene Context for Image Captioning

Rakshith Shetty; Hamed Rezazadegan Tavakoli; Jorma Laaksonen

This paper presents a framework for image captioning by exploiting the scene context. To date, most of the captioning models have been relying on the combination of Convolutional Neural Networks (CNN) and the Long-Short Term Memory (LSTM) model, trained in an end-to-end fashion. Recently, there has been extensive research towards improving the language model and the CNN architecture, utilizing attention mechanisms, and improving the learning techniques in such systems. A less studied area is the contribution of the scene context in the captioning. In this work, we study the role of the scene context, consisting of the scene type and objects. To this end, we augment the CNN features with scene context features, including scene detectors, objects and their localization, and their combinations. We use the scene context features as an initialization feature at the zeroth time step in a LSTM model with deep residual connections. In subsequent time steps, the model, however, uses the original CNN features. The proposed language model, contrary to more conventional ones, thus has access to visual features through the whole process of sentence generation. We demonstrate that the scene context features affect the language formation and improve the captioning results in the proposed framework. We also report results from the Microsoft COCO benchmark, where our model achieves the state-of-the-art performance on the test set.


arXiv: Computer Vision and Pattern Recognition | 2015

Video captioning with recurrent networks based on frame- and video-level features and visual content classification

Rakshith Shetty; Jorma Laaksonen


usenix security symposium | 2017

A4NT: Author Attribute Anonymity by Adversarial Training of Neural Machine Translation

Rakshith Shetty; Bernt Schiele; Mario Fritz


arXiv: Computer Vision and Pattern Recognition | 2017

Can Saliency Information Benefit Image Captioning Models

Hamed Rezazadegan Tavakoli; Rakshith Shetty; Ali Borji; Jorma Laaksonen


neural information processing systems | 2018

Adversarial Scene Editing: Automatic Object Removal from Weak Supervision

Rakshith Shetty; Mario Fritz; Bernt Schiele


arXiv: Computer Vision and Pattern Recognition | 2018

Answering Visual What-If Questions: From Actions to Predicted Scene Descriptions.

M. Wagner; H. Basevi; Rakshith Shetty; Wenbin Li; M. Malinowski; Mario Fritz; Aleš Leonardis

Collaboration


Dive into the Rakshith Shetty's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Ali Borji

University of Central Florida

View shared research outputs
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge