Jay Yagnik | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jay Yagnik is active.

Explore More

Publication

Featured researches published by Jay Yagnik.

international world wide web conferences | 2008

Video suggestion and discovery for youtube: taking random walks through the view graph

Shumeet Baluja; Rohan Seth; Dandapani Sivakumar; Yushi Jing; Jay Yagnik; Shankar Kumar; Deepak Ravichandran; Mohamed Aly

The rapid growth of the number of videos in YouTube provides enormous potential for users to find content of interest to them. Unfortunately, given the difficulty of searching videos, the size of the video repository also makes the discovery of new content a daunting task. In this paper, we present a novel method based upon the analysis of the entire user-video graph to provide personalized video suggestions for users. The resulting algorithm, termed Adsorption, provides a simple method to efficiently propagate preference information through a variety of graphs. We extensively test the results of the recommendations on a three month snapshot of live data from YouTube.

computer vision and pattern recognition | 2013

Fast, Accurate Detection of 100,000 Object Classes on a Single Machine

Thomas Dean; Mark A. Ruzon; Mark Segal; Jonathon Shlens; Sudheendra Vijayanarasimhan; Jay Yagnik

Many object detection systems are constrained by the time required to convolve a target image with a bank of filters that code for different aspects of an objects appearance, such as the presence of component parts. We exploit locality-sensitive hashing to replace the dot-product kernel operator in the convolution with a fixed number of hash-table probes that effectively sample all of the filter responses in time independent of the size of the filter bank. To show the effectiveness of the technique, we apply it to evaluate 100,000 deformable-part models requiring over a million (part) filters on multiple scales of a target image in less than 20 seconds using a single multi-core processor with 20GB of RAM. This represents a speed-up of approximately 20,000 times - four orders of magnitude - when compared with performing the convolutions explicitly on the same hardware. While mean average precision over the full set of 100,000 object classes is around 0.16 due in large part to the challenges in gathering training data and collecting ground truth for so many classes, we achieve a mAP of at least 0.20 on a third of the classes and 0.30 or better on about 20% of the classes.

acm multimedia | 2009

Tour the world: a technical demonstration of a web-scale landmark recognition engine

Yan-Tao Zheng; Ming Zhao; Yang Song; Hartwig Adam; Ulrich Buddemeier; Alessandro Bissacco; Fernando Brucher; Tat-Seng Chua; Hartmut Neven; Jay Yagnik

We present a technical demonstration of a world-scale touristic landmark recognition engine. To build such an engine, we leverage ~21.4 million images, from photo sharing websites and Google Image Search, and around two thousand web articles to mine the landmark names and learn the visual models. The landmark recognition engine incorporates 5312 landmarks from 1259 cities in 144 countries. This demonstration gives three exhibits: (1) a live landmark recognition engine that can visually recognize landmarks in a given image; (2) an interactive navigation tool showing landmarks on Google Earth; and (3) sample visual clusters (landmark model images) and a list of 1000 randomly selected landmarks from our recognition engine with their iconic images.

computer vision and pattern recognition | 2013

Discriminative Segment Annotation in Weakly Labeled Video

Kevin Tang; Rahul Sukthankar; Jay Yagnik; Li Fei-Fei

The ubiquitous availability of Internet video offers the vision community the exciting opportunity to directly learn localized visual concepts from real-world imagery. Unfortunately, most such attempts are doomed because traditional approaches are ill-suited, both in terms of their computational characteristics and their inability to robustly contend with the label noise that plagues uncurated Internet content. We present CRANE, a weakly supervised algorithm that is specifically designed to learn under such conditions. First, we exploit the asymmetric availability of real-world training data, where small numbers of positive videos tagged with the concept are supplemented with large quantities of unreliable negative data. Second, we ensure that CRANE is robust to label noise, both in terms of tagged videos that fail to contain the concept as well as occasional negative videos that do. Finally, CRANE is highly parallelizable, making it practical to deploy at large scale without sacrificing the quality of the learned solution. Although CRANE is general, this paper focuses on segment annotation, where we show state-of-the-art pixel-level segmentation results on two datasets, one of which includes a training set of spatiotemporal segments from more than 20,000 videos.

computer vision and pattern recognition | 2010

Finding meaning on YouTube: Tag recommendation and category discovery

George Toderici; Hrishikesh Aradhye; Marius Pasca; Luciano Sbaiz; Jay Yagnik

We present a system that automatically recommends tags for YouTube videos solely based on their audiovisual content. We also propose a novel framework for unsupervised discovery of video categories that exploits knowledge mined from the World-Wide Web text documents/searches. First, video content to tag association is learned by training classifiers that map audiovisual content-based features from millions of videos on YouTube.com to existing uploader-supplied tags for these videos. When a new video is uploaded, the labels provided by these classifiers are used to automatically suggest tags deemed relevant to the video. Our system has learned a vocabulary of over 20,000 tags. Secondly, we mined large volumes of Web pages and search queries to discover a set of possible text entity categories and a set of associated is-A relationships that map individual text entities to categories. Finally, we apply these is-A relationships mined from web text on the tags learned from audiovisual content of videos to automatically synthesize a reliable set of categories most relevant to videos – along with a mechanism to predict these categories for new uploads. We then present rigorous rating studies that establish that: (a) the average relevance of tags automatically recommended by our system matches the average relevance of the uploader-supplied tags at the same or better coverage and (b) the average precision@K of video categories discovered by our system is 70% with K=5.

international conference on computer vision | 2011

The power of comparative reasoning

Jay Yagnik; Dennis Strelow; David A. Ross; Ruei-Sung Lin

Rank correlation measures are known for their resilience to perturbations in numeric values and are widely used in many evaluation metrics. Such ordinal measures have rarely been applied in treatment of numeric features as a representational transformation. We emphasize the benefits of ordinal representations of input features both theoretically and empirically. We present a family of algorithms for computing ordinal embeddings based on partial order statistics. Apart from having the stability benefits of ordinal measures, these embeddings are highly nonlinear, giving rise to sparse feature spaces highly favored by several machine learning methods. These embeddings are deterministic, data independent and by virtue of being based on partial order statistics, add another degree of resilience to noise. These machine-learning-free methods when applied to the task of fast similarity search outperform state-of-the-art machine learning methods with complex optimization setups. For solving classification problems, the embeddings provide a nonlinear transformation resulting in sparse binary codes that are well-suited for a large class of machine learning algorithms. These methods show significant improvement on VOC 2010 using simple linear classifiers which can be trained quickly. Our method can be extended to the case of polynomial kernels, while permitting very efficient computation. Further, since the popular Min Hash algorithm is a special case of our method, we demonstrate an efficient scheme for computing Min Hash on conjunctions of binary features. The actual method can be implemented in about 10 lines of code in most languages (2 lines in MAT-LAB), and does not require any data-driven optimization.

computer vision and pattern recognition | 2010

SPEC hashing: Similarity preserving algorithm for entropy-based coding

Ruei-Sung Lin; David A. Ross; Jay Yagnik

Searching approximate nearest neighbors in large scale high dimensional data set has been a challenging problem. This paper presents a novel and fast algorithm for learning binary hash functions for fast nearest neighbor retrieval. The nearest neighbors are defined according to the semantic similarity between the objects. Our method uses the information of these semantic similarities and learns a hash function with binary code such that only objects with high similarity have small Hamming distance. The hash function is incrementally trained one bit at a time, and as bits are added to the hash code Hamming distances between dissimilar objects increase. We further link our method to the idea of maximizing conditional entropy among pair of bits and derive an extremely efficient linear time hash learning algorithm. Experiments on similar image retrieval and celebrity face recognition show that our method produces apparent improvement in performance over some state-of-the-art methods.

international conference on data mining | 2009

Video2Text: Learning to Annotate Video Content

Hrishikesh Aradhye; George Toderici; Jay Yagnik

This paper discusses a new method for automatic discovery and organization of descriptive concepts (labels) within large real-world corpora of user-uploaded multimedia, such as YouTube. com. Conversely, it also provides validation of existing labels, if any. While training, our method does not assume any explicit manual annotation other than the weak labels already available in the form of video title, description, and tags. Prior work related to such auto-annotation assumed that a vocabulary of labels of interest (e. g., indoor, outdoor, city, landscape) is specified a priori. In contrast, the proposed method begins with an empty vocabulary. It analyzes audiovisual features of 25 million YouTube. com videos -- nearly 150 years of video data -- effectively searching for consistent correlation between these features and text metadata. It autonomously extends the label vocabulary as and when it discovers concepts it can reliably identify, eventually leading to a vocabulary with thousands of labels and growing. We believe that this work significantly extends the state of the art in multimedia data mining, discovery, and organization based on the technical merit of the proposed ideas as well as the enormous scale of the mining exercise in a very challenging, unconstrained, noisy domain.

computer vision and pattern recognition | 2010

Taxonomic classification for web-based videos

Yang Song; Ming Zhao; Jay Yagnik; Xiaoyun Wu

Categorizing web-based videos is an important yet challenging task. The difficulties arise from large data diversity within a category, lack of labeled data, and degradation of video quality. This paper presents a large scale video taxonomic classification scheme (with more than 1000 categories) tackling these issues. Taxonomic structure of categories is deployed in classifier training. To compensate for the lack of labeled video data, a novel method is proposed to adapt the web-text documents trained classifiers to video domain so that the availability of a large corpus of labeled text documents can be leveraged. Video content based features are integrated with text-based features to gain power in the case of degradation of one type of features. Evaluation on videos from hundreds of categories shows that the proposed algorithms generate significant performance improvement over text classifiers or classifiers trained using only video content based features.

multimedia information retrieval | 2007

Learning people annotation from the web via consistency learning

Jay Yagnik; Atiq Islam

The phenomenal growth of Image/Video on the web and the increasing sparseness of meta information to go along with forces us to look for signals from the Image/Video content for Search / Information Retrieval and Browsing based corpus exploration. One of the prominent type of information that users look for while searching/browsing through such corpora is information around the people present in the Image/Video. While face recognition has matured to some extent over the past few years, this problem remains a hard one due to a) absence of labelled data for such a large set of celebrities that users look for and b) the variability of age/makeup/expressions/pose in the target corpus. We propose a learning paradigm which we refer to as consistency learning to address both these issues by posing the problem of learning from weakly labelled training set. We use the text-image co-occurrence on the web as a weak signal of relevance and learn the set of consistent face models from this very large and noisy training set. The resulting system learns face models for a large set of celebrities directly from the web and uses it to tag Image/Video for better retrieval. While the proposed method has been applied to faces, we see it broadly applicable in any learning problem with a suitable similarity metric defined. We present results on learning from a very large dataset of 37 million images resulting in a validation accuracy of 92.68%.

Explore More