Kevin Tang | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Kevin Tang is active.

Explore More

Publication

Featured researches published by Kevin Tang.

computer vision and pattern recognition | 2012

Learning latent temporal structure for complex event detection

Kevin Tang; Li Fei-Fei; Daphne Koller

In this paper, we tackle the problem of understanding the temporal structure of complex events in highly varying videos obtained from the Internet. Towards this goal, we utilize a conditional model trained in a max-margin framework that is able to automatically discover discriminative and interesting segments of video, while simultaneously achieving competitive accuracies on difficult detection and recognition tasks. We introduce latent variables over the frames of a video, and allow our algorithm to discover and assign sequences of states that are most discriminative for the event. Our model is based on the variable-duration hidden Markov model, and models durations of states in addition to the transitions between states. The simplicity of our model allows us to perform fast, exact inference using dynamic programming, which is extremely important when we set our sights on being able to process a very large number of videos quickly and efficiently. We show promising results on the Olympic Sports dataset [16] and the 2011 TRECVID Multimedia Event Detection task [18]. We also illustrate and visualize the semantic understanding capabilities of our model.

international conference on image processing | 2010

Towards computational models of kinship verification

Ruogu Fang; Kevin Tang; Noah Snavely; Tsuhan Chen

We tackle the challenge of kinship verification using novel feature extraction and selection methods, automatically classifying pairs of face images as “related” or “unrelated” (in terms of kinship). First, we conducted a controlled online search to collect frontal face images of 150 pairs of public figures and celebrities, along with images of their parents or children. Next, we propose and evaluate a set of low-level image features for this classification problem. After selecting the most discriminative inherited facial features, we demonstrate a classification accuracy of 70.67% on a test set of image pairs using K-Nearest-Neighbors. Finally, we present an evaluation of human performance on this problem.

computer vision and pattern recognition | 2013

Discriminative Segment Annotation in Weakly Labeled Video

Kevin Tang; Rahul Sukthankar; Jay Yagnik; Li Fei-Fei

The ubiquitous availability of Internet video offers the vision community the exciting opportunity to directly learn localized visual concepts from real-world imagery. Unfortunately, most such attempts are doomed because traditional approaches are ill-suited, both in terms of their computational characteristics and their inability to robustly contend with the label noise that plagues uncurated Internet content. We present CRANE, a weakly supervised algorithm that is specifically designed to learn under such conditions. First, we exploit the asymmetric availability of real-world training data, where small numbers of positive videos tagged with the concept are supplemented with large quantities of unreliable negative data. Second, we ensure that CRANE is robust to label noise, both in terms of tagged videos that fail to contain the concept as well as occasional negative videos that do. Finally, CRANE is highly parallelizable, making it practical to deploy at large scale without sacrificing the quality of the learned solution. Although CRANE is general, this paper focuses on segment annotation, where we show state-of-the-art pixel-level segmentation results on two datasets, one of which includes a training set of spatiotemporal segments from more than 20,000 videos.

computer vision and pattern recognition | 2014

Co-localization in Real-World Images

Kevin Tang; Armand Joulin; Li-Jia Li; Li Fei-Fei

In this paper, we tackle the problem of co-localization in real-world images. Co-localization is the problem of simultaneously localizing (with bounding boxes) objects of the same class across a set of distinct images. Although similar problems such as co-segmentation and weakly supervised localization have been previously studied, we focus on being able to perform co-localization in real-world settings, which are typically characterized by large amounts of intra-class variation, inter-class diversity, and annotation noise. To address these issues, we present a joint image-box formulation for solving the co-localization problem, and show how it can be relaxed to a convex quadratic program which can be efficiently solved. We perform an extensive evaluation of our method compared to previous state-of-the-art approaches on the challenging PASCAL VOC 2007 and Object Discovery datasets. In addition, we also present a large-scale study of co-localization on ImageNet, involving ground-truth annotations for 3, 624 classes and approximately 1 million images.

european conference on computer vision | 2014

Efficient Image and Video Co-localization with Frank-Wolfe Algorithm

Armand Joulin; Kevin Tang; Li Fei-Fei

In this paper, we tackle the problem of performing efficient co-localization in images and videos. Co-localization is the problem of simultaneously localizing (with bounding boxes) objects of the same class across a set of distinct images or videos. Building upon recent state-of-the-art methods, we show how we are able to naturally incorporate temporal terms and constraints for video co-localization into a quadratic programming framework. Furthermore, by leveraging the Frank-Wolfe algorithm (or conditional gradient), we show how our optimization formulations for both images and videos can be reduced to solving a succession of simple integer programs, leading to increased efficiency in both memory and speed. To validate our method, we present experimental results on the PASCAL VOC 2007 dataset for images and the YouTube-Objects dataset for videos, as well as a joint combination of the two.

international conference on computer vision | 2013

Combining the Right Features for Complex Event Recognition

Kevin Tang; Bangpeng Yao; Li Fei-Fei; Daphne Koller

In this paper, we tackle the problem of combining features extracted from video for complex event recognition. Feature combination is an especially relevant task in video data, as there are many features we can extract, ranging from image features computed from individual frames to video features that take temporal information into account. To combine features effectively, we propose a method that is able to be selective of different subsets of features, as some features or feature combinations may be uninformative for certain classes. We introduce a hierarchical method for combining features based on the AND/OR graph structure, where nodes in the graph represent combinations of different sets of features. Our method automatically learns the structure of the AND/OR graph using score-based structure learning, and we introduce an inference procedure that is able to efficiently compute structure scores. We present promising results and analysis on the difficult and large-scale 2011 TRECVID Multimedia Event Detection dataset.

international conference on computer vision | 2015

Learning Temporal Embeddings for Complex Video Analysis

Vignesh Ramanathan; Kevin Tang; Greg Mori; Li Fei-Fei

In this paper, we propose to learn temporal embeddings of video frames for complex video analysis. Large quantities of unlabeled video data can be easily obtained from the Internet. These videos possess the implicit weak label that they are sequences of temporally and semantically coherent images. We leverage this information to learn temporal embeddings for video frames by associating frames with the temporal context that they appear in. To do this, we propose a scheme for incorporating temporal context based on past and future frames in videos, and compare this to other contextual representations. In addition, we show how data augmentation using multi-resolution samples and hard negatives helps to significantly improve the quality of the learned embeddings. We evaluate various design decisions for learning temporal embeddings, and show that our embeddings can improve performance for multiple video tasks such as retrieval, classification, and temporal order recovery in unconstrained Internet video.

Computer Music Journal | 2010

Machine learning of jazz grammars

Jon Gillick; Kevin Tang; Robert M. Keller

In the context of an educational software tool that can generate novel jazz solos using a probabilistic grammar (Keller 2007), this article describes the automated learning of such grammars. Learning takes place from a corpus of transcriptions, typically from a single performer, and our methods attempt to improvise solos representative of such a style. In order to capture idiomatic gestures of a specific soloist, we extend an earlier grammar representation (Keller and Morrison 2007) with a technique for representing melodic contour. Representative contours are extracted from a corpus using clustering, and sequencing among contours is done using Markov chains that are encoded into the grammar. This article first defines the basic building blocks for contours of typical jazz solos, which we call slopes, then shows how these slopes may be incorporated into a grammar wherein the notes are chosen according to tonal categories relevant to jazz playing. We show that melodic contours can be accurately portrayed using slopes learned from a corpus. Experimental results, including blind comparisons of solos generated from grammars based on several corpora, are reported.

computer vision and pattern recognition | 2010

Optimizing one-shot recognition with micro-set learning

Kevin Tang; Marshall F. Tappen; Rahul Sukthankar; Christoph H. Lampert

For object category recognition to scale beyond a small number of classes, it is important that algorithms be able to learn from a small amount of labeled data per additional class. One-shot recognition aims to apply the knowledge gained from a set of categories with plentiful data to categories for which only a single exemplar is available for each. As with earlier efforts motivated by transfer learning, we seek an internal representation for the domain that generalizes across classes. However, in contrast to existing work, we formulate the problem in a fundamentally new manner by optimizing the internal representation for the one-shot task using the notion of micro-sets. A micro-set is a sample of data that contains only a single instance of each category, sampled from the pool of available data, which serves as a mechanism to force the learned representation to explicitly address the variability and noise inherent in the one-shot recognition task. We optimize our learned domain features so that they minimize an expected loss over micro-sets drawn from the training set and show that these features generalize effectively to previously unseen categories. We detail a discriminative approach for optimizing one-shot recognition using micro-sets and present experiments on the Animals with Attributes and Caltech-101 datasets that demonstrate the benefits of our formulation.

international conference on computer vision | 2015

Improving Image Classification with Location Context

Kevin Tang; Manohar Paluri; Li Fei-Fei; Rob Fergus; Lubomir D. Bourdev

With the widespread availability of cellphones and cameras that have GPS capabilities, it is common for images being uploaded to the Internet today to have GPS coordinates associated with them. In addition to research that tries to predict GPS coordinates from visual features, this also opens up the door to problems that are conditioned on the availability of GPS coordinates. In this work, we tackle the problem of performing image classification with location context, in which we are given the GPS coordinates for images in both the train and test phases. We explore different ways of encoding and extracting features from the GPS coordinates, and show how to naturally incorporate these features into a Convolutional Neural Network (CNN), the current state-of-the-art for most image classification and recognition problems. We also show how it is possible to simultaneously learn the optimal pooling radii for a subset of our features within the CNN framework. To evaluate our model and to help promote research in this area, we identify a set of location-sensitive concepts and annotate a subset of the Yahoo Flickr Creative Commons 100M dataset that has GPS coordinates with these concepts, which we make publicly available. By leveraging location context, we are able to achieve almost a 7% gain in mean average precision.

Explore More