Benjamin Sapp | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Benjamin Sapp is active.

Explore More

Publication

Featured researches published by Benjamin Sapp.

computer vision and pattern recognition | 2013

MODEC: Multimodal Decomposable Models for Human Pose Estimation

Benjamin Sapp; Ben Taskar

We propose a multimodal, decomposable model for articulated human pose estimation in monocular images. A typical approach to this problem is to use a linear structured model, which struggles to capture the wide range of appearance present in realistic, unconstrained images. In this paper, we instead propose a model of human pose that explicitly captures a variety of pose modes. Unlike other multimodal models, our approach includes both global and local pose cues and uses a convex objective and joint training for mode selection and pose estimation. We also employ a cascaded mode selection step which controls the trade-off between speed and accuracy, yielding a 5x speedup in inference and learning. Our model outperforms state-of-the-art approaches across the accuracy-speed trade-off curve for several pose datasets. This includes our newly-collected dataset of people in movies, FLIC, which contains an order of magnitude more labeled data for training and testing than existing datasets.

european conference on computer vision | 2010

Cascaded models for articulated pose estimation

Benjamin Sapp; Alexander Toshev; Ben Taskar

We address the problem of articulated human pose estimation by learning a coarse-to-fine cascade of pictorial structure models. While the fine-level state-space of poses of individual parts is too large to permit the use of rich appearance models, most possibilities can be ruled out by efficient structured models at a coarser scale. We propose to learn a sequence of structured models at different pose resolutions, where coarse models filter the pose space for the next level via their max-marginals. The cascade is trained to prune as much as possible while preserving true poses for the final level pictorial structure model. The final level uses much more expensive segmentation, contour and shape features in the model for the remaining filtered set of candidates. We evaluate our framework on the challenging Buffy and PASCAL human pose datasets, improving the state-of-the-art.

computer vision and pattern recognition | 2011

Parsing human motion with stretchable models

Benjamin Sapp; David Weiss; Ben Taskar

We address the problem of articulated human pose estimation in videos using an ensemble of tractable models with rich appearance, shape, contour and motion cues. In previous articulated pose estimation work on unconstrained videos, using temporal coupling of limb positions has made little to no difference in performance over parsing frames individually [8, 28]. One crucial reason for this is that joint parsing of multiple articulated parts over time involves intractable inference and learning problems, and previous work has resorted to approximate inference and simplified models. We overcome these computational and modeling limitations using an ensemble of tractable submodels which couple locations of body joints within and across frames using expressive cues. Each submodel is responsible for tracking a single joint through time (e.g., left elbow) and also models the spatial arrangement of all joints in a single frame. Because of the tree structure of each submodel, we can perform efficient exact inference and use rich temporal features that depend on image appearance, e.g., color tracking and optical flow contours. We propose and experimentally investigate a hierarchy of submodel combination methods, and we find that a highly efficient max-marginal combination method outperforms much slower (by orders of magnitude) approximate inference using dual decomposition. We apply our pose model on a new video dataset of highly varied and articulated poses from TV shows. We show significant quantitative and qualitative improvements over state-of-the-art single-frame pose estimation approaches.

computer vision and pattern recognition | 2009

Learning from ambiguously labeled images

Timothee Cour; Benjamin Sapp; Chris Jordan; Benjamin Taskar

In many image and video collections, we have access only to partially labeled data. For example, personal photo collections often contain several faces per image and a caption that only specifies who is in the picture, but not which name matches which face. Similarly, movie screenplays can tell us who is in the scene, but not when and where they are on the screen. We formulate the learning problem in this setting as partially-supervised multiclass classification where each instance is labeled ambiguously with more than one label. We show theoretically that effective learning is possible under reasonable assumptions even when all the data is weakly labeled. Motivated by the analysis, we propose a general convex learning formulation based on minimization of a surrogate loss appropriate for the ambiguous label setting. We apply our framework to identifying faces culled from Web news sources and to naming characters in TV series and movies. We experiment on a very large dataset consisting of 100 hours of video, and in particular achieve 6% error for character naming on 16 episodes of LOST.

computer vision and pattern recognition | 2010

Adaptive pose priors for pictorial structures

Benjamin Sapp; Chris Jordan; Ben Taskar

Pictorial structure (PS) models are extensively used for part-based recognition of scenes, people, animals and multi-part objects. To achieve tractability, the structure and parameterization of the model is often restricted, for example, by assuming tree dependency structure and unimodal, data-independent pairwise interactions. These expressivity restrictions fail to capture important patterns in the data. On the other hand, local methods such as nearest-neighbor classification and kernel density estimation provide non-parametric flexibility but require large amounts of data to generalize well. We propose a simple semi-parametric approach that combines the tractability of pictorial structure inference with the flexibility of non-parametric methods by expressing a subset of model parameters as kernel regression estimates from a learned sparse set of exemplars. This yields query-specific, image-dependent pose priors. We develop an effective shape-based kernel for upper-body pose similarity and propose a leave-one-out loss function for learning a sparse subset of exemplars for kernel regression. We apply our techniques to two challenging datasets of human figure parsing and advance the state-of-the-art (from 80% to 86% on the Buffy dataset [8]), while using only 15% of the training data as exemplars.

annual computer security applications conference | 2012

Practicality of accelerometer side channels on smartphones

Adam J. Aviv; Benjamin Sapp; Matt Blaze; Jonathan M. Smith

Modern smartphones are equipped with a plethora of sensors that enable a wide range of interactions, but some of these sensors can be employed as a side channel to surreptitiously learn about user input. In this paper, we show that the accelerometer sensor can also be employed as a high-bandwidth side channel; particularly, we demonstrate how to use the accelerometer sensor to learn user tap- and gesture-based input as required to unlock smartphones using a PIN/password or Androids graphical password pattern. Using data collected from a diverse group of 24 users in controlled (while sitting) and uncontrolled (while walking) settings, we develop sample rate independent features for accelerometer readings based on signal processing and polynomial fitting techniques. In controlled settings, our prediction model can on average classify the PIN entered 43% of the time and pattern 73% of the time within 5 attempts when selecting from a test set of 50 PINs and 50 patterns. In uncontrolled settings, while users are walking, our model can still classify 20% of the PINs and 40% of the patterns within 5 attempts. We additionally explore the possibility of constructing an accelerometer-reading-to-input dictionary and find that such dictionaries would be greatly challenged by movement-noise and cross-user training.

european conference on computer vision | 2016

The Unreasonable Effectiveness of Noisy Data for Fine-Grained Recognition

Jonathan Krause; Benjamin Sapp; Andrew Howard; Howard Zhou; Alexander Toshev; Tom Duerig; James Philbin; Li Fei-Fei

Current approaches for fine-grained recognition do the following: First, recruit experts to annotate a dataset of images, optionally also collecting more structured data in the form of part annotations and bounding boxes. Second, train a model utilizing this data. Toward the goal of solving fine-grained recognition, we introduce an alternative approach, leveraging free, noisy data from the web and simple, generic methods of recognition. This approach has benefits in both performance and scalability. We demonstrate its efficacy on four fine-grained datasets, greatly exceeding existing state of the art without the manual collection of even a single label, and furthermore show first results at scaling to more than 10,000 fine-grained categories. Quantitatively, we achieve top-1 accuracies of \(92.3\,\%\) on CUB-200-2011, \(85.4\,\%\) on Birdsnap, \(93.4\,\%\) on FGVC-Aircraft, and \(80.8\,\%\) on Stanford Dogs without using their annotated training sets. We compare our approach to an active learning approach for expanding fine-grained datasets.

international conference on computer vision | 2011

Recognizing manipulation actions in arts and crafts shows using domain-specific visual and textual cues

Benjamin Sapp; Rizwan Chaudhry; Xiaodong Yu; Gautam Singh; Ian Perera; Francis Ferraro; Evelyne Tzoukermann; Jana Kosecka; Jan Neumann

We present an approach for automatic annotation of commercial videos from an arts-and-crafts domain with the aid of textual descriptions. The main focus is on recognizing both manipulation actions (e.g. cut, draw, glue) and the tools that are used to perform these actions (e.g. markers, brushes, glue bottle). We demonstrate how multiple visual cues such as motion descriptors, object presence, and hand poses can be combined with the help of contextual priors that are automatically extracted from associated transcripts or online instructions. Using these diverse features and linguistic information we propose several increasingly complex computational models for recognizing elementary manipulation actions and composite activities, as well as their temporal order. The approach is evaluated on a novel dataset of comprised of 27 episodes of PBS Sprout TV, each containing on average 8 manipulation actions.

Journal of Machine Learning Research | 2011