Stephen Gould | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Stephen Gould is active.

Explore More

Publication

Featured researches published by Stephen Gould.

international conference on computer vision | 2009

Decomposing a scene into geometric and semantically consistent regions

Stephen Gould; Richard Fulton; Daphne Koller

High-level, or holistic, scene understanding involves reasoning about objects, regions, and the 3D relationships between them. This requires a representation above the level of pixels that can be endowed with high-level attributes such as class of object/region, its orientation, and (rough 3D) location within the scene. Towards this goal, we propose a region-based model which combines appearance and scene geometry to automatically decompose a scene into semantically meaningful regions. Our model is defined in terms of a unified energy function over scene appearance and structure. We show how this energy function can be learned from data and present an efficient inference technique that makes use of multiple over-segmentations of the image to propose moves in the energy-space. We show, experimentally, that our method achieves state-of-the-art performance on the tasks of both multi-class image segmentation and geometric reasoning. Finally, by understanding region classes and geometry, we show how our model can be used as the basis for 3D reconstruction of the scene.

International Journal of Computer Vision | 2008

Multi-Class Segmentation with Relative Location Prior

Stephen Gould; Jim Rodgers; David Cohen; Daphne Koller

Multi-class image segmentation has made significant advances in recent years through the combination of local and global features. One important type of global feature is that of inter-class spatial relationships. For example, identifying “tree” pixels indicates that pixels above and to the sides are more likely to be “sky” whereas pixels below are more likely to be “grass.” Incorporating such global information across the entire image and between all classes is a computational challenge as it is image-dependent, and hence, cannot be precomputed.In this work we propose a method for capturing global information from inter-class spatial relationships and encoding it as a local feature. We employ a two-stage classification process to label all image pixels. First, we generate predictions which are used to compute a local relative location feature from learned relative location maps. In the second stage, we combine this with appearance-based features to provide a final segmentation. We compare our results to recent published results on several multi-class image segmentation databases and show that the incorporation of relative location information allows us to significantly outperform the current state-of-the-art.

computer vision and pattern recognition | 2010

Single image depth estimation from predicted semantic labels

Beyang Liu; Stephen Gould; Daphne Koller

We consider the problem of estimating the depth of each pixel in a scene from a single monocular image. Unlike traditional approaches [18, 19], which attempt to map from appearance features to depth directly, we first perform a semantic segmentation of the scene and use the semantic labels to guide the 3D reconstruction. This approach provides several advantages: By knowing the semantic class of a pixel or region, depth and geometry constraints can be easily enforced (e.g., “sky” is far away and “ground” is horizontal). In addition, depth can be more readily predicted by measuring the difference in appearance with respect to a given semantic class. For example, a tree will have more uniform appearance in the distance than it does close up. Finally, the incorporation of semantic features allows us to achieve state-of-the-art results with a significantly simpler model than previous works.

computer vision and pattern recognition | 2016

Dynamic Image Networks for Action Recognition

Hakan Bilen; Basura Fernando; Efstratios Gavves; Andrea Vedaldi; Stephen Gould

We introduce the concept of dynamic image, a novel compact representation of videos useful for video analysis especially when convolutional neural networks (CNNs) are used. The dynamic image is based on the rank pooling concept and is obtained through the parameters of a ranking machine that encodes the temporal evolution of the frames of the video. Dynamic images are obtained by directly applying rank pooling on the raw image pixels of a video producing a single RGB image per video. This idea is simple but powerful as it enables the use of existing CNN models directly on video data with fine-tuning. We present an efficient and effective approximate rank pooling operator, speeding it up orders of magnitude compared to rank pooling. Our new approximate rank pooling CNN layer allows us to generalize dynamic images to dynamic feature maps and we demonstrate the power of our new representations on standard benchmarks in action recognition achieving state-of-the-art performance.

international conference on robotics and automation | 2009

High-accuracy 3D sensing for mobile manipulation: Improving object detection and door opening

Morgan Quigley; Siddharth Batra; Stephen Gould; Ellen Klingbeil; Quoc V. Le; Ashley Wellman; Andrew Y. Ng

High-resolution 3D scanning can improve the performance of object detection and door opening, two tasks critical to the operation of mobile manipulators in cluttered homes and workplaces. We discuss how high-resolution depth information can be combined with visual imagery to improve the performance of object detection beyond what is (currently) achievable with 2D images alone, and we present door-opening and inventory-taking experiments.

european conference on computer vision | 2016

SPICE: Semantic Propositional Image Caption Evaluation

Peter Anderson; Basura Fernando; Mark Johnson; Stephen Gould

There is considerable interest in the task of automatically generating image captions. However, evaluation is challenging. Existing automatic evaluation metrics are primarily sensitive to n-gram overlap, which is neither necessary nor sufficient for the task of simulating human judgment. We hypothesize that semantic propositional content is an important component of human caption evaluation, and propose a new automated caption evaluation metric defined over scene graphs coined SPICE. Extensive evaluations across a range of models and datasets indicate that SPICE captures human judgments over model-generated captions better than other automatic metrics (e.g., system-level correlation of 0.88 with human judgments on the MS COCO dataset, versus 0.43 for CIDEr and 0.53 for METEOR). Furthermore, SPICE can answer questions such as which caption-generator best understands colors? and can caption-generators count?

european conference on computer vision | 2010

Discriminative learning with latent variables for cluttered indoor scene understanding

Huayan Wang; Stephen Gould; Daphne Koller

We address the problem of understanding an indoor scene from a single image in terms of recovering the layouts of the faces (floor, ceiling, walls) and furniture. A major challenge of this task arises from the fact that most indoor scenes are cluttered by furniture and decorations, whose appearances vary drastically across scenes, and can hardly be modeled (or even hand-labeled) consistently. In this paper we tackle this problem by introducing latent variables to account for clutters, so that the observed image is jointly explained by the face and clutter layouts. Model parameters are learned in the maximum margin formulation, which is constrained by extra prior energy terms that define the role of the latent variables. Our approach enables taking into account and inferring indoor clutter layouts without hand-labeling of the clutters in the training set. Yet it outperforms the state-of-the-art method of Hedau et al. [4] that requires clutter labels.

european conference on computer vision | 2012

PATCHMATCHGRAPH: building a graph of dense patch correspondences for label transfer

Stephen Gould; Yuhang Zhang

We address the problem of semantic segmentation, or multi-class pixel labeling, by constructing a graph of dense overlapping patch correspondences across large image sets. We then transfer annotations from labeled images to unlabeled images using the established patch correspondences. Unlike previous approaches to non-parametric label transfer our approach does not require an initial image retrieval step. Moreover, we operate on a graph for computing mappings between images, which avoids the need for exhaustive pairwise comparisons. Consequently, we can leverage offline computation to enhance performance at test time. We conduct extensive experiments to analyze different variants of our graph construction algorithm and evaluate multi-class pixel labeling performance on several challenging datasets.

computer vision and pattern recognition | 2016

Discriminative Hierarchical Rank Pooling for Activity Recognition

Basura Fernando; Peter Anderson; Marcus Hutter; Stephen Gould

We present hierarchical rank pooling, a video sequence encoding method for activity recognition. It consists of a network of rank pooling functions which captures the dynamics of rich convolutional neural network features within a video sequence. By stacking non-linear feature functions and rank pooling over one another, we obtain a high capacity dynamic encoding mechanism, which is used for action recognition. We present a method for jointly learning the video representation and activity classifier parameters. Our method obtains state-of-the art results on three important activity recognition benchmarks: 76.7% on Hollywood2, 66.9% on HMDB51 and, 91.4% on UCF101.

european conference on computer vision | 2014

Superpixel Graph Label Transfer with Learned Distance Metric

Stephen Gould; Jiecheng Zhao; Xuming He; Yuhang Zhang

We present a fast approximate nearest neighbor algorithm for semantic segmentation. Our algorithm builds a graph over superpixels from an annotated set of training images. Edges in the graph represent approximate nearest neighbors in feature space. At test time we match superpixels from a novel image to the training images by adding the novel image to the graph. A move-making search algorithm allows us to leverage the graph and image structure for finding matches. We then transfer labels from the training images to the image under test. To promote good matches between superpixels we propose to learn a distance metric that weights the edges in our graph. Our approach is evaluated on four standard semantic segmentation datasets and achieves results comparable with the state-of-the-art.

Explore More