Sergio Guadarrama | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sergio Guadarrama is active.

Explore More

Publication

Featured researches published by Sergio Guadarrama.

acm multimedia | 2014

Caffe: Convolutional Architecture for Fast Feature Embedding

Yangqing Jia; Evan Shelhamer; Jeff Donahue; Sergey Karayev; Jonathan Long; Ross B. Girshick; Sergio Guadarrama; Trevor Darrell

Caffe provides multimedia scientists and practitioners with a clean and modifiable framework for state-of-the-art deep learning algorithms and a collection of reference models. The framework is a BSD-licensed C++ library with Python and MATLAB bindings for training and deploying general-purpose convolutional neural networks and other deep models efficiently on commodity architectures. Caffe fits industry and internet-scale media needs by CUDA GPU computation, processing over 40 million images a day on a single K40 or Titan GPU (approx 2 ms per image). By separating model representation from actual implementation, Caffe allows experimentation and seamless switching among platforms for ease of development and deployment from prototyping machines to cloud environments. Caffe is maintained and developed by the Berkeley Vision and Learning Center (BVLC) with the help of an active community of contributors on GitHub. It powers ongoing research projects, large-scale industrial applications, and startup prototypes in vision, speech, and multimedia.

computer vision and pattern recognition | 2017

Speed/Accuracy Trade-Offs for Modern Convolutional Object Detectors

Jonathan Huang; Vivek Rathod; Chen Sun; Menglong Zhu; Anoop Korattikara; Alireza Fathi; Ian Fischer; Zbigniew Wojna; Yang Song; Sergio Guadarrama; Kevin P. Murphy

The goal of this paper is to serve as a guide for selecting a detection architecture that achieves the right speed/memory/accuracy balance for a given application and platform. To this end, we investigate various ways to trade accuracy for speed and memory usage in modern convolutional object detection systems. A number of successful systems have been proposed in recent years, but apples-toapples comparisons are difficult due to different base feature extractors (e.g., VGG, Residual Networks), different default image resolutions, as well as different hardware and software platforms. We present a unified implementation of the Faster R-CNN [30], R-FCN [6] and SSD [25] systems, which we view as meta-architectures and trace out the speed/accuracy trade-off curve created by using alternative feature extractors and varying other critical parameters such as image size within each of these meta-architectures. On one extreme end of this spectrum where speed and memory are critical, we present a detector that achieves real time speeds and can be deployed on a mobile device. On the opposite end in which accuracy is critical, we present a detector that achieves state-of-the-art performance measured on the COCO detection task.

international conference on computer vision | 2015

Im2Calories: Towards an Automated Mobile Vision Food Diary

Austin Myers; Nick Johnston; Vivek Rathod; Anoop Korattikara; Alexander N. Gorban; Nathan Silberman; Sergio Guadarrama; George Papandreou; Jonathan Huang; Kevin P. Murphy

We present a system which can recognize the contents of your meal from a single image, and then predict its nutritional contents, such as calories. The simplest version assumes that the user is eating at a restaurant for which we know the menu. In this case, we can collect images offline to train a multi-label classifier. At run time, we apply the classifier (running on your phone) to predict which foods are present in your meal, and we lookup the corresponding nutritional facts. We apply this method to a new dataset of images from 23 different restaurants, using a CNN-based classifier, significantly outperforming previous work. The more challenging setting works outside of restaurants. In this case, we need to estimate the size of the foods, as well as their labels. This requires solving segmentation and depth / volume estimation from a single image. We present CNN-based approaches to these problems, with promising preliminary results.

IEEE Transactions on Pattern Analysis and Machine Intelligence | 2017

Long-Term Recurrent Convolutional Networks for Visual Recognition and Description

Jeff Donahue; Lisa Anne Hendricks; Marcus Rohrbach; Subhashini Venugopalan; Sergio Guadarrama; Kate Saenko; Trevor Darrell

Models based on deep convolutional networks have dominated recent image interpretation tasks; we investigate whether models which are also recurrent, or “temporally deep”, are effective for tasks involving sequences, visual and otherwise. We develop a novel recurrent convolutional architecture suitable for large-scale visual learning which is end-to-end trainable, and demonstrate the value of these models on benchmark video recognition tasks, image description and retrieval problems, and video narration challenges. In contrast to current models which assume a fixed spatio-temporal receptive field or simple temporal averaging for sequential processing, recurrent convolutional models are “doubly deep” in that they can be compositional in spatial and temporal “layers”. Such models may have advantages when target concepts are complex and/or training data are limited. Learning long-term dependencies is possible when nonlinearities are incorporated into the network state updates. Long-term RNN models are appealing in that they directly can map variable-length inputs (e.g., video frames) to variable length outputs (e.g., natural language text) and can model complex temporal dynamics; yet they can be optimized with backpropagation. Our recurrent long-term models are directly connected to modern visual convnet models and can be jointly trained to simultaneously learn temporal dynamics and convolutional perceptual representations. Our results show such models have distinct advantages over state-of-the-art models for recognition or generation which are separately defined and/or optimized.

intelligent robots and systems | 2013

Grounding spatial relations for human-robot interaction

Sergio Guadarrama; Lorenzo Riano; Dave Golland; Daniel Gouhring; Yangqing Jia; Daniel Klein; Pieter Abbeel; Trevor Darrell

We propose a system for human-robot interaction that learns both models for spatial prepositions and for object recognition. Our system grounds the meaning of an input sentence in terms of visual percepts coming from the robots sensors in order to send an appropriate command to the PR2 or respond to spatial queries. To perform this grounding, the system recognizes the objects in the scene, determines which spatial relations hold between those objects, and semantically parses the input sentence. The proposed system uses the visual and spatial information in conjunction with the semantic parse to interpret statements that refer to objects (nouns), their spatial relationships (prepositions), and to execute commands (actions). The semantic parse is inherently compositional, allowing the robot to understand complex commands that refer to multiple objects and relations such as: “Move the cup close to the robot to the area in front of the plate and behind the tea box”. Our system correctly parses 94% of the 210 online test sentences, correctly interprets 91% of the correctly parsed sentences, and correctly executes 89% of the correctly interpreted sentences.

robotics science and systems | 2014

Open-vocabulary Object Retrieval

Sergio Guadarrama; Erik Rodner; Kate Saenko; Ning Zhang; Ryan Farrell; Jeff Donahue; Trevor Darrell

In this paper, we address the problem of retrieving objects based on open-vocabulary natural language queries: Given a phrase describing a specific object, e.g., “the corn flakes box”, the task is to find the best match in a set of images containing candidate objects. When naming objects, humans tend to use natural language with rich semantics, including basic-level categories, fine-grained categories, and instance-level concepts such as brand names. Existing approaches to large-scale object recognition fail in this scenario, as they expect queries that map directly to a fixed set of pre-trained visual categories, e.g. ImageNet synset tags. We address this limitation by introducing a novel object retrieval method. Given a candidate object image, we first map it to a set of words that are likely to describe it, using several learned image-to-text projections. We also propose a method for handling open-vocabularies, i.e., words not contained in the training data. We then compare the natural language query to the sets of words predicted for each candidate and select the best match. Our method can combine categoryand instance-level semantics in a common representation. We present extensive experimental results on several datasets using both instance-level and category-level matching and show that our approach can accurately retrieve objects based on extremely varied open-vocabulary queries. The source code of our approach will be publicly available together with pre-trained models at http://openvoc.berkeleyvision.org and could be directly used for robotics applications.

international conference on robotics and automation | 2016

Cross-modal adaptation for RGB-D detection

Judy Hoffman; Saurabh Gupta; Jian Leong; Sergio Guadarrama; Trevor Darrell

In this paper we propose a technique to adapt convolutional neural network (CNN) based object detectors trained on RGB images to effectively leverage depth images at test time to boost detection performance. Given labeled depth images for a handful of categories we adapt an RGB object detector for a new category such that it can now use depth images in addition to RGB images at test time to produce more accurate detections. Our approach is built upon the observation that lower layers of a CNN are largely task and category agnostic and domain specific while higher layers are largely task and category specific while being domain agnostic. We operationalize this observation by proposing a mid-level fusion of RGB and depth CNNs. Experimental evaluation on the challenging NYUD2 dataset shows that our proposed adaptation technique results in an average 21% relative improvement in detection performance over an RGB-only baseline even when no depth training data is available for the particular category evaluated. We believe our proposed technique will extend advances made in computer vision to RGB-D data leading to improvements in performance at little additional annotation effort.

2011 IEEE Symposium on Advances in Type-2 Fuzzy Logic Systems (T2FUZZ) | 2011

Constrained type-2 fuzzy sets

Jonathan M. Garibaldi; Sergio Guadarrama

Type-2 fuzzy sets extend the expressive capabilities of type-1 fuzzy sets in that, in addition to represent imprecise concepts, they are able to represent the imprecision in the membership function of fuzzy sets. Their use is particularly appropriate when modelling linguistic concepts such as words that mean slightly different things to different people. However, type-2 fuzzy sets do not place any constraints upon the continuity and other properties of their embedded sets. We argue that, for some concepts, additional properties constraining both the allowable footprint of uncertainty and the embedded sets of type-2 sets are desirable. We present a novel formulation of a constrained type-2 fuzzy set that achieves this, and then detail some of the properties of this novel form of type-2 fuzzy sets.

The International Journal of Robotics Research | 2016

Understanding object descriptions in robotics by open-vocabulary object retrieval and detection

Sergio Guadarrama; Erik Rodner; Kate Saenko; Trevor Darrell

We address the problem of retrieving and detecting objects based on open-vocabulary natural language queries: given a phrase describing a specific object, for example “the corn flakes box”, the task is to find the best match in a set of images containing candidate objects. When naming objects, humans tend to use natural language with rich semantics, including basic-level categories, fine-grained categories, and instance-level concepts such as brand names. Existing approaches to large-scale object recognition fail in this scenario, as they expect queries that map directly to a fixed set of pre-trained visual categories, for example ImageNet synset tags. We address this limitation by introducing a novel object retrieval method. Given a candidate object image, we first map it to a set of words that are likely to describe it, using several learned image-to-text projections. We also propose a method for handling open vocabularies, that is, words not contained in the training data. We then compare the natural language query to the sets of words predicted for each candidate and select the best match. Our method can combine category- and instance-level semantics in a common representation. We present extensive experimental results on several datasets using both instance-level and category-level matching and show that our approach can accurately retrieve objects based on extremely varied open-vocabulary queries. Furthermore, we show how to process queries referring to objects within scenes, using state-of-the-art adapted detectors. The source code of our approach will be publicly available together with pre-trained models at http://openvoc.berkeleyvision.org and could be directly used for robotics applications.

european conference on computer vision | 2018

Tracking Emerges by Colorizing Videos

Carl Vondrick; Abhinav Shrivastava; Alireza Fathi; Sergio Guadarrama; Kevin P. Murphy

We use large amounts of unlabeled video to learn models for visual tracking without manual human supervision. We leverage the natural temporal coherency of color to create a model that learns to colorize gray-scale videos by copying colors from a reference frame. Quantitative and qualitative experiments suggest that this task causes the model to automatically learn to track visual regions. Although the model is trained without any ground-truth labels, our method learns to track well enough to outperform the latest methods based on optical flow. Moreover, our results suggest that failures to track are correlated with failures to colorize, indicating that advancing video colorization may further improve self-supervised visual tracking.

Explore More