Tomas Simon
Carnegie Mellon University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Tomas Simon.
computer vision and pattern recognition | 2017
Zhe Cao; Tomas Simon; Shih-En Wei; Yaser Sheikh
We present an approach to efficiently detect the 2D pose of multiple people in an image. The approach uses a nonparametric representation, which we refer to as Part Affinity Fields (PAFs), to learn to associate body parts with individuals in the image. The architecture encodes global context, allowing a greedy bottom-up parsing step that maintains high accuracy while achieving realtime performance, irrespective of the number of people in the image. The architecture is designed to jointly learn part locations and their association via two branches of the same sequential prediction process. Our method placed first in the inaugural COCO 2016 keypoints challenge, and significantly exceeds the previous state-of-the-art result on the MPII Multi-Person benchmark, both in performance and efficiency.
international conference on computer graphics and interactive techniques | 2014
Natasha Kholgade; Tomas Simon; Alexei A. Efros; Yaser Sheikh
Photo-editing software restricts the control of objects in a photograph to the 2D image plane. We present a method that enables users to perform the full range of 3D manipulations, including scaling, rotation, translation, and nonrigid deformations, to an object in a photograph. As 3D manipulations often reveal parts of the object that are hidden in the original photograph, our approach uses publicly available 3D models to guide the completion of the geometry and appearance of the revealed areas of the object. The completion process leverages the structure and symmetry in the stock 3D model to factor out the effects of illumination, and to complete the appearance of the object. We demonstrate our system by producing object manipulations that would be impossible in traditional 2D photo-editing programs, such as turning a car over, making a paper-crane flap its wings, or manipulating airplanes in a historical photograph to change its story.
computer vision and pattern recognition | 2010
Tomas Simon; Minh Hoai Nguyen; Fernando De la Torre; Jeffrey F. Cohn
Automatic facial action unit (AU) detection from video is a long-standing problem in computer vision. Two main approaches have been pursued: (1) static modeling — typically posed as a discriminative classification problem in which each video frame is evaluated independently; (2) temporal modeling — frames are segmented into sequences and typically modeled with a variant of dynamic Bayesian networks. We propose a segment-based approach, kSeg-SVM, that incorporates benefits of both approaches and avoids their limitations. kSeg-SVM is a temporal extension of the spatial bag-of-words. kSeg-SVM is trained within a structured output SVM framework that formulates AU detection as a problem of detecting temporal events in a time series of visual features. Each segment is modeled by a variant of the BoW representation with soft assignment of the words based on similarity. Our framework has several benefits for AU detection: (1) both dependencies between features and the length of action units are modeled; (2) all possible segments of the video may be used for training; and (3) no assumptions are required about the underlying structure of the action unit events (e.g., i.i.d.). Our algorithm finds the best k-or-fewer segments that maximize the SVM score. Experimental results suggest that the proposed method outperforms state-of-the-art static methods for AU detection.
ACM Transactions on Graphics | 2012
Ijaz Akhter; Tomas Simon; Sohaib Khan; Iain A. Matthews; Yaser Sheikh
A variety of dynamic objects, such as faces, bodies, and cloth, are represented in computer graphics as a collection of moving spatial landmarks. Spatiotemporal data is inherent in a number of graphics applications including animation, simulation, and object and camera tracking. The principal modes of variation in the spatial geometry of objects are typically modeled using dimensionality reduction techniques, while concurrently, trajectory representations like splines and autoregressive models are widely used to exploit the temporal regularity of deformation. In this article, we present the bilinear spatiotemporal basis as a model that simultaneously exploits spatial and temporal regularity while maintaining the ability to generalize well to new sequences. This factorization allows the use of analytical, predefined functions to represent temporal variation (e.g., B-Splines or the Discrete Cosine Transform) resulting in efficient model representation and estimation. The model can be interpreted as representing the data as a linear combination of spatiotemporal sequences consisting of shape modes oscillating over time at key frequencies. We apply the bilinear model to natural spatiotemporal phenomena, including face, body, and cloth motion data, and compare it in terms of compaction, generalization ability, predictive precision, and efficiency to existing models. We demonstrate the application of the model to a number of graphics tasks including labeling, gap-filling, denoising, and motion touch-up.
computer vision and pattern recognition | 2017
Tomas Simon; Hanbyul Joo; Iain A. Matthews; Yaser Sheikh
We present an approach that uses a multi-camera system to train fine-grained detectors for keypoints that are prone to occlusion, such as the joints of a hand. We call this procedure multiview bootstrapping: first, an initial keypoint detector is used to produce noisy labels in multiple views of the hand. The noisy detections are then triangulated in 3D using multiview geometry or marked as outliers. Finally, the reprojected triangulations are used as new labeled training data to improve the detector. We repeat this process, generating more labeled data in each iteration. We derive a result analytically relating the minimum number of views to achieve target true and false positive rates for a given detector. The method is used to train a hand keypoint detector for single images. The resulting keypoint detector runs in realtime on RGB images and has accuracy comparable to methods that use depth sensors. The single view detector, triangulated over multiple views, enables 3D markerless hand motion capture with complex object interactions.
european conference on computer vision | 2014
Tomas Simon; Jack Valmadre; Iain A. Matthews; Yaser Sheikh
Reconstructing 3D motion data is highly under-constrained due to several common sources of data loss during measurement, such as projection, occlusion, or miscorrespondence. We present a statistical model of 3D motion data, based on the Kronecker structure of the spatiotemporal covariance of natural motion, as a prior on 3D motion. This prior is expressed as a matrix normal distribution, composed of separable and compact row and column covariances. We relate the marginals of the distribution to the shape, trajectory, and shape-trajectory models of prior art. When the marginal shape distribution is not available from training data, we show how placing a hierarchical prior over shapes results in a convex MAP solution in terms of the trace-norm. The matrix normal distribution, fit to a single sequence, outperforms state-of-the-art methods at reconstructing 3D motion data in the presence of significant data loss, while providing covariance estimates of the imputed points.
affective computing and intelligent interaction | 2011
Fernando De la Torre; Tomas Simon; Zara Ambadar; Jeffrey F. Cohn
FACS (Facial Action Coding System) coding is the state of the art in manual measurement of facial actions. FACS coding, however, is labor intensive and difficult to standardize. A goal of automated FACS coding is to eliminate the need for manual coding and realize automatic recognition and analysis of facial actions. Success of this effort depends in part on access to reliably coded corpora; however, manual FACS coding remains expensive and slow. This paper proposes Fast-FACS, a computer vision aided system that improves speed and reliability of FACS coding. Three are the main novelties of the system: (1) to the best of our knowledge, this is the first paper to predict onsets and offsets from peaks, (2) use Active Appearance Models for computer assisted FACS coding, (3) learn an optimal metric to predict onsets and offsets from peaks. The system was tested in the RU-FACS database, which consists of natural facial behavior during a twoperson interview. Fast-FACS reduced manual coding time by nearly 50% and demonstrated strong concurrent validity with manual FACS coding.
international conference on computer vision | 2015
Paulo F. U. Gotardo; Tomas Simon; Yaser Sheikh; Iain A. Matthews
Photometric stereo (PS) is an established technique for high-detail reconstruction of 3D geometry and appearance. To correct for surface integration errors, PS is often combined with multiview stereo (MVS). With dynamic objects, PS reconstruction also faces the problem of computing optical flow (OF) for image alignment under rapid changes in illumination. Current PS methods typically compute optical flow and MVS as independent stages, each one with its own limitations and errors introduced by early regularization. In contrast, scene flow methods estimate geometry and motion, but lack the fine detail from PS. This paper proposes photogeometric scene flow (PGSF) for high-quality dynamic 3D reconstruction. PGSF performs PS, OF, and MVS simultaneously. It is based on two key observations: (i) while image alignment improves PS, PS allows for surfaces to be relit to improve alignment, (ii) PS provides surface gradients that render the smoothness term in MVS unnecessary, leading to truly data-driven, continuous depth estimates. This synergy is demonstrated in the quality of the resulting RGB appearance, 3D geometry, and 3D motion.
Computer Graphics Forum | 2018
Dan Andrei Calian; Jean-François Lalonde; Paulo F.U. Gotardo; Tomas Simon; Iain A. Matthews; Kenny Mitchell
Image‐based lighting has allowed the creation of photo‐realistic computer‐generated content. However, it requires the accurate capture of the illumination conditions, a task neither easy nor intuitive, especially to the average digital photography enthusiast. This paper presents an approach to directly estimate an HDR light probe from a single LDR photograph, shot outdoors with a consumer camera, without specialized calibration targets or equipment. Our insight is to use a persons face as an outdoor light probe. To estimate HDR light probes from LDR faces we use an inverse rendering approach which employs data‐driven priors to guide the estimation of realistic, HDR lighting. We build compact, realistic representations of outdoor lighting both parametrically and in a data‐driven way, by training a deep convolutional autoencoder on a large dataset of HDR sky environment maps. Our approach can recover high‐frequency, extremely high dynamic range lighting environments. For quantitative evaluation of lighting estimation accuracy and relighting accuracy, we also contribute a new database of face photographs with corresponding HDR light probes. We show that relighting objects with HDR light probes estimated by our method yields realistic results in a wide variety of settings.
IEEE Transactions on Pattern Analysis and Machine Intelligence | 2017
Tomas Simon; Jack Valmadre; Iain A. Matthews; Yaser Sheikh
Recovering dynamic 3D structures from 2D image observations is highly under-constrained because of projection and missing data, motivating the use of strong priors to constrain shape deformation. In this paper, we empirically show that the spatiotemporal covariance of natural deformations is dominated by a Kronecker pattern. We demonstrate that this pattern arises as the limit of a spatiotemporal autoregressive process, and derive a Kronecker Markov Random Field as a prior distribution over dynamic structures. This distribution unifies shape and trajectory models of prior art and has the individual models as its marginals. The key assumption of the Kronecker MRF is that the spatiotemporal covariance is separable into the product of a temporal and a shape covariance, and can therefore be modeled using the matrix normal distribution. Analysis on motion capture data validates that this distribution is an accurate approximation with significantly fewer free parameters. Using the trace-norm, we present a convex method to estimate missing data from a single sequence when the marginal shape distribution is unknown. The Kronecker-Markov distribution, fit to a single sequence, outperforms state-of-the-art methods at inferring missing 3D data, and additionally provides covariance estimates of the uncertainty.