Chen Kong
Carnegie Mellon University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Chen Kong.
computer vision and pattern recognition | 2014
Chen Kong; Dahua Lin; Mohit Bansal; Raquel Urtasun; Sanja Fidler
In this paper we exploit natural sentential descriptions of RGB-D scenes in order to improve 3D semantic parsing. Importantly, in doing so, we reason about which particular object each noun/pronoun is referring to in the image. This allows us to utilize visual information in order to disambiguate the so-called coreference resolution problem that arises in text. Towards this goal, we propose a structure prediction model that exploits potentials computed from text and RGB-D imagery to reason about the class of the 3D objects, the scene type, as well as to align the nouns/pronouns with the referred visual objects. We demonstrate the effectiveness of our approach on the challenging NYU-RGBD v2 dataset, which we enrich with natural lingual descriptions. We show that our approach significantly improves 3D detection and scene classification accuracy, and is able to reliably estimate the text-to-image alignment. Furthermore, by using textual and visual information, we are also able to successfully deal with coreference in text, improving upon the state-of-the-art Stanford coreference system [15].
computer vision and pattern recognition | 2014
Dahua Lin; Sanja Fidler; Chen Kong; Raquel Urtasun
In this paper, we tackle the problem of retrieving videos using complex natural language queries. Towards this goal, we first parse the sentential descriptions into a semantic graph, which is then matched to visual concepts using a generalized bipartite matching algorithm. Our approach exploits object appearance, motion and spatial relations, and learns the importance of each term using structure prediction. We demonstrate the effectiveness of our approach on a new dataset designed for semantic search in the context of autonomous driving, which exhibits complex and highly dynamic scenes with many objects. We show that our approach is able to locate a major portion of the objects described in the query with high accuracy, and improve the relevance in video retrieval.
computer vision and pattern recognition | 2017
Chen Kong; Chen-Hsuan Lin; Simon Lucey
We investigate the problem of estimating the dense 3D shape of an object, given a set of 2D landmarks and silhouette in a single image. An obvious prior to employ in such a problem is a dictionary of dense CAD models. Employing a sufficiently large enough dictionary of CAD models, however, is in general computationally infeasible. A common strategy in dictionary learning to encourage generalization is to allow for linear combinations of dictionary elements. This too, however, is problematic as most CAD models cannot be readily placed in global dense correspondence. In this paper, we propose a two-step strategy. First, we employ orthogonal matching pursuit to rapidly choose the closest single CAD model in our dictionary to the projected image. Second, we employ a novel graph embedding based on local dense correspondence to allow for sparse linear combinations of CAD models. We validate our framework experimentally in both synthetic and real world scenario and demonstrate the superiority of our approach to both 3D mesh reconstruction and volumetric representation.
british machine vision conference | 2015
Dahua Lin; Sanja Fidler; Chen Kong; Raquel Urtasun
This paper proposes a novel framework for generating lingual descriptions of indoor scenes. This is an important problem, as an effective solution to this problem can enable many exciting real-world applications, such as human robot interaction, image/video synopsis, and automatic caption generation. Whereas substantial efforts have been made to tackle this problem, previous approaches focusing primarily on generating a single sentence for each image, which is not sufficient for describing complex scenes. We attempt to go beyond this, by generating coherent descriptions with multiple sentences. Particularly, we are interested in generating multi-sentence descriptions of cluttered indoor scenes. Complex, multi-sentence output requires us to deal with challenging problems such as consistent co-referrals to visual entities across sentences. Furthermore, the sequence of sentences needs to be as natural as possible, mimicking how humans describe the scene. This is especially important for example in the context of social robotics to enable realistic communications. Towards this goal, we develop a framework with three major components: (1) a holistic visual parser based on [3] that couples the inference of objects, attributes, and relations to produce a semantic representation of a 3D scene (Fig. 1); (2) a generative grammar automatically learned from training text; and (3) a text generation algorithm that takes into account subtle dependencies across sentences, such as logical order, diversity, saliency of objects, and co-reference resolution.
international conference on 3d vision | 2016
Chen Kong; Rui Zhu; Hamed Kiani; Simon Lucey
Inferring the motion and shape of non-rigid objects from images has been widely explored by Non-Rigid Structure from Motion (NRSfM) algorithms. Despite their promising results, they often utilize additional constraints about the camera motion (e.g. temporal order) and the deformation of the object of interest, which are not always provided in real-world scenarios. This makes the application of NRSfM limited to very few deformable objects (e.g. human face and body). In this paper, we propose the concept of Structure from Category (SfC) to reconstruct 3D structure of generic objects solely from images with no shape and motion constraint (i.e. prior-less). Similar to the NRSfM approaches, SfC involves two steps: (i) correspondence, and (ii) inversion. Correspondence determines the location of key points across images of the same object category. Once established, the inverse problem of recovering the 3D structure from the 2D points is solved over an augmented sparse shape-space model. We validate our approach experimentally by reconstructing 3D structures of both synthetic and natural images, and demonstrate the superiority of our approach to the state-of-the-art low-rank NRSfM approaches.
computer vision and pattern recognition | 2016
Chen Kong; Simon Lucey
national conference on artificial intelligence | 2018
Chen-Hsuan Lin; Chen Kong; Simon Lucey
international conference on 3d vision | 2017
Jhony K. Pontes; Chen Kong; Anders Eriksson; Clinton Fookes; Sridha Sridharan; Simon Lucey
arXiv: Computer Vision and Pattern Recognition | 2017
Jhony K. Pontes; Chen Kong; Sridha Sridharan; Simon Lucey; Anders Eriksson; Clinton Fookes
arXiv: Computer Vision and Pattern Recognition | 2015
Dahua Lin; Chen Kong; Sanja Fidler; Raquel Urtasun