Is this you? Create Your Porfile

Jiajun Wu

Massachusetts Institute of Technology

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jiajun Wu is active.

Explore More

Publication

Featured researches published by Jiajun Wu.

computer vision and pattern recognition | 2015

Deep multiple instance learning for image classification and auto-annotation

Jiajun Wu; Yinan Yu; Chang Huang; Kai Yu

The recent development in learning deep representations has demonstrated its wide applications in traditional vision tasks like classification and detection. However, there has been little investigation on how we could build up a deep learning framework in a weakly supervised setting. In this paper, we attempt to model deep learning in a weakly supervised learning (multiple instance learning) framework. In our setting, each image follows a dual multi-instance assumption, where its object proposals and possible text annotations can be regarded as two instance sets. We thus design effective systems to exploit the MIL property with deep learning strategies from the two ends; we also try to jointly learn the relationship between object and annotation proposals. We conduct extensive experiments and prove that our weakly supervised deep learning framework not only achieves convincing performance in vision tasks including classification and image annotation, but also extracts reasonable region-keyword pairs with little supervision, on both widely used benchmarks like PASCAL VOC and MIT Indoor Scene 67, and also a dataset for image-and patch-level annotations.

european conference on computer vision | 2016

Single Image 3D Interpreter Network

Jiajun Wu; Tianfan Xue; Joseph J. Lim; Yuandong Tian; Joshua B. Tenenbaum; Antonio Torralba; William T. Freeman

Understanding 3D object structure from a single image is an important but difficult task in computer vision, mostly due to the lack of 3D object annotations in real images. Previous work tackles this problem by either solving an optimization task given 2D keypoint positions, or training on synthetic data with ground truth 3D information. In this work, we propose 3D INterpreter Network (3D-INN), an end-to-end framework which sequentially estimates 2D keypoint heatmaps and 3D object structure, trained on both real 2D-annotated images and synthetic 3D data. This is made possible mainly by two technical innovations. First, we propose a Projection Layer, which projects estimated 3D structure to 2D space, so that 3D-INN can be trained to predict 3D structural parameters supervised by 2D annotations on real images. Second, heatmaps of keypoints serve as an intermediate representation connecting real and synthetic data, enabling 3D-INN to benefit from the variation and abundance of synthetic 3D objects, without suffering from the difference between the statistics of real and synthesized images due to imperfect rendering. The network achieves state-of-the-art performance on both 2D keypoint estimation and 3D structure recovery. We also show that the recovered 3D information can be used in other vision applications, such as 3D rendering and image retrieval.

european conference on computer vision | 2016

Ambient Sound Provides Supervision for Visual Learning

Andrew Hale Owens; Jiajun Wu; Joshua H. McDermott; William T. Freeman; Antonio Torralba

The sound of crashing waves, the roar of fast-moving cars – sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds.

computer vision and pattern recognition | 2017

Synthesizing 3D Shapes via Modeling Multi-view Depth Maps and Silhouettes with Deep Generative Networks

Amir Arsalan Soltani; Haibin Huang; Jiajun Wu; Tejas D. Kulkarni; Joshua B. Tenenbaum

We study the problem of learning generative models of 3D shapes. Voxels or 3D parts have been widely used as the underlying representations to build complex 3D shapes, however, voxel-based representations suffer from high memory requirements, and parts-based models require a large collection of cached or richly parametrized parts. We take an alternative approach: learning a generative model over multi-view depth maps or their corresponding silhouettes, and using a deterministic rendering function to produce 3D shapes from these images. A multi-view representation of shapes enables generation of 3D models with fine details, as 2D depth maps and silhouettes can be modeled at a much higher resolution than 3D voxels. Moreover, our approach naturally brings the ability to recover the underlying 3D representation from depth maps of one or a few viewpoints. Experiments show that our framework can generate 3D shapes with variations and details. We also demonstrate that our model has out-of-sample generalization power for real-world tasks with occluded objects.

computer vision and pattern recognition | 2017

Neural Scene De-rendering

Jiajun Wu; Joshua B. Tenenbaum; Pushmeet Kohli

We study the problem of holistic scene understanding. We would like to obtain a compact, expressive, and interpretable representation of scenes that encodes information such as the number of objects and their categories, poses, positions, etc. Such a representation would allow us to reason about and even reconstruct or manipulate elements of the scene. Previous works have used encoder-decoder based neural architectures to learn image representations, however, representations obtained in this way are typically uninterpretable, or only explain a single object in the scene. In this work, we propose a new approach to learn an interpretable distributed representation of scenes. Our approach employs a deterministic rendering function as the decoder, mapping a naturally structured and disentangled scene description, which we named scene XML, to an image. By doing so, the encoder is forced to perform the inverse of the rendering operation (a.k.a. de-rendering) to transform an input image to the structured scene XML that the decoder used to produce the image. We use a object proposal based encoder that is trained by minimizing both the supervised prediction and the unsupervised reconstruction errors. Experiments demonstrate that our approach works well on scene de-rendering with two different graphics engines, and our learned representation can be easily adapted for a wide range of applications like image editing, inpainting, visual analogy-making, and image captioning.

british machine vision conference | 2016

Physics 101: Learning Physical Object Properties from Unlabeled Videos.

Jiajun Wu; Joseph J. Lim; Hongyi Zhang; Joshua B. Tenenbaum; William T. Freeman

We study the problem of learning physical properties of objects from unlabeled videos. Humans can learn basic physical laws when they are very young, which suggests that such tasks may be important goals for computational vision systems. We consider various scenarios: objects sliding down an inclined surface and colliding; objects attached to a spring; objects falling onto various surfaces, etc. Many physical properties like mass, density, and coefficient of restitution influence the outcome of these scenarios, and our goal is to recover them automatically. We have collected 17,408 video clips containing 101 objects of various materials and appearances (shapes, colors, and sizes). Together, they form a dataset, named Physics 101, for studying object-centered physical properties. We propose an unsupervised representation learning model, which explicitly encodes basic physical laws into the structure and use them, with automatically discovered observations from videos, as supervision. Experiments demonstrate that our model can learn physical properties of objects from video. We also illustrate how its generative nature enables solving other tasks such as outcome prediction.

International Journal of Computer Vision | 2018

Learning Sight from Sound: Ambient Sound Provides Supervision for Visual Learning

Andrew Owens; Jiajun Wu; Josh H. McDermott; William T. Freeman; Antonio Torralba

The sound of crashing waves, the roar of fast-moving cars—sound conveys important information about the objects in our surroundings. In this work, we show that ambient sounds can be used as a supervisory signal for learning visual models. To demonstrate this, we train a convolutional neural network to predict a statistical summary of the sound associated with a video frame. We show that, through this process, the network learns a representation that conveys information about objects and scenes. We evaluate this representation on several recognition tasks, finding that its performance is comparable to that of other state-of-the-art unsupervised learning methods. Finally, we show through visualizations that the network learns units that are selective to objects that are often associated with characteristic sounds. This paper extends an earlier conference paper, Owens et al. (in: European conference on computer vision, 2016b), with additional experiments and discussion.

user interface software and technology | 2018

MoSculp: Interactive Visualization of Shape and Time

Xiuming Zhang; Tali Dekel; Tianfan Xue; Andrew Owens; Qiurui He; Jiajun Wu; Stefanie Mueller; William T. Freeman

We present a system that visualizes complex human motion via 3D motion sculptures-a representation that conveys the 3D structure swept by a human body as it moves through space. Our system computes a motion sculpture from an input video, and then embeds it back into the scene in a 3D-aware fashion. The user may also explore the sculpture directly in 3D or physically print it. Our interactive interface allows users to customize the sculpture design, for example, by selecting materials and lighting conditions. To provide this end-to-end workflow, we introduce an algorithm that estimates a humans 3D geometry over time from a set of 2D images, and develop a 3D-aware image-based rendering approach that inserts the sculpture back into the original video. By automating the process, our system takes motion sculpture creation out of the realm of professional artists, and makes it applicable to a wide range of existing video material. By conveying 3D information to users, motion sculptures reveal space-time motion information that is difficult to perceive with the naked eye, and allow viewers to interpret how different parts of the object interact over time. We validate the effectiveness of motion sculptures with user studies, finding that our visualizations are more informative about motion than existing stroboscopic and space-time visualization methods.

european conference on computer vision | 2018

Learning Shape Priors for Single-View 3D Completion And Reconstruction

Jiajun Wu; Chengkai Zhang; Xiuming Zhang; Zhoutong Zhang; William T. Freeman; Joshua B. Tenenbaum

The problem of single-view 3D shape completion or reconstruction is challenging, because among the many possible shapes that explain an observation, most are implausible and do not correspond to natural objects. Recent research in the field has tackled this problem by exploiting the expressiveness of deep convolutional networks. In fact, there is another level of ambiguity that is often overlooked: among plausible shapes, there are still multiple shapes that fit the 2D image equally well; i.e., the ground truth shape is non-deterministic given a single-view input. Existing fully supervised approaches fail to address this issue, and often produce blurry mean shapes with smooth surfaces but no fine details. In this paper, we propose ShapeHD, pushing the limit of single-view shape completion and reconstruction by integrating deep generative models with adversarially learned shape priors. The learned priors serve as a regularizer, penalizing the model only if its output is unrealistic, not if it deviates from the ground truth. Our design thus overcomes both levels of ambiguity aforementioned. Experiments demonstrate that ShapeHD outperforms state of the art by a large margin in both shape completion and shape reconstruction on multiple real datasets.

european conference on computer vision | 2018

Physical Primitive Decomposition

Zhijian Liu; William T. Freeman; Joshua B. Tenenbaum; Jiajun Wu

Objects are made of parts, each with distinct geometry, physics, functionality, and affordances. Developing such a distributed, physical, interpretable representation of objects will facilitate intelligent agents to better explore and interact with the world. In this paper, we study physical primitive decomposition—understanding an object through its components, each with physical and geometric attributes. As annotated data for object parts and physics are rare, we propose a novel formulation that learns physical primitives by explaining both an object’s appearance and its behaviors in physical events. Our model performs well on block towers and tools in both synthetic and real scenarios; we also demonstrate that visual and physical observations often provide complementary signals. We further present ablation and behavioral studies to better understand our model and contrast it with human performance.

Explore More