Bolei Zhou
Massachusetts Institute of Technology
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Bolei Zhou.
computer vision and pattern recognition | 2016
Bolei Zhou; Aditya Khosla; Àgata Lapedriza; Aude Oliva; Antonio Torralba
In this work, we revisit the global average pooling layer proposed in [13], and shed light on how it explicitly enables the convolutional neural network (CNN) to have remarkable localization ability despite being trained on imagelevel labels. While this technique was previously proposed as a means for regularizing training, we find that it actually builds a generic localizable deep representation that exposes the implicit attention of CNNs on an image. Despite the apparent simplicity of global average pooling, we are able to achieve 37.1% top-5 error for object localization on ILSVRC 2014 without training on any bounding box annotation. We demonstrate in a variety of experiments that our network is able to localize the discriminative image regions despite just being trained for solving classification task1.
computer vision and pattern recognition | 2012
Bolei Zhou; Xiaogang Wang; Xiaoou Tang
In this paper, a new Mixture model of Dynamic pedestrian-Agents (MDA) is proposed to learn the collective behavior patterns of pedestrians in crowded scenes. Collective behaviors characterize the intrinsic dynamics of the crowd. From the agent-based modeling, each pedestrian in the crowd is driven by a dynamic pedestrian-agent, which is a linear dynamic system with its initial and termination states reflecting a pedestrians belief of the starting point and the destination. Then the whole crowd is modeled as a mixture of dynamic pedestrian-agents. Once the model is unsupervisedly learned from real data, MDA can simulate the crowd behaviors. Furthermore, MDA can well infer the past behaviors and predict the future behaviors of pedestrians given their trajectories only partially observed, and classify different pedestrian behaviors in the scene. The effectiveness of MDA and its applications are demonstrated by qualitative and quantitative experiments on the video surveillance dataset collected from the New York Grand Central Station.
computer vision and pattern recognition | 2011
Bolei Zhou; Xiaogang Wang; Xiaoou Tang
In this paper, a Random Field Topic (RFT) model is proposed for semantic region analysis from motions of objects in crowded scenes. Different from existing approaches of learning semantic regions either from optical flows or from complete trajectories, our model assumes that fragments of trajectories (called tracklets) are observed in crowded scenes. It advances the existing Latent Dirichlet Allocation topic model, by integrating the Markov random fields (MR-F) as prior to enforce the spatial and temporal coherence between tracklets during the learning process. Two kinds of MRF, pairwise MRF and the forest of randomly spanning trees, are defined. Another contribution of this model is to include sources and sinks as high-level semantic prior, which effectively improves the learning of semantic regions and the clustering of tracklets. Experiments on a large scale data set, which includes 40, 000+ tracklets collected from the crowded New York Grand Central station, show that our model outperforms state-of-the-art methods both on qualitative results of learning semantic regions and on quantitative results of clustering tracklets.
IEEE Transactions on Pattern Analysis and Machine Intelligence | 2018
Bolei Zhou; Àgata Lapedriza; Aditya Khosla; Aude Oliva; Antonio Torralba
The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach near-human semantic classification performance at tasks such as visual object and scene recognition. Here we describe the Places Database, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse list of the types of environments encountered in the world. Using the state-of-the-art Convolutional Neural Networks (CNNs), we provide scene classification CNNs (Places-CNNs) as baselines, that significantly outperform the previous approaches. Visualization of the CNNs trained on Places shows that object detectors emerge as an intermediate representation of scene classification. With its high-coverage and high-diversity of exemplars, the Places Database along with the Places-CNNs offer a novel resource to guide future progress on scene recognition problems.
computer vision and pattern recognition | 2017
David Bau; Bolei Zhou; Aditya Khosla; Aude Oliva; Antonio Torralba
We propose a general framework called Network Dissection for quantifying the interpretability of latent representations of CNNs by evaluating the alignment between individual hidden units and a set of semantic concepts. Given any CNN model, the proposed method draws on a data set of concepts to score the semantics of hidden units at each intermediate convolutional layer. The units with semantics are labeled across a broad range of visual concepts including objects, parts, scenes, textures, materials, and colors. We use the proposed method to test the hypothesis that interpretability is an axis-independent property of the representation space, then we apply the method to compare the latent representations of various networks when trained to solve different classification problems. We further analyze the effect of training iterations, compare networks trained with different initializations, and measure the effect of dropout and batch normalization on the interpretability of deep visual representations. We demonstrate that the proposed method can shed light on characteristics of CNN models and training methods that go beyond measurements of their discriminative power.
european conference on computer vision | 2012
Bolei Zhou; Xiaoou Tang; Xiaogang Wang
Coherent motions, which describe the collective movements of individuals in crowd, widely exist in physical and biological systems. Understanding their underlying priors and detecting various coherent motion patterns from background clutters have both scientific values and a wide range of practical applications, especially for crowd motion analysis. In this paper, we propose and study a prior of coherent motion called Coherent Neighbor Invariance, which characterizes the local spatiotemporal relationships of individuals in coherent motion. Based on the coherent neighbor invariance, a general technique of detecting coherent motion patterns from noisy time-series data called Coherent Filtering is proposed. It can be effectively applied to data with different distributions at different scales in various real-world problems, where the environments could be sparse or extremely crowded with heavy noise. Experimental evaluation and comparison on synthetic and real data show the existence of Coherence Neighbor Invariance and the effectiveness of our Coherent Filtering.
Journal of Vision | 2017
Bolei Zhou; Àgata Lapedriza; Antonio Torralba; Aude Oliva
The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach near-human semantic classification at tasks such as object and scene recognition. Here we describe the Places Database, a repository of 10 million scene photographs, labeled with scene semantic categories and attributes, comprising a quasi-exhaustive list of the types of environments encountered in the world. Using state of the art Convolutional Neural Networks, we provide impressive baseline performances at scene classification. With its high-coverage and high-diversity of exemplars, the Places Database offers an ecosystem to guide future progress on currently intractable visual recognition problems.
computer vision and pattern recognition | 2017
Bolei Zhou; Hang Zhao; Xavier Puig; Sanja Fidler; Adela Barriuso; Antonio Torralba
Scene parsing, or recognizing and segmenting objects and stuff in an image, is one of the key problems in computer vision. Despite the communitys efforts in data collection, there are still few image datasets covering a wide range of scenes and object categories with dense and detailed annotations for scene parsing. In this paper, we introduce and analyze the ADE20K dataset, spanning diverse annotations of scenes, objects, parts of objects, and in some cases even parts of parts. A scene parsing benchmark is built upon the ADE20K with 150 object and stuff classes included. Several segmentation baseline models are evaluated on the benchmark. A novel network design called Cascade Segmentation Module is proposed to parse a scene into stuff, objects, and object parts in a cascade and improve over the baselines. We further show that the trained scene parsing networks can lead to applications such as image content removal and scene synthesis1.
computer vision and pattern recognition | 2013
Bolei Zhou; Xiaoou Tang; Xiaogang Wang
Collective motions are common in crowd systems and have attracted a great deal of attention in a variety of multidisciplinary fields. Collectiveness, which indicates the degree of individuals acting as a union in collective motion, is a fundamental and universal measurement for various crowd systems. By integrating path similarities among crowds on collective manifold, this paper proposes a descriptor of collectiveness and an efficient computation for the crowd and its constituent individuals. The algorithm of the Collective Merging is then proposed to detect collective motions from random motions. We validate the effectiveness and robustness of the proposed collectiveness descriptor on the system of self-driven particles. We then compare the collectiveness descriptor to human perception for collective motion and show high consistency. Our experiments regarding the detection of collective motions and the measurement of collectiveness in videos of pedestrian crowds and bacteria colony demonstrate a wide range of applications of the collectiveness descriptor.
european conference on computer vision | 2014
Bolei Zhou; Liu Liu; Aude Oliva; Antonio Torralba
After hundreds of years of human settlement, each city has formed a distinct identity, distinguishing itself from other cities. In this work, we propose to characterize the identity of a city via an attribute analysis of 2 million geo-tagged images from 21 cities over 3 continents. First, we estimate the scene attributes of these images and use this representation to build a higher-level set of 7 city attributes, tailored to the form and function of cities. Then, we conduct the city identity recognition experiments on the geo-tagged images and identify images with salient city identity on each city attribute. Based on the misclassification rate of the city identity recognition, we analyze the visual similarity among different cities. Finally, we discuss the potential application of computer vision to urban planning.