Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Bingbing Ni is active.

Publication


Featured researches published by Bingbing Ni.


international conference on computer vision | 2011

RGBD-HuDaAct: A color-depth video database for human daily activity recognition

Bingbing Ni; Gang Wang; Pierre Moulin

In this paper, we present a home-monitoring oriented human activity recognition benchmark database, based on the combination of a color video camera and a depth sensor. Our contributions are two-fold: 1) We have created a publicly releasable human activity video database (i.e., named as RGBD-HuDaAct), which contains synchronized color-depth video streams, for the task of human daily activity recognition. This database aims at encouraging more research efforts on human activity recognition based on multi-modality sensor combination (e.g., color plus depth). 2) Two multi-modality fusion schemes, which naturally combine color and depth information, have been developed from two state-of-the-art feature representation methods for action recognition, i.e., spatio-temporal interest points (STIPs) and motion history images (MHIs). These depth-extended feature representation methods are evaluated comprehensively and superior recognition performances over their uni-modality (e.g., color only) counterparts are demonstrated.


ACM Computing Surveys | 2012

Assistive tagging: A survey of multimedia tagging with human-computer joint exploration

Meng Wang; Bingbing Ni; Xian-Sheng Hua; Tat-Seng Chua

Along with the explosive growth of multimedia data, automatic multimedia tagging has attracted great interest of various research communities, such as computer vision, multimedia, and information retrieval. However, despite the great progress achieved in the past two decades, automatic tagging technologies still can hardly achieve satisfactory performance on real-world multimedia data that vary widely in genre, quality, and content. Meanwhile, the power of human intelligence has been fully demonstrated in the Web 2.0 era. If well motivated, Internet users are able to tag a large amount of multimedia data. Therefore, a set of new techniques has been developed by combining humans and computers for more accurate and efficient multimedia tagging, such as batch tagging, active tagging, tag recommendation, and tag refinement. These techniques are able to accomplish multimedia tagging by jointly exploring humans and computers in different ways. This article refers to them collectively as assistive tagging and conducts a comprehensive survey of existing research efforts on this theme. We first introduce the status of automatic tagging and manual tagging and then state why assistive tagging can be a good solution. We categorize existing assistive tagging techniques into three paradigms: (1) tagging with data selection & organization; (2) tag recommendation; and (3) tag processing. We introduce the research efforts on each paradigm and summarize the methodologies. We also provide a discussion on several future trends in this research direction.


IEEE Transactions on Pattern Analysis and Machine Intelligence | 2016

HCP: A Flexible CNN Framework for Multi-Label Image Classification

Yunchao Wei; Wei Xia; Min Lin; Junshi Huang; Bingbing Ni; Jian Dong; Yao Zhao; Shuicheng Yan

Convolutional Neural Network (CNN) has demonstrated promising performance in single-label image classification tasks. However, how CNN best copes with multi-label images still remains an open problem, mainly due to the complex underlying object layouts and insufficient multi-label training images. In this work, we propose a flexible deep CNN infrastructure, called Hypotheses-CNN-Pooling (HCP), where an arbitrary number of object segment hypotheses are taken as the inputs, then a shared CNN is connected with each hypothesis, and finally the CNN output results from different hypotheses are aggregated with max pooling to produce the ultimate multi-label predictions. Some unique characteristics of this flexible deep CNN infrastructure include: 1) no ground-truth bounding box information is required for training; 2) the whole HCP infrastructure is robust to possibly noisy and/or redundant hypotheses; 3) the shared CNN is flexible and can be well pre-trained with a large-scale single-label image dataset, e.g., ImageNet; and 4) it may naturally output multi-label prediction results. Experimental results on Pascal VOC 2007 and VOC 2012 multi-label image datasets well demonstrate the superiority of the proposed HCP infrastructure over other state-of-the-arts. In particular, the mAP reaches 90.5% by HCP only and 93.2% after the fusion with our complementary result in [12] based on hand-crafted features on the VOC 2012 dataset.


IEEE Transactions on Circuits and Systems for Video Technology | 2015

Crowded Scene Analysis: A Survey

Teng Li; Huan Chang; Meng Wang; Bingbing Ni; Shuicheng Yan

Automated scene analysis has been a topic of great interest in computer vision and cognitive science. Recently, with the growth of crowd phenomena in the real world, crowded scene analysis has attracted much attention. However, the visual occlusions and ambiguities in crowded scenes, as well as the complex behaviors and scene semantics, make the analysis a challenging task. In the past few years, an increasing number of works on the crowded scene analysis have been reported, which covered different aspects including crowd motion pattern learning, crowd behavior and activity analyses, and anomaly detection in crowds. This paper surveys the state-of-the-art techniques on this topic. We first provide the background knowledge and the available features related to crowded scenes. Then, existing models, popular algorithms, evaluation protocols, and system performance are provided corresponding to different aspects of the crowded scene analysis. We also outline the available datasets for performance evaluation. Finally, some research problems and promising future directions are presented with discussions.


computer vision and pattern recognition | 2011

Geometric ℓ p -norm feature pooling for image classification

Jiashi Feng; Bingbing Ni; Qi Tian; Shuicheng Yan

Modern visual classification models generally include a feature pooling step, which aggregates local features over the region of interest into a statistic through a certain spatial pooling operation. Two commonly used operations are the average and max poolings. However, recent theoretical analysis has indicated that neither of these two pooling techniques may be qualified to be optimal. Besides, we further reveal in this work that more severe limitations of these two pooling methods are from the unrecoverable loss of the spatial information during the statistical summarization and the underlying over-simplified assumption about the feature distribution. We aim to address these inherent issues in this work and generalize previous pooling methods as follows. We define a weighted ℓp-norm spatial pooling function tailored for the class-specific feature spatial distribution. Moreover, a sensible prior for the feature spatial correlation is incorporated. Optimizing such pooling function towards optimal class separability yields a so-called geometric ℓp-norm pooling (GLP) method. The described GLP method is capable of preserving the class-specific spatial/geometric information in the pooled features and significantly boosts the discriminating capability of the resultant features for image classification. Comprehensive evaluations on several image benchmarks demonstrate that the proposed GLP method can boost the image classification performance with a single type of feature to outperform or be comparable with the state-of-the-arts.


computer vision and pattern recognition | 2009

Recognizing human group activities with localized causalities

Bingbing Ni; Shuicheng Yan; Ashraf A. Kassim

The aim of this paper is to address the problem of recognizing human group activities in surveillance videos. This task has great potentials in practice, however was rarely studied due to the lack of benchmark database and the difficulties caused by large intra-class variations. Our contributions are two-fold. Firstly, we propose to encode the group-activities with three types of localized causalities, namely self-causality, pair-causality, and group-causality, which characterize the local interaction/reasoning relations within, between, and among motion trajectories of different humans respectively. Each type of causality is expressed as a specific digital filter, whose frequency responses then constitute the feature representation space. Finally, each video clip of certain group activity is encoded as a bag of localized causalities/filters. We also collect a human group-activity video database, which involves six popular group activity categories with about 80 video clips for each in average, captured in five different sessions with varying numbers of participants. Extensive experiments on this database based on our proposed features and different classifiers show the promising results on this challenging task.


acm multimedia | 2010

Learning to photograph

Bin Cheng; Bingbing Ni; Shuicheng Yan; Qi Tian

In this paper, we propose an intelligent photography system, which automatically and professionally generates/recommends user-favorite photo(s) from a wide view or a continuous view sequence. This task is quite challenging given that the evaluation of photo quality is under-determined and usually subjective. Motivated by the recent prevalence of online media, we present a solution y mining the underlying knowledge and experience of the photographers from massively crawled professional photos (about 100,000 images, which are highly ranked by users) of those popular photo sharing websites, e.g. Flickr.com. Generally far contexts are critical in characterizing the composition rules for professional photos, and thus we present a method called omni-range context modeling to learn the patch/object spatial correlation distribution for the concurrent patch/object pair of arbitrary distance. The learned photo omni-range context priors then serve as rules to guide the composition of professional photos. When a wide view is fed into the system, these priors are utilized together with other cues (e.g., placements of faces at different poses, patch number, etc) to form a posterior probability formulation for professional sub-view finding. Moreover, this system can function as intelligent professionalview guider based on real-time view quality assessment and the embedded compass (for recording capture direction). Beyond the salient areas targeted by most existing view recommendation algorithms, the proposed system targets at professional photo composition. Qualitative experiments as well as comprehensive user studies well demonstrate the validity and efficiency of the proposed omnirange context learning method as well as the automatic view finding framework.


IEEE Transactions on Systems, Man, and Cybernetics | 2013

Multilevel Depth and Image Fusion for Human Activity Detection

Bingbing Ni; Yong Pei; Pierre Moulin; Shuicheng Yan

Recognizing complex human activities usually requires the detection and modeling of individual visual features and the interactions between them. Current methods only rely on the visual features extracted from 2-D images, and therefore often lead to unreliable salient visual feature detection and inaccurate modeling of the interaction context between individual features. In this paper, we show that these problems can be addressed by combining data from a conventional camera and a depth sensor (e.g., Microsoft Kinect). We propose a novel complex activity recognition and localization framework that effectively fuses information from both grayscale and depth image channels at multiple levels of the video processing pipeline. In the individual visual feature detection level, depth-based filters are applied to the detected human/object rectangles to remove false detections. In the next level of interaction modeling, 3-D spatial and temporal contexts among human subjects or objects are extracted by integrating information from both grayscale and depth images. Depth information is also utilized to distinguish different types of indoor scenes. Finally, a latent structural model is developed to integrate the information from multiple levels of video processing for an activity detection. Extensive experiments on two activity recognition benchmarks (one with depth information) and a challenging grayscale + depth human activity database that contains complex interactions between human-human, human-object, and human-surroundings demonstrate the effectiveness of the proposed multilevel grayscale + depth fusion scheme. Higher recognition and localization accuracies are obtained relative to the previous methods.


computer vision and pattern recognition | 2015

Motion Part Regularization: Improving action recognition via trajectory group selection

Bingbing Ni; Pierre Moulin; Xiaokang Yang; Shuicheng Yan

Dense local trajectories have been successfully used in action recognition. However, for most actions only a few local motion features (e.g., critical movement of hand, arm, leg etc.) are responsible for the action label. Therefore, highlighting the local features which are associated with important motion parts will lead to a more discriminative action representation. Inspired by recent advances in sentence regularization for text classification, we introduce a Motion Part Regularization framework to mine for discriminative groups of dense trajectories which form important motion parts. First, motion part candidates are generated by spatio-temporal grouping of densely extracted trajectories. Second, an objective function which encourages sparse selection for these trajectory groups is formulated together with an action class discriminative term. Then, we propose an alternative optimization algorithm to efficiently solve this objective function by introducing a set of auxiliary variables which correspond to the discriminativeness weights of each motion part (trajectory group). These learned motion part weights are further utilized to form a discriminativeness weighted Fisher vector representation for each action sample for final classification. The proposed motion part regularization framework achieves the state-of-the-art performances on several action recognition benchmarks.


IEEE Transactions on Multimedia | 2011

Web Image and Video Mining Towards Universal and Robust Age Estimator

Bingbing Ni; Zheng Song; Shuicheng Yan

In this paper, we present an automatic web image and video mining framework with the ultimate goal of building a universal human age estimator based on facial information, which is applicable to all ethnic groups and various image qualities. On one hand, a large (391 k) yet noisy human aging image database is collected from Flickr and Google Image using a set of human age-related text queries. Multiple human face detectors based on distinctive techniques are adopted for noise-prune face detection. For each image, the detected faces with high detection confidences constitute a bag of face instances. We further remove the outliers via principal component analysis (PCA), which results in a condensed image database with about 175 k face instances. A robust multi-instance regressor learning algorithm is then developed to learn the kernel regression-based human age estimator in the presence of bag label noises. On the other hand, about 10 k video clips are downloaded from YouTube. We extract tracked face sequences from these video clips. Although their age labels are unknown, the tracked faces within a sequence are naturally with identical ages. This age-consistence constraint for face pairs is used as an extra regularizer to enhance the robustness of the age estimator. The derived human age estimator is extensively evaluated on three benchmark human aging databases, and without taking any images from these benchmark databases as training samples, comparable age estimation accuracies with the state-of-the-art results are achieved.

Collaboration


Dive into the Bingbing Ni's collaboration.

Top Co-Authors

Avatar

Shuicheng Yan

National University of Singapore

View shared research outputs
Top Co-Authors

Avatar

Xiaokang Yang

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Ashraf A. Kassim

National University of Singapore

View shared research outputs
Top Co-Authors

Avatar

Meng Wang

Hefei University of Technology

View shared research outputs
Top Co-Authors

Avatar

Yi Xu

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Yichao Yan

Shanghai Jiao Tong University

View shared research outputs
Top Co-Authors

Avatar

Qi Tian

University of Texas at San Antonio

View shared research outputs
Top Co-Authors

Avatar

Stefan Winkler

National University of Singapore

View shared research outputs
Top Co-Authors

Avatar

Mengdi Xu

National University of Singapore

View shared research outputs
Top Co-Authors

Avatar

Changzhi Luo

Hefei University of Technology

View shared research outputs
Researchain Logo
Decentralizing Knowledge