Hao Ye
Chinese Academy of Sciences
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Hao Ye.
international conference on multimedia and expo | 2017
Li Wang; Hong Wang; Yingbin Zheng; Hao Ye; Xiangyang Xue
We perform fast vehicle detection from traffic surveillance cameras. A novel deep learning framework, namely Evolving Boxes, is developed that proposes and refines the object boxes under different feature representations. Specifically, our framework is embedded with a light-weight proposal network to generate initial anchor boxes as well as to early discard unlikely regions; a fine-turning network produces detailed features for these candidate boxes. We show intriguingly that by applying different feature fusion techniques, the initial boxes can be refined for both localization and recognition. We evaluate our network on the recent DETRAC benchmark and obtain a significant improvement over the state-of-the-art Faster RCNN by 9.5% mAP. Further, our network achieves 9–13 FPS detection speed on a moderate commercial GPU.
acm multimedia | 2016
Hao Ye; Weiyuan Shao; Hong Wang; Jianqi Ma; Li Wang; Yingbin Zheng; Xiangyang Xue
In this paper, we introduce an active annotation and learning framework for the face recognition task. Starting with an initial label deficient face image training set, we iteratively train a deep neural network and use this model to choose the examples for further manual annotation. We follow the active learning strategy and derive the Value of Information criterion to actively select candidate annotation images. During these iterations, the deep neural network is incrementally updated. Experimental results conducted on LFW benchmark and MS-Celeb-1M challenge demonstrate the effectiveness of our proposed framework.
IEEE Signal Processing Letters | 2018
Yingbin Zheng; Hao Ye; Li Wang; Jian Pu
Effective visual representation plays an important role in the scene classification systems. While many existing methods are focused on the generic descriptors extracted from the RGB color channels, we argue the importance of depth context, since scenes are composed with spatial variability and depth is an essential component in understanding the geometry. In this letter, we present a novel depth representation for RGB-D scene classification based on a specific designed convolutional neural network (CNN). Contrast to previous deep models that transfer from pretrained RGB CNN models, we harness model by using the multiviewpoint depth image augmentation to overcome the data scarcity problem. The proposed CNN framework contains the dilated convolutions to expand the receptive field and a subsequent spatial pooling to aggregate multiscale contextual information. The combination of contextual design and multiviewpoint depth images are important toward a more compact representation, compared to directly using original depth images or off-the-shelf networks. Through extensive experiments on SUN RGB-D dataset, we demonstrate that the representation outperforms recent state of the arts, and combining it with standard CNN-based RGB features can lead to further improvements.
pacific rim conference on multimedia | 2018
Zhao Zhou; Yingbin Zheng; Hao Ye; Jian Pu; Gufei Sun
Scene classification is a fundamental problem to understand the high-resolution remote sensing imagery. Recently, convolutional neural network (ConvNet) has achieved remarkable performance in different tasks, and significant efforts have been made to develop various representations for satellite image scene classification. In this paper, we present a novel representation based on a ConvNet with context aggregation. The proposed two-pathway ResNet (ResNet-TP) architecture adopts the ResNet [1] as backbone, and the two pathways allow the network to model both local details and regional context. The ResNet-TP based representation is generated by global average pooling on the last convolutional layers from both pathways. Experiments on two scene classification datasets, UCM Land Use and NWPU-RESISC45, show that the proposed mechanism achieves promising improvements over state-of-the-art methods.
international conference on multimedia retrieval | 2018
Baohan Xu; Hao Ye; Yingbin Zheng; Heng Wang; Tianyu Luwang; Yu-Gang Jiang
Recently, video action recognition has been widely studied. Training deep neural networks requires a large amount of well-labeled videos. On the other hand, videos in the same class share high-level semantic similarity. In this paper, we introduce a novel neural network architecture to simultaneously capture local and long-term spatial temporal information. The dilated dense network is proposed with the blocks being composed of densely-connected dilated convolutions layers. The proposed framework is capable of fusing each layers outputs to learn high-level representations, and the representations are robust even with only few training snippets. The aggregations of dilated dense blocks are also explored. We conduct extensive experiments on UCF101 and demonstrate the effectiveness of our proposed method, especially with few training examples.
international conference on multimedia retrieval | 2018
Haonan Qiu; Yingbin Zheng; Hao Ye; Feng Wang; Liang He
Locating actions in long untrimmed videos has been a challenging problem in video content analysis. The performances of existing action localization approaches remain unsatisfactory in precisely determining the beginning and the end of an action. Imitating the human perception procedure with observations and refinements, we propose a novel three-phase action localization framework. Our framework is embedded with an Actionness Network to generate initial proposals through frame-wise similarity grouping, and then a Refinement Network to conduct boundary adjustment on these proposals. Finally, the refined proposals are sent to a Localization Network for further fine-grained location regression. The whole process can be deemed as multi-stage refinement using a novel non-local pyramid feature under various temporal granularities. We evaluate our framework on THUMOS14 benchmark and obtain a significant improvement over the state-of-the-arts approaches. Specifically, the performance gain is remarkable under precise localization with high IoU thresholds. Our proposed framework achieves mAP@IoU=0.5 of 34.2%.
pacific rim conference on multimedia | 2017
Yingbin Zheng; Jian Pu; Hong Wang; Hao Ye
Depth cue is crucial for perception of spatial layout and understanding the cluttered indoor scenes. However, there is little study of leveraging depth information within the image scene classification systems, mainly because the lack of depth labeling in existing monocular image datasets. In this paper, we introduce a framework to overcome this limitation by incorporating the predicted depth descriptor of the monocular images for indoor scene classification. The depth prediction model is firstly learned from existing RGB-D dataset using the multiscale convolutional network. Given a monocular RGB image, a representation encoding the predicted depth cue is generated. This predicted depth descriptors can be further fused with features from color channels. Experiments are performed on two indoor scene classification benchmarks and the quantitative comparisons demonstrate the effectiveness of proposed scheme.
advanced video and signal based surveillance | 2017
Siwei Lyu; Ming-Ching Chang; Dawei Du; Longyin Wen; Honggang Qi; Yuezun Li; Yi Wei; Lipeng Ke; Tao Hu; Marco Del Coco; Pierluigi Carcagnì; Dmitriy Anisimov; Erik Bochinski; Fabio Galasso; Filiz Bunyak; Guang Han; Hao Ye; Hong Wang; Kannappan Palaniappan; Koray Ozcan; Li Wang; Liang Wang; Martin Lauer; Nattachai Watcharapinchai; Nenghui Song; Noor M. Al-Shakarji; Shuo Wang; Sikandar Amin; Sitapa Rujikietgumjorn; Tatiana Khanova
The rapid advances of transportation infrastructure have led to a dramatic increase in the demand for smart systems capable of monitoring traffic and street safety. Fundamental to these applications are a community-based evaluation platform and benchmark for object detection and multi-object tracking. To this end, we organize the AVSS2017 Challenge on Advanced Traffic Monitoring, in conjunction with the International Workshop on Traffic and Street Surveillance for Safety and Security (IWT4S), to evaluate the state-of-the-art object detection and multi-object tracking algorithms in the relevance of traffic surveillance. Submitted algorithms are evaluated using the large-scale UA-DETRAC benchmark and evaluation protocol. The benchmark, the evaluation toolkit and the algorithm performance are publicly available from the website http://detrac-db.rit.albany.edu.
Proceedings of the Workshop on Large-Scale Video Classification Challenge | 2017
Yao Peng; Hao Ye; Yining Lin; Yixin Bao; Zhijian Zhao; Haonan Qiu; Li Wang; Yingbin Zheng
Videos are dominant on the Internet. Current systems to process large-scale videos are suboptimal due to the following reasons: (1) machine learning modules such as feature extractors and classifiers generate huge intermediate data and place heavy burden to the storage and network, and (2) task scheduling is explicit; manually configuring the machine learning modules on the cluster is tedious and inefficient. In this work, we propose Elastic Streaming Sequential data Processing system (ESSP) that supports automatic task scheduling; multiple machine learning components are automatically parallelized. Further, our system prevents extensive disc I/O by applying the in-memory dataflow scheme. Evaluation on real-world video classification datasets shows many-fold improvements.
chinese conference on biometric recognition | 2016
Weiyuan Shao; Hong Wang; Yingbin Zheng; Hao Ye
This paper proposes a compact face representation for face recognition. The face with landmark points in the image is detected and then used to generate transformed face regions. Different types of regions form the transformed face region datasets, and face networks are trained. A novel forward model selection algorithm is designed to simultaneously select the complementary face models and generate the compact representation. Employing a public dataset as training set and fusing by only six selected face networks, the recognition system with this compact face representation achieves 99.05 % accuracy on LFW benchmark.