Weilin Huang
Chinese Academy of Sciences
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Weilin Huang.
european conference on computer vision | 2014
Weilin Huang; Yu Qiao; Xiaoou Tang
Maximally Stable Extremal Regions (MSERs) have achieved great success in scene text detection. However, this low-level pixel operation inherently limits its capability for handling complex text information efficiently (e. g. connections between text or background components), leading to the difficulty in distinguishing texts from background components. In this paper, we propose a novel framework to tackle this problem by leveraging the high capability of convolutional neural network (CNN). In contrast to recent methods using a set of low-level heuristic features, the CNN network is capable of learning high-level features to robustly identify text components from text-like outliers (e.g. bikes, windows, or leaves). Our approach takes advantages of both MSERs and sliding-window based methods. The MSERs operator dramatically reduces the number of windows scanned and enhances detection of the low-quality texts. While the sliding-window with CNN is applied to correctly separate the connections of multiple characters in components. The proposed system achieved strong robustness against a number of extreme text variations and serious real-world problems. It was evaluated on the ICDAR 2011 benchmark dataset, and achieved over 78% in F-measure, which is significantly higher than previous methods.
european conference on computer vision | 2016
Zhi Tian; Weilin Huang; Tong He; Pan He; Yu Qiao
We propose a novel Connectionist Text Proposal Network (CTPN) that accurately localizes text lines in natural image. The CTPN detects a text line in a sequence of fine-scale text proposals directly in convolutional feature maps. We develop a vertical anchor mechanism that jointly predicts location and text/non-text score of each fixed-width proposal, considerably improving localization accuracy. The sequential proposals are naturally connected by a recurrent neural network, which is seamlessly incorporated into the convolutional network, resulting in an end-to-end trainable model. This allows the CTPN to explore rich context information of image, making it powerful to detect extremely ambiguous text. The CTPN works reliably on multi-scale and multi-language text without further post-processing, departing from previous bottom-up methods requiring multi-step post filtering. It achieves 0.88 and 0.61 F-measure on the ICDAR 2013 and 2015 benchmarks, surpassing recent results [8, 35] by a large margin. The CTPN is computationally efficient with 0.14 s/image, by using the very deep VGG16 model [27]. Online demo is available: http://textdet.com/.
IEEE Transactions on Image Processing | 2016
Tong He; Weilin Huang; Yu Qiao; Jian Yao
Recent deep learning models have demonstrated strong capabilities for classifying text and non-text components in natural images. They extract a high-level feature globally computed from a whole image component (patch), where the cluttered background information may dominate true text features in the deep representation. This leads to less discriminative power and poorer robustness. In this paper, we present a new system for scene text detection by proposing a novel text-attentional convolutional neural network (Text-CNN) that particularly focuses on extracting text-related regions and features from the image components. We develop a new learning mechanism to train the Text-CNN with multi-level and rich supervised information, including text region mask, character label, and binary text/non-text information. The rich supervision information enables the Text-CNN with a strong capability for discriminating ambiguous texts, and also increases its robustness against complicated background components. The training process is formulated as a multi-task learning problem, where low-level supervised information greatly facilitates the main task of text/non-text classification. In addition, a powerful low-level detector called contrast-enhancement maximally stable extremal regions (MSERs) is developed, which extends the widely used MSERs by enhancing intensity contrast between text patterns and background. This allows it to detect highly challenging text patterns, resulting in a higher recall. Our approach achieved promising results on the ICDAR 2013 data set, with an F-measure of 0.82, substantially improving the state-of-the-art results.
IEEE Transactions on Image Processing | 2017
Sheng Guo; Weilin Huang; Limin Wang; Yu Qiao
Convolutional neural networks (CNNs) have recently achieved remarkable successes in various image classification and understanding tasks. The deep features obtained at the top fully connected layer of the CNN (FC-features) exhibit rich global semantic information and are extremely effective in image classification. On the other hand, the convolutional features in the middle layers of the CNN also contain meaningful local information, but are not fully explored for image representation. In this paper, we propose a novel locally supervised deep hybrid model (LS-DHM) that effectively enhances and explores the convolutional features for scene recognition. First, we notice that the convolutional features capture local objects and fine structures of scene images, which yield important cues for discriminating ambiguous scenes, whereas these features are significantly eliminated in the highly compressed FC representation. Second, we propose a new local convolutional supervision layer to enhance the local structure of the image by directly propagating the label information to the convolutional layers. Third, we propose an efficient Fisher convolutional vector (FCV) that successfully rescues the orderless mid-level semantic information (e.g., objects and textures) of scene image. The FCV encodes the large-sized convolutional maps into a fixed-length mid-level representation, and is demonstrated to be strongly complementary to the high-level FC-features. Finally, both the FCV and FC-features are collaboratively employed in the LS-DHM representation, which achieves outstanding performance in our experiments. It obtains 83.75% and 67.56% accuracies, respectively, on the heavily benchmarked MIT Indoor67 and SUN397 data sets, advancing the state-of-the-art substantially.Convolutional neural networks (CNNs) have recently achieved remarkable successes in various image classification and understanding tasks. The deep features obtained at the top fully connected layer of the CNN (FC-features) exhibit rich global semantic information and are extremely effective in image classification. On the other hand, the convolutional features in the middle layers of the CNN also contain meaningful local information, but are not fully explored for image representation. In this paper, we propose a novel locally supervised deep hybrid model (LS-DHM) that effectively enhances and explores the convolutional features for scene recognition. First, we notice that the convolutional features capture local objects and fine structures of scene images, which yield important cues for discriminating ambiguous scenes, whereas these features are significantly eliminated in the highly compressed FC representation. Second, we propose a new local convolutional supervision layer to enhance the local structure of the image by directly propagating the label information to the convolutional layers. Third, we propose an efficient Fisher convolutional vector (FCV) that successfully rescues the orderless mid-level semantic information (e.g., objects and textures) of scene image. The FCV encodes the large-sized convolutional maps into a fixed-length mid-level representation, and is demonstrated to be strongly complementary to the high-level FC-features. Finally, both the FCV and FC-features are collaboratively employed in the LS-DHM representation, which achieves outstanding performance in our experiments. It obtains 83.75% and 67.56% accuracies, respectively, on the heavily benchmarked MIT Indoor67 and SUN397 data sets, advancing the state-of-the-art substantially.
IEEE Transactions on Image Processing | 2017
Limin Wang; Sheng Guo; Weilin Huang; Yuanjun Xiong; Yu Qiao
Convolutional neural networks (CNNs) have made remarkable progress on scene recognition, partially due to these recent large-scale scene datasets, such as the Places and Places2. Scene categories are often defined by multi-level information, including local objects, global layout, and background environment, thus leading to large intra-class variations. In addition, with the increasing number of scene categories, label ambiguity has become another crucial issue in large-scale classification. This paper focuses on large-scale scene recognition and makes two major contributions to tackle these issues. First, we propose a multi-resolution CNN architecture that captures visual content and structure at multiple levels. The multi-resolution CNNs are composed of coarse resolution CNNs and fine resolution CNNs, which are complementary to each other. Second, we design two knowledge guided disambiguation techniques to deal with the problem of label ambiguity: 1) we exploit the knowledge from the confusion matrix computed on validation data to merge ambiguous classes into a super category and 2) we utilize the knowledge of extra networks to produce a soft label for each image. Then, the super categories or soft labels are employed to guide CNN training on the Places2. We conduct extensive experiments on three large-scale image datasets (ImageNet, Places, and Places2), demonstrating the effectiveness of our approach. Furthermore, our method takes part in two major scene recognition challenges, and achieves the second place at the Places2 challenge in ILSVRC 2015, and the first place at the LSUN challenge in CVPR 2016. Finally, we directly test the learned representations on other scene benchmarks, and obtain the new state-of-the-art results on the MIT Indoor67 (86.7%) and SUN397 (72.0%). We release the code and models at https://github.com/wanglimin/MRCNN-Scene-Recognition.Convolutional neural networks (CNNs) have made remarkable progress on scene recognition, partially due to these recent large-scale scene datasets, such as the Places and Places2. Scene categories are often defined by multi-level information, including local objects, global layout, and background environment, thus leading to large intra-class variations. In addition, with the increasing number of scene categories, label ambiguity has become another crucial issue in large-scale classification. This paper focuses on large-scale scene recognition and makes two major contributions to tackle these issues. First, we propose a multi-resolution CNN architecture that captures visual content and structure at multiple levels. The multi-resolution CNNs are composed of coarse resolution CNNs and fine resolution CNNs, which are complementary to each other. Second, we design two knowledge guided disambiguation techniques to deal with the problem of label ambiguity: 1) we exploit the knowledge from the confusion matrix computed on validation data to merge ambiguous classes into a super category and 2) we utilize the knowledge of extra networks to produce a soft label for each image. Then, the super categories or soft labels are employed to guide CNN training on the Places2. We conduct extensive experiments on three large-scale image datasets (ImageNet, Places, and Places2), demonstrating the effectiveness of our approach. Furthermore, our method takes part in two major scene recognition challenges, and achieves the second place at the Places2 challenge in ILSVRC 2015, and the first place at the LSUN challenge in CVPR 2016. Finally, we directly test the learned representations on other scene benchmarks, and obtain the new state-of-the-art results on the MIT Indoor67 (86.7%) and SUN397 (72.0%). We release the code and models at https://github.com/wanglimin/MRCNN-Scene-Recognition.
medical image computing and computer assisted intervention | 2017
Weilin Huang; Christopher P. Bridge; J. Alison Noble; Andrew Zisserman
We present an automatic method to describe clinically useful information about scanning, and to guide image interpretation in ultrasound (US) videos of the fetal heart. Our method is able to jointly predict the visibility, viewing plane, location and orientation of the fetal heart at the frame level. The contributions of the paper are three-fold: (i) a convolutional neural network architecture is developed for a multi-task prediction, which is computed by sliding a 3x3 window spatially through convolutional maps. (ii) an anchor mechanism and Intersection over Union (IoU) loss are applied for improving localization accuracy. (iii) a recurrent architecture is designed to recursively compute regional convolutional features temporally over sequential frames, allowing each prediction to be conditioned on the whole video. This results in a spatial-temporal model that precisely describes detailed heart parameters in challenging US videos. We report results on a real-world clinical dataset, where our method achieves performance on par with expert annotations.
IEEE Transactions on Image Processing | 2017
Dihong Gong; Zhifeng Li; Weilin Huang; Xuelong Li; Dacheng Tao
Heterogeneous face recognition is an important, yet challenging problem in face recognition community. It refers to matching a probe face image to a gallery of face images taken from alternate imaging modality. The major challenge of heterogeneous face recognition lies in the great discrepancies between different image modalities. Conventional face feature descriptors, e.g., local binary patterns, histogram of oriented gradients, and scale-invariant feature transform, are mostly designed in a handcrafted way and thus generally fail to extract the common discriminant information from the heterogeneous face images. In this paper, we propose a new feature descriptor called common encoding model for heterogeneous face recognition, which is able to capture common discriminant information, such that the large modality gap can be significantly reduced at the feature extraction stage. Specifically, we turn a face image into an encoded one with the encoding model learned from the training data, where the difference of the encoded heterogeneous face images of the same person can be minimized. Based on the encoded face images, we further develop a discriminant matching method to infer the hidden identity information of the cross-modality face images for enhanced recognition performance. The effectiveness of the proposed approach is demonstrated (on several public-domain face datasets) in two typical heterogeneous face recognition scenarios: matching NIR faces to VIS faces and matching sketches to photographs.
IEEE Transactions on Image Processing | 2015
Yongqiang Gao; Weilin Huang; Yu Qiao
Local binary descriptors are attracting increasingly attention due to their great advantages in computational speed, which are able to achieve real-time performance in numerous image/vision applications. Various methods have been proposed to learn data-dependent binary descriptors. However, most existing binary descriptors aim overly at computational simplicity at the expense of significant information loss which causes ambiguity in similarity measure using Hamming distance. In this paper, by considering multiple features might share complementary information, we present a novel local binary descriptor, referred as ring-based multi-grouped descriptor (RMGD), to successfully bridge the performance gap between current binary and floated-point descriptors. Our contributions are twofold. First, we introduce a new pooling configuration based on spatial ring-region sampling, allowing for involving binary tests on the full set of pairwise regions with different shapes, scales, and distances. This leads to a more meaningful description than the existing methods which normally apply a limited set of pooling configurations. Then, an extended Adaboost is proposed for an efficient bit selection by emphasizing high variance and low correlation, achieving a highly compact representation. Second, the RMGD is computed from multiple image properties where binary strings are extracted. We cast multi-grouped features integration as rankSVM or sparse support vector machine learning problem, so that different features can compensate strongly for each other, which is the key to discriminativeness and robustness. The performance of the RMGD was evaluated on a number of publicly available benchmarks, where the RMGD outperforms the state-of-the-art binary descriptors significantly.
Neurocomputing | 2017
Yongqiang Gao; Weilin Huang; Yu Qiao
Abstract Binary descriptors have received extensive research interests due to their low memory storage and computational efficiency. However, the discriminative ability of the binary descriptors is often limited in comparison with general floating point ones. In this paper, we present a learning framework to effectively integrate multiple binary descriptors, which is referred as learning-based multiple binary descriptors (LMBD). We observe that previous successful binary descriptors like Receptive Fields Descriptor (RFD) which includes rectangular pooling area (RFD R ) and Gaussian pooling area (RFD G )), BinBoost, and Boosted Gradient Maps (BGM), are highly complementary to each other. We show that the proposed LMBD can improve the discriminative ability of individual binary descriptors significantly. We formulate the fusion of multiple groups of the binary descriptors was formulated as a pair-wise ranking problem, which can be solved effectively in a rankSVM framework. Extensive experiments were conducted to evaluate the efficiency of LMBD. The proposed LMBD obtains the error rate of 12.44% on the challenging local patch datasets, which is about 2% lower than the state-of-the-art results (obtained by a learning based floating point descriptor). Furthermore, the proposed binary descriptor also outperforms other binary descriptors on image matching task.
chinese conference on biometric recognition | 2017
Huijuan Huang; Zhi Tian; Tong He; Weilin Huang; Yu Qiao
In this paper, we present a novel Orientation-Aware Text Proposals Network (OA-TPN) for detecting text in the wild. The OA-TPN is able to accurately localize arbitrary-oriented text lines in a natural image. Instead of detecting the whole text line at one time, the OA-TPN detects sequences of small-scale orientation-aware text proposals. To handle text lines with different orientations, we utilize deep networks to jointly estimate text proposals with associate directions at the convolutional maps. Final text bounding boxes can be generated from the predicted text proposals by implementing a proposed text-line construction approach. The proposed text detector works reliably on multi-scale and multi-orientation text with single-scale images. Experimental results on the MSRA-TD500 and SWT demonstrate the effectiveness of our methods.
Collaboration
Dive into the Weilin Huang's collaboration.
Commonwealth Scientific and Industrial Research Organisation
View shared research outputs