Yanming Guo | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Yanming Guo is active.

Explore More

Publication

Featured researches published by Yanming Guo.

Neurocomputing | 2016

Deep learning for visual understanding

Yanming Guo; Yu Liu; Ard Oerlemans; Songyang Lao; Song Wu; Michael S. Lew

Deep learning algorithms are a subset of the machine learning algorithms, which aim at discovering multiple levels of distributed representations. Recently, numerous deep learning algorithms have been proposed to solve traditional artificial intelligence problems. This work aims to review the state-of-the-art in deep learning algorithms in computer vision by highlighting the contributions and challenges from over 210 recent research papers. It first gives an overview of various deep learning approaches and their recent developments, and then briefly describes their applications in diverse vision tasks, such as image classification, object detection, image retrieval, semantic segmentation and human pose estimation. Finally, the paper summarizes the future trends and challenges in designing and training deep neural networks.

international conference on multimedia retrieval | 2015

DeepIndex for Accurate and Efficient Image Retrieval

Yu Liu; Yanming Guo; Song Wu; Michael S. Lew

In the well-known Bag-of-Words model, local features, such as the SIFT descriptor, are extracted and quantized into visual words. Then, an index is created to reduce computational burden. However, local clues serve as low-level representations that can not represent high-level semantic concepts. Recently, the success of deep features extracted from convolutional neural networks(CNN) has shown promising results toward bridging the semantic gap. Inspired by this, we attempt to introduce deep features into inverted index based image retrieval and thus propose the DeepIndex framework. Moreover, considering the compensation of different deep features, we incorporate multiple deep features from different fully connected layers, resulting in the multiple DeepIndex. We find the optimal integration of one midlevel deep feature and one high-level deep feature, from two different CNN architectures separately. This can be treated as an attempt to further reduce the semantic gap. Extensive experiments on three benchmark datasets demonstrate that, the proposed DeepIndex method is competitive with the state-of-the-art on Holidays(85:65% mAP), Paris(81:24% mAP), and UKB(3:76 score). In addition, our method is efficient in terms of both memory and time cost.

conference on multimedia modeling | 2017

On the Exploration of Convolutional Fusion Networks for Visual Recognition

Yu Liu; Yanming Guo; Michael S. Lew

Despite recent advances in multi-scale deep representations, their limitations are attributed to expensive parameters and weak fusion modules. Hence, we propose an efficient approach to fuse multi-scale deep representations, called convolutional fusion networks (CFN). Owing to using 1 \(\times \) 1 convolution and global average pooling, CFN can efficiently generate the side branches while adding few parameters. In addition, we present a locally-connected fusion module, which can learn adaptive weights for the side branches and form a discriminatively fused feature. CFN models trained on the CIFAR and ImageNet datasets demonstrate remarkable improvements over the plain CNNs. Furthermore, we generalize CFN to three new tasks, including scene recognition, fine-grained recognition and image retrieval. Our experiments show that it can obtain consistent improvements towards the transferring tasks.

International Journal of Multimedia Information Retrieval | 2018

A review of semantic segmentation using deep neural networks

Yanming Guo; Yu Liu; Theodoros Georgiou; Michael S. Lew

During the long history of computer vision, one of the grand challenges has been semantic segmentation which is the ability to segment an unknown image into different parts and objects (e.g., beach, ocean, sun, dog, swimmer). Furthermore, segmentation is even deeper than object recognition because recognition is not necessary for segmentation. Specifically, humans can perform image segmentation without even knowing what the objects are (for example, in satellite imagery or medical X-ray scans, there may be several objects which are unknown, but they can still be segmented within the image typically for further investigation). Performing segmentation without knowing the exact identity of all objects in the scene is an important part of our visual understanding process which can give us a powerful model to understand the world and also be used to improve or augment existing computer vision techniques. Herein this work, we review the field of semantic segmentation as pertaining to deep convolutional neural networks. We provide comprehensive coverage of the top approaches and summarize the strengths, weaknesses and major challenges.

IEEE Transactions on Multimedia | 2018

Bag of Surrogate Parts Feature for Visual Recognition

Yanming Guo; Yu Liu; Songyang Lao; E. Bakker; Liang Bai; Michael S. Lew

Convolutional neural networks (CNNs) have attracted significant attention in visual recognition. Several recent studies have shown that, in addition to the fully connected layers, the features derived from the convolutional layers of CNNs can also achieve promising performance in image classification tasks. In this paper, we propose a new feature from the convolutional layers, called Bag of Surrogate Parts (BoSP), and its spatial variant, Spatial-BoSP (S-BoSP). The main idea is, we assume the feature maps in the convolutional layers as surrogate parts, and densely sample and assign image regions to these surrogate parts by observing the activation values. Together with BoSP/S-BoSP, we further propose another two schemes to enhance the performance: scale pooling and global-part prediction. Scale pooling aims to handle the objects with different scales and deformations, and global-part prediction combines the predictions of global and part features. By conducting extensive experiments on generic object, fine-grained object and scene datasets, we find the proposed scheme can not only achieve superior performance to the fully connected feature, but also produces competitive or, in some cases, remarkably better performance than the state of the art.

conference on multimedia modeling | 2017

What Convnets Make for Image Captioning

Yu Liu; Yanming Guo; Michael S. Lew

Nowadays, a general pipeline for the image captioning task takes advantage of image representations based on convolutional neural networks (CNNs) and sequence modeling based on recurrent neural networks (RNNs). As captioning performance closely depends on the discriminative capacity of CNNs, our work aims to investigate the effects of different Convnets (CNN models) on image captioning. We train three Convnets based on different classification tasks: single-label, multi-label and multi-attribute, and then feed visual representations from these Convnets into a Long Short-Term Memory (LSTM) to model the sequence of words. Since the three Convnets focus on different visual contents in one image, we propose aggregating them together to generate a richer visual representation. Furthermore, during testing, we use an efficient multi-scale augmentation approach based on fully convolutional networks (FCNs). Extensive experiments on the MS COCO dataset provide significant insights into the effects of Convnets. Finally, we achieve comparable results to the state-of-the-art for both caption generation and image-sentence retrieval tasks.

british machine vision conference | 2016

Bag of Surrogate Parts: one inherent feature of deep CNNs.

Yanming Guo; Michael S. Lew

Convolutional Neural Networks (CNNs) have achieved promising performance in image classification tasks. In this paper, we develop a new feature from convolutional layers, called Bag of Surrogate Parts (BoSP), and its spatial variant, Spatial BoSP (SBoSP). Specifically, we take the feature maps in convolutional layers as surrogate parts, and densely sample and assign the regions in input images to these surrogate parts by observing the activation values. To better handle the objects with different scales and deformations, and make more comprehensive predictions, we further propose a scale pooling technique for assigning the features, and global constrained augmentation for the final prediction. Compared with most existing methods that also utilize the activations from convolutional layers, the proposed method is efficient, has no tuning parameters, and could generate low-dimensional, highly discriminative features. The experiments on generic object, fine-grained object and scene datasets indicate that the proposed feature can not only produce superior results to fully-connected layer based features, but also get comparable, or in some cases considerably better performance than the state-of-the-art.

pacific rim conference on multimedia | 2015

Convolutional Neural Networks Features: Principal Pyramidal Convolution

Yanming Guo; Songyang Lao; Yu Liu; Liang Bai; Shi Liu; Michael S. Lew

The features extracted from convolutional neural networks (CNNs) are able to capture the discriminative part of an image and have shown superior performance in visual recognition. Furthermore, it has been verified that the CNN activations trained from large and diverse datasets can act as generic features and be transferred to other visual recognition tasks. In this paper, we aim to learn more from an image and present an effective method called Principal Pyramidal Convolution (PPC). The scheme first partitions the image into two levels, and extracts CNN activations for each sub-region along with the whole image, and then aggregates them together. The concatenated feature is later reduced to the standard dimension using Principal Component Analysis (PCA) algorithm, generating the refined CNN feature. When applied in image classification and retrieval tasks, the PPC feature consistently outperforms the conventional CNN feature, regardless of the network type where they derive from. Specifically, PPC achieves state-of-the-art result on the MIT Indoor67 dataset, utilizing the activations from Places-CNN.

Archive | 2013

Player Detection Algorithm Based on Color Segmentation and Improved CamShift Algorithm

Yanming Guo; Songyang Lao; Liang Bai

Aimed at the most proposed methods that detecting and tracking moving object cannot receive a good segmentation of the player in dynamic scenes, the chapter put forward an algorithm, based on color segmentation and the CamShift algorithm, to detect the tennis player. Firstly, apply a supervised clustering binary tree to automatically detect the target’s area. Secondly, take the detected area as the initial tracking window, track the target by using the improved CamShift algorithm. On the basis of accurate tracking, the chapter raises a new extraction method of the player, taking use of the characteristic that mostly the court’s color is simple. The method raises the thought that extract the player by the color feature, and it combines the CamShift algorithm with the difference in frame to overcome the tracking-loss problem. Experimental results show that our algorithm is effective in detecting the player in dynamic scenes.

Pattern Recognition | 2018

Learning visual and textual representations for multimodal matching and classification

Yu Liu; Li Liu; Yanming Guo; Michael S. Lew

Abstract Multimodal learning has been an important and challenging problem for decades, which aims to bridge the modality gap between heterogeneous representations, such as vision and language. Unlike many current approaches which only focus on either multimodal matching or classification, we propose a unified network to jointly learn multimodal matching and classification (MMC-Net) between images and texts. The proposed MMC-Net model can seamlessly integrate the matching and classification components. It first learns visual and textual embedding features in the matching component, and then generates discriminative multimodal representations in the classification component. Combining the two components in a unified model can help in improving their performance. Moreover, we present a multi-stage training algorithm by minimizing both of the matching and classification loss functions. Experimental results on four well-known multimodal benchmarks demonstrate the effectiveness and efficiency of the proposed approach, which achieves competitive performance for multimodal matching and classification compared to state-of-the-art approaches.

Explore More