Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Xinhang Song is active.

Publication


Featured researches published by Xinhang Song.


IEEE Transactions on Multimedia | 2015

Geolocalized Modeling for Dish Recognition

Ruihan Xu; Luis Herranz; Shuqiang Jiang; Shuang Wang; Xinhang Song; Ramesh Jain

Food-related photos have become increasingly popular , due to social networks, food recommendations, and dietary assessment systems. Reliable annotation is essential in those systems, but unconstrained automatic food recognition is still not accurate enough. Most works focus on exploiting only the visual content while ignoring the context. To address this limitation, in this paper we explore leveraging geolocation and external information about restaurants to simplify the classification problem. We propose a framework incorporating discriminative classification in geolocalized settings and introduce the concept of geolocalized models, which, in our scenario, are trained locally at each restaurant location. In particular, we propose two strategies to implement this framework: geolocalized voting and combinations of bundled classifiers. Both models show promising performance, and the later is particularly efficient and scalable. We collected a restaurant-oriented food dataset with food images, dish tags, and restaurant-level information, such as the menu and geolocation. Experiments on this dataset show that exploiting geolocation improves around 30% the recognition performance, and geolocalized models contribute with an additional 3-8% absolute gain, while they can be trained up to five times faster.


computer vision and pattern recognition | 2015

Joint multi-feature spatial context for scene recognition in the semantic manifold

Xinhang Song; Shuqiang Jiang; Luis Herranz

In the semantic multinomial framework patches and images are modeled as points in a semantic probability simplex. Patch theme models are learned resorting to weak supervision via image labels, which leads the problem of scene categories co-occurring in this semantic space. Fortunately, each category has its own co-occurrence patterns that are consistent across the images in that category. Thus, discovering and modeling these patterns is critical to improve the recognition performance in this representation. In this paper, we observe that not only global co-occurrences at the image-level are important, but also different regions have different category co-occurrence patterns. We exploit local contextual relations to address the problem of discovering consistent co-occurrence patterns and removing noisy ones. Our hypothesis is that a less noisy semantic representation, would greatly help the classifier to model consistent co-occurrences and discriminate better between scene categories. An important advantage of modeling features in a semantic space is that this space is feature independent. Thus, we can combine multiple features and spatial neighbors in the same common space, and formulate the problem as minimizing a context-dependent energy. Experimental results show that exploiting different types of contextual relations consistently improves the recognition accuracy. In particular, larger datasets benefit more from the proposed method, leading to very competitive performance.


IEEE Transactions on Image Processing | 2017

Multi-Scale Multi-Feature Context Modeling for Scene Recognition in the Semantic Manifold

Xinhang Song; Shuqiang Jiang; Luis Herranz

Before the big data era, scene recognition was often approached with two-step inference using localized intermediate representations (objects, topics, and so on). One of such approaches is the semantic manifold (SM), in which patches and images are modeled as points in a semantic probability simplex. Patch models are learned resorting to weak supervision via image labels, which leads to the problem of scene categories co-occurring in this semantic space. Fortunately, each category has its own co-occurrence patterns that are consistent across the images in that category. Thus, discovering and modeling these patterns are critical to improve the recognition performance in this representation. Since the emergence of large data sets, such as ImageNet and Places, these approaches have been relegated in favor of the much more powerful convolutional neural networks (CNNs), which can automatically learn multi-layered representations from the data. In this paper, we address many limitations of the original SM approach and related works. We propose discriminative patch representations using neural networks and further propose a hybrid architecture in which the semantic manifold is built on top of multiscale CNNs. Both representations can be computed significantly faster than the Gaussian mixture models of the original SM. To combine multiple scales, spatial relations, and multiple features, we formulate rich context models using Markov random fields. To solve the optimization problem, we analyze global and local approaches, where a top–down hierarchical algorithm has the best performance. Experimental results show that exploiting different types of contextual relations jointly consistently improves the recognition accuracy.


acm multimedia | 2016

Image Captioning with both Object and Scene Information

Xiangyang Li; Xinhang Song; Luis Herranz; Yaohui Zhu; Shuqiang Jiang

Recently, automatic generation of image captions has attracted great interest not only because of its extensive applications but also because it connects computer vision and natural language processing. By combining convolutional neural networks (CNNs), which learn visual representations from images, and recurrent neural networks (RNNs), which translate the learned features into text sequences, the content of a image can be transformed into linguistic sequences. Existing approaches typically focus on visual features extracted form an object-oriented CNN (train on ImageNet) and then decode them into natural language. In this paper, we propose a novel model using not only object-related, but also scene-related information extracted from the images. To make full use of both object and scene information, we first combine object information and scene information (extracted from a scene-oriented CNN), and then using as inputs to RNNs. Both types of information provide complementary aspects that help in generating a more complete description of the image. Qualitative and quantitative evaluation results validate the effectiveness of our method.


Pattern Recognition | 2016

Category co-occurrence modeling for large scale scene recognition

Xinhang Song; Shuqiang Jiang; Luis Herranz; Yan Kong; Kai Zheng

Scene recognition involves complex reasoning from low-level local features to high-level scene categories. The large semantic gap motivates that most methods model scenes resorting to mid-level representations (e.g. objects, topics). However, this implies an additional mid-level vocabulary and has implications in training and inference. In contrast, the semantic multinomial (SMN) represents patches directly in the scene-level semantic space, which leads to ambiguity when aggregated to a global image representation. Fortunately, this ambiguity appears in the form of scene category co-occurrences which can be modeled a posteriori with a classifier. In this paper we observe that these patterns are essentially local rather than global, sparse, and consistent across SMNs obtained from multiple visual features. We propose a co-occurrence modeling framework where we exploit all these patterns jointly in a common semantic space, combining both supervised and unsupervised learning. Based on this framework we can integrate multiple features and design embeddings for large scale recognition directly in the scene-level space. Finally, we use the co-occurrence modeling framework to develop new scene representations, which experiments show that outperform previous SMN-based representations. HighlightsA general framework for modeling scene category co-occurrences.Multiple features integration and unsupervised filtering for co-occurrences modeling.Semantic representations (filtered SMNs, co-codes and KCNF embedding).


Multimedia Systems | 2014

Relative image similarity learning with contextual information for Internet cross-media retrieval

Shuqiang Jiang; Xinhang Song; Qingming Huang

With the fast explosive rate of the amount of image data on the Internet, how to efficiently utilize them in the cross-media scenario becomes an urgent problem. Images are usually accompanied with contextual textual information. These two heterogeneous modalities are mutually reinforcing to make the Internet content more informative. In most cases, visual information can be regarded as an enhanced content of the textual document. To make image-to-image similarity being more consistent with document-to-document similarity, this paper proposes a method to learn image similarities according to the relations of the accompanied textual documents. More specifically, instead of using the static quantitative relations, rank-based learning procedure by employing structural SVM is adopted in this paper, and the ranking structure is established by comparing the relative relations of textual information. The learning results are in more accordance with the human’s recognition. The proposed method in this paper can be used not only for the image-to-image retrieval, but also for cross-modality multimedia, where a query expansion framework is proposed to get more satisfactory results. Extensive experimental evaluations on large scale Internet dataset validate the performance of the proposed methods.


international conference on data mining | 2014

Semantic Features for Food Image Recognition with Geo-Constraints

Xinhang Song; Shuqiang Jiang; Ruihan Xu; Luis Herranz

Food-related photos have become increasingly very popular, due to social networks, food recommendation and dietary assessment systems. Reliable annotation is essential in those systems, but user-contributed tags are often non-informative and inconsistent, and unconstrained automatic food recognition still has relatively low accuracy. Most works focus on exploiting only the visual content while ignoring the context. In this paper, we improve the food image recognition with using two novel components two kinds of context. Firstly, different from the conventional approach representing image in a visual feature space the visual features, we try to represent the images in the a semantic space (also called semantic simplex), where we model aiming at modeling more context information between each categories. Secondly, we explore leveraging leverage the geographic context of the user and information about geolocation and information about restaurants to simplify the classification problem. Thus, We propose a food recognition framework for the food recognition based on these two kinds of context, based on including semantic features learning and location-adaptive classification. We collected a restaurant-oriented food dataset with food images, dish tags and restaurant-level information, such as the menu and geographic location. Experiments on this dataset show that exploiting geolocation improves around 30% the recognition performance, and the semantic feature has a gain of 3%-10% to the other visual features.


international joint conference on artificial intelligence | 2017

Combining Models from Multiple Sources for RGB-D Scene Recognition

Xinhang Song; Shuqiang Jiang; Luis Herranz

Depth can complement RGB with useful cues about object volumes and scene layout. However, RGB-D image datasets are still too small for directly training deep convolutional neural networks (CNNs), in contrast to the massive monomodal RGB datasets. Previous works in RGB-D recognition typically combine two separate networks for RGB and depth data, pretrained with a large RGB dataset and then fine tuned to the respective target RGB and depth datasets. These approaches have several limitations: 1) only use low-level filters learned from RGB data, thus not being able to exploit properly depth-specific patterns, and 2) RGB and depth features are only combined at highlevels but rarely at lower-levels. In this paper, we propose a framework that leverages both knowledge acquired from large RGB datasets together with depth-specific cues learned from the limited depth data, obtaining more effective multi-source and multi-modal representations. We propose a multi-modal combination method that selects discriminative combinations of layers from the different source models and target modalities, capturing both high-level properties of the task and intrinsic low-level properties of both modalities.


acm multimedia | 2015

Rich Image Description Based on Regions

Xiaodan Zhang; Xinhang Song; Xiong Lv; Shuqiang Jiang; Qixiang Ye; Jianbin Jiao

Abstract Automatically describing the content of an image is a fundamental problem in artificial intelligence that connects computer vision and natural language processing. In contrast to the previous image description methods that focus on describing the whole image, this paper presents a method of generating rich image descriptions from image regions. First, we detect regions with R-CNN (regions with convolutional neural network features) framework. We then utilize the RNN (recurrent neural networks) to generate sentences for image regions. Finally, we propose an optimization method to select one suitable region. The proposed model generates several sentence description of regions in an image, which has sufficient representative power of the whole image and contains more detailed information. Comparing to general image level description, generating more specific and accurate sentences on the different regions can satisfy more personal requirements for different people. Experimental evaluations validate the effectiveness of the proposed method.


acm multimedia | 2017

RGB-D Scene Recognition with Object-to-Object Relation

Xinhang Song; Chengpeng Chen; Shuqiang Jiang

A scene is usually abstract that consists of several less abstract entities such as objects or themes. It is very difficult to reason scenes from visual features due to the semantic gap between the abstract scenes and low-level visual features. Some alternative works recognize scenes with a two-step framework by representing images with intermediate representations of objects or themes. However, the object co-occurrences between scenes may lead to ambiguity for scene recognition. In this paper, we propose a framework to represent images with intermediate (object) representations with spatial layout, i.e., object-to-object relation (OOR) representation. In order to better capture the spatial information, the proposed OOR is adapted to RGB-D data. In the proposed framework, we first apply object detection technique on RGB and depth images separately. Then the detected results of both modalities are combined with a RGB-D proposal fusion process. Based on the detected results, we extract semantic feature OOR and regional convolutional neural network (CNN) features located by bounding boxes. Finally, different features are concatenated to feed to the classifier for scene recognition. The experimental results on SUN RGB-D and NYUD2 datasets illustrate the efficiency of the proposed method.

Collaboration


Dive into the Xinhang Song's collaboration.

Top Co-Authors

Avatar

Shuqiang Jiang

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Luis Herranz

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Ruihan Xu

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Xiangyang Li

Capital Normal University

View shared research outputs
Top Co-Authors

Avatar

Chengpeng Chen

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Jianbin Jiao

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Qingming Huang

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Qixiang Ye

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Shuang Wang

Chinese Academy of Sciences

View shared research outputs
Top Co-Authors

Avatar

Xiaodan Zhang

Chinese Academy of Sciences

View shared research outputs
Researchain Logo
Decentralizing Knowledge