Xian-Sheng Hua
Microsoft Research Asia (China)
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Xian-Sheng Hua.
computer vision and pattern recognition | 2008
Zheng-Jun Zha; Xian-Sheng Hua; Tao Mei; Jingdong Wang; Guo-Jun Qi; Zengfu Wang
In real world, an image is usually associated with multiple labels which are characterized by different regions in the image. Thus image classification is naturally posed as both a multi-label learning and multi-instance learning problem. Different from existing research which has considered these two problems separately, we propose an integrated multi- label multi-instance learning (MLMIL) approach based on hidden conditional random fields (HCRFs), which simultaneously captures both the connections between semantic labels and regions, and the correlations among the labels in a single formulation. We apply this MLMIL framework to image classification and report superior performance compared to key existing approaches over the MSR Cambridge (MSRC) and Corel data sets.
computer vision and pattern recognition | 2008
Guo-Jun Qi; Xian-Sheng Hua; Yong Rui; Jinhui Tang; Hong-Jiang Zhang
In this paper, we propose a two-dimensional active learning scheme and show its application in image classification. Traditional active learning methods select samples only along the sample dimension. While this is the right strategy in binary classification, it is sub-optimal for multi-label classification. In multi-label classification, we argue that, for each selected sample, only a part of more effective labels are necessary to be annotated while others can be inferred by exploring the correlations among the labels. The reason is that the contributions of different labels to minimizing the classification error are different due to the inherent label correlations. To this end, we propose to select sample-label pairs, rather than only samples, to minimize a multi-label Bayesian classification error bound. This new active learning strategy not only considers the sample dimension but also the label dimension, and we call it Two-Dimensional Active Learning (2DAL). We also show that the traditional active learning formulation is a special case of 2DAL when there is only one label. Extensive experiments conducted on two real-world applications show that the 2DAL significantly outperforms the best existing approaches which did not take label correlation into account.
Pattern Recognition | 2009
Yong Wang; Tao Mei; Shaogang Gong; Xian-Sheng Hua
This paper presents a novel approach to automatic image annotation which combines global, regional, and contextual features by an extended cross-media relevance model. Unlike typical image annotation methods which use either global or regional features exclusively, as well as neglect the textual context information among the annotated words, the proposed approach incorporates the three kinds of information which are helpful to describe image semantics to annotate images by estimating their joint probability. Specifically, we describe the global features as a distribution vector of visual topics and model the textual context as a multinomial distribution. The global features provide the global distribution of visual topics over an image, while the textual context relaxes the assumption of mutual independence among annotated words which is commonly adopted in most existing methods. Both the global features and textual context are learned by a probability latent semantic analysis approach from the training data. The experiments over 5k Corel images have shown that combining these three kinds of information is beneficial in image annotation.
computer vision and pattern recognition | 2008
Tao Mei; Yong Wang; Xian-Sheng Hua; Shaogang Gong; Shipeng Li
Conventional approaches to automatic image annotation usually suffer from two problems: (1) They cannot guarantee a good semantic coherence of the annotated words for each image, as they treat each word independently without considering the inherent semantic coherence among the words; (2) They heavily rely on visual similarity for judging semantic similarity. To address the above issues, we propose a novel approach to image annotation which simultaneously learns a semantic distance by capturing the prior annotation knowledge and propagates the annotation of an image as a whole entity. Specifically, a semantic distance function (SDF) is learned for each semantic cluster to measure the semantic similarity based on relative comparison relations of prior annotations. To annotate a new image, the training images in each cluster are ranked according to their SDF values with respect to this image and their corresponding annotations are then propagated to this image as a whole entity to ensure semantic coherence. We evaluate the innovative SDF-based approach on Corel images compared with Support Vector Machine-based approach. The experiments show that SDF-based approach outperforms in terms of semantic coherence, especially when each training image is associated with multiple words.
computer vision and pattern recognition | 2006
Guo-Jun Qi; Yan Song; Xian-Sheng Hua; Hong-Jiang Zhang; Li-Rong Dai
Supervised and semi-supervised learning are frequently applied methods to annotate videos by map..ing low-level features into high-level semantic concepts. Though they work well for certain concepts, the performance is still far from reality due to the large gap between the features and the semantics. The main constraint of these methods is that the information contained in a limited number of labeled training samples can hardly represent the distributions of the semantic concepts. In this paper, we propose a novel semi-automatic video annotation framework, active learning with clustering tuning, to tackle the disadvantages of current video annotation solutions. In this framework, firstly an initial training set is constructed based on clustering the entire video dataset. And then a SVM-based active learning scheme is proposed, which aims at maximizing the margin of the SVM classifier by manually selectively labeling a small set of samples. Moreover, in each round of active learning, we tune/refine the clustering results based on the prediction results of current stage, which is beneficial for selecting the most informative samples in the active learning process, as well as helps further improve the final annotation accuracy in the post-processing step. Experimental results show that the proposed scheme performs superior to typical active learning algorithms in terms of both annotation accuracy and stability.
international conference on multimedia and expo | 2008
Xinmei Tian; Linjun Yang; Jingdong Wang; Xiuqing Wu; Xian-Sheng Hua
One crucial problem in transductive video annotation is how to estimate the label from the neighboring samples. Existing methods such as graph-based Gaussian random filed only considered the pair-wise similarity and then propagated the labels based on it. In this paper, we propose a new method from the perspective of local learning, which formulate the prediction of labels from the neighbors into a learning problem. Our contributions lie in two-fold: (1) we propose a new transductive video annotation method based on local kernel classifier; (2) local learnable is proposed to measure whether a sample can be learned from the neighbors well and we employ this measure into the optimization objective. Experiments on TRECVID 2005 dataset prove that the proposed method is effective and the local learning perspective is promising for video annotation.
international conference on multimedia and expo | 2008
Jingdong Wang; Xinmei Tian; Linjun Yang; Zheng-Jun Zha; Xian-Sheng Hua
In this paper, we propose an optimized video scene segmentation approach with considering both content coherence and temporally contextual dissimilarity. First, a chain structure is constructed by connecting temporally adjacent shots to represent a video. Then the chain is partitioned such that the content within a chain segment is coherent enough and the contextual similarity of temporally adjacent chain segments is small enough. This task is formulated as a ratio function of content coherence and contextual similarity. Finally, we present an effective and efficient hierarchical chain partitioning approach to find the optimal scene segmentation. Experimental results on a set of home videos and feature movies demonstrate the superiority of the proposed approach over several existing key approaches.
visual communications and image processing | 2005
Tao Mei; Xian-Sheng Hua; He-Qin Zhou; Shipeng Li
In this paper, we present a learning-based approach to mining the capture intention of camcorder users, aiming at providing a novel viewpoint in terms of home video content analysis. In contrast to existing approaches to video analysis designed from the viewers standpoint, this approach models the capture intention from a camcorder users point of view, by investigating a set of effective intention oriented features. With this approach, not only the capture intention is effectively mined, but also a set of intention probability curves are produced for efficient browsing of home video content. The experimental evaluations indicate that the intention based approach is an effective complement to existing home video content analysis schemes.
international conference on multimedia and expo | 2009
Hao Xu; Jingdong Wang; Xian-Sheng Hua; Shipeng Li
In this paper, we address the problem of generating both visual and textual summaries for tagged image collections simultaneously. The visual and textual summaries consist of representative images and tags of the collection, which are selected through a proposed cross-media voting scheme. In the voting scheme, the likelihood of an image to be a representative is voted by not only other images but also the tags, according to the intra-media and cross-media affinities. The likelihood of a tag to be a representative is obtained in similar manner at the same time. We demonstrate that the proposed scheme produces more informative textual and visual summaries than summarizing images and tags separately.
international conference on multimedia and expo | 2005
Xian-Sheng Hua; Shipeng Li; Hong-Jiang zhang