Shikui Wei
Beijing Jiaotong University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Shikui Wei.
IEEE Transactions on Knowledge and Data Engineering | 2010
Shikui Wei; Yao Zhao; Zhenfeng Zhu; Nan Liu
Analysis on click-through data from a very large search engine log shows that users are usually interested in the top-ranked portion of returned search results. Therefore, it is crucial for search engines to achieve high accuracy on the top-ranked documents. While many methods exist for boosting video search performance, they either pay less attention to the above factor or encounter difficulties in practical applications. In this paper, we present a flexible and effective reranking method, called CR-Reranking, to improve the retrieval effectiveness. To offer high accuracy on the top-ranked results, CR-Reranking employs a cross-reference (CR) strategy to fuse multimodal cues. Specifically, multimodal features are first utilized separately to rerank the initial returned results at the cluster level, and then all the ranked clusters from different modalities are cooperatively used to infer the shots with high relevance. Experimental results show that the search quality, especially on the top-ranked results, is improved significantly.
IEEE Transactions on Systems, Man, and Cybernetics | 2017
Yunchao Wei; Yao Zhao; Canyi Lu; Shikui Wei; Luoqi Liu; Zhenfeng Zhu; Shuicheng Yan
Recently, convolutional neural network (CNN) visual features have demonstrated their powerful ability as a universal representation for various recognition tasks. In this paper, cross-modal retrieval with CNN visual features is implemented with several classic methods. Specifically, off-the-shelf CNN visual features are extracted from the CNN model, which is pretrained on ImageNet with more than one million images from 1000 object categories, as a generic image representation to tackle cross-modal retrieval. To further enhance the representational ability of CNN visual features, based on the pretrained CNN model on ImageNet, a fine-tuning step is performed by using the open source Caffe CNN library for each target data set. Besides, we propose a deep semantic matching method to address the cross-modal retrieval problem with respect to samples which are annotated with one or multiple labels. Extensive experiments on five popular publicly available data sets well demonstrate the superiority of CNN visual features for cross-modal retrieval.Recently, convolutional neural network (CNN) visual features have demonstrated their powerful ability as a universal representation for various recognition tasks. In this paper, cross-modal retrieval with CNN visual features is implemented with several classic methods. Specifically, off-the-shelf CNN visual features are extracted from the CNN model, which is pretrained on ImageNet with more than one million images from 1000 object categories, as a generic image representation to tackle cross-modal retrieval. To further enhance the representational ability of CNN visual features, based on the pretrained CNN model on ImageNet, a fine-tuning step is performed by using the open source Caffe CNN library for each target data set. Besides, we propose a deep semantic matching method to address the cross-modal retrieval problem with respect to samples which are annotated with one or multiple labels. Extensive experiments on five popular publicly available data sets well demonstrate the superiority of CNN visual features for cross-modal retrieval.
IEEE Transactions on Circuits and Systems for Video Technology | 2011
Shikui Wei; Yao Zhao; Ce Zhu; Changsheng Xu; Zhenfeng Zhu
Content-based video copy detection is very important for copyright protection in view of the growing popularity of video sharing websites, which deals with not only whether a copy occurs in a query video stream but also where the copy is located and where the copy is originated from. While a lot of work has addressed the problem with good performance, less effort has been made to consider the copy detection problem in the case of a continuous query stream, for which precise temporal localization and some complex video transformations like frame insertion and video editing need to be handled. We attempt to attack the problem by presenting a frame fusion based copy detection approach, which converts video copy detection to frame similarity search and frame fusion under a temporal consistency assumption. Our work focuses mainly on the frame fusion stage due to its critical role in copy detection performance. The proposed frame fusion scheme is based on a Viterbi-like algorithm, comprising an online back-tracking strategy with three relaxed constraints. The experimental results show that the proposed approach achieves high localization accuracy in both the query stream and the reference database even when a query video stream undergoes some complex transformations, while achieving comparable performance compared with state-of-the-art copy detection methods.
IEEE Transactions on Systems, Man, and Cybernetics | 2013
Shikui Wei; Dong Xu; Xuelong Li; Yao Zhao
The bag-of-words (BoW) model has been known as an effective method for large-scale image search and indexing. Recent work shows that the performance of the model can be further improved by using the embedding method. While different variants of the BoW model and embedding method have been developed, less effort has been made to discover their underlying working mechanism. In this paper, we systematically investigate the image search performance variation with respect to a few factors of the BoW model, and study how to employ the embedding method to further improve the image search performance. Subsequently, we summarize several observations based on the experiments on descriptor matching. To validate these observations in a real image search, we propose an effective and efficient image search scheme, in which the BoW model and embedding method are jointly optimized in terms of effectiveness and efficiency by following these observations. Our comprehensive experiments demonstrate that it is beneficial to employ these observations to develop an image search algorithm, and the proposed image search scheme outperforms state-of-the-art methods in both effectiveness and efficiency.
ACM Transactions on Intelligent Systems and Technology | 2016
Yunchao Wei; Yao Zhao; Zhenfeng Zhu; Shikui Wei; Yanhui Xiao; Jiashi Feng; Shuicheng Yan
In this article, we investigate the cross-media retrieval between images and text, that is, using image to search text (I2T) and using text to search images (T2I). Existing cross-media retrieval methods usually learn one couple of projections, by which the original features of images and text can be projected into a common latent space to measure the content similarity. However, using the same projections for the two different retrieval tasks (I2T and T2I) may lead to a tradeoff between their respective performances, rather than their best performances. Different from previous works, we propose a modality-dependent cross-media retrieval (MDCR) model, where two couples of projections are learned for different cross-media retrieval tasks instead of one couple of projections. Specifically, by jointly optimizing the correlation between images and text and the linear regression from one modal space (image or text) to the semantic space, two couples of mappings are learned to project images and text from their original feature spaces into two common latent subspaces (one for I2T and the other for T2I). Extensive experiments show the superiority of the proposed MDCR compared with other methods. In particular, based on the 4,096-dimensional convolutional neural network (CNN) visual feature and 100-dimensional Latent Dirichlet Allocation (LDA) textual feature, the mAP of the proposed method achieves the mAP score of 41.5%, which is a new state-of-the-art performance on the Wikipedia dataset.
IEEE Transactions on Systems, Man, and Cybernetics | 2014
Yanhui Xiao; Zhenfeng Zhu; Yao Zhao; Yunchao Wei; Shikui Wei; Xuelong Li
Nonnegative matrix factorization (NMF) is a useful technique to explore a parts-based representation by decomposing the original data matrix into a few parts-based basis vectors and encodings with nonnegative constraints. It has been widely used in image processing and pattern recognition tasks due to its psychological and physiological interpretation of natural data whose representation may be parts-based in human brain. However, the nonnegative constraint for matrix factorization is generally not sufficient to produce representations that are robust to local transformations. To overcome this problem, in this paper, we proposed a topographic NMF (TNMF), which imposes a topographic constraint on the encoding factor as a regularizer during matrix factorization. In essence, the topographic constraint is a two-layered network, which contains the square nonlinearity in the first layer and the square-root nonlinearity in the second layer. By pooling together the structure-correlated features belonging to the same hidden topic, the TNMF will force the encodings to be organized in a topographical map. Thus, the feature invariance can be promoted. Some experiments carried out on three standard datasets validate the effectiveness of our method in comparison to the state-of-the-art approaches.
IEEE Transactions on Knowledge and Data Engineering | 2014
Lei Zhang; Yao Zhao; Zhenfeng Zhu; Shikui Wei; Xindong Wu
In some real world applications, like information retrieval and data classification, we often are confronted with the situation that the same semantic concept can be expressed using different views with similar information. Thus, how to obtain a certain Semantically Consistent Patterns (SCP) for cross-view data, which embeds the complementary information from different views, is of great importance for those applications. However, the heterogeneity among cross-view representations brings a significant challenge on mining the SCP. In this paper, we propose a general framework to discover the SCP for cross-view data. Specifically, aiming at building a feature-isomorphic space among different views, a novel Isomorphic Relevant Redundant Transformation (IRRT) is first proposed. The IRRT linearly maps multiple heterogeneous low-level feature spaces to a high-dimensional redundant feature-isomorphic one, which we name as mid-level space. Thus, much more complementary information from different views can be captured. Furthermore, to mine the semantic consistency among the isomorphic representations in the mid-level space, we propose a new Correlation-based Joint Feature Learning (CJFL) model to extract a unique high-level semantic subspace shared across the feature-isomorphic data. Consequently, the SCP for cross-view data can be obtained. Comprehensive experiments on three data sets demonstrate the advantages of our framework in classification and retrieval.
IEEE Transactions on Neural Networks | 2015
Yanhui Xiao; Zhenfeng Zhu; Yao Zhao; Yunchao Wei; Shikui Wei
Independent component analysis with soft reconstruction cost (RICA) has been recently proposed to linearly learn sparse representation with an overcomplete basis, and this technique exhibits promising performance even on unwhitened data. However, linear RICA may not be effective for the majority of real-world data because nonlinearly separable data structure pervasively exists in original data space. Meanwhile, RICA is essentially an unsupervised method and does not employ class information. Motivated by the success of the kernel trick that maps a nonlinearly separable data structure into a linearly separable case in a high-dimensional feature space, we propose a kernel RICA (kRICA) model to nonlinearly capture sparse representation in feature space. Furthermore, we extend the unsupervised kRICA to a supervised one by introducing a class-driven discrimination constraint, such that the data samples from the same class are well represented on the basis of the corresponding subset of basis vectors. This discrimination constraint minimizes inhomogeneous representation energy and maximizes homogeneous representation energy simultaneously, which is essentially equivalent to maximizing between-class scatter and minimizing within-class scatter at the same time in an implicit manner. Experimental results demonstrate that the proposed algorithm is more effective than other state-of-the-art methods on several datasets.
international conference on multimedia and expo | 2014
Yunchao Wei; Yao Zhao; Zhenfeng Zhu; Yanhui Xiao; Shikui Wei
In this paper, we propose a cross-media regularization framework to enhance image understanding which can benefit image retrieval, classification and so on. The goal of cross-media regularization is to find regularization projections by exploiting the correlations between visual features and textual features. Thus, the original noisy distribution of visual features can be refined by leveraging the discriminative distribution of the corresponding textual features. Within the proposed cross-media regularization framework, a mid-level representation is built by jointly projecting both visual and textual features into a shared feature subspace, which leads to transferring of the discriminative semantic characteristic embedded in the textual modality into the corresponding visual modality. Meanwhile, the discriminative characteristic of textual features can also be boosted simultaneously. The experimental results demonstrate that the proposed mid-level space learning process can remarkably improve the search quality and outperform the existing semantic regularization methods.
international conference on multimedia and expo | 2013
Zhenfeng Zhu; Peilu Xin; Shikui Wei; Yao Zhao
As one of the most successful approaches for recommendation, matrix factorization based Collaborative Filtering (CF) technique has received considerable attentions over the past years. In this paper, we propose an orthogonal matrix factorization model with graph regularization to preserve the consistency of the local structure both in user and item spaces, respectively. Instead of traditional alternating optimization method, a greedy sequential one is introduced to optimize a pair of coupled factor vector and its corresponding loading vector simultaneously each time, thus the original optimization problem is converted into the well-studied Multivariate Eigen Problem (MEP). Furthermore, multiple pairs of coupled eigen-vectors can be obtained in sequence. To guarantee nonrecurring of repetition of solutions, a novel dual-deflation technique is developed to incorporate into the sequential optimization. Experimental results on MovieLens and Each Movie data sets demonstrate that the proposed method is much more competitive compared with the state of the art matrix factorization based collaborative filtering methods.