Shenghuo Zhu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Shenghuo Zhu is active.

Explore More

Publication

Featured researches published by Shenghuo Zhu.

computer vision and pattern recognition | 2011

Large-scale image classification: Fast feature extraction and SVM training

Yuanqing Lin; Fengjun Lv; Shenghuo Zhu; Ming Yang; Timothee Cour; Kai Yu; Liangliang Cao; Thomas S. Huang

Most research efforts on image classification so far have been focused on medium-scale datasets, which are often defined as datasets that can fit into the memory of a desktop (typically 4G∼48G). There are two main reasons for the limited effort on large-scale image classification. First, until the emergence of ImageNet dataset, there was almost no publicly available large-scale benchmark data for image classification. This is mostly because class labels are expensive to obtain. Second, large-scale classification is hard because it poses more challenges than its medium-scale counterparts. A key challenge is how to achieve efficiency in both feature extraction and classifier training without compromising performance. This paper is to show how we address this challenge using ImageNet dataset as an example. For feature extraction, we develop a Hadoop scheme that performs feature extraction in parallel using hundreds of mappers. This allows us to extract fairly sophisticated features (with dimensions being hundreds of thousands) on 1.2 million images within one day. For SVM training, we develop a parallel averaging stochastic gradient descent (ASGD) algorithm for training one-against-all 1000-class SVM classifiers. The ASGD algorithm is capable of dealing with terabytes of training data and converges very fast–typically 5 epochs are sufficient. As a result, we achieve state-of-the-art performance on the ImageNet 1000-class classification, i.e., 52.9% in classification accuracy and 71.8% in top 5 hit rate.

international acm sigir conference on research and development in information retrieval | 2008

Multi-document summarization via sentence-level semantic analysis and symmetric matrix factorization

Dingding Wang; Tao Li; Shenghuo Zhu; Chris H. Q. Ding

Multi-document summarization aims to create a compressed summary while retaining the main characteristics of the original set of documents. Many approaches use statistics and machine learning techniques to extract sentences from documents. In this paper, we propose a new multi-document summarization framework based on sentence-level semantic analysis and symmetric non-negative matrix factorization. We first calculate sentence-sentence similarities using semantic analysis and construct the similarity matrix. Then symmetric matrix factorization, which has been shown to be equivalent to normalized spectral clustering, is used to group sentences into clusters. Finally, the most informative sentences are selected from each group to form the summary. Experimental results on DUC2005 and DUC2006 data sets demonstrate the improvement of our proposed framework over the implemented existing summarization systems. A further study on the factors that benefit the high performance is also conducted.

Sigkdd Explorations | 2002

A survey on wavelet applications in data mining

Tao Li; Qi Li; Shenghuo Zhu; Mitsunori Ogihara

Recently there has been significant development in the use of wavelet methods in various data mining processes. However, there has been written no comprehensive survey available on the topic. The goal of this is paper to fill the void. First, the paper presents a high-level data-mining framework that reduces the overall process into smaller components. Then applications of wavelets for each component are reviewd. The paper concludes by discussing the impact of wavelets on data mining research and outlining potential future research directions and applications.

Data Mining and Knowledge Discovery | 2011

Community discovery using nonnegative matrix factorization

Fei Wang; Tao Li; Xin Wang; Shenghuo Zhu; Chris H. Q. Ding

Complex networks exist in a wide range of real world systems, such as social networks, technological networks, and biological networks. During the last decades, many researchers have concentrated on exploring some common things contained in those large networks include the small-world property, power-law degree distributions, and network connectivity. In this paper, we will investigate another important issue, community discovery, in network analysis. We choose Nonnegative Matrix Factorization (NMF) as our tool to find the communities because of its powerful interpretability and close relationship between clustering methods. Targeting different types of networks (undirected, directed and compound), we propose three NMF techniques (Symmetric NMF, Asymmetric NMF and Joint NMF). The correctness and convergence properties of those algorithms are also studied. Finally the experiments on real world networks are presented to show the effectiveness of the proposed methods.

international acm sigir conference on research and development in information retrieval | 2007

Combining content and link for classification using matrix factorization

Shenghuo Zhu; Kai Yu; Yun Chi; Yihong Gong

The world wide web contains rich textual contents that areinterconnected via complex hyperlinks. This huge database violates the assumption held by most of conventional statistical methods that each web page is considered as an independent and identical sample. It is thus difficult to apply traditional mining or learning methods for solving web mining problems, e.g., web page classification, by exploiting both the content and the link structure. The research in this direction has recently received considerable attention but are still in an early stage. Though a few methods exploit both the link structure or the content information, some of them combine the only authority information with the content information, and the others first decompose the link structure into hub and authority features, then apply them as additional document features. Being practically attractive for its great simplicity, this paper aims to design an algorithm that exploits both the content and linkage information, by carrying out a joint factorization on both the linkage adjacency matrix and the document-term matrix, and derives a new representation for web pages in a low-dimensional factor space, without explicitly separating them as content, hub or authority factors. Further analysis can be performed based on the compact representation of web pages. In the experiments, the proposed method is compared with state-of-the-art methods and demonstrates an excellent accuracy in hypertext classification on the WebKB and Cora benchmarks.

international conference on computer vision | 2011

Contextual weighting for vocabulary tree based image retrieval

Xiaoyu Wang; Ming Yang; Timothee Cour; Shenghuo Zhu; Kai Yu; Tony X. Han

In this paper we address the problem of image retrieval from millions of database images. We improve the vocabulary tree based approach by introducing contextual weighting of local features in both descriptor and spatial domains. Specifically, we propose to incorporate efficient statistics of neighbor descriptors both on the vocabulary tree and in the image spatial domain into the retrieval. These contextual cues substantially enhance the discriminative power of individual local features with very small computational overhead. We have conducted extensive experiments on benchmark datasets, i.e., the UKbench, Holidays, and our new Mobile dataset, which show that our method reaches state-of-the-art performance with much less computation. Furthermore, the proposed method demonstrates excellent scalability in terms of both retrieval accuracy and efficiency on large-scale experiments using 1.26 million images from the ImageNet database as distractors.

international acm sigir conference on research and development in information retrieval | 2005

Multi-labelled classification using maximum entropy method

Shenghuo Zhu; Xiang Ji; Wei Xu; Yihong Gong

Many classification problems require classifiers to assign each single document into more than one category, which is called multi-labelled classification. The categories in such problems usually are neither conditionally independent from each other nor mutually exclusive, therefore it is not trivial to directly employ state-of-the-art classification algorithms without losing information of relation among categories. In this paper, we explore correlations among categories with maximum entropy method and derive a classification algorithm for multi-labelled documents. Our experiments show that this method significantly outperforms the combination of single label approach.

international acm sigir conference on research and development in information retrieval | 2002

Document clustering with cluster refinement and model selection capabilities

Xin Liu; Yihong Gong; Wei Xu; Shenghuo Zhu

In this paper, we propose a document clustering method that strives to achieve: (1) a high accuracy of document clustering, and (2) the capability of estimating the number of clusters in the document corpus (i.e. the model selection capability). To accurately cluster the given document corpus, we employ a richer feature set to represent each document, and use the Gaussian Mixture Model (GMM) together with the Expectation-Maximization (EM) algorithm to conduct an initial document clustering. From this initial result, we identify a set of discriminative featuresfor each cluster, and refine the initially obtained document clusters by voting on the cluster label of each document using this discriminative feature set. This self-refinement process of discriminative feature identification and cluster label voting is iteratively applied until the convergence of document clusters. On the other hand, the model selection capability is achieved by introducing randomness in the cluster initialization stage, and then discovering a value C for the number of clusters N by which running the document clustering process for a fixed number of times yields sufficiently similar results. Performance evaluations exhibit clear superiority of the proposed method with its improved document clustering and model selection accuracies. The evaluations also demonstrate how each feature as well as the cluster refinement process contribute to the document clustering accuracy.

Knowledge and Information Systems | 2006

Using discriminant analysis for multi-class classification: an experimental investigation

Tao Li; Shenghuo Zhu; Mitsunori Ogihara

Many supervised machine learning tasks can be cast as multi-class classification problems. Support vector machines (SVMs) excel at binary classification problems, but the elegant theory behind large-margin hyperplane cannot be easily extended to their multi-class counterparts. On the other hand, it was shown that the decision hyperplanes for binary classification obtained by SVMs are equivalent to the solutions obtained by Fishers linear discriminant on the set of support vectors. Discriminant analysis approaches are well known to learn discriminative feature transformations in the statistical pattern recognition literature and can be easily extend to multi-class cases. The use of discriminant analysis, however, has not been fully experimented in the data mining literature. In this paper, we explore the use of discriminant analysis for multi-class classification problems. We evaluate the performance of discriminant analysis on a large collection of benchmark datasets and investigate its usage in text categorization. Our experiments suggest that discriminant analysis provides a fast, efficient yet accurate alternative for general multi-class classification problems.

international world wide web conferences | 2008

Learning multiple graphs for document recommendations

Ding Zhou; Shenghuo Zhu; Kai Yu; Xiaodan Song; Belle L. Tseng; Hongyuan Zha; C. Lee Giles

The Web offers rich relational data with different semantics. In this paper, we address the problem of document recommendation in a digital library, where the documents in question are networked by citations and are associated with other entities by various relations. Due to the sparsity of a single graph and noise in graph construction, we propose a new method for combining multiple graphs to measure document similarities, where different factorization strategies are used based on the nature of different graphs. In particular, the new method seeks a single low-dimensional embedding of documents that captures their relative similarities in a latent space. Based on the obtained embedding, a new recommendation framework is developed using semi-supervised learning on graphs. In addition, we address the scalability issue and propose an incremental algorithm. The new incremental method significantly improves the efficiency by calculating the embedding for new incoming documents only. The new batch and incremental methods are evaluated on two real world datasets prepared from CiteSeer. Experiments demonstrate significant quality improvement for our batch method and significant efficiency improvement with tolerable quality loss for our incremental method.

Explore More