Hongfu Liu | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hongfu Liu is active.

Explore More

Publication

Featured researches published by Hongfu Liu.

knowledge discovery and data mining | 2015

Spectral Ensemble Clustering

Hongfu Liu; Tongliang Liu; Junjie Wu; Dacheng Tao; Yun Fu

Ensemble clustering, also known as consensus clustering, is emerging as a promising solution for multi-source and/or heterogeneous data clustering. The co-association matrix based method, which redefines the ensemble clustering problem as a classical graph partition problem, is a landmark method in this area. Nevertheless, the relatively high time and space complexity preclude it from real-life large-scale data clustering. We therefore propose SEC, an efficient Spectral Ensemble Clustering method based on co-association matrix. We show that SEC has theoretical equivalence to weighted K-means clustering and results in vastly reduced algorithmic complexity. We then derive the latent consensus function of SEC, which to our best knowledge is among the first to bridge co-association matrix based method to the methods with explicit object functions. The robustness and generalizability of SEC are then investigated to prove the superiority of SEC in theory. We finally extend SEC to meet the challenge rising from incomplete basic partitions, based on which a scheme for big data clustering can be formed. Experimental results on various real-world data sets demonstrate that SEC is an effective and efficient competitor to some state-of-the-art ensemble clustering methods and is also suitable for big data clustering.

knowledge discovery and data mining | 2016

Infinite Ensemble for Image Clustering

Hongfu Liu; Ming Shao; Sheng Li; Yun Fu

Image clustering has been a critical preprocessing step for vision tasks, e.g., visual concept discovery, content-based image retrieval. Conventional image clustering methods use handcraft visual descriptors as basic features via K-means, or build the graph within spectral clustering. Recently, representation learning with deep structure shows appealing performance in unsupervised feature pre-treatment. However, few studies have discussed how to deploy deep representation learning to image clustering problems, especially the unified framework which integrates both representation learning and ensemble clustering for efficient image clustering still remains void. In addition, even though it is widely recognized that with the increasing number of basic partitions, ensemble clustering gets better performance and lower variances, the best number of basic partitions for a given data set is a pending problem. In light of this, we propose the Infinite Ensemble Clustering (IEC), which incorporates the power of deep representation and ensemble clustering in a one-step framework to fuse infinite basic partitions. Generally speaking, a set of basic partitions is firstly generated from the image data, then by converting the basic partitions to the 1-of-K codings, we link the marginalized auto-encoder to the infinite ensemble clustering with i.i.d. basic partitions, which can be approached by the closed-form solutions, finally we follow the layer-wise training procedure and feed the concatenated deep features to K-means for final clustering. Extensive experiments on diverse vision data sets with different levels of visual descriptors demonstrate both the time efficiency and superior performance of IEC compared to the state-of-the-art ensemble clustering and deep clustering methods.

IEEE Transactions on Knowledge and Data Engineering | 2017

Spectral Ensemble Clustering via Weighted K-Means: Theoretical and Practical Evidence

Hongfu Liu; Junjie Wu; Tongliang Liu; Dacheng Tao; Yun Fu

As a promising way for heterogeneous data analytics, consensus clustering has attracted increasing attention in recent decades. Among various excellent solutions, the co-association matrix based methods form a landmark, which redefines consensus clustering as a graph partition problem. Nevertheless, the relatively high time and space complexities preclude it from wide real-life applications. We, therefore, propose Spectral Ensemble Clustering (SEC) to leverage the advantages of co-association matrix in information integration but run more efficiently. We disclose the theoretical equivalence between SEC and weighted K-means clustering, which dramatically reduces the algorithmic complexity. We also derive the latent consensus function of SEC, which to our best knowledge is the first to bridge co-association matrix based methods to the methods with explicit global objective functions. Further, we prove in theory that SEC holds the robustness, generalizability, and convergence properties. We finally extend SEC to meet the challenge arising from incomplete basic partitions, based on which a row-segmentation scheme for big data clustering is proposed. Experiments on various real-world data sets in both ensemble and multi-view clustering scenarios demonstrate the superiority of SEC to some state-of-the-art methods. In particular, SEC seems to be a promising candidate for big data clustering.

international conference on data mining | 2015

Clustering with Partition Level Side Information

Hongfu Liu; Yun Fu

Constrained clustering uses pre-given knowledge to improve the clustering performance. Among existing literature, researchers usually focus on Must-Link and Cannot-Link pairwise constraints. However, pairwise constraints not only disobey the way we make decisions, but also suffer from the vulnerability of noisy constraints and the order of constraints. In light of this, we use partition level side information instead of pairwise constraints to guide the process of clustering. Compared with pairwise constraints, partition level side information keeps the consistency within partial structure and avoids self-contradictory and the impact of constraints order. Generally speaking, only small part of the data instances are given labels by human workers, which are used to supervise the procedure of clustering. Inspired by the success of ensemble clustering, we aim to find a clustering solution which captures the intrinsic structure from the data itself, and agrees with the partition level side information as much as possible. Then we derive the objective function and equivalently transfer it into a K-mean-like optimization problem. Extensive experiments on several real-world datasets demonstrate the effectiveness and efficiency of our method compared to pairwise constrained clustering and consensus clustering, which verifies the superiority of partition level side information to pairwise constraints. Besides, our method has high robustness to noisy side information.

conference on information and knowledge management | 2016

Robust Spectral Ensemble Clustering

Zhiqiang Tao; Hongfu Liu; Sheng Li; Yun Fu

Ensemble Clustering (EC) aims to integrate multiple Basic Partitions (BPs) of the same dataset into a consensus one. It could be transformed as a graph partition problem on the co-association matrix derived from BPs. However, existing EC methods usually directly use the co-association matrix, yet without considering various noises (e.g., the disagreement between different BPs or outliers) that may exist in it. These noises can impair the cluster structure of a co-association matrix and thus degrade the final clustering performance. In this paper, we propose a novel Robust Spectral Ensemble Clustering (RSEC) approach to address this challenge. First, RSEC learns a robust representation for the co-association matrix through low-rank constraint, which reveals the cluster structure of a co-association matrix and captures various noises in it. Second, RSEC finds the consensus partition by conducting spectral clustering. These two steps are iteratively performed in a unified optimization framework. Most importantly, during our optimization process, we utilize consensus partition to iteratively enhance the block-diagonal structure of the learned representation to further assist the clustering process. Experiments on numerous real-world datasets demonstrate the effectiveness of our method compared with the state-of-the-art. Moreover, several impact factors that may affect the clustering performance of our approach are also explored extensively.

Data Mining and Knowledge Discovery | 2018

Infinite ensemble clustering

Hongfu Liu; Ming Shao; Sheng Li; Yun Fu

Ensemble clustering aims to fuse several diverse basic partitions into a consensus one, which has been widely recognized as a promising tool to discover novel clusters and deliver robust partitions, while representation learning with deep structure shows appealing performance in unsupervised feature pre-treatment. In the literature, it has been empirically found that with the increasing number of basic partitions, ensemble clustering gets better performance and lower variances, yet the best number of basic partitions for a given data set is a pending problem. In light of this, we propose the Infinite Ensemble Clustering (IEC), which incorporates marginalized denoising auto-encoder with dropout noises to generate the expectation representation for infinite basic partitions. Generally speaking, a set of basic partitions is firstly generated from the data. Then by converting the basic partitions to the 1-of-K codings, we link the marginalized denoising auto-encoder to the infinite basic partition representation. Finally, we follow the layer-wise training procedure and feed the concatenated deep features to K-means for final clustering. According to different types of marginalized auto-encoders, the linear and non-linear versions of IEC are proposed. Extensive experiments on diverse vision data sets with different levels of visual descriptors demonstrate the superior performance of IEC compared to the state-of-the-art ensemble clustering and deep clustering methods. Moreover, we evaluate the performance of IEC in the application of pan-omics gene expression analysis application via survival analysis.

IEEE Transactions on Pattern Analysis and Machine Intelligence | 2018

Partition Level Constrained Clustering

Hongfu Liu; Zhiqiang Tao; Yun Fu

Constrained clustering uses pre-given knowledge to improve the clustering performance. Here we use a new constraint called partition level side information and propose the Partition Level Constrained Clustering (PLCC) framework, where only a small proportion of the data is given labels to guide the procedure of clustering. Our goal is to find a partition which captures the intrinsic structure from the data itself, and also agrees with the partition level side information. Then we derive the algorithm of partition level side information based on K-means and give its corresponding solution. Further, we extend it to handle multiple side information and design the algorithm of partition level side information for spectral clustering. Extensive experiments demonstrate the effectiveness and efficiency of our method compared to pairwise constrained clustering and ensemble clustering methods, even in the inconsistent cluster number setting, which verifies the superiority of partition level side information to pairwise constraints. Besides, our method has high robustness to noisy side information, and we also validate the performance of our method with multiple side information. Finally, the image cosegmentation application based on saliency-guided side information demonstrates the effectiveness of PLCC as a flexible framework in different domains, even with the unsupervised side information.

Bioinformatics | 2017

Entropy-based consensus clustering for patient stratification

Hongfu Liu; Rui Zhao; Hongsheng Fang; Feixiong Cheng; Yun Fu; Yang-Yu Liu

Motivation: Patient stratification or disease subtyping is crucial for precision medicine and personalized treatment of complex diseases. The increasing availability of high‐throughput molecular data provides a great opportunity for patient stratification. Many clustering methods have been employed to tackle this problem in a purely data‐driven manner. Yet, existing methods leveraging high‐throughput molecular data often suffers from various limitations, e.g. noise, data heterogeneity, high dimensionality or poor interpretability. Results: Here we introduced an Entropy‐based Consensus Clustering (ECC) method that overcomes those limitations all together. Our ECC method employs an entropy‐based utility function to fuse many basic partitions to a consensus one that agrees with the basic ones as much as possible. Maximizing the utility function in ECC has a much more meaningful interpretation than any other consensus clustering methods. Moreover, we exactly map the complex utility maximization problem to the classic K‐means clustering problem, which can then be efficiently solved with linear time and space complexity. Our ECC method can also naturally integrate multiple molecular data types measured from the same set of subjects, and easily handle missing values without any imputation. We applied ECC to 110 synthetic and 48 real datasets, including 35 cancer gene expression benchmark datasets and 13 cancer types with four molecular data types from The Cancer Genome Atlas. We found that ECC shows superior performance against existing clustering methods. Our results clearly demonstrate the power of ECC in clinically relevant patient stratification. Availability and implementation: The Matlab package is available at http://scholar.harvard.edu/yyl/ecc. Contact: [email protected] or [email protected] Supplementary information: Supplementary data are available at Bioinformatics online.

international conference on data mining | 2016

Robust Multi-View Feature Selection

Hongfu Liu; Haiyi Mao; Yun Fu

High-throughput technologies have enabled us to rapidly accumulate a wealth of diverse data types. These multi-view data contain much more information to uncover the cluster structure than single-view data, which draws raising attention in data mining and machine learning areas. On one hand, many features are extracted to provide enough information for better representations, on the other hand, such abundant features might result in noisy, redundant and irrelevant information, which harms the performance of the learning algorithms. In this paper, we focus on a new topic, multi-view unsupervised feature selection, which aims to discover the discriminative features in each view for better explanation and representation. Although there are some exploratory studies along this direction, most of them employ the traditional feature selection by putting the features in different views together and fail to evaluate the performance in the multi-view setting. The features selected in this way are difficult to explain due to the meaning of different views, which disobeys the goal of feature selection as well. In light of this, we intend to give a correct understanding of multi-view feature selection. Different from the existing work, which either incorrectly concatenates the features from different views, or takes huge time complexity to learn the pseudo labels, we propose a novel algorithm, Robust Multi-view Feature Selection (RMFS), which applies robust multi-view K-means to obtain the robust and high quality pseudo labels for sparse feature selection in an efficient way. Nontrivially we give the solution by taking the derivatives and further provide a K-means-like optimization to update several variables in a unified framework with the convergence guarantee. We demonstrate extensive experiments on three real-world multi-view data sets, which illustrate the effectiveness and efficiency of RMFS in terms of both single-view and multi-view evaluations by a large margin.

international conference on big data | 2016

Outlier detection via sampling ensemble

Hongfu Liu; Yuchao Zhang; Bo Deng; Yun Fu

Outlier detection is a key technique in data ming and machine learning fields. The deviating characters of outliers make huge detrimental effects on the learning tasks. A lot of algorithms are therefore proposed to handle outliers from different perspectives, such as distance, density, angle and so on. Among these approaches, the density-based methods achieve better performance, but also suffer from huge time complexity. Recently, in order to accelerate the speed and improve the performance, the subsampling ensemble method attracts much attention, which has a reasonable theoretical interpretation and high performance. However, existing work only gives the partial picture of outlier detection via row-sampling, the effective portfolio of bi-sampling is still void. In light of this, we propose the general outlier detection framework via bi-sampling, Bi-Sampling Outlier Detection (BSOD) and provide the effective portfolios of the row and column-sampling ratios in a theoretical way. In addition, the benefits of BSOD are fully illustrated in terms of ensemble diversity and divide-and-conquer. Further we employ LOF within BSOD as BI-LOF to conduct extensive experiments. In general, on 30 synthetic and 17 real-world data sets we thoroughly explore the characteristics of BI-LOF with different numbers of instances, features, nearest neighbors, validate the theoretical analysis of BSOD condition on synthetic data sets, and show obvious advantages over other state-of-the-art algorithms in terms of low and high dimensional real-world data sets. And finally we use BI-LOF to conduct image outlier detection and show high quality and stableness of BI-LOF.

Explore More