Hongjie Bai | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Hongjie Bai is active.

Explore More

Publication

Featured researches published by Hongjie Bai.

IEEE Transactions on Pattern Analysis and Machine Intelligence | 2011

Parallel Spectral Clustering in Distributed Systems

Wen-Yen Chen; Yangqiu Song; Hongjie Bai; Chih-Jen Lin; Edward Y. Chang

Spectral clustering algorithms have been shown to be more effective in finding clusters than some traditional algorithms, such as k-means. However, spectral clustering suffers from a scalability problem in both memory use and computational time when the size of a data set is large. To perform clustering on large data sets, we investigate two representative ways of approximating the dense similarity matrix. We compare one approach by sparsifying the matrix with another by the Nyström method. We then pick the strategy of sparsifying the matrix via retaining nearest neighbors and investigate its parallelization. We parallelize both memory use and computation on distributed computers. Through an empirical study on a document data set of 193,844 instances and a photo data set of 2,121,863, we show that our parallel algorithm can effectively handle large problems.

neural information processing systems | 2007

Parallelizing Support Vector Machines on Distributed Computers

Kaihua Zhu; Hao Wang; Hongjie Bai; Jian Li; Zhihuan Qiu; Hang Cui; Edward Y. Chang

Support Vector Machines (SVMs) suffer from a widely recognized scalability problem in both memory use and computational time. To improve scalability, we have developed a parallel SVM algorithm (PSVM), which reduces memory use through performing a row-based, approximate matrix factorization, and which loads only essential data to each machine to perform parallel computation. Let n denote the number of training instances, p the reduced matrix dimension after factorization (p is significantly smaller than n), and m the number of machines. PSVM reduces the memory requirement from O(n2) to O(np/m), and improves computation time to O(np2/m). Empirical study shows PSVM to be effective. PSVM Open Source is available for download at http://code.google.com/p/psvm/.

algorithmic applications in management | 2009

PLDA: Parallel Latent Dirichlet Allocation for Large-Scale Applications

Yi Wang; Hongjie Bai; Matt Stanton; Wen-Yen Chen; Edward Y. Chang

This paper presents PLDA, our parallel implementation of Latent Dirichlet Allocation on MPI and MapReduce. PLDA smooths out storage and computation bottlenecks and provides fault recovery for lengthy distributed computations. We show that PLDA can be applied to large, real-world applications and achieves good scalability. We have released MPI-PLDA to open source at http://code.google.com/p/plda under the Apache License.

international world wide web conferences | 2009

Collaborative filtering for orkut communities: discovery of user latent behavior

Wen-Yen Chen; Jon-Chyuan Chu; Junyi Luan; Hongjie Bai; Yi Wang; Edward Y. Chang

Users of social networking services can connect with each other by forming communities for online interaction. Yet as the number of communities hosted by such websites grows over time, users have even greater need for effective community recommendations in order to meet more users. In this paper, we investigate two algorithms from very different domains and evaluate their effectiveness for personalized community recommendation. First is association rule mining (ARM), which discovers associations between sets of communities that are shared across many users. Second is latent Dirichlet allocation (LDA), which models user-community co-occurrences using latent aspects. In comparing LDA with ARM, we are interested in discovering whether modeling low-rank latent structure is more effective for recommendations than directly mining rules from the observed data. We experiment on an Orkut data set consisting of 492,104 users and 118,002 communities. Our empirical comparisons using the top-k recommendations metric show that LDA performs consistently better than ARM for the community recommendation task when recommending a list of 4 or more communities. However, for recommendation lists of up to 3 communities, ARM is still a bit better. We analyze examples of the latent information learned by LDA to explain this finding. To efficiently handle the large-scale data set, we parallelize LDA on distributed computers and demonstrate our parallel implementations scalability with varying numbers of machines.

european conference on machine learning | 2008

Parallel Spectral Clustering

Yangqiu Song; Wen-Yen Chen; Hongjie Bai; Chih-Jen Lin; Edward Y. Chang

Spectral clustering algorithm has been shown to be more effective in finding clusters than most traditional algorithms. However, spectral clustering suffers from a scalability problem in both memory use and computational time when a dataset size is large. To perform clustering on large datasets, we propose to parallelize both memory use and computation on distributed computers. Through an empirical study on a large document dataset of 193,844 data instances and a large photo dataset of 637,137, we demonstrate that our parallel algorithm can effectively alleviate the scalability problem.

acm multimedia | 2009

Parallel algorithms for mining large-scale rich-media data

Edward Y. Chang; Hongjie Bai; Kaihua Zhu

The amount of online photos and videos is now at the scale of tens of billions. To organize, index, and retrieve these large-scale rich-media data, a system must employ scalable data management and mining algorithms. The research community needs to consider solving large scale problems rather than solving problems with small datasets that do not reflect real life scenarios. This tutorial introduces key challenges in large-scale rich-media data mining, and presents parallel algorithms for tackling such challenges. We present our parallel implementations of Spectral Clustering (PSC), FP-Growth (PFP), Latent Dirichlet Allocation (PLDA), and Support Vector Machines (PSVM).

international conference on multimedia and expo | 2007

Parallel Approximate Matrix Factorization for Kernel Methods

Kaihua Zhu; Hang Cui; Hongjie Bai; Jian Li; Zhihuan Qiu; Hao Wang; Hui Xu; Edward Y. Chang

The kernel methods play a pivotal role in machine learning algorithms. Unfortunately, working with the kernel methods must deal with an n times n kernel matrix, which is memory intensive. In this paper, we present a parallel, approximate matrix factorization algorithm, which loads only essential data to individual processors to enable parallel processing of data. Our method reduces space requirement for the kernel matrix from O(n2) to O(np/m), where n is the amount of data, p the reduced matrix dimension (p << n), and m the number of processors.

Archive | 2011