Ruoming Jin | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ruoming Jin is active.

Explore More

Publication

Featured researches published by Ruoming Jin.

international conference on management of data | 2009

3-HOP: a high-compression indexing scheme for reachability query

Ruoming Jin; Yang Xiang; Ning Ruan; David Fuhry

Reachability queries on large directed graphs have attracted much attention recently. The existing work either uses spanning structures, such as chains or trees, to compress the complete transitive closure, or utilizes the 2-hop strategy to describe the reachability. Almost all of these approaches work well for very sparse graphs. However, the challenging problem is that as the ratio of the number of edges to the number of vertices increases, the size of the compressed transitive closure grows very large. In this paper, we propose a new 3-hop indexing scheme for directed graphs with higher density. The basic idea of 3-hop indexing is to use chain structures in combination with hops to minimize the number of structures that must be indexed. Technically, our goal is to find a 3-hop scheme over dense DAGs (directed acyclic graphs) with minimum index size. We develop an efficient algorithm to discover a transitive closure contour, which yields near optimal index size. Empirical studies show that our 3-hop scheme has much smaller index size than state-of-the-art reachability query schemes such as 2-hop and path-tree when DAGs are not very sparse, while our query time is close to path-tree, which is considered to be one of the best reachability query schemes.

IEEE Transactions on Knowledge and Data Engineering | 2005

Shared memory parallelization of data mining algorithms: techniques, programming interface, and performance

Ruoming Jin; Ge Yang; Gagan Agrawal

With recent technological advances, shared memory parallel machines have become more scalable, and offer large main memories and high bus bandwidths. They are emerging as good platforms for data warehousing and data mining. In This work, we focus on shared memory parallelization of data mining algorithms. We have developed a series of techniques for parallelization of data mining algorithms, including full replication, full locking, fixed locking, optimized full locking, and cache-sensitive locking. Unlike previous work on shared memory parallelization of specific data mining algorithms, all of our techniques apply to a large number of popular data mining algorithms. In addition, we propose a reduction-object-based interface for specifying a data mining algorithm. We show how our runtime system can apply any of the techniques we have developed starting from a common specification of the algorithm. We have carried out a detailed evaluation of the parallelization techniques and the programming interface. We have experimented with apriori and fp-tree-based association mining, k-means clustering, k-nearest neighbor classifier, and decision tree construction. The main results from our experiments are as follows: 1) Among full replication, optimized full locking, and cache-sensitive locking, there is no clear winner. Each of these three techniques can outperform others depending upon machine and dataset parameters. These three techniques perform significantly better than the other two techniques. 2) Good parallel efficiency is achieved for each of the four algorithms we experimented with, using our techniques and runtime system. 3) The overhead of the interface is within 10 percent in almost all cases. 4) In the case of decision tree construction, combining different techniques turned out to be crucial for achieving high performance.

international conference on data mining | 2005

An algorithm for in-core frequent itemset mining on streaming data

Ruoming Jin; Gagan Agrawal

Frequent item set mining is a core data mining operation and has been extensively studied over the last decade. This paper takes a new approach for this problem and makes two major contributions. First, we present a one pass algorithm for frequent item set mining, which has deterministic bounds on the accuracy, and does not require any out-of-core summary structure. Second, because our one pass algorithm does not produce any false negatives, it can be easily extended to a two pass accurate algorithm. Our two pass algorithm is very memory efficient, and allows mining of datasets with large number of distinct items and/or very low support levels. Our detailed experimental evaluation on synthetic and real datasets shows the following. First, our one pass algorithm is very accurate in practice. Second, our algorithm requires significantly lower memory than Manku and Motwanis one pass algorithm and the multi-pass Apriori algorithm. Our two pass algorithm outperforms Apriori and FP-tree when the number of distinct items is large and/or support levels are very low. In other cases, it is quite competitive, with possible exception of cases where the average length of frequent item sets is quite high.

knowledge discovery and data mining | 2003

Efficient decision tree construction on streaming data

Ruoming Jin; Gagan Agrawal

Decision tree construction is a well studied problem in data mining. Recently, there has been much interest in mining streaming data. Domingos and Hulten have presented a one-pass algorithm for decision tree construction. Their work uses Hoeffding inequality to achieve a probabilistic bound on the accuracy of the tree constructed.In this paper, we revisit this problem. We make the following two contributions: 1) We present a numerical interval pruning (NIP) approach for efficiently processing numerical attributes. Our results show an average of 39% reduction in execution times. 2) We exploit the properties of the gain function entropy (and gini) to reduce the sample size required for obtaining a given bound on the accuracy. Our experimental results show a 37% reduction in the number of data instances required.

international conference on management of data | 2008

Efficiently answering reachability queries on very large directed graphs

Ruoming Jin; Yang Xiang; Ning Ruan; Haixun Wang

Efficiently processing queries against very large graphs is an important research topic largely driven by emerging real world applications, as diverse as XML databases, GIS, web mining, social network analysis, ontologies, and bioinformatics. In particular, graph reachability has attracted a lot of research attention as reachability queries are not only common on graph databases, but they also serve as fundamental operations for many other graph queries. The main idea behind answering reachability queries in graphs is to build indices based on reachability labels. Essentially, each vertex in the graph is assigned with certain labels such that the reachability between any two vertices can be determined by their labels. Several approaches have been proposed for building these reachability labels; among them are interval labeling (tree cover) and 2-hop labeling. However, due to the large number of vertices in many real world graphs (some graphs can easily contain millions of vertices), the computational cost and (index) size of the labels using existing methods would prove too expensive to be practical. In this paper, we introduce a novel graph structure, referred to as path-tree, to help labeling very large graphs. The path-tree cover is a spanning subgraph of G in a tree shape. We demonstrate both analytically and empirically the effectiveness of our new approaches.

very large data bases | 2011

Distance-constraint reachability computation in uncertain graphs

Ruoming Jin; Lin Liu; Bolin Ding; Haixun Wang

Driven by the emerging network applications, querying and mining uncertain graphs has become increasingly important. In this paper, we investigate a fundamental problem concerning uncertain graphs, which we call the distance-constraint reachability (DCR) problem: Given two vertices s and t, what is the probability that the distance from s to t is less than or equal to a user-defined threshold d in the uncertain graph? Since this problem is #P-Complete, we focus on efficiently and accurately approximating DCR online. Our main results include two new estimators for the probabilistic reachability. One is a Horvitz-Thomson type estimator based on the unequal probabilistic sampling scheme, and the other is a novel recursive sampling estimator, which effectively combines a deterministic recursive computational procedure with a sampling process to boost the estimation accuracy. Both estimators can produce much smaller variance than the direct sampling estimator, which considers each trial to be either 1 or 0. We also present methods to make these estimators more computationally efficient. The comprehensive experiment evaluation on both real and synthetic datasets demonstrates the efficiency and accuracy of our new estimators.

international conference on data mining | 2008

A Topic Modeling Approach and Its Integration into the Random Walk Framework for Academic Search

Jie Tang; Ruoming Jin; Jing Zhang

In this paper, we propose a unified topic modeling approach and its integration into the random walk framework for academic search. Specifically, we present a topic model for simultaneously modeling papers, authors, and publication venues. We combine the proposed topic model into the random walk framework. Experimental results show that our proposed approach for academic search significantly outperforms the baseline methods of using BM25 and language model, and those of using the existing topic models (including pLSI, LDA, and the AT model).

Managing and Mining Graph Data | 2010

A Survey of Algorithms for Dense Subgraph Discovery

Victor E. Lee; Ning Ruan; Ruoming Jin; Charu C. Aggarwal

In this chapter, we present a survey of algorithms for dense subgraph discovery.The problem of dense subgraph discovery is closely related to clustering though the two problems also have a number of differences. For example, the problem of clustering is largely concerned with that of finding a fixed partition in the data, whereas the problem of dense subgraph discovery defines these dense components in a much more flexible way. The problem of dense subgraph discovery may wither be defined over single or multiple graphs. We explore both cases. In the latter case, the problem is also closely related to the problem of the frequent subgraph discovery. This chapter will discuss and organize the literature on this topic effectively in order to make it much more accessible to the reader.

Knowledge and Information Systems | 2006

Fast and exact out-of-core and distributed k -means clustering

Ruoming Jin; Anjan Goswami; Gagan Agrawal

Clustering has been one of the most widely studied topics in data mining and k-means clustering has been one of the popular clustering algorithms. K-means requires several passes on the entire dataset, which can make it very expensive for large disk-resident datasets. In view of this, a lot of work has been done on various approximate versions of k-means, which require only one or a small number of passes on the entire dataset.In this paper, we present a new algorithm, called fast and exact k-means clustering (FEKM), which typically requires only one or a small number of passes on the entire dataset and provably produces the same cluster centres as reported by the original k-means algorithm. The algorithm uses sampling to create initial cluster centres and then takes one or more passes over the entire dataset to adjust these cluster centres. We provide theoretical analysis to show that the cluster centres thus reported are the same as the ones computed by the original k-means algorithm. Experimental results from a number of real and synthetic datasets show speedup between a factor of 2 and 4.5, as compared with k-means.This paper also describes and evaluates a distributed version of FEKM, which we refer to as DFEKM. This algorithm is suitable for analysing data that is distributed across loosely coupled machines. Unlike the previous work in this area, DFEKM provably produces the same results as the original k-means algorithm. Our experimental results show that DFEKM is clearly better than two other possible options for exact clustering on distributed data, which are down loading all data and running sequential k-means or running parallel k-means on a loosely coupled configuration. Moreover, even in a tightly coupled environment, DFEKM can outperform parallel k-means if there is a significant load imbalance.

knowledge discovery and data mining | 2012

Learning personal + social latent factor model for social recommendation

Yelong Shen; Ruoming Jin

Social recommendation, which aims to systematically leverage the social relationships between users as well as their past behaviors for automatic recommendation, attract much attention recently. The belief is that users linked with each other in social networks tend to share certain common interests or have similar tastes (homophily principle); such similarity is expected to help improve the recommendation accuracy and quality. There have been a few studies on social recommendations; however, they almost completely ignored the heterogeneity and diversity of the social relationship. In this paper, we develop a joint personal and social latent factor (PSLF) model for social recommendation. Specifically, it combines the state-of-the-art collaborative filtering and the social network modeling approaches for social recommendation. Especially, the PSLF extracts the social factor vectors for each user based on the state-of-the-art mixture membership stochastic blockmodel, which can explicitly express the varieties of the social relationship. To optimize the PSLF model, we develop a scalable expectation-maximization (EM) algorithm, which utilizes a novel approximate mean-field technique for fast expectation computation. We compare our approach with the latest social recommendation approaches on two real datasets, Flixter and Douban (both with large social networks). With similar training cost, our approach has shown a significant improvement in terms of prediction accuracy criteria over the existing approaches.

Explore More