David Fuhry | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where David Fuhry is active.

Explore More

Publication

Featured researches published by David Fuhry.

international conference on management of data | 2009

3-HOP: a high-compression indexing scheme for reachability query

Ruoming Jin; Yang Xiang; Ning Ruan; David Fuhry

Reachability queries on large directed graphs have attracted much attention recently. The existing work either uses spanning structures, such as chains or trees, to compress the complete transitive closure, or utilizes the 2-hop strategy to describe the reachability. Almost all of these approaches work well for very sparse graphs. However, the challenging problem is that as the ratio of the number of edges to the number of vertices increases, the size of the compressed transitive closure grows very large. In this paper, we propose a new 3-hop indexing scheme for directed graphs with higher density. The basic idea of 3-hop indexing is to use chain structures in combination with hops to minimize the number of structures that must be indexed. Technically, our goal is to find a 3-hop scheme over dense DAGs (directed acyclic graphs) with minimum index size. We develop an efficient algorithm to discover a transitive closure contour, which yields near optimal index size. Empirical studies show that our 3-hop scheme has much smaller index size than state-of-the-art reachability query schemes such as 2-hop and path-tree when DAGs are not very sparse, while our query time is close to path-tree, which is considered to be one of the best reachability query schemes.

international world wide web conferences | 2013

Efficient community detection in large networks using content and links

Yiye Ruan; David Fuhry; Srinivasan Parthasarathy

In this paper we discuss a very simple approach of combining content and link information in graph structures for the purpose of community discovery, a fundamental task in network analysis. Our approach hinges on the basic intuition that many networks contain noise in the link structure and that content information can help strengthen the community signal. This enables ones to eliminate the impact of noise (false positives and false negatives), which is particularly prevalent in online social networks and Web-scale information networks. Specifically we introduce a measure of signal strength between two nodes in the network by fusing their link strength with content similarity. Link strength is estimated based on whether the link is likely (with high probability) to reside within a community. Content similarity is estimated through cosine similarity or Jaccard coefficient. We discuss a simple mechanism for fusing content and link similarity. We then present a biased edge sampling procedure which retains edges that are locally relevant for each graph node. The resulting backbone graph can be clustered using standard community discovery algorithms such as Metis and Markov clustering. Through extensive experiments on multiple real-world datasets (Flickr, Wikipedia and CiteSeer) with varying sizes and characteristics, we demonstrate the effectiveness and efficiency of our methods over state-of-the-art learning and mining approaches several of which also attempt to combine link and content analysis for the purposes of community discovery. Specifically we always find a qualitative benefit when combining content with link analysis. Additionally our biased graph sampling approach realizes a quantitative benefit in that it is typically several orders of magnitude faster than competing approaches.

Data Mining and Knowledge Discovery | 2011

Summarizing transactional databases with overlapped hyperrectangles

Yang Xiang; Ruoming Jin; David Fuhry; Feodor F. Dragan

Transactional data are ubiquitous. Several methods, including frequent itemset mining and co-clustering, have been proposed to analyze transactional databases. In this work, we propose a new research problem to succinctly summarize transactional databases. Solving this problem requires linking the high level structure of the database to a potentially huge number of frequent itemsets. We formulate this problem as a set covering problem using overlapped hyperrectangles (a concept generally regarded as tile according to some existing papers); we then prove that this problem and its several variations are NP-hard, and we further reveal its relationship with the compact representation of a directed bipartite graph. We develop an approximation algorithm Hyper which can achieve a logarithmic approximation ratio in polynomial time. We propose a pruning strategy that can significantly speed up the processing of our algorithm, and we also propose an efficient algorithm Hyper+ to further summarize the set of hyperrectangles by allowing false positive conditions. Additionally, we show that hyperrectangles generated by our algorithms can be properly visualized. A detailed study using both real and synthetic datasets shows the effectiveness and efficiency of our approaches in summarizing transactional databases.

extending database technology | 2009

Efficient skyline computation in metric space

David Fuhry; Ruoming Jin; Donghui Zhang

Given a set of n query points in a general metric space, a metric-space skyline (MSS) query asks what are the closest points to all these query points in the database. Here, consider for any point p, if there are no other points in the database which have less or equal distance to all the query points, then p is denoted as one of the closest points to the query points. This problem is a direct generalization of the recently proposed spatial-skyline query problem, where all the points are located in two or three dimensional Euclidean space. It is also closely related with the nearest neighbor (NN) query, the range query and the common skyline query problem. In this paper, we have developed new algorithms to aggressively prune non-skyline points from the search space. We also contribute two new optimization techniques to reduce the number of distance computations and dominance tests. Our experimental evaluation has shown the effectiveness and efficiency of our approach.

knowledge discovery and data mining | 2008

Succinct summarization of transactional databases: an overlapped hyperrectangle scheme

Yang Xiang; Ruoming Jin; David Fuhry; Feodor F. Dragan

Transactional data are ubiquitous. Several methods, including frequent itemsets mining and co-clustering, have been proposed to analyze transactional databases. In this work, we propose a new research problem to succinctly summarize transactional databases. Solving this problem requires linking the high level structure of the database to a potentially huge number of frequent itemsets. We formulate this problem as a set covering problem using overlapped hyperrectangles; we then prove that this problem and its several variations are NP-hard. We develop an approximation algorithm HYPER which can achieve a ln(k) + 1 approximation ratio in polynomial time. We propose a pruning strategy that can significantly speed up the processing of our algorithm. Additionally, we propose an efficient algorithm to further summarize the set of hyperrectangles by allowing false positive conditions. A detailed study using both real and synthetic datasets shows the effectiveness and efficiency of our approaches in summarizing transactional databases.

Network Modeling Analysis in Health Informatics and BioInformatics | 2012

Merging network patterns: a general framework to summarize biomedical network data

Yang Xiang; David Fuhry; Kamer Kaya; Ruoming Jin; Kun Huang

The ability to summarize a large number of network patterns discovered from biomedical data provides valuable information for use in many applications. We show that several variants of the problem are all NP-hard, and merging network patterns is a practical solution for these applications. In this work, we propose an algorithmic framework for merging network patterns. We have developed fast algorithms under this general framework which supports several types of biomedical network data. In addition, our empirical study demonstrates that our algorithms are efficient in merging a large number of biomedical network patterns and can be configured for various knowledge discovery purposes.

international conference on data mining | 2008

Overlapping Matrix Pattern Visualization: A Hypergraph Approach

Ruoming Jin; Yang Xiang; David Fuhry; Feodor F. Dragan

In this work, we study a visual data mining problem: Given a set of discovered overlapping submatrices of interest, how can we order the rows and columns of the data matrix to best display these submatrices and their relationships? We find this problem can be converted to the hypergraph ordering problem, which generalizes the traditional minimal linear arrangement (or graph ordering) problem and then we are able to prove the NP-hardness of this problem. We propose a novel iterative algorithm which utilize the existing graph ordering algorithm to solve the optimal visualization problem. This algorithm can always converge to a local minimum. The detailed experimental evaluation using a set of publicly available transactional datasets demonstrates the effectiveness and efficiency of the proposed algorithm.

extending database technology | 2009

Estimating the number of frequent itemsets in a large database

Ruoming Jin; Scott McCallen; Yuri Breitbart; David Fuhry; Dong Wang

Estimating the number of frequent itemsets for minimal support α in a large dataset is of great interest from both theoretical and practical perspectives. However, finding not only the number of frequent itemsets, but even the number of maximal frequent itemsets, is #P-complete. In this study, we provide a theoretical investigation on the sampling estimator. We discover and prove several fundamental but also rather surprising properties of the sampling estimator. We also propose a novel algorithm to estimate the number of frequent itemsets without using sampling. Our detailed experimental results have shown the accuracy and efficiency of our proposed approach.

international conference on data engineering | 2015

Towards a parameter-free and parallel itemset mining algorithm in linearithmic time

Gregory Buehrer; Roberto L. de Oliveira; David Fuhry; Srinivasan Parthasarathy

Extracting interesting patterns from large data stores efficiently is a challenging problem in many domains. In the data mining literature, pattern frequency has often been touted as a proxy for interestingness and has been leveraged as a pruning criteria to realize scalable solutions. However, while there exist many frequent pattern algorithms in the literature, all scale exponentially in the worst case, restricting their utility on very large data sets. Furthermore, as we theoretically argue in this article, the problem is very hard to approximate within a reasonable factor, with a polynomial time algorithm. As a counter point to this theoretical result, we present a practical algorithm called Localized Approximate Miner (LAM) that scales linearithmically with the input data. Instead of fully exploring the top of the search lattice to a user-defined point, as traditional mining algorithms do, we instead explore different parts of the complete lattice, efficiently. The key to this efficient exploration is the reliance on min-wise independent permutations to collect the data into highly similar subsets of a partition. It is straightforward to implement and scales to very large data sets. We illustrate its utility on a range of data sets, and demonstrate that the algorithm finds more patterns of higher utility in much less time than several state-of-the-art algorithms. Moreover, we realize a natural multi-level parallelization of LAM that further reduces runtimes by up to 193-fold when leveraging 256 CMP cores spanning 32 machines.

Archive | 2015

Community Discovery: Simple and Scalable Approaches

Yiye Ruan; David Fuhry; Jiongqian Liang; Yu Wang; Srinivasan Parthasarathy

The increasing size and complexity of online social networks have brought distinct challenges to the task of community discovery. A community discovery algorithm needs to be efficient, not taking a prohibitive amount of time to finish. The algorithm should also be scalable, capable of handling large networks containing billions of edges or even more. Furthermore, a community discovery algorithm should be effective in that it produces community assignments of high quality. In this chapter, we present a selection of algorithms that follow simple design principles, and have proven highly effective and efficient according to extensive empirical evaluations. We start by discussing a generic approach of community discovery by combining multilevel graph contraction with core clustering algorithms. Next we describe the usage of network sampling in community discovery, where the goal is to reduce the number of nodes and/or edges while retaining the network’s underlying community structure. Finally, we review research efforts that leverage various parallel and distributed computing paradigms in community discovery, which can facilitate finding communities in tera- and peta-scale networks.

Explore More