Keith Henderson
Lawrence Livermore National Laboratory
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Keith Henderson.
conference on high performance computing (supercomputing) | 2005
Andy Yoo; Edmond Chow; Keith Henderson; Will McLendon; Bruce Hendrickson
Many emerging large-scale data science applications require searching large graphs distributed across multiple memories and processors. This paper presents a distributed breadth- first search (BFS) scheme that scales for random graphs with up to three billion vertices and 30 billion edges. Scalability was tested on IBM BlueGene/L with 32,768 nodes at the Lawrence Livermore National Laboratory. Scalability was obtained through a series of optimizations, in particular, those that ensure scalable use of memory. We use 2D (edge) partitioning of the graph instead of conventional 1D (vertex) partitioning to reduce communication overhead. For Poisson random graphs, we show that the expected size of the messages is scalable for both 2D and 1D partitionings. Finally, we have developed efficient collective communication functions for the 3D torus architecture of BlueGene/L that also take advantage of the structure in the problem. The performance and characteristics of the algorithm are measured and reported.
acm symposium on applied computing | 2009
Keith Henderson; Tina Eliassi-Rad
This paper introduces LDA-G, a scalable Bayesian approach to finding latent group structures in large real-world graph data. Existing Bayesian approaches for group discovery (such as Infinite Relational Models) have only been applied to small graphs with a couple of hundred nodes. LDA-G (short for Latent Dirichlet Allocation for Graphs) utilizes a well-known topic modeling algorithm to find latent group structure. Specifically, we modify Latent Dirichlet Allocation (LDA) to operate on graph data instead of text corpora. Our modifications reflect the differences between real-world graph data and text corpora (e.g., a nodes neighbor count vs. a documents word count). In our empirical study, we apply LDA-G to several large graphs (with thousands of nodes) from PubMed (a scientific publication repository). We compare LDA-Gs quantitative performance on link prediction with two existing approaches: one Bayesian (namely, Infinite Relational Model) and one non-Bayesian (namely, Cross-association). On average, LDA-G outperforms IRM by 15% and Cross-association by 25% (in terms of area under the ROC curve). Furthermore, we demonstrate that LDA-G can discover useful qualitative information.
international world wide web conferences | 2012
Ryan A. Rossi; Brian Gallagher; Jennifer Neville; Keith Henderson
To understand the structural dynamics of a large-scale social, biological or technological network, it may be useful to discover behavioral roles representing the main connectivity patterns present over time. In this paper, we propose a scalable non-parametric approach to automatically learn the structural dynamics of the network and individual nodes. Roles may represent structural or behavioral patterns such as the center of a star, peripheral nodes, or bridge nodes that connect different communities. Our novel approach learns the appropriate structural role dynamics for any arbitrary network and tracks the changes over time. In particular, we uncover the specific global network dynamics and the local node dynamics of a technological, communication, and social network. We identify interesting node and network patterns such as stationary and non-stationary roles, spikes/steps in role-memberships (perhaps indicating anomalies), increasing/decreasing role trends, among many others. Our results indicate that the nodes in each of these networks have distinct connectivity patterns that are non-stationary and evolve considerably over time. Overall, the experiments demonstrate the effectiveness of our approach for fast mining and tracking of the dynamics in large networks. Furthermore, the dynamic structural representation provides a basis for building more sophisticated models and tools that are fast for exploring large dynamic networks.
international conference on cluster computing | 2006
Timothy D. R. Hartley; Füsun Özgüner; Andy Yoo; Scott R. Kohn; Keith Henderson
This paper presents a middleware framework for storing, accessing and analyzing massive-scale semantic graphs. The framework, MSSG, targets scale-free semantic graphs with O(1012) (trillion) vertices and edges. Here, we present the overall architectural design of the framework, as well as a prototype implementation for cluster architectures. The sheer size of these massive-scale semantic graphs prohibits storing the entire graph in memory even on medium- to large-scale parallel architectures. We therefore propose a new graph database, grDB, for the efficient storage and retrieval of large scale-free semantic graphs on secondary storage. This new database supports the efficient and scalable execution of parallel out-of-core graph algorithms which are essential for analyzing semantic graphs of massive size. We have also developed a parallel out-of-core breadth-first search algorithm for performance study. To the best of our knowledge, it is the first of such algorithms reported in the literature. Experimental evaluations on large real-world semantic graphs show that the MSSG framework scales well, and grDB outperforms widely used open-source out-of-core databases, such as BerkeleyDB and MySQL, in the storage and retrieval of scale-free graphs
ieee international conference on high performance computing data and analytics | 2008
Bronis R. de Supinski; Martin Schulz; Vasily V. Bulatov; William H. Cabot; Bor Chan; Andrew W. Cook; Erik W. Draeger; James N. Glosli; Jeffrey Greenough; Keith Henderson; Alison Kubota; Steve Louis; Brian Miller; Mehul Patel; Thomas E. Spelce; Frederick H. Streitz; Peter L. Williams; Robert Kim Yates; Andy Yoo; George S. Almasi; Gyan Bhanot; Alan Gara; John A. Gunnels; Manish Gupta; José E. Moreira; James C. Sexton; Bob Walkup; Charles J. Archer; Francois Gygi; Timothy C. Germann
BlueGene/L (BG/L), developed through a partnership between IBM and Lawrence Livermore National Laboratory (LLNL), is currently the worlds largest system both in terms of scale, with 131,072 processors, and absolute performance, with a peak rate of 367 Tflop/s. BG/L has led the last four Top500 lists with a Linpack rate of 280.6 Tflop/s for the full machine installed at LLNL and is expected to remain the fastest computer in the next few editions. However, the real value of a machine such as BG/L derives from the scientific breakthroughs that real applications can produce by successfully using its unprecedented scale and computational power. In this paper, we describe our experiences with eight large scale applications on BG/ L from several application domains, ranging from molecular dynamics to dislocation dynamics and turbulence simulations to searches in semantic graphs. We also discuss the challenges we faced when scaling these codes and present several successful optimization techniques. All applications show excellent scaling behavior, even at very large processor counts, with one code even achieving a sustained performance of more than 100 Tflop/s, clearly demonstrating the real success of the BG/L design.
Archive | 2005
Edmond Chow; Keith Henderson; Andy Yoo
Many emerging large-scale data science applications require searching large graphs distributed across multiple memories and processors. This paper presents a scalable implementation of distributed breadth-first search (BFS) which has been applied to graphs with over one billion vertices. The main contribution of this paper is to compare a 2-D (edge) partitioning of the graph to the more common 1-D (vertex) partitioning. For Poisson random graphs which have low diameter like many realistic information network data, we determine when one type of partitioning is advantageous over the other. Also for Poisson random graphs, we show that memory use is scalable. The experimental tests use a level-synchronized BFS algorithm running on a large Linux cluster and BlueGene/L. On the latter machine, the timing is related to the number of synchronization steps in the algorithm.
acm symposium on applied computing | 2015
Keith Henderson; Brian Gallagher; Tina Eliassi-Rad
Given a collection of m continuous-valued, one-dimensional empirical probability distributions {P1, ..., Pm}, how can we cluster these distributions efficiently with a nonparametric approach? Such problems arise in many real-world settings where keeping the moments of the distribution is not appropriate, because either some of the moments are not defined or the distributions are heavy-tailed or bi-modal. Examples include mining distributions of inter-arrival times and phone-call lengths. We present an efficient algorithm with a non-parametric model for clustering empirical, one-dimensional, continuous probability distributions. Our algorithm, called ep-means, is based on the Earth Movers Distance and k-means clustering. We illustrate the utility of ep-means on various data sets and applications. In particular, we demonstrate that ep-means effectively and efficiently clusters probability distributions of mixed and arbitrary shapes, recovering ground-truth clusters exactly in cases where existing methods perform at baseline accuracy. We also demonstrate that ep-means outperforms moment-based classification techniques and discovers useful patterns in a variety of real-world applications.
social computing behavioral modeling and prediction | 2010
Tina Eliassi-Rad; Keith Henderson
We introduce a new approach to literature search that is based on finding mixed-membership communities on an augmented co-authorship graph (ACA) with a scalable generative model. An ACA graph contains two types of edges: (1) coauthorship links and (2) links between researchers with substantial expertise overlap. Our solution eliminates the biases introduced by either looking at citations of a paper or doing a Web search. A case study on PubMed shows the benefits of our approach.
conference on high performance computing (supercomputing) | 2006
Andy Yoo; Keith Henderson
The lack of publicly available large scale-free graphs forces researchers studying massive scale-free graphs to rely on synthetically generated graphs in testing and evaluating their algorithms. This requires a graph generator that can scale to the graphs with potentially tens and hundreds of billions of vertices and edges. We have developed two such scalable parallel graph generators in this research. The parallel Barabasi-Albert method iteratively builds scale-free graphs using two-phase preferential attachment technique in a bottom-up fashion. The parallel Kronecker method, on the other hand, constructs a graph recursively in a top-down fashion from a given seed graph using the Kronecker matrix multiplication. We show that both graph generators generate massive graphs at a very high rate. It is also shown that graphs generated by these methods have all the common properties of the real scale-free graphs such as power-law degree distribution and small-worldness.
international conference on social computing | 2011
Tina Eliassi-Rad; Keith Henderson
Given a network, we are interested in ranking sets of nodes that score highest on user-specified criteria. For instance in graphs from bibliographic data (e.g. PubMed), we would like to discover sets of authors with expertise in a wide range of disciplines. We present this ranking task as a Top-K problem; utilize fixed-memory heuristic search; and present performance of both the serial and distributed search algorithms on synthetic and real-world data sets.