Steve Harenberg | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Steve Harenberg is active.

Explore More

Publication

Featured researches published by Steve Harenberg.

siam international conference on data mining | 2016

A Scalable Approach for Outlier Detection in Edge Streams Using Sketch-based Approximations.

Stephen Ranshous; Steve Harenberg; Kshitij Sharma; Nagiza F. Samatova

Dynamic graphs are a powerful way to model an evolving set of objects and their ongoing interactions. A broad spectrum of systems, such as information, communication, and social, are naturally represented by dynamic graphs. Outlier (or anomaly) detection in dynamic graphs can provide unique insights into the relationships of objects and identify novel or emerging relationships. To date, outlier detection in dynamic graphs has been studied in the context of graph streams, focusing on the analysis and comparison of entire graph objects. However, the volume and velocity of data are necessitating a transition from outlier detection in the context of graph streams to outlier detection in the context of edge streams–where the stream consists of individual graph edges instead of entire graph objects. In this paper, we propose the first approach for outlier detection in edge streams. We first describe a highlevel model for outlier detection based on global and local structural properties of a stream. We propose a novel application of the Count-Min sketch for approximating these properties, and prove probabilistic error bounds on our edge outlier scoring functions. Our sketch-based implementation provides a scalable solution, having constant time updates and constant space requirements. Experiments on synthetic and real world datasets demonstrate our method’s scalability, effectiveness for discovering outliers, and the effects of approximation.

advanced data mining and applications | 2016

Community Detection in Dynamic Attributed Graphs

Gonzalo Bello; Steve Harenberg; Abhishek Agrawal; Nagiza F. Samatova

Community detection is one of the most widely studied tasks in network analysis because community structures are ubiquitous across real-world networks. These real-world networks are often both attributed and dynamic in nature. In this paper, we propose a community detection algorithm for dynamic attributed graphs that, unlike existing community detection methods, incorporates both temporal and attribute information along with the structural properties of the graph. Our proposed algorithm handles graphs with heterogeneous attribute types, as well as changes to both the structure and the attribute information, which is essential for its applicability to real-world networks. We evaluated our proposed algorithm on a variety of synthetically generated benchmark dynamic attributed graphs, as well as on large-scale real-world networks. The results obtained show that our proposed algorithm is able to identify graph partitions of high modularity and high attribute similarity more efficiently than state-of-the-art methods for community detection.

cluster computing and the grid | 2016

AMRZone: A Runtime AMR Data Sharing Framework for Scientific Applications

Wenzhao Zhang; Houjun Tang; Steve Harenberg; Surendra Byna; Xiaocheng Zou; Dharshi Devendran; Daniel F. Martin; Kesheng Wu; Bin Dong; Scott Klasky; Nagiza F. Samatova

Frameworks that facilitate runtime data sharingacross multiple applications are of great importance for scientificdata analytics. Although existing frameworks work well overuniform mesh data, they can not effectively handle adaptive meshrefinement (AMR) data. Among the challenges to construct anAMR-capable framework include: (1) designing an architecturethat facilitates online AMR data management, (2) achievinga load-balanced AMR data distribution for the data stagingspace at runtime, and (3) building an effective online indexto support the unique spatial data retrieval requirements forAMR data. Towards addressing these challenges to supportruntime AMR data sharing across scientific applications, wepresent the AMRZone framework. Experiments over real-worldAMR datasets demonstrate AMRZones effectiveness at achievinga balanced workload distribution, reading/writing large-scaledatasets with thousands of parallel processes, and satisfyingqueries with spatial constraints. Moreover, AMRZones performance and scalability are even comparable with existing state-of-the-art work when tested over uniform mesh data with up to16384 cores, in the best case, our framework achieves a 46% performance improvement.

cluster computing and the grid | 2016

Usage Pattern-Driven Dynamic Data Layout Reorganization

Houjun Tang; Surendra Byna; Steve Harenberg; Xiaocheng Zou; Wenzhao Zhang; Kesheng Wu; Bin Dong; Oliver Rübel; Kristofer E. Bouchard; Scott Klasky; Nagiza F. Samatova

As scientific simulations and experiments move toward extremely large scales and generate massive amounts of data, the data access performance of analytic applications becomes crucial. A mismatch often happens between write and read patterns of data accesses, typically resulting in poor read performance. Data layout reorganization has been used to improve the locality of data accesses. However, current data reorganizations are static and focus on generating a single (or set of) optimized layouts that rely on prior knowledge of exact future access patterns. We propose a framework that dynamically recognizes the data usage patterns, replicates the data of interest in multiple reorganized layouts that would benefit common read patterns, and makes runtime decisions on selecting a favorable layout for a given read pattern. This framework supports reading individual elements and chunks of a multi-dimensional array of variables. Our pattern-driven layout selection strategy achieves multi-fold speedups compared to reading from the original dataset.

advanced data mining and applications | 2016

Causality-Guided Feature Selection

Mandar S. Chaudhary; Doel L. Gonzalez; Gonzalo Bello; Michael P. Angus; Dhara Desai; Steve Harenberg; P. Murali Doraiswamy; Fredrick H. M. Semazzi; Vipin Kumar; Nagiza F. Samatova

Identifying meaningful features that drive a phenomenon (response) of interest in complex systems of interconnected factors is a challenging problem. Causal discovery methods have been previously applied to estimate bounds on causal strengths of factors on a response or to identify meaningful interactions between factors in complex systems, but these approaches have been used only for inferential purposes. In contrast, we posit that interactions between factors with a potential causal association on a given response could be viable candidates not only for hypothesis generation but also for predictive modeling. In this work, we propose a causality-guided feature selection methodology that identifies factors having a potential cause-effect relationship in complex systems, and selects features by clustering them based on their causal strength with respect to the response. To this end, we estimate statistically significant causal effects on the response of factors taking part in potential causal relationships, while addressing associated technical challenges, such as multicollinearity in the data. We validate the proposed methodology for predicting response in five real-world datasets from the domain of climate science and biology. The selected features show predictive skill and consistent performance across different domains.

siam international conference on data mining | 2014

Memory-efficient query-driven community detection with application to complex disease associations

Steve Harenberg; Ramona G. Seay; Stephen Ranshous; Kanchana Padmanabhan; Jitendra K. Harlalka; Eric R. Schendel; Michael P. O'Brien; Rada Chirkova; William Hendrix; Alok N. Choudhary; Vipin Kumar; Murali Doraiswamy; Nagiza F. Samatova

Community detection in real-world graphs presents a number of challenges. First, even if the number of detected communities grows linearly with the graph size, it becomes impossible to manually inspect each community for value added to the application knowledge base. Mining for communities with query nodes as knowledge priors could allow for filtering out irrelevant information and for enriching end-users knowledge associated with the problem of interest, such as discovery of genes functionally associated with the Alzheimer’s (AD) biomarker genes. Second, the data-intensive nature of community enumeration challenges current approaches that often assume that the input graph and the detected communities fit in memory. As computer systems scale, DRAM memory sizes are not expected to increase linearly, while technologies such as SSD memories have the potential to provide much higher capacities at a lower power-cost point, and have a much lower latency than disks. Out-of-core algorithms and/or databaseinspired indexing could provide an opportunity for different design optimizations for query-driven community detection algorithms tuned for emerging architectures. Therefore, this work addresses the need for query-driven and memory-efficient community detection. Using maximal cliques as the community definition, due to their high signalto-noise ratio, we propose and systematically compare two contrasting methods: indexed-based and out-of-core. Both methods improve peak memory efficiency as much as 1000X compared to the state-of-the-art. However, the index-based method, which also has a 10-to-100-fold run time reduction, outperforms the out-of-core algorithm in most cases. The achieved scalability enables the discovery of diseases that are known to be or likely associated with Alzheimer’s when the genome-scale network is mined with AD biomarker genes as knowledge priors.

ieee international conference semantic computing | 2017

A Lifelong Learning Topic Model Structured Using Latent Embeddings

Mingyang Xu; Ruixin Yang; Steve Harenberg; Nagiza F. Samatova

We propose a latent-embedding-structured lifelong learning topic model, called the LLT model, to discover coherent topics from a corpus. Specifically, we exploit latent word embeddings to structure our model and mine word correlation knowledge to assist in topic modeling. During each learning iteration, our model learns new word embeddings based on the topics generated in the previous learning iteration. Experimental results demonstrate that our LLT model is able to generate more coherent topics than state-of-the-art methods.

international conference on parallel processing | 2016

In Situ Storage Layout Optimization for AMR Spatio-temporal Read Accesses

Houjun Tang; Suren Byna; Steve Harenberg; Wenzhao Zhang; Xiaocheng Zou; Daniel F. Martin; Bin Dong; Dharshi Devendran; Kesheng Wu; David Trebotich; Scott Klasky; Nagiza F. Samatova

Analyses of large simulation data often concentrate on regions in space and in time that contain important information. As simulations adopt Adaptive Mesh Refinement (AMR), the data records from a region of interest could be widely scattered on storage devices and accessing interesting regions results in significantly reduced I/O performance. In this work, we study the organization of block-structured AMR data on storage to improve performance of spatio-temporal data accesses. AMR has a complex hierarchical multi-resolution data structure that does not fit easily with the existing approaches that focus on uniform mesh data. To enable efficient AMR read accesses, we develop an in situ data layout optimization framework. Our framework automatically selects from a set of candidate layouts based on a performance model, and reorganizes the data before writing to storage. We evaluate this framework with three AMR datasets and access patterns derived from scientific applications. Our performance model is able to identify the best layout scheme and yields up to a 3X read performance improvement compared to the original layout. Though it is not possible to turn all read accesses into contiguous reads, we are able to achieve 90% of contiguous read throughput with the optimized layouts on average.

advanced data mining and applications | 2016

Knowledge-Guided Maximal Clique Enumeration

Steve Harenberg; Ramona G. Seay; Gonzalo Bello; Rada Chirkova; P. Murali Doraiswamy; Nagiza F. Samatova

Maximal clique enumeration is a long-standing problem in graph mining and knowledge discovery. Numerous classic algorithms exist for solving this problem. However, these algorithms focus on enumerating all maximal cliques, which may be computationally impractical and much of the output may be irrelevant to the user. To address this issue, we introduce the problem of knowledge-biased clique enumeration, a query-driven formulation that reduces output space, computation time, and memory usage. Moreover, we introduce a dynamic state space indexing strategy for efficiently processing multiple queries over the same graph. This strategy reduces redundant computations by dynamically indexing the constituent state space generated with each query. Experimental results over real-world networks demonstrate this strategy’s effectiveness at reducing the cumulative query-response time. Although developed in the context of maximal cliques, our techniques could possibly be generalized to other constraint-based graph enumeration tasks.

international conference on parallel processing | 2013

A generic high-performance method for deinterleaving scientific data

Eric R. Schendel; Steve Harenberg; Houjun Tang; Venkatram Vishwanath; Michael E. Papka; Nagiza F. Samatova

High-performance and energy-efficient data management applications are a necessity for HPC systems due to the extreme scale of data produced by high fidelity scientific simulations that these systems support. Data layout in memory hugely impacts the performance. For better performance, most simulations interleave variables in memory during their calculation phase, but deinterleave the data for subsequent storage and analysis. As a result, efficient data deinterleaving is critical; yet, common deinterleaving methods provide inefficient throughput and energy performance. To address this problem, we propose a deinterleaving method that is high performance, energy efficient, and generic to any data type. To the best of our knowledge, this is the first deinterleaving method that 1) exploits data cache prefetching, 2) reduces memory accesses, and 3) optimizes the use of complete cache line writes. When evaluated against conventional deinterleaving methods on 105 STREAM standard micro-benchmarks, our method always improved throughput and throughput/watt on multi-core systems. In the best case, our deinterleaving method improved throughput up to 26.2x and throughput/watt up to 7.8x.

Explore More