Ann C. Gentile | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ann C. Gentile is active.

Explore More

Publication

Featured researches published by Ann C. Gentile.

international parallel and distributed processing symposium | 2009

Resource monitoring and management with OVIS to enable HPC in cloud computing environments

James M. Brandt; Ann C. Gentile; Jackson R. Mayo; Philippe Pierre Pebay; Diana C. Roe; David C. Thompson; Matthew H. Wong

Using the cloud computing paradigm, a host of companies promise to make huge compute resources available to users on a pay-as-you-go basis. These resources can be configured on the fly to provide the hardware and operating system of choice to the customer on a large scale. While the current target market for these resources in the commercial space is web development/hosting, this model has the lure of savings of ownership, operation, and maintenance costs, and thus sounds like an attractive solution for people who currently invest millions to hundreds of millions of dollars annually on High Performance Computing (HPC) platforms in order to support large-scale scientific simulation codes. Given the current interconnect bandwidth and topologies utilized in these commercial offerings, however, the only current viable market in HPC would be small-memory-footprint embarrassingly parallel or loosely coupled applications, which inherently require little to no inter-processor communication. While providing the appropriate resources (bandwidth, latency, memory, etc.) for the HPC community would increase the potential to enable HPC in cloud environments, this would not address the need for scalability and reliability, crucial to HPC applications. Providing for these needs is particularly difficult in commercial cloud offerings where the number of virtual resources can far outstrip the number of physical resources, the resources are shared among many users, and the resources may be heterogeneous. Advanced resource monitoring, analysis, and configuration tools can help address these issues, since they bring the ability to dynamically provide and respond to information about the platform and application state and would enable more appropriate, efficient, and flexible use of the resources key to enabling HPC. Additionally such tools could be of benefit to non-HPC cloud providers, users, and applications by providing more efficient resource utilization in general.

international parallel and distributed processing symposium | 2008

Ovis-2: A robust distributed architecture for scalable RAS

Jim M. Brandt; Bert J. Debusschere; Ann C. Gentile; Jackson R. Mayo; Philippe Pierre Pebay; David C. Thompson; Matthew H. Wong

Resource utilization in High Performance Compute clusters can be improved by increased awareness of system state information. Sophisticated run-time characterization of system state in increasingly large clusters requires a scalable fault-tolerant RAS framework. In this paper we describe the architecture of OVIS-2 and how it meets these requirements. We describe some of the sophisticated statistical analysis, 3-D visualization, and use cases for these. Using this framework and associated tools allows the engineer to explore the behaviors and complex interactions of low level system elements while simultaneously giving the system administrator their desired level of detail with respect to ongoing system and component health.

Computer Science - Research and Development | 2011

Baler: deterministic, lossless log message clustering tool

Narate Taerat; Jim M. Brandt; Ann C. Gentile; Matthew H. Wong; Chokchai Leangsuksun

The rate of failures in HPC systems continues to increase as the number of components comprising the systems increases. System logs are one of the valuable information sources that can be used to analyze system failures and their root causes. However, system log files are usually too large and complex to analyze manually. There are some existing log clustering tools that seek to help analysts in exploring these logs, however they fail to satisfy our needs with respect to scalability, usability and quality of results. Thus, we have developed a log clustering tool to better address these needs. In this paper we present our novel approach and initial experimental results.

high performance distributed computing | 1997

Lilith: Scalable execution of user code for distributed computing

David A. Evensky; Ann C. Gentile; L. J. Camp; Rob Armstrong

Lilith is a general purpose tool to provide highly scalable, easy distribution of user code across a heterogeneous computing platform. Liliths principal task is to span a heterogeneous tree of machines executing user-defined code in a scalable and secure fashion. Lilith will be used for controlling user processes as well as general system administrative tasks. Lilith is written in Java, taking advantage of Javas platform independence and intent to move code across networks. The design of Lilith provides hooks for experimenting with tree-spanning algorithms and security schemes. We present the Lilith Object model, security scheme, and implementation, and present timing results demonstrating Liliths scalable behavior.

Proceedings of the 2009 workshop on Resiliency in high performance | 2009

Methodologies for advance warning of compute cluster problems via statistical analysis: a case study

Jim M. Brandt; Ann C. Gentile; Jackson R. Mayo; Philippe Pierre Pebay; Diana C. Roe; David C. Thompson; Matthew H. Wong

The ability to predict impending failures (hardware or software) on large scale high performance compute (HPC) platforms, augmented by checkpoint mechanisms could drastically increase the scalability of applications and efficiency of platforms. In this paper we present our findings and methodologies employed to date in our search for reliable, advance indicators of failures on a 288 node, 4608 core, Opteron based cluster in production use at Sandia National Laboratories. In support of this effort we have deployed OVIS, a Sandia-developed scalable HPC monitoring, analysis, and visualization tool designed for this purpose. We demonstrate that for a particular error case, statistical analysis using OVIS would enable advanced warning of cluster problems on timescales that would enable application and system administrator response in advance of errors, subsequent system error log reporting, and job failures. This is significant as the utility of detecting such indicators depends on how far in advance of failure they can be recognized and how reliable they are.

cluster computing and the grid | 2008

Using Probabilistic Characterization to Reduce Runtime Faults in HPC Systems

Jim M. Brandt; Bert J. Debusschere; Ann C. Gentile; Jackson R. Mayo; Philippe Pierre Pebay; David C. Thompson; Matthew H. Wong

The current trend in high performance computing is to aggregate ever larger numbers of processing and interconnection elements in order to achieve desired levels of computational power, This, however, also comes with a decrease in the Mean Time To Interrupt because the elements comprising these systems are not becoming significantly more robust. There is substantial evidence that the Mean Time To Interrupt vs. number of processor elements involved is quite similar over a large number of platforms. In this paper we present a system that uses hardware level monitoring coupled with statistical analysis and modeling to select processing system elements based on where they lie in the statistical distribution of similar elements. These characterizations can be used by the scheduler/resource manager to deliver a close to optimal set of processing elements given the available pool and the reliability requirements of the application.

international conference on cluster computing | 2014

Demonstrating improved application performance using dynamic monitoring and task mapping

James M. Brandt; Karen Dragon Devine; Ann C. Gentile; Kevin Pedretti

This work demonstrates the integration of monitoring, analysis, and feedback to perform application-to-resource mapping that adapts to both static architecture features and dynamic resource state. In particular, we present a framework for mapping MPI tasks to compute resources based on run-time analysis of system-wide network data, architecture-specific routing algorithms, and application communication patterns. We address several challenges. Within each node, we collect local utilization data. We consolidate that information to form a global view of system performance, accounting for system-wide factors including competing applications. We provide an interface for applications to query the global information. Then we exploit the system information to change the mapping of tasks to nodes so that system bottlenecks are avoided. We demonstrate the benefit of this monitoring and feedback by remapping MPI tasks based on route-length, bandwidth, and credit-stalls metrics for a parallel sparse matrix-vector multiplication kernel. In the best case, remapping based on dynamic network information in a congested environment recovered 48.9% of the time lost to congestion, reducing matrix-vector multiplication time by 7.8%. Our experiments focus on the Cray XE/XK platform, but the integration concepts are generally applicable to any platform for which applicable metrics and route knowledge can be obtained.

dependable systems and networks | 2012

Filtering log data: Finding the needles in the Haystack

Li Yu; Ziming Zheng; Zhiling Lan; Terry Jones; Jim M. Brandt; Ann C. Gentile

Log data is an incredible asset for troubleshooting in large-scale systems. Nevertheless, due to the ever-growing system scale, the volume of such data becomes overwhelming, bringing enormous burdens on both data storage and data analysis. To address this problem, we present a 2-dimensional online filtering mechanism to remove redundant and noisy data via feature selection and instance selection. The objective of this work is two-fold: (i) to significantly reduce data volume without losing important information, and (ii) to effectively promote data analysis. We evaluate this new filtering mechanism by means of real environmental data from the production supercomputers at Oak Ridge National Laboratory and Sandia National Laboratory. Our preliminary results demonstrate that our method can reduce more than 85% disk space, thereby significantly reducing analysis time. Moreover, it also facilitates better failure prediction and diagnosis by more than 20%, as compared to the conventional predictive approach relying on RAS (Reliability, Availability, and Serviceability) events alone.

Archive | 2012

Report of experiments and evidence for ASC L2 milestone 4467 : demonstration of a legacy application's path to exascale.

Matthew L. Curry; Kurt Brian Ferreira; Kevin Pedretti; Vitus J. Leung; Kenneth Moreland; Gerald Fredrick Lofstead; Ann C. Gentile; Ruth Klundt; H. Lee Ward; James H. Laros; Karl Scott Hemmert; Nathan D. Fabian; Michael J. Levenhagen; Ronald B. Brightwell; Richard Frederick Barrett; Kyle Bruce Wheeler; Suzanne M. Kelly; Arun F. Rodrigues; James M. Brandt; David C. Thompson; John P. VanDyke; Ron A. Oldfield; Thomas Tucker

This report documents thirteen of Sandias contributions to the Computational Systems and Software Environment (CSSE) within the Advanced Simulation and Computing (ASC) program between fiscal years 2009 and 2012. It describes their impact on ASC applications. Most contributions are implemented in lower software levels allowing for application improvement without source code changes. Improvements are identified in such areas as reduced run time, characterizing power usage, and Input/Output (I/O). Other experiments are more forward looking, demonstrating potential bottlenecks using mini-application versions of the legacy codes and simulating their network activity on Exascale-class hardware. The purpose of this report is to prove that the team has completed milestone 4467-Demonstration of a Legacy Applications Path to Exascale. Cielo is expected to be the last capability system on which existing ASC codes can run without significant modifications. This assertion will be tested to determine where the breaking point is for an existing highly scalable application. The goal is to stretch the performance boundaries of the application by applying recent CSSE RD in areas such as resilience, power, I/O, visualization services, SMARTMAP, lightweight LWKs, virtualization, simulation, and feedback loops. Dedicated system time reservations and/or CCC allocations will be used to quantify the impact of system-level changes to extend the life and performance of the ASC code base. Finally, a simulation of anticipated exascale-class hardware will be performed using SST to supplement the calculations. Determine where the breaking point is for an existing highly scalable application: Chapter 15 presented the CSSE work that sought to identify the breaking point in two ASC legacy applications-Charon and CTH. Their mini-app versions were also employed to complete the task. There is no single breaking point as more than one issue was found with the two codes. The results were that applications can expect to encounter performance issues related to the computing environment, system software, and algorithms. Careful profiling of runtime performance will be needed to identify the source of an issue, in strong combination with knowledge of system software and application source code.

international conference on cluster computing | 2005

Meaningful Automated Statistical Analysis of Large Computational Clusters

Jim M. Brandt; Ann C. Gentile; Youssef M. Marzouk; Philippe Pierre Pebay

As clusters utilizing commercial off-the-shelf technology have grown from tens to thousands of nodes and typical job sizes have likewise increased, much effort has been devoted to improving the scalability of message-passing fabrics, schedulers, and storage. Largely ignored, however, has been the issue of predicting node failure, which also has a large impact on scalability. In fact, more than ten years into cluster computing, we are still managing this issue on a node-by-node basis even though available diagnostic data has grown immensely. We have built a tool that uses the statistical similarity of the large number of nodes in a cluster to infer the health of each individual node. In the poster, we first present real data and statistical calculations as foundational material and justification for our claims of similarity. Next we present our methodology and its implications for early notification of deviation from normal behavior, problem diagnosis, automatic code restart via interaction with scheduler, and airflow distribution monitoring in the machine room. A framework addressing scalability is discussed briefly. Lastly, we present case studies showing how our methodology has been used to detect aberrant nodes whose deviations are still far below the detection level of traditional methods. A summary of the results of the case studies appears below

Explore More