Jeremiah J. Wilke | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Jeremiah J. Wilke is active.

Explore More

Publication

Featured researches published by Jeremiah J. Wilke.

international conference on parallel processing | 2013

Validation and uncertainty assessment of extreme-scale HPC simulation through bayesian inference

Jeremiah J. Wilke; Khachik Sargsyan; Joseph P. Kenny; Bert J. Debusschere; Habib N. Najm; Gilbert Hendry

Simulation of high-performance computing (HPC) systems plays a critical role in their development - especially as HPC moves toward the co-design model used for embedded systems, tying hardware and software into a unified design cycle. Exploring system-wide tradeoffs in hardware, middleware and applications using high-fidelity cycle-accurate simulation, however, is far too costly. Coarse-grained methods can provide efficient, accurate simulation but require rigorous uncertainty quantification (UQ) before using results to support design decisions. We present here SST/macro, a coarse-grained structural simulator providing flexible congestion models for low-cost simulation. We explore the accuracy limits of coarse-grained simulation by deriving error distributions of model parameters using Bayesian inference. Propagating these uncertainties through the model, we demonstrate SST/macros utility in making conclusions about performance tradeoffs for a series of MPI collectives. Low-cost and high-accuracy simulations coupled with UQ methodology make SST/macro a powerful tool for rapidly prototyping systems to aid extreme-scale HPC co-design.

ieee international conference on high performance computing data and analytics | 2016

Flexfly: enabling a reconfigurable dragonfly through silicon photonics

Ke Wen; Payman Samadi; Sébastien Rumley; Christine P. Chen; Yiwen Shen; Meisam Bahadori; Keren Bergman; Jeremiah J. Wilke

The Dragonfly topology provides low-diameter connectivity for high-performance computing with all-to-all global links at the inter-group level. Our traffic matrix characterization of various scientific applications shows consistent mismatch between the imbalanced group-to-group traffic and the uniform global bandwidth allocation of Dragonfly. Though adaptive routing has been proposed to utilize bandwidth of non-minimal paths, increased hops and cross-group interference lower efficiency. This work presents a photonic architecture, Flexfly, which “trades” global links among groups using low-radix Silicon photonic switches. With transparent optical switching, Flexfly reconfigures the inter-group topology based on traffic pattern, stealing additional direct bandwidth for communication-intensive group pairs. Simulations with applications such as GTC, Nekbone and LULESH show up to 1.8× speedup over Dragonfly paired with UGAL routing, along with halved hop count and latency for cross-group messages. We built a 32-node Flexfly prototype using a Silicon photonic switch connecting four groups and demonstrated 820 ns interconnect reconfiguration time.

Archive | 2015

ASC ATDM Level 2 Milestone #5325: Asynchronous Many-Task Runtime System Analysis and Assessment for Next Generation Platforms.

Gavin Matthew Baker; Matthew Tyler Bettencourt; Steven W. Bova; Ken Franko; Marc Gamell; Ryan E. Grant; Simon D. Hammond; David S. Hollman; Samuel Knight; Hemanth Kolla; Paul Lin; Stephen L. Olivier; Gregory D. Sjaardema; Nicole Lemaster Slattengren; Keita Teranishi; Jeremiah J. Wilke; Janine C. Bennett; Robert L. Clay; Laxkimant Kale; Nikhil Jain; Eric Mikida; Alex Aiken; Michael Bauer; Wonchan Lee; Elliott Slaughter; Sean Treichler; Martin Berzins; Todd Harman; Alan Humphreys; John A. Schmidt

This report provides in-depth information and analysis to help create a technical road map for developing nextgeneration programming models and runtime systems that support Advanced Simulation and Computing (ASC) workload requirements. The focus herein is on asynchronous many-task (AMT) model and runtime systems, which are of great interest in the context of “exascale” computing, as they hold the promise to address key issues associated with future extreme-scale computer architectures. This report includes a thorough qualitative and quantitative examination of three best-of-class AMT runtime systems—Charm++, Legion, and Uintah, all of which are in use as part of the ASC Predictive Science Academic Alliance Program II (PSAAP-II) Centers. The studies focus on each of the runtimes’ programmability, performance, and mutability. Through the experiments and analysis presented, several overarching findings emerge. From a performance perspective, AMT runtimes show tremendous potential for addressing extremescale challenges. Empirical studies show an AMT runtime can mitigate performance heterogeneity inherent to the machine itself and that Message Passing Interface (MPI) and AMT runtimes perform comparably under balanced conditions. From a programmability and mutability perspective however, none of the runtimes in this study are currently ready for use in developing production-ready Sandia ASC applications. The report concludes by recommending a codesign path forward, wherein application, programming model, and runtime system developers work together to define requirements and solutions. Such a requirements-driven co-design approach benefits the high-performance computing (HPC) community as a whole, with widespread community engagement mitigating risk for both application developers and runtime system developers.

ieee international conference on high performance computing data and analytics | 2016

Topology-aware performance optimization and modeling of adaptive mesh refinement codes for exascale

Cy P. Chan; John Bachan; Joseph P. Kenny; Jeremiah J. Wilke; Vincent E. Beckner; Ann S. Almgren; John B. Bell

We introduce a topology-aware performance optimization and modeling workflow for AMR simulation that includes two new modeling tools, ProgrAMR and Mota Mapper, which interface with the BoxLib AMR framework and the SSTmacro network simulator. ProgrAMR allows us to generate and model the execution of task dependency graphs from high-level specifications of AMR-based applications, which we demonstrate by analyzing two example AMR-based multigrid solvers with varying degrees of asynchrony. Mota Mapper generates multiobjective, network topology-aware box mappings, which we apply to optimize the data layout for the example multigrid solvers. While the sensitivity of these solvers to layout and execution strategy appears to be modest for balanced scenarios, the impact of better mapping algorithms can be significant when performance is highly constrained by network hop latency. Furthermore, we show that network latency in the multigrid bottom solve is the main contributing factor preventing good scaling on exascale-class machines.

high performance computing and communications | 2015

Application Modeling for Scalable Simulation of Massively Parallel Systems

Eric Anger; Damian Dechev; Gilbert Hendry; Jeremiah J. Wilke; Sudhakar Yalamanchili

Macro-scale simulation has been advanced as one tool for application -- architecture co-design to express operation of exascale systems. These simulations approximate the behavior of system components, trading off accuracy for increased evaluation speed. Application skeletons serve as the vehicle for these simulations, but they require accurately capturing the execution behavior of computation. The complexity of application codes, the heterogeneity of the platforms, and the increasing importance of simulating multiple performance metrics (e.g., execution time, energy) require new modeling techniques. We propose flexible statistical models to increase the fidelity of application simulation at scale. We present performance model validation for several exascale mini-applications that leverage a variety of parallel programming frameworks targeting heterogeneous architectures for both time and energy performance metrics. When paired with these statistical models, application skeletons were simulated on average 12.5 times faster than the original application incurring only 6.08% error, which is 12.5% faster and 33.7% more accurate than baseline models.

Archive | 2015

Using Discrete Event Simulation for Programming Model Exploration at Extreme-Scale: Macroscale Components for the Structural Simulation Toolkit (SST)

Jeremiah J. Wilke; Joseph P. Kenny

Discrete event simulation provides a powerful mechanism for designing and testing new extreme- scale programming models for high-performance computing. Rather than debug, run, and wait for results on an actual system, design can first iterate through a simulator. This is particularly useful when test beds cannot be used, i.e. to explore hardware or scales that do not yet exist or are inaccessible. Here we detail the macroscale components of the structural simulation toolkit (SST). Instead of depending on trace replay or state machines, the simulator is architected to execute real code on real software stacks. Our particular user-space threading framework allows massive scales to be simulated even on small clusters. The link between the discrete event core and the threading framework allows interesting performance metrics like call graphs to be collected from a simulated run. Performance analysis via simulation can thus become an important phase in extreme-scale programming model and runtime system design via the SST macroscale components.

ieee acm international symposium cluster cloud and grid computing | 2017

APHiD: Hierarchical Task Placement to Enable a Tapered Fat Tree Topology for Lower Power and Cost in HPC Networks

George Michelogiannakis; Khaled Z. Ibrahim; John Shalf; Jeremiah J. Wilke; Samuel Knight; Joseph P. Kenny

The power and procurement cost of bandwidth in system-wide networks has forced a steady drop in the byte/flop ratio. This trend of computation becoming faster relative to the network is expected to hold. In this paper, we explore how cost-oriented task placement enables reducing the cost of system-wide networks by enabling high performance even on tapered topologies where more bandwidth is provisioned at lower levels. We describe APHiD, an efficient hierarchical placement algorithm that uses new techniques to improve the quality of heuristic solutions and reduces the demand on high-level, expensive bandwidth in hierarchical topologies. We apply APHiD to a tapered fat-tree, demonstrating that APHiD maintains application scalability even for severely tapered network configurations. Using simulation, we show that for tapered networks APHiD improves performance by more than 50% over random placement and even 15% in some cases over costlier, state-of-the-art placement algorithms.

Proceedings of the 5th Workshop on Fault Tolerance for HPC at eXtreme Scale | 2015

Evolving the Message Passing Programming Model via a Fault-Tolerant, Object-oriented Transport Layer

Jeremiah J. Wilke; Keita Teranishi; Janine C. Bennett; Hemanth Kolla; David S. Hollman; Nicole Lemaster Slattengren

In this position paper, we argue for improved fault-tolerance of an MPI code by introducing lightweight virtualization into the MPI interface. In particular, we outline key-value store semantics for MPI send/recv calls, thereby creating a far more expressive programming model. The general message passing semantics and imperative style of MPI application codes would remain essentially unchanged. However, the additional expressiblity of the programming model 1) enables the underlying transport layer to handle fault-tolerance more transparently to the application developer, and 2) provides an evolutionary code path towards more declarative asynchronous programming models. The core contribution of this paper is an initial implementation of the DHARMA transport layer that provides the new, required functionality to support the MPI key-value store model.

ieee international conference on high performance computing, data, and analytics | 2018

Compiler-Assisted Source-to-Source Skeletonization of Application Models for System Simulation

Jeremiah J. Wilke; Joseph P. Kenny; Samuel Knight; Sébastien Rumley

Performance modeling of networks through simulation requires application endpoint models that inject traffic into the simulation models. Endpoint models today for system-scale studies consist mainly of post-mortem trace replay, but these off-line simulations may lack flexibility and scalability. On-line simulations running so-called skeleton applications run reduced versions of an application that generate traffic that is the same or similar to the full application. These skeleton apps have advantages for flexibility and scalability, but they often must be custom written for the simulator itself. Auto-skeletonization of existing application source code via compiler tools would provide endpoint models with minimal development effort. These source-to-source transformations have been only narrowly explored. We introduce a pragma language and corresponding Clang-driven source-to-source compiler that performs auto-skeletonization based on provided pragma annotations. We describe the compiler toolchain, validate the generated skeletons, and show scalability of the generated simulation models beyond 100 K endpoints for example MPI applications. Overall, we assert that our proposed auto-skeletonization approach and the flexible skeletons it produces can be an important tool in realizing balanced exascale interconnect designs.

ieee international conference on high performance computing, data, and analytics | 2018

The Pitfalls of Provisioning Exascale Networks: A Trace Replay Analysis for Understanding Communication Performance

Joseph P. Kenny; Khachik Sargsyan; Samuel Knight; George Michelogiannakis; Jeremiah J. Wilke

Data movement is considered the main performance concern for exascale, including both on-node memory and off-node network communication. Indeed, many application traces show significant time spent in MPI calls, potentially indicating that faster networks must be provisioned for scalability. However, equating MPI times with network communication delays ignores synchronization delays and software overheads independent of network hardware. Using point-to-point protocol details, we explore the decomposition of MPI time into communication, synchronization and software stack components using architecture simulation. Detailed validation using Bayesian inference is used to identify the sensitivity of performance to specific latency/bandwidth parameters for different network protocols and to quantify associated uncertainties. The inference combined with trace replay shows that synchronization and MPI software stack overhead are at least as important as the network itself in determining time spent in communication routines.

Explore More