Justin Luitjens
University of Utah
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Justin Luitjens.
ieee international conference on high performance computing data and analytics | 2014
Oreste Villa; Daniel R. Johnson; Mike O'Connor; Evgeny Bolotin; David W. Nellans; Justin Luitjens; Nikolai Sakharnykh; Peng Wang; Paulius Micikevicius; Anthony Scudiero; Stephen W. Keckler; William J. Dally
Modern scientific discovery is driven by an insatiable demand for computing performance. The HPC community is targeting development of supercomputers able to sustain 1 ExaFlops by the year 2020 and power consumption is the primary obstacle to achieving this goal. A combination of architectural improvements, circuit design, and manufacturing technologies must provide over a 20× improvement in energy efficiency. In this paper, we present some of the progress NVIDIA Research is making toward the design of Exascale systems by tailoring features to address the scaling challenges of performance and energy efficiency. We evaluate several architectural concepts for a set of HPC applications demonstrating expected energy efficiency improvements resulting from circuit and packaging innovations such as low-voltage SRAM, low-energy signalling, and on-package memory. Finally, we discuss the scaling of these features with respect to future process technologies and provide power and performance projections for our Exascale research architecture.
many-task computing on grids and supercomputers | 2010
Qingyu Meng; Justin Luitjens; Martin Berzins
Uintah is a computational framework for fluid-structure interaction problems using a combination of the ICE fluid flow algorithm, adaptive mesh refinement (AMR) and MPM particle methods. Uintah uses domain decomposition with a task-graph approach for asynchronous communication and automatic message generation. The Uintah software has been used for a decade with its original task scheduler that ran computational tasks in a predefined static order. In order to improve the performance of Uintah for petascale architecture, a new dynamic task scheduler allowing better overlapping of the communication and computation is designed and evaluated in this study. The new scheduler supports asynchronous, out-of-order scheduling of computational tasks by putting them in a distributed directed acyclic graph (DAG) and by isolating task memory and keeping multiple copies of task variables in a data warehouse when necessary. A new runtime system has been implemented with a two-stage priority queuing architecture to improve the scheduling efficiency. The effectiveness of this new approach is shown through an analysis of the performance of the software on large scale fluid-structure examples.
international parallel and distributed processing symposium | 2010
Justin Luitjens; Martin Berzins
Uintah is a highly parallel and adaptive multi-physics framework created by the Center for Simulation of Accidental Fires and Explosions in Utah. Uintah, which is built upon the Common Component Architecture, has facilitated the simulation of a wide variety of fluid-structure interaction problems using both adaptive structured meshes for the fluid and particles to model solids. Uintah was originally designed for, and has performed well on, about a thousand processors. The evolution of Uintah to use tens of thousands processors has required improvements in memory usage, data structure design, load balancing algorithms and cost estimation in order to improve strong and weak scalability up to 98,304 cores for situations in which the mesh used varies adaptively and also cases in which particles that represent the solids move from mesh cell to mesh cell.
Concurrency and Computation: Practice and Experience | 2007
Justin Luitjens; Martin Berzins; Thomas C. Henderson
In this paper we consider the scalability of parallel space‐filling curve generation as implemented through parallel sorting algorithms. Multiple sorting algorithms are studied and results show that space‐filling curves can be generated quickly in parallel on thousands of processors. In addition, performance models are presented that are consistent with measured performance and offer insight into performance on still larger numbers of processors. At large numbers of processors, the scalability of adaptive mesh refined codes depends on the individual components of the adaptive solver. One such component is the dynamic load balancer. In adaptive mesh refined codes, the mesh is constantly changing resulting in load imbalance among the processors requiring a load‐balancing phase. The load balancing may occur often, requiring the load balancer to perform quickly. One common method for dynamic load balancing is to use space‐filling curves. Space‐filling curves, in particular the Hilbert curve, generate good partitions quickly in serial. However, at tens and hundreds of thousands of processors serial generation of space‐filling curves will hinder scalability. In order to avoid this issue we have developed a method that generates space‐filling curves quickly in parallel by reducing the generation to integer sorting. Copyright
Concurrency and Computation: Practice and Experience | 2011
Justin Luitjens; Martin Berzins
Block‐structured adaptive mesh refinement (BSAMR) is widely used within simulation software because it improves the utilization of computing resources by refining the mesh only where necessary. For BSAMR to scale onto existing petascale and eventually exascale computers all portions of the simulation need to weak scale ideally. Any portions of the simulation that do not will become a bottleneck at larger numbers of cores. The challenge is to design algorithms that will make it possible to avoid these bottlenecks on exascale computers. One step of existing BSAMR algorithms involves determining where to create new patches of refinement. The Berger–Rigoutsos algorithm is commonly used to perform this task. This paper provides a detailed analysis of the performance of two existing parallel implementations of the Berger–Rigoutsos algorithm and develops a new parallel implementation of the Berger–Rigoutsos algorithm and a tiled algorithm that exhibits ideal scalability. The analysis and computational results up to 98 304 cores are used to design performance models which are then used to predict how these algorithms will perform on 100 M cores. Copyright
teragrid conference | 2010
Martin Berzins; Justin Luitjens; Qingyu Meng; Todd Harman; Charles A. Wight; Joseph R. Peterson
Archive | 2007
Justin Luitjens; Bryan Worthen
ieee international conference on high performance computing data and analytics | 2012
Richard F. Barrett; Shekhar Y. Borkar; Sudip S. Dosanjh; Simon D. Hammond; Michael A. Heroux; Xiaobo Sharon Hu; Justin Luitjens; Steven G. Parker; John Shalf; Li Tang
Concurrency and Computation: Practice and Experience | 2007
Justin Luitjens; Martin Berzins; Thomas C. Henderson
Archive | 2011
Martin Berzins; Justin Luitjens