Justin Luitjens | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Justin Luitjens is active.

Explore More

Publication

Featured researches published by Justin Luitjens.

ieee international conference on high performance computing data and analytics | 2014

Scaling the power wall: a path to exascale

Oreste Villa; Daniel R. Johnson; Mike O'Connor; Evgeny Bolotin; David W. Nellans; Justin Luitjens; Nikolai Sakharnykh; Peng Wang; Paulius Micikevicius; Anthony Scudiero; Stephen W. Keckler; William J. Dally

Modern scientific discovery is driven by an insatiable demand for computing performance. The HPC community is targeting development of supercomputers able to sustain 1 ExaFlops by the year 2020 and power consumption is the primary obstacle to achieving this goal. A combination of architectural improvements, circuit design, and manufacturing technologies must provide over a 20× improvement in energy efficiency. In this paper, we present some of the progress NVIDIA Research is making toward the design of Exascale systems by tailoring features to address the scaling challenges of performance and energy efficiency. We evaluate several architectural concepts for a set of HPC applications demonstrating expected energy efficiency improvements resulting from circuit and packaging innovations such as low-voltage SRAM, low-energy signalling, and on-package memory. Finally, we discuss the scaling of these features with respect to future process technologies and provide power and performance projections for our Exascale research architecture.

many-task computing on grids and supercomputers | 2010

Dynamic task scheduling for the Uintah framework

Qingyu Meng; Justin Luitjens; Martin Berzins

Uintah is a computational framework for fluid-structure interaction problems using a combination of the ICE fluid flow algorithm, adaptive mesh refinement (AMR) and MPM particle methods. Uintah uses domain decomposition with a task-graph approach for asynchronous communication and automatic message generation. The Uintah software has been used for a decade with its original task scheduler that ran computational tasks in a predefined static order. In order to improve the performance of Uintah for petascale architecture, a new dynamic task scheduler allowing better overlapping of the communication and computation is designed and evaluated in this study. The new scheduler supports asynchronous, out-of-order scheduling of computational tasks by putting them in a distributed directed acyclic graph (DAG) and by isolating task memory and keeping multiple copies of task variables in a data warehouse when necessary. A new runtime system has been implemented with a two-stage priority queuing architecture to improve the scheduling efficiency. The effectiveness of this new approach is shown through an analysis of the performance of the software on large scale fluid-structure examples.

international parallel and distributed processing symposium | 2010

Improving the performance of Uintah: A large-scale adaptive meshing computational framework

Justin Luitjens; Martin Berzins

Uintah is a highly parallel and adaptive multi-physics framework created by the Center for Simulation of Accidental Fires and Explosions in Utah. Uintah, which is built upon the Common Component Architecture, has facilitated the simulation of a wide variety of fluid-structure interaction problems using both adaptive structured meshes for the fluid and particles to model solids. Uintah was originally designed for, and has performed well on, about a thousand processors. The evolution of Uintah to use tens of thousands processors has required improvements in memory usage, data structure design, load balancing algorithms and cost estimation in order to improve strong and weak scalability up to 98,304 cores for situations in which the mesh used varies adaptively and also cases in which particles that represent the solids move from mesh cell to mesh cell.

Concurrency and Computation: Practice and Experience | 2007

Parallel space-filling curve generation through sorting

Justin Luitjens; Martin Berzins; Thomas C. Henderson

In this paper we consider the scalability of parallel space‐filling curve generation as implemented through parallel sorting algorithms. Multiple sorting algorithms are studied and results show that space‐filling curves can be generated quickly in parallel on thousands of processors. In addition, performance models are presented that are consistent with measured performance and offer insight into performance on still larger numbers of processors. At large numbers of processors, the scalability of adaptive mesh refined codes depends on the individual components of the adaptive solver. One such component is the dynamic load balancer. In adaptive mesh refined codes, the mesh is constantly changing resulting in load imbalance among the processors requiring a load‐balancing phase. The load balancing may occur often, requiring the load balancer to perform quickly. One common method for dynamic load balancing is to use space‐filling curves. Space‐filling curves, in particular the Hilbert curve, generate good partitions quickly in serial. However, at tens and hundreds of thousands of processors serial generation of space‐filling curves will hinder scalability. In order to avoid this issue we have developed a method that generates space‐filling curves quickly in parallel by reducing the generation to integer sorting. Copyright

Concurrency and Computation: Practice and Experience | 2011

Scalable parallel regridding algorithms for block-structured adaptive mesh refinement

Justin Luitjens; Martin Berzins

Block‐structured adaptive mesh refinement (BSAMR) is widely used within simulation software because it improves the utilization of computing resources by refining the mesh only where necessary. For BSAMR to scale onto existing petascale and eventually exascale computers all portions of the simulation need to weak scale ideally. Any portions of the simulation that do not will become a bottleneck at larger numbers of cores. The challenge is to design algorithms that will make it possible to avoid these bottlenecks on exascale computers. One step of existing BSAMR algorithms involves determining where to create new patches of refinement. The Berger–Rigoutsos algorithm is commonly used to perform this task. This paper provides a detailed analysis of the performance of two existing parallel implementations of the Berger–Rigoutsos algorithm and develops a new parallel implementation of the Berger–Rigoutsos algorithm and a tiled algorithm that exhibits ideal scalability. The analysis and computational results up to 98 304 cores are used to design performance models which are then used to predict how these algorithms will perform on 100 M cores. Copyright

teragrid conference | 2010

Uintah: a scalable framework for hazard analysis

Martin Berzins; Justin Luitjens; Qingyu Meng; Todd Harman; Charles A. Wight; Joseph R. Peterson

Archive | 2007

Scalable Parallel AMR for the Uintah Multiphysics Code

Justin Luitjens; Bryan Worthen

ieee international conference on high performance computing data and analytics | 2012

On the Role of Co-design in High Performance Computing

Richard F. Barrett; Shekhar Y. Borkar; Sudip S. Dosanjh; Simon D. Hammond; Michael A. Heroux; Xiaobo Sharon Hu; Justin Luitjens; Steven G. Parker; John Shalf; Li Tang

Concurrency and Computation: Practice and Experience | 2007