Sean Treichler | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Sean Treichler is active.

Explore More

Publication

Featured researches published by Sean Treichler.

ieee international conference on high performance computing data and analytics | 2012

Legion: expressing locality and independence with logical regions

Michael Bauer; Sean Treichler; Elliott Slaughter; Alex Aiken

Modern parallel architectures have both heterogeneous processors and deep, complex memory hierarchies. We present Legion, a programming model and runtime system for achieving high performance on these machines. Legion is organized around logical regions, which express both locality and independence of program data, and tasks, functions that perform computations on regions. We describe a runtime system that dynamically extracts parallelism from Legion programs, using a distributed, parallel scheduling algorithm that identifies both independent tasks and nested parallelism. Legion also enables explicit, programmer controlled movement of data through the memory hierarchy and placement of tasks based on locality information via a novel mapping interface. We evaluate our Legion implementation on three applications: fluid-flow on a regular grid, a three-level AMR code solving a heat diffusion equation, and a circuit simulation.

international conference on parallel architectures and compilation techniques | 2014

Realm: an event-based low-level runtime for distributed memory architectures

Alex Aiken; Michael Bauer; Sean Treichler

We present Realm, an event-based runtime system for heterogeneous, distributed memory machines. Realm is fully asynchronous: all runtime actions are non-blocking. Realm supports spawning computations, moving data, and reservations, a novel synchronization primitive. Asynchrony is exposed via a light-weight event system capable of operating without central management. We describe an implementation of Realm that relies on a novel generational event data structure for efficiently handling large numbers of events in a distributed address space. Microbenchmark experiments show our implementation of Realm approaches the underlying hardware performance limits. We measure the performance of three real-world applications on the Keeneland supercomputer. Our results demonstrate that Realm confers considerable latency hiding to clients, attaining significant speedups over traditional bulk-synchronous and independently optimized MPI codes.

conference on object oriented programming systems languages and applications | 2013

Language support for dynamic, hierarchical data partitioning

Sean Treichler; Michael Bauer; Alex Aiken

Applications written for distributed-memory parallel architectures must partition their data to enable parallel execution. As memory hierarchies become deeper, it is increasingly necessary that the data partitioning also be hierarchical to match. Current language proposals perform this hierarchical partitioning statically, which excludes many important applications where the appropriate partitioning is itself data dependent and so must be computed dynamically. We describe Legion, a region-based programming system, where each region may be partitioned into subregions. Partitions are computed dynamically and are fully programmable. The division of data need not be disjoint and subregions of a region may overlap, or alias one another. Computations use regions with certain privileges (e.g., expressing that a computation uses a region read-only) and data coherence (e.g., expressing that the computation need only be atomic with respect to other operations on the region), which can be controlled on a per-region (or subregion) basis. We present the novel aspects of the Legion design, in particular the combination of static and dynamic checks used to enforce soundness. We give an extended example illustrating how Legion can express computations with dynamically determined relationships between computations and data partitions. We prove the soundness of Legions type system, and show Legion type checking improves performance by up to 71% by eliding provably safe memory checks. In particular, we show that the dynamic checks to detect aliasing at runtime at the region granularity have negligible overhead. We report results for three real-world applications running on distributed memory machines, achieving up to 62.5X speedup on 96 GPUs on the Keeneland supercomputer.

acm sigplan symposium on principles and practice of parallel programming | 2014

Singe: leveraging warp specialization for high performance on GPUs

Michael Bauer; Sean Treichler; Alex Aiken

We present Singe, a Domain Specific Language (DSL) compiler for combustion chemistry that leverages warp specialization to produce high performance code for GPUs. Instead of relying on traditional GPU programming models that emphasize data-parallel computations, warp specialization allows compilers like Singe to partition computations into sub-computations which are then assigned to different warps within a thread block. Fine-grain synchronization between warps is performed efficiently in hardware using producer-consumer named barriers. Partitioning computations using warp specialization allows Singe to deal efficiently with the irregularity in both data access patterns and computation. Furthermore, warp-specialized partitioning of computations allows Singe to fit extremely large working sets into on-chip memories. Finally, we describe the architecture and general compilation techniques necessary for constructing a warp-specializing compiler. We show that the warp-specialized code emitted by Singe is up to 3.75X faster than previously optimized data-parallel GPU kernels.

ieee international conference on high performance computing data and analytics | 2015

Regent: a high-productivity programming language for HPC with logical regions

Elliott Slaughter; Wonchan Lee; Sean Treichler; Michael Bauer; Alex Aiken

We present Regent, a high-productivity programming language for high performance computing with logical regions. Regent users compose programs with tasks (functions eligible for parallel execution) and logical regions (hierarchical collections of structured objects). Regent programs appear to execute sequentially, require no explicit synchronization, and are trivially deadlock-free. Regents type system catches many common classes of mistakes and guarantees that a program with correct serial execution produces identical results on parallel and distributed machines. We present an optimizing compiler for Regent that translates Regent programs into efficient implementations for Legion, an asynchronous task-based model. Regent employs several novel compiler optimizations to minimize the dynamic overhead of the runtime system and enable efficient operation. We evaluate Regent on three benchmark applications and demonstrate that Regent achieves performance comparable to hand-tuned Legion.

ieee international conference on high performance computing data and analytics | 2014

Structure slicing: extending logical regions with fields

Michael Bauer; Sean Treichler; Elliott Slaughter; Alex Aiken

Applications on modern supercomputers are increasingly limited by the cost of data movement, but mainstream programming systems have few abstractions for describing the structure of a programs data. Consequently, the burden of managing data movement, placement, and layout currently falls primarily upon the programmer. To address this problem we previously proposed a data model based on logical regions and described Legion, a programming system incorporating logical regions. In this paper, we present structure slicing, which incorporates fields into the logical region data model. We show that structure slicing enables Legion to automatically infer task parallelism from field non-interference, decouple the specification of data usage from layout, and reduce the overall amount of data moved. We demonstrate that structure slicing enables both strong and weak scaling of three Legion applications including S3D, a production combustion simulation that uses logical regions with thousands of fields, with speedups of up to 3.68X over a vectorized CPU-only Fortran implementation and 1.88X over an independently hand-tuned OpenACC code.

Archive | 2015

ASC ATDM Level 2 Milestone #5325: Asynchronous Many-Task Runtime System Analysis and Assessment for Next Generation Platforms.

Gavin Matthew Baker; Matthew Tyler Bettencourt; Steven W. Bova; Ken Franko; Marc Gamell; Ryan E. Grant; Simon D. Hammond; David S. Hollman; Samuel Knight; Hemanth Kolla; Paul Lin; Stephen L. Olivier; Gregory D. Sjaardema; Nicole Lemaster Slattengren; Keita Teranishi; Jeremiah J. Wilke; Janine C. Bennett; Robert L. Clay; Laxkimant Kale; Nikhil Jain; Eric Mikida; Alex Aiken; Michael Bauer; Wonchan Lee; Elliott Slaughter; Sean Treichler; Martin Berzins; Todd Harman; Alan Humphreys; John A. Schmidt

This report provides in-depth information and analysis to help create a technical road map for developing nextgeneration programming models and runtime systems that support Advanced Simulation and Computing (ASC) workload requirements. The focus herein is on asynchronous many-task (AMT) model and runtime systems, which are of great interest in the context of “exascale” computing, as they hold the promise to address key issues associated with future extreme-scale computer architectures. This report includes a thorough qualitative and quantitative examination of three best-of-class AMT runtime systems—Charm++, Legion, and Uintah, all of which are in use as part of the ASC Predictive Science Academic Alliance Program II (PSAAP-II) Centers. The studies focus on each of the runtimes’ programmability, performance, and mutability. Through the experiments and analysis presented, several overarching findings emerge. From a performance perspective, AMT runtimes show tremendous potential for addressing extremescale challenges. Empirical studies show an AMT runtime can mitigate performance heterogeneity inherent to the machine itself and that Message Passing Interface (MPI) and AMT runtimes perform comparably under balanced conditions. From a programmability and mutability perspective however, none of the runtimes in this study are currently ready for use in developing production-ready Sandia ASC applications. The report concludes by recommending a codesign path forward, wherein application, programming model, and runtime system developers work together to define requirements and solutions. Such a requirements-driven co-design approach benefits the high-performance computing (HPC) community as a whole, with widespread community engagement mitigating risk for both application developers and runtime system developers.

Computing in Science and Engineering | 2011

EcoG: A Power-Efficient GPU Cluster Architecture for Scientific Computing

Michael T. Showerman; Jeremy Enos; Craig P. Steffen; Sean Treichler; William Gropp; Wen-mei W. Hwu

Researchers built the EcoG GPU-based cluster to show that a system can be designed around GPU computing and still be power efficient.

international parallel and distributed processing symposium | 2016

Towards Asynchronous Many-Task in Situ Data Analysis Using Legion

Philippe Pierre Pebay; Janine C. Bennett; David S. Hollman; Sean Treichler; Patrick S. McCormick; Christine Sweeney; Hemanth Kolla; Alex Aiken

We explore the use of asynchronous many-task (AMT) programming models for the implementation of in situ analysis towards the goal of maximizing programmer productivity and overall performance on next generation platforms. We describe how a broad class of statistics algorithms can be transformed from a traditional single-programm multiple-data (SPMD) implementation to an AMT implementation, demonstrating with a concrete example: a measurement of descriptive statistics implemented in Legion. Our experiments to quantify the benefit and possible drawbacks of this approach are in progress, and we present some encouraging initial results on the (minimal) impact of the AMT-based approach on code complexity, task scheduling, and application scalability.

conference on object oriented programming systems languages and applications | 2016

Dependent partitioning

Sean Treichler; Michael Bauer; Rahul Sharma; Elliott Slaughter; Alex Aiken

A key problem in parallel programming is how data is partitioned: divided into subsets that can be operated on in parallel and, in distributed memory machines, spread across multiple address spaces. We present a dependent partitioning framework that allows an application to concisely describe relationships between partitions. Applications first establish independent partitions, which may contain arbitrary subsets of application data, permitting the expression of arbitrary application-specific data distributions. Dependent partitions are then derived from these using the dependent partitioning operations provided by the framework. By directly capturing inter-partition relationships, our framework can soundly and precisely reason about programs to perform important program analyses crucial to ensuring correctness and achieving good performance. As an example of the reasoning made possible, we present a static analysis that discharges most consistency checks on partitioned data during compilation. We describe an implementation of our framework within Regent, a language designed for the Legion programming model. The use of dependent partitioning constructs results in a 86-96% decrease in the lines of code required to describe the partitioning, eliminates many of the expensive dynamic checks required for soundness by the current Regent partitioning implementation, and speeds up the computation of partitions by 2.6-12.7X even on a single thread. Additionally, we show that a distributed implementation incorporated into the the Legion runtime system allows partitioning of data sets that are too large to fit on a single node and yields a further 29X speedup of partitioning operations on 64 nodes.

Explore More