Didem Unat | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Didem Unat is active.

Explore More

Publication

Featured researches published by Didem Unat.

international conference on supercomputing | 2011

Mint: realizing CUDA performance in 3D stencil methods with annotated C

Didem Unat; Xing Cai; Scott B. Baden

We present Mint, a programming model that enables the non-expert to enjoy the performance benefits of hand coded CUDA without becoming entangled in the details. Mint targets stencil methods, which are an important class of scientific applications. We have implemented the Mint programming model with a source-to-source translator that generates optimized CUDA C from traditional C source. The translator relies on annotations to guide translation at a high level. The set of pragmas is small, and the model is compact and simple. Yet, Mint is able to deliver performance competitive with painstakingly hand-optimized CUDA. We show that, for a set of widely used stencil kernels, Mint realized 80% of the performance obtained from aggressively optimized CUDA on the 200 series NVIDIA GPUs. Our optimizations target three dimensional kernels, which present a daunting array of optimizations.

ieee international conference on high performance computing data and analytics | 2014

Abstract machine models and proxy architectures for exascale computing

James A. Ang; Richard F. Barrett; R.E. Benner; D. Burke; Cy P. Chan; Jeanine Cook; David Donofrio; Simon D. Hammond; Karl Scott Hemmert; Suzanne M. Kelly; H. Le; Vitus J. Leung; David Resnick; Arun Rodrigues; John Shalf; Dylan T. Stark; Didem Unat; Nicholas J. Wright

To achieve exascale computing, fundamental hardware architectures must change. This will significantly impact scientific applications that run on current high performance computing (HPC) systems, many of which codify years of scientific domain knowledge and refinements for contemporary computer systems. To adapt to exascale architectures, developers must be able to reason about new hardware and determine what programming models and algorithms will provide the best blend of performance and energy efficiency in the future. An abstract machine model is designed to expose to the application developers and system software only the aspects of the machine that are important or relevant to performance and code structure. These models are intended as communication aids between application developers and hardware architects during the co-design process. A proxy architecture is a parameterized version of an abstract machine model, with parameters added to elucidate potential speeds and capacities of key hardware components. These more detailed architectural models enable discussion among the developers of analytic models and simulators and computer hardware architects and they allow for application performance analysis, system software development, and hardware optimization opportunities. In this paper, we present a set of abstract machine models and show how they might be used to help software developers prepare for exascale. We then apply parameters to one of these models to demonstrate how a proxy architecture can enable a more concrete exploration of how well application codes map onto future architectures.

ieee international conference on high performance computing data and analytics | 2013

Modeling and predicting performance of high performance computing applications on hardware accelerators

Mitesh R. Meswani; Laura Carrington; Didem Unat; Allan Snavely; Scott B. Baden; Stephen W. Poole

Hybrid-core systems speedup applications by offloading certain compute operations that can run faster on hardware accelerators. However, such systems require significant programming and porting effort to gain a performance benefit from the accelerators. Therefore, prior to porting it is prudent to investigate the predicted performance benefit of accelerators for a given workload. To address this problem we present a performance-modeling framework that predicts the application performance rapidly and accurately for hybrid-core systems. We present predictions for two full-scale HPC applications—HYCOM and Milc. Our results for two accelerators (GPU and FPGA) show that gather/scatter and stream operations can speedup by as much as a factor of 15 and overall compute time of Milc and HYCOM improve by 3.4% and 20%, respectively. We also show that in order to benefit from the accelerators, 70% of the latency of data transfer time between the CPU and the accelerators needs to be overcome.

Archive | 2014

Programming Abstractions for Data Locality

Adrian Tate; Amir Kamil; Anshu Dubey; Armin Groblinger; Brad Chamberlain; Brice Goglin; Harold C. Edwards; Chris J. Newburn; David Padua; Didem Unat; Emmanuel Jeannot; Frank Hannig; Gysi Tobias; Hatem Ltaief; James C. Sexton; Jesús Labarta; John Shalf; Karl Fuerlinger; Kathryn O'Brien; Leonidas Linardakis; Maciej Besta; Marie-Christine Sawley; Mark James Abraham; Mauro Bianco; Miquel Pericàs; Naoya Maruyama; Paul H. J. Kelly; Peter Messmer; Robert B. Ross; Romain Ciedat

The goal of the workshop and this report is to identify common themes and standardize concepts for locality-preserving abstractions for exascale programming models. Current software tools are built on the premise that computing is the most expensive component, we are rapidly moving to an era that computing is cheap and massively parallel while data movement dominates energy and performance costs. In order to respond to exascale systems (the next generation of high performance computing systems), the scientific computing community needs to refactor their applications to align with the emerging data-centric paradigm. Our applications must be evolved to express information about data locality. Unfortunately current programming environments offer few ways to do so. They ignore the incurred cost of communication and simply rely on the hardware cache coherency to virtualize data movement. With the increasing importance of task-level parallelism on future systems, task models have to support constructs that express data locality and affinity. At the system level, communication libraries implicitly assume all the processing elements are equidistant to each other. In order to take advantage of emerging technologies, application developers need a set of programming abstractions to describe data locality for the new computing ecosystem. The new programming paradigm should be more data centric and allow to describe how to decompose and how to layout data in the memory.Fortunately, there are many emerging concepts such as constructs for tiling, data layout, array views, task and thread affinity, and topology aware communication libraries for managing data locality. There is an opportunity to identify commonalities in strategy to enable us to combine the best of these concepts to develop a comprehensive approach to expressing and managing data locality on exascale programming systems. These programming model abstractions can expose crucial information about data locality to the compiler and runtime system to enable performance-portable code. The research question is to identify the right level of abstraction, which includes techniques that range from template libraries all the way to completely new languages to achieve this goal.

international supercomputing conference | 2013

Software Design Space Exploration for Exascale Combustion Co-design

Cy P. Chan; Didem Unat; Michael J. Lijewski; Weiqun Zhang; John B. Bell; John Shalf

The design of hardware for next-generation exascale computing systems will require a deep understanding of how software optimizations impact hardware design trade-offs. In order to characterize how co-tuning hardware and software parameters affects the performance of combustion simulation codes, we created ExaSAT, a compiler-driven static analysis and performance modeling framework. Our framework can evaluate hundreds of hardware/software configurations in seconds, providing an essential speed advantage over simulators and dynamic analysis techniques during the co-design process. Our analytic performance model shows that advanced code transformations, such as cache blocking and loop fusion, can have a significant impact on choices for cache and memory architecture. Our modeling helped us identify tuned configurations that achieve a 90% reduction in memory traffic, which could significantly improve performance and reduce energy consumption. These techniques will also be useful for the development of advanced programming models and runtimes, which must reason about these optimizations to deliver better performance and energy efficiency.

ieee international conference on high performance computing data and analytics | 2015

ExaSAT: An exascale co-design tool for performance modeling

Didem Unat; Cy P. Chan; Weiqun Zhang; Samuel Williams; John Bachan; John B. Bell; John Shalf

One of the emerging challenges to designing HPC systems is understanding and projecting the requirements of exascale applications. In order to determine the performance consequences of different hardware designs, analytic models are essential because they can provide fast feedback to the co-design centers and chip designers without costly simulations. However, current attempts to analytically model program performance typically rely on the user manually specifying a performance model. We introduce the ExaSAT framework that automates the extraction of parameterized performance models directly from source code using compiler analysis. The parameterized analytic model enables quantitative evaluation of a broad range of hardware design trade-offs and software optimizations on a variety of different performance metrics, with a primary focus on data movement as a metric. We demonstrate the ExaSAT framework’s ability to perform deep code analysis of a proxy application from the Department of Energy Combustion Co-design Center to illustrate its value to the exascale co-design process. ExaSAT analysis provides insights into the hardware and software trade-offs and lays the groundwork for exploring a more targeted set of design points using cycle-accurate architectural simulators.

Computing in Science and Engineering | 2012

Accelerating a 3D Finite-Difference Earthquake Simulation with a C-to-CUDA Translator

Didem Unat; Jun Zhou; Yifeng Cui; Scott B. Baden; Xing Cai

GPUs provide impressive computing power, but GPU programming can be challenging. Here, an experience in porting real-world earthquake code to Nvidia GPUs is described. Specifically, an annotation-based programming model, called Mint, and its accompanying source-to-source translator are used to automatically generate CUDA source code and simplify the exploration of performance tradeoffs.

ieee international symposium on workload characterization | 2011

Modeling and predicting application performance on hardware accelerators

Mitesh R. Meswani; Laura Carrington; Didem Unat; Allan Snavely; Scott B. Baden; Stephen W. Poole

Systems with hardware accelerators speedup applications by offloading certain compute operations that can run faster on accelerators. Thus, it is not surprising that many of top500 supercomputers use accelerators. However, in addition to procurement cost, significant programming and porting effort is required to realize the potential benefit of such accelerators. Hence, before building such a system it is prudent to answer the question ‘what is the projected performance benefit from accelerators for workloads of interest?’ We address this question by way of a performance-modeling framework, which predicts realizable application performance on accelerators speedily and accurately without going to the considerable effort of porting and tuning.

IEEE Transactions on Parallel and Distributed Systems | 2017

Trends in Data Locality Abstractions for HPC Systems

Didem Unat; Anshu Dubey; Torsten Hoefler; John Shalf; Mark James Abraham; Mauro Bianco; Bradford L. Chamberlain; Romain Cledat; H. Carter Edwards; Hal Finkel; Karl Fuerlinger; Frank Hannig; Emmanuel Jeannot; Amir Kamil; Jeff Keasler; Paul H. J. Kelly; Vitus J. Leung; Hatem Ltaief; Naoya Maruyama; Chris J. Newburn; Miquel Pericàs

The cost of data movement has always been an important concern in high performance computing (HPC) systems. It has now become the dominant factor in terms of both energy consumption and performance. Support for expression of data locality has been explored in the past, but those efforts have had only modest success in being adopted in HPC applications for various reasons. them However, with the increasing complexity of the memory hierarchy and higher parallelism in emerging HPC systems, locality management has acquired a new urgency. Developers can no longer limit themselves to low-level solutions and ignore the potential for productivity and performance portability obtained by using locality abstractions. Fortunately, the trend emerging in recent literature on the topic alleviates many of the concerns that got in the way of their adoption by application developers. Data locality abstractions are available in the forms of libraries, data structures, languages and runtime systems; a common theme is increasing productivity without sacrificing performance. This paper examines these trends and identifies commonalities that can combine various locality concepts to develop a comprehensive approach to expressing and managing data locality on future large-scale high-performance computing systems.

SIAM Journal on Scientific Computing | 2016

BoxLib with Tiling: An Adaptive Mesh Refinement Software Framework

Weiqun Zhang; Ann S. Almgren; Marcus S. Day; Tan Nguyen; John Shalf; Didem Unat

In this paper we introduce a block-structured adaptive mesh refinement software framework that incorporates tiling, a well-known loop transformation. Because the multiscale, multiphysics codes built in BoxLib are designed to solve complex systems at high resolution, performance on current and next generation architectures is essential. With the expectation of many more cores per node on next generation architectures, the ability to effectively utilize threads within a node is essential, and the current model for parallelization will not be sufficient. We describe a new version of BoxLib in which the tiling constructs are embedded so that BoxLib-based applications can easily realize expected performance gains without extra effort on the part of the application developer. We also discuss a path forward to enable future versions of BoxLib to take advantage of NUMA-aware optimizations using the TiDA portable library.

Explore More