Hansang Bae
Purdue University
                                 Network
                            
                            Latest external collaboration on country level. Dive into details by clicking on the dots.
                                 Publication
                            
                            Featured researches published by Hansang Bae.
IEEE Transactions on Electron Devices | 2007
Gerhard Klimeck; Shaikh Ahmed; Hansang Bae; Neerav Kharche; Steve Clark; Benjamin P Haley; Sunhee Lee; Maxim Naumov; Hoon Ryu; Faisal Saied; Martha Prada; Marek Korkusinski; Timothy B. Boykin; Rajib Rahman
Device physics and material science meet at the atomic scale of novel nanostructured semiconductors, and the distinction between new device or new material is blurred. Not only the quantum-mechanical effects in the electronic states of the device but also the granular atomistic representation of the underlying material are important. Approaches based on a continuum representation of the underlying material typically used by device engineers and physicists become invalid. Ab initio methods used by material scientists typically do not represent the band gaps and masses precisely enough for device design, or they do not scale to realistically large device sizes. The plethora of geometry, material, and doping configurations in semiconductor devices at the nanoscale suggests that a general nanoelectronic modeling tool is needed. The 3-D NanoElectronic MOdeling (NEMO 3-D) tool has been developed to address these needs. Based on the atomistic valence force field and a variety of nearest neighbor tight-binding models (e.g., s, sp3s*, and sp3d5s*), NEMO 3-D enables the computation of strain and electronic structure for more than 64 and 52 million atoms, corresponding to volumes of (110 nm)3 and (101 nm)3, respectively. The physical problem may involve very large scale computations, and NEMO 3-D has been designed and optimized to be scalable from single central processing units to large numbers of processors on commodity clusters and supercomputers. NEMO 3-D has been released with an open-source license in 2003 and is continually developed by the Network for Computational Nanotechnology (NCN). A web-based online interactive version for educational purposes is freely available on the NCN portal ( http://www.nanoHUB.org). In this paper, theoretical models and essential algorithmic and computational components that have been used in the development and successful deployment of NEMO 3-D are discussed.
IEEE Computer | 2009
Chirag Dave; Hansang Bae; Seung-Jai Min; Seyong Lee; Rudolf Eigenmann; Samuel P. Midkiff
The Cetus tool provides an infrastructure for research on multicore compiler optimizations that emphasizes automatic parallelization. The compiler infrastructure, which targets C programs, supports source-to-source transformations, is user-oriented and easy to handle, and provides the most important parallelization passes as well as the underlying enabling techniques.
Journal of Physics: Conference Series | 2009
Benjamin P Haley; Sunhee Lee; Mathieu Luisier; Hoon Ryu; Faisal Saied; Steve Clark; Hansang Bae; Gerhard Klimeck
Recent improvements to existing HPC codes NEMO 3-D and OMEN, combined with access to peta-scale computing resources, have enabled realistic device engineering simulations that were previously infeasible. NEMO 3-D can now simulate 1 billion atom systems, and, using 3D spatial decomposition, scale to 32768 cores. Simulation time for the band structure of an experimental P doped Si quantum computing device fell from 40 minutes to 1 minute. OMEN can perform fully quantum mechanical transport calculations for real-word UTB FETs on 147,456 cores in roughly 5 minutes. Both of these tools power simulation engines on the nanoHUB, giving the community access to previously unavailable research capabilities.
International Journal of Parallel Programming | 2013
Hansang Bae; Dheya Mustafa; Jae-Woo Lee; Aurangzeb; Hao Lin; Chirag Dave; Rudolf Eigenmann; Samuel P. Midkiff
This paper provides an overview and an evaluation of the Cetus source-to-source compiler infrastructure. The original goal of the Cetus project was to create an easy-to-use compiler for research in automatic parallelization of C programs. In meantime, Cetus has been used for many additional program transformation tasks. It serves as a compiler infrastructure for many projects in the US and internationally. Recently, Cetus has been supported by the National Science Foundation to build a community resource. The compiler has gone through several iterations of benchmark studies and implementations of those techniques that could improve the parallel performance of these programs. These efforts have resulted in a system that favorably compares with state-of-the-art parallelizers, such as Intel’s ICC. A key limitation of advanced optimizing compilers is their lack of runtime information, such as the program input data. We will discuss and evaluate several techniques that support dynamic optimization decisions. Finally, as there is an extensive body of proposed compiler analyses and transformations for parallelization, the question of the importance of the techniques arises. This paper evaluates the impact of the individual Cetus techniques on overall program performance.
languages and compilers for parallel computing | 2011
Okwan Kwon; Fahed Jubair; Seung-Jai Min; Hansang Bae; Rudolf Eigenmann; Samuel P. Midkiff
OpenMP is an explicit parallel programming model that offers reasonable productivity. Its memory model assumes a shared address space, and hence the direct translation - as done by common OpenMP compilers - requires an underlying shared-memory architecture. Many lab machines include 10s of processors, built from commodity components and thus include distributed address spaces. Despite many efforts to provide higher productivity for these platforms, the most common programming model uses message passing, which is substantially more tedious to program than shared-address-space models. This paper presents a compiler/runtime system that translates OpenMP programs into message passing variants and executes them on clusters up to 64 processors. We build on previous work that provided a proof of concept of such translation. The present paper describes compiler algorithms and runtime techniques that provide the automatic translation of a first class of OpenMP applications: those that exhibit regular write array subscripts and repetitive communication. We evaluate the translator on representative benchmarks of this class and compare their performance against hand-written MPI variants. In all but one case, our translated versions perform close to the hand-written variants.
international workshop on openmp | 2005
Zhelong Pan; Brian Armstrong; Hansang Bae; Rudolf Eigenmann
Iteration space tiling is a well-explored programming and compiler technique to enhance program locality. Its performance benefit appears obvious, as the ratio of processor versus memory speed increases continuously. In an effort to include a tiling pass into an advanced parallelizing compiler, we have found that the interaction of tiling and parallelization raises unexplored issues. Applying existing, sequential tiling techniques, followed by parallelization, leads to performance degradation in many programs. Applying tiling after parallelization without considering parallel execution semantics may lead to incorrect programs. Doing so conservatively, also introduces overhead in some of the measured programs. In this paper, we present an algorithm that applies tiling in concert with parallelization. The algorithm avoids the above negative effects. Our paper also presents the first comprehensive evaluation of tiling techniques on compiler-parallelized programs. Our tiling algorithm improves the SPEC CPU95 floating-point programs by up to 21% over nontiled versions (4.9% on average) and the SPEC CPU2000 Fortran 77 programs up to 49% (11% on average). Notably, in about half of the benchmarks, tiling does not have a significant effect.
Computing in Science and Engineering | 2008
Mohamed Sayeed; Hansang Bae; Yili Zheng; Brian Armstrong; Rudolf Eigenmann; Faisal Saied
A good benchmarking methodology can save a tremendous amount of resources in terms of human effort, machine cycles, and cost. Such a methodology must consider the relevance and openness of the chosen codes, well-defined rules for executing and reporting the benchmarks, a review process to enforce the rules, and a public repository for the obtained information. For the methodology to be feasible, it must also be supported by adequate tools that enable the user to consistently execute the benchmarks and gather the requisite metrics. At the very least, reliable benchmarking results can help people make decisions about HPC acquisitions and assist scientists and engineers in system advances. By saving resources and enabling balanced designs and configurations, realistic benchmarking ultimately leads to increased competitiveness in both industry and academia.
languages and compilers for parallel computing | 2005
Hansang Bae; Rudolf Eigenmann
We have designed and implemented an interprocedural algorithm to analyze symbolic value ranges that can be assumed by variables at any given point in a program. Our algorithm contrasts with related work on interprocedural value range analysis in that it extends the ability to handle symbolic range expressions. It builds on our previous work of intraprocedural symbolic range analysis. We have evaluated our algorithm using 11 Perfect Benchmarks and 10 SPEC floating-point benchmarks of the CPU 95 and CPU 2000 suites. We have measured the ability to perform test elision, dead code elimination, and detect data dependences. We have also evaluated the algorithm’s ability to help detect zero-trip loops for induction variable substitution and subscript ranges for array reductions.
languages and compilers for parallel computing | 2011
Hao Lin; Hansang Bae; Samuel P. Midkiff; Rudolf Eigenmann; Soohong P. Kim
In the early 1980s, shared memory mini-super-computers had buses and memory whose speeds were relatively fast compared to processor speeds. This led to the widespread use of various producer/consumer (post/wait) synchronization schemes for enforcing data dependences within parallel doacross loops. The rise of the “killer micro”, instruction sets optimized for serial programs, and rapidly increasing processor clock rates driven by Moore’s law, led to special purpose synchronization instructions being replaced by software barriers combined with loop fission (to allow the barriers to enforce dependences.) One cost of this approach is poorer cache behavior because variables on which a dependence exists are now accessed in separate loops. With the advent of the multicore era, producer/consumer synchronization again appears plausible. In this paper we compare the performance of hardware and software synchronization schemes to barrier synchronization, and show that either hardware or software based producer/consumer synchronization can provide applications with superior performance.
arXiv: Computational Physics | 2009
Shaikh Ahmed; Neerav Kharche; Rajib Rahman; Muhammad Usman; Sunhee Lee; Hoon Ryu; Hansang Bae; Steve Clark; Benjamin P Haley; Maxim Naumov; Faisal Saied; Marek Korkusinski; Rick Kennel; Michael McLennan; Timothy B. Boykin; Gerhard Klimeck
