HyoukJoong Lee
Stanford University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by HyoukJoong Lee.
acm sigplan symposium on principles and practice of parallel programming | 2011
Hassan Chafi; Arvind K. Sujeeth; Kevin J. Brown; HyoukJoong Lee; Anand R. Atreya; Kunle Olukotun
Exploiting heterogeneous parallel hardware currently requires mapping application code to multiple disparate programming models. Unfortunately, general-purpose programming models available today can yield high performance but are too low-level to be accessible to the average programmer. We propose leveraging domain-specific languages (DSLs) to map high-level application code to heterogeneous devices. To demonstrate the potential of this approach we present OptiML, a DSL for machine learning. OptiML programs are implicitly parallel and can achieve high performance on heterogeneous hardware with no modification required to the source code. For such a DSL-based approach to be tractable at large scales, better tools are required for DSL authors to simplify language creation and parallelization. To address this concern, we introduce Delite, a system designed specifically for DSLs that is both a framework for creating an implicitly parallel DSL as well as a dynamic runtime providing automated targeting to heterogeneous parallel hardware. We show that OptiML running on Delite achieves single-threaded, parallel, and GPU performance superior to explicitly parallelized MATLAB code in nearly all cases.
ACM Transactions in Embedded Computing Systems | 2014
Arvind K. Sujeeth; Kevin J. Brown; HyoukJoong Lee; Tiark Rompf; Hassan Chafi; Kunle Olukotun
Developing high-performance software is a difficult task that requires the use of low-level, architecture-specific programming models (e.g., OpenMP for CMPs, CUDA for GPUs, MPI for clusters). It is typically not possible to write a single application that can run efficiently in different environments, leading to multiple versions and increased complexity. Domain-Specific Languages (DSLs) are a promising avenue to enable programmers to use high-level abstractions and still achieve good performance on a variety of hardware. This is possible because DSLs have higher-level semantics and restrictions than general-purpose languages, so DSL compilers can perform higher-level optimization and translation. However, the cost of developing performance-oriented DSLs is a substantial roadblock to their development and adoption. In this article, we present an overview of the Delite compiler framework and the DSLs that have been developed with it. Delite simplifies the process of DSL development by providing common components, like parallel patterns, optimizations, and code generators, that can be reused in DSL implementations. Delite DSLs are embedded in Scala, a general-purpose programming language, but use metaprogramming to construct an Intermediate Representation (IR) of user programs and compile to multiple languages (including Cpp, CUDA, and OpenCL). DSL programs are automatically parallelized and different parts of the application can run simultaneously on CPUs and GPUs. We present Delite DSLs for machine learning, data querying, graph analysis, and scientific computing and show that they all achieve performance competitive to or exceeding Cpp code.
symposium on principles of programming languages | 2013
Tiark Rompf; Arvind K. Sujeeth; Nada Amin; Kevin J. Brown; Vojin Jovanovic; HyoukJoong Lee; Manohar Jonnalagedda; Kunle Olukotun
High level data structures are a cornerstone of modern programming and at the same time stand in the way of compiler optimizations. In order to reason about user- or library-defined data structures compilers need to be extensible. Common mechanisms to extend compilers fall into two categories. Frontend macros, staging or partial evaluation systems can be used to programmatically remove abstraction and specialize programs before they enter the compiler. Alternatively, some compilers allow extending the internal workings by adding new transformation passes at different points in the compile chain or adding new intermediate representation (IR) types. None of these mechanisms alone is sufficient to handle the challenges posed by high level data structures. This paper shows a novel way to combine them to yield benefits that are greater than the sum of the parts. Instead of using staging merely as a front end, we implement internal compiler passes using staging as well. These internal passes delegate back to program execution to construct the transformed IR. Staging is known to simplify program generation, and in the same way it can simplify program transformation. Defining a transformation as a staged IR interpreter is simpler than implementing a low-level IR to IR transformer. With custom IR nodes, many optimizations that are expressed as rewritings from IR nodes to staged program fragments can be combined into a single pass, mitigating phase ordering problems. Speculative rewriting can preserve optimistic assumptions around loops. We demonstrate several powerful program optimizations using this architecture that are particularly geared towards data structures: a novel loop fusion and deforestation algorithm, array of struct to struct of array conversion, object flattening and code generation for heterogeneous parallel devices. We validate our approach using several non trivial case studies that exhibit order of magnitude speedups in experiments.
international symposium on microarchitecture | 2011
HyoukJoong Lee; Kevin J. Brown; Arvind K. Sujeeth; Hassan Chafi; Kunle Olukotun; Tirark Rompf
Domain-specific languages offer a solution to the performance and the productivity issues in heterogeneous computing systems. The Delite compiler framework simplifies the process of building embedded parallel DSLs. DSL developers can implement domain-specific operations by extending the DSL framework, which provides static optimizations and code generation for heterogeneous hardware. The Delite runtime automatically schedules and executes DSL operations on heterogeneous hardware.
arXiv: Programming Languages | 2011
Tiark Rompf; Arvind K. Sujeeth; HyoukJoong Lee; Kevin J. Brown; Hassan Chafi; Kunle Olukotun
Domain-specific languages raise the level of abstraction in software development. While it is evident that programmers can more easily reason about very high-level programs, the same holds for compilers only if the compiler has an accurate model of the application domain and the underlying target platform. Since mapping high-level, general-purpose languages to modern, heterogeneous hardware is becoming increasingly difficult, DSLs are an attractive way to capitalize on improved hardware performance, precisely by making the compiler reason on a higher level. Implementing efficient DSL compilers is a daunting task however, and support for building performance-oriented DSLs is urgently needed. To this end, we present the Delite Framework, an extensible toolkit that drastically simplifies building embedded DSLs and compiling DSL programs for execution on heterogeneous hardware. We discuss several building blocks in some detail and present experimental results for the OptiML machine-learning DSL implemented on top of Delite.
european conference on object oriented programming | 2013
Arvind K. Sujeeth; Tiark Rompf; Kevin J. Brown; HyoukJoong Lee; Hassan Chafi; Victoria Popic; Michael Wu; Aleksandar Prokopec; Vojin Jovanovic; Kunle Olukotun
Programmers who need high performance currently rely on low-level, architecture-specific programming models (e.g. OpenMP for CMPs, CUDA for GPUs, MPI for clusters). Performance optimization with these frameworks usually requires expertise in the specific programming model and a deep understanding of the target architecture. Domain-specific languages (DSLs) are a promising alternative, allowing compilers to map problem-specific abstractions directly to low-level architecture-specific programming models. However, developing DSLs is difficult, and using multiple DSLs together in a single application is even harder because existing compiled solutions do not compose together. In this paper, we present four new performance-oriented DSLs developed with Delite, an extensible DSL compilation framework. We demonstrate new techniques to compose compiled DSLs embedded in a common backend together in a single program and show that generic optimizations can be applied across the different DSL sections. Our new DSLs are implemented with a small number of reusable components (less than 9 parallel operators total) and still achieve performance up to 125x better than library implementations and at worst within 30% of optimized stand-alone DSLs. The DSLs retain good performance when composed together, and applying cross-DSL optimizations results in up to an additional 1.82x improvement.
field programmable logic and applications | 2014
Nithin George; HyoukJoong Lee; David Novo; Tiark Rompf; Kevin J. Brown; Arvind K. Sujeeth; Kunle Olukotun; Paolo Ienne
Field Programmable Gate Arrays (FPGAs) are very versatile devices, but their complicated programming model has stymied their widespread usage. While modern High-Level Synthesis (HLS) tools provide better programming models, the interface they offer is still too low-level. In order to produce good quality hardware designs with these tools, the users are forced to manually perform optimizations that demand detailed knowledge of both the application and the implementation platform. Additionally, many HLS tools only generate isolated hardware modules that the user still needs to integrate into a system design before generating the FPGA bitstream. These problems make HLS tools difficult to use for application developers who have little hardware design knowledge. To address these problems, we propose an automated methodology to generate FPGA bitstreams from high-level programs written in Domain-Specific Languages (DSLs). We leverage the domain-knowledge conveyed by the DSL and its domain-specific semantics to extract application parallelism, perform optimizations and also identify a suitable system-architecture for the implementation, thereby, relieving the user from most of the hardware-level details. We demonstrate the high productivity and high design quality this approach offers by automatically generating hardware systems from applications written in OptiML, a machine-learning DSL. To evaluate our methodology, we use four OptiML applications and show that we can easily generate different solutions which achieve different trade-offs between performance and area. More importantly, the results reveal that our generated hardware achieves much better performance compared to the one obtained from using the HLS tool without platform-specific optimizations.
international symposium on microarchitecture | 2014
HyoukJoong Lee; Kevin J. Brown; Arvind K. Sujeeth; Tiark Rompf; Kunle Olukotun
Recent work has explored using higher level languages to improve programmer productivity on GPUs. These languages often utilize high level computation patterns (e.g., Map and Reduce) that encode parallel semantics to enable automatic compilation to GPU kernels. However, the problem of efficiently mapping patterns to GPU hardware becomes significantly more difficult when the patterns are nested, which is common in non-trivial applications. To address this issue, we present a general analysis framework for automatically and efficiently mapping nested patterns onto GPUs. The analysis maps nested patterns onto a logical multidimensional domain and parameterizes the block size and degree of parallelism in each dimension. We then add GPU-specific hard and soft constraints to prune the space of possible mappings and select the best mapping. We also perform multiple compiler optimizations that are guided by the mapping to avoid dynamic memory allocations and automatically utilize shared memory within GPU kernels. We compare the performance of our automatically selected mappings to hand-optimized implementations on multiple benchmarks and show that the average performance gap on 7 out of 8 benchmarks is 24%. Furthermore, our mapping strategy outperforms simple 1D mappings and existing 2D mappings by up to 28.6x and 9.6x respectively.
symposium on code generation and optimization | 2016
Kevin J. Brown; HyoukJoong Lee; Tiark Rompf; Arvind K. Sujeeth; Christopher De Sa; Christopher R. Aberger; Kunle Olukotun
High performance in modern computing platforms requires programs to be parallel, distributed, and run on heterogeneous hardware. However programming such architectures is extremely difficult due to the need to implement the application using multiple programming models and combine them together in ad-hoc ways. To optimize distributed applications both for modern hardware and for modern programmers we need a programming model that is sufficiently expressive to support a variety of parallel applications, sufficiently performant to surpass hand-optimized sequential implementations, and sufficiently portable to support a variety of heterogeneous hardware. Unfortunately existing systems tend to fall short of these requirements. In this paper we introduce the Distributed Multiloop Language (DMLL), a new intermediate language based on common parallel patterns that captures the necessary semantic knowledge to efficiently target distributed heterogeneous architectures. We show straightforward analyses that determine what data to distribute based on its usage as well as powerful transformations of nested patterns that restructure computation to enable distribution and optimize for heterogeneous devices. We present experimental results for a range of applications spanning multiple domains and demonstrate highly efficient execution compared to manually-optimized counterparts in multiple distributed programming models.
architectural support for programming languages and operating systems | 2016
Raghu Prabhakar; David Koeplinger; Kevin J. Brown; HyoukJoong Lee; Christopher De Sa; Christos Kozyrakis; Kunle Olukotun
In recent years the computing landscape has seen an increasing shift towards specialized accelerators. Field programmable gate arrays (FPGAs) are particularly promising for the implementation of these accelerators, as they offer significant performance and energy improvements over CPUs for a wide class of applications and are far more flexible than fixed-function ASICs. However, FPGAs are difficult to program. Traditional programming models for reconfigurable logic use low-level hardware description languages like Verilog and VHDL, which have none of the productivity features of modern software languages but produce very efficient designs, and low-level software languages like C and OpenCL coupled with high-level synthesis (HLS) tools that typically produce designs that are far less efficient. Functional languages with parallel patterns are a better fit for hardware generation because they provide high-level abstractions to programmers with little experience in hardware design and avoid many of the problems faced when generating hardware from imperative languages. In this paper, we identify two important optimizations for using parallel patterns to generate efficient hardware: tiling and metapipelining. We present a general representation of tiled parallel patterns, and provide rules for automatically tiling patterns and generating metapipelines. We demonstrate experimentally that these optimizations result in speedups up to 39.4× on a set of benchmarks from the data analytics domain.