Karthik Gururaj
University of California, Los Angeles
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Karthik Gururaj.
symposium on application specific processors | 2009
Alexandros Papakonstantinou; Karthik Gururaj; John A. Stratton; Deming Chen; Jason Cong; Wen-mei W. Hwu
As growing power dissipation and thermal effects disrupted the rising clock frequency trend and threatened to annul Moores law, the computing industry has switched its route to higher performance through parallel processing. The rise of multi-core systems in all domains of computing has opened the door to heterogeneous multi-processors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs and FPGAs are becoming very popular in PC-based heterogeneous systems for speeding up compute intensive kernels of scientific, imaging and simulation applications. GPUs can execute hundreds of concurrent threads, while FPGAs provide customized concurrency for highly parallel kernels. However, exploiting the parallelism available in these applications is currently not a push-button task. Often the programmer has to expose the applications fine and coarse grained parallelism by using special APIs. CUDA is such a parallel-computing API that is driven by the GPU industry and is gaining significant popularity. In this work, we adapt the CUDA programming model into a new FPGA design flow called FCUDA, which efficiently maps the coarse and fine grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SPMD CUDA thread blocks into parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multi-core accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.
design, automation, and test in europe | 2009
Jason Cong; Karthik Gururaj
In this paper, we propose a novel, energy aware scheduling algorithm for applications running on DVS-enabled multiprocessor systems, which exploits variation in execution times of individual tasks. In particular, our algorithm takes into account latency and resource constraints, precedence constraints among tasks and input-dependent variation in execution times of tasks to produce a scheduling solution and voltage assignment such that the average energy consumption is minimized. Our algorithm is based on a mathematical programming formulation of the scheduling and voltage assignment problem and runs in polynomial time. Experiments with randomly generated task graphs show that up to 30% savings in energy can be obtained by using our algorithm over existing techniques. We perform experiments on two real-world applications - MPEG-4 decoder and MJPEG encoder. Simulations show that the scheduling solution generated by our algorithm can provide up to 25% reduction in energy consumption over greedy dynamic slack reclamation algorithms.
international conference on computer aided design | 2011
Jason Cong; Karthik Gururaj
Traditionally, research in fault tolerance has required architectural state to be numerically perfect for program execution to be correct. However, in many programs, even if execution is not 100% numerically correct, the program can still appear to execute correctly from the users perspective. To quantify user satisfaction, application-level fidelity metrics (such as PSNR) can be used. The output for such applications is defined to be correct if the fidelity metrics satisfy a certain threshold. However, such applications still contain instructions whose outputs are critical — i.e. their correctness decides if the overall quality of the program output is acceptable. In this paper, we present an analysis technique for identifying such critical program segments. More importantly, our technique is capable of guaranteeing application-level correctness through a combination of static analysis and runtime monitoring. Our static analysis consists of data flow analysis followed by control flow analysis to find static critical instructions which affect several instructions. Critical instructions are further refined into likely non-critical and likely critical sets in a profiling phase. At runtime, we use a monitoring scheme to monitor likely non-critical instructions and take remedial actions if some likely non-critical instructions become critical. Based on this analysis, we minimize the number of instructions that are duplicated and checked at runtime using a software-based fault detection and recovery technique [20]. Put together, our approach can lead to 22% average energy savings for multimedia applications while guaranteeing application-level correctness, when compared to a recent work [9], which cannot guarantee application-level correctness. Comparing to the approach proposed in [20] which guarantees both application-level and numerical correctness, our method achieves 79% energy reduction.
international symposium on low power electronics and design | 2011
Jason Cong; Karthik Gururaj; Hui Huang; Chunyue Liu; Glenn Reinman; Yi Zou
By reconfiguring part of the cache as software-managed scratchpad memory (SPM), hybrid caches manage to handle both unknown and predictable memory access patterns. However, existing hybrid caches provide a flexible partitioning of cache and SPM without considering adaptation to the run-time cache behavior. Previous cache set balancing techniques are either energy-inefficient or require serial tag and data array access. In this paper an adaptive hybrid cache is proposed to dynamically remap SPM blocks from high-demand cache sets to low-demand cache sets. This achieves 19%, 25%, 18% and 18% energy-runtime-production reductions over four previous representative techniques on a wide range of benchmarks.
field-programmable custom computing machines | 2011
Alexandros Papakonstantinou; Yun Liang; John A. Stratton; Karthik Gururaj; Deming Chen; Wen-mei W. Hwu; Jason Cong
Recent progress in High-Level Synthesis (HLS) techniques has helped raise the abstraction level of FPGA programming. However implementation and performance evaluation of the HLS-generated RTL, involves lengthy logic synthesis and physical design flows. Moreover, mapping of different levels of coarse grained parallelism onto hardware spatial parallelism affects the final FPGA-based performance both in terms of cycles and frequency. Evaluation of the rich design space through the full implementation flow - starting with high level source code and ending with routed net list - is prohibitive in various scientific and computing domains, thus hindering the adoption of reconfigurable computing. This work presents a framework for multilevel granularity parallelism exploration with HLS-order of efficiency. Our framework considers different granularities of parallelism for mapping CUDA kernels onto high performance FPGA-based accelerators. We leverage resource and clock period models to estimate the impact of multi-granularity parallelism extraction on execution cycles and frequency. The proposed Multilevel Granularity Parallelism Synthesis (ML-GPS) framework employs an efficient design space search heuristic in tandem with the estimation models as well as design layout information to derive a performance near-optimal configuration. Our experimental results demonstrate that ML-GPS can efficiently identify and generate CUDA kernel configurations that can significantly outperform previous related tools whereas it can offer competitive performance compared to software kernel execution on GPUs at a fraction of the energy cost.
international conference on computer aided design | 2008
Jason Cong; Karthik Gururaj; Guoling Han; Adam Kaplan; Mishali Naik; Glenn Reinman
The ability to integrate diverse components such as processor cores, memories, custom hardware blocks and complex network-on-chip (NoC) communication frameworks onto a single chip has greatly increased the design space available for system-on-chip (SoC) designers. Efficient and accurate performance estimation tools are needed to assist the designer in making design decisions. In this paper, we present MC-Sim, a heterogeneous multi-core simulator framework which is capable of accurately simulating a variety of processor, memory, NoC configurations and application specific coprocessors. We also describe a methodology to automatically generate fast, cycle-true behavioral, C-based simulators for coprocessors using a high-level synthesis tool and integrate them with MC-Sim, thus augmenting it with the capacity to simulate coprocessors. Our C-based simulators provide on an average 45times improvement in simulation speed over that of RTL descriptions. We have used this framework to simulate a number of real-life applications such as the MPEG4 decoder and litho-simulation, and experimented with a number of design choices. Our simulator framework is able to accurately model the performance of these applications (only 7% off the actual implementation) and allows us to explore the design space rapidly and achieve interesting design implementations.
ACM Transactions in Embedded Computing Systems | 2013
Alexandros Papakonstantinou; Karthik Gururaj; John A. Stratton; Deming Chen; Jason Cong; Wen-mei W. Hwu
The rise of multicore architectures across all computing domains has opened the door to heterogeneous multiprocessors, where processors of different compute characteristics can be combined to effectively boost the performance per watt of different application kernels. GPUs, in particular, are becoming very popular for speeding up compute-intensive kernels of scientific, imaging, and simulation applications. New programming models that facilitate parallel processing on heterogeneous systems containing GPUs are spreading rapidly in the computing community. By leveraging these investments, the developers of other accelerators have an opportunity to significantly reduce the programming effort by supporting those accelerator models already gaining popularity. In this work, we adapt one such language, the CUDA programming model, into a new FPGA design flow called FCUDA, which efficiently maps the coarse- and fine-grained parallelism exposed in CUDA onto the reconfigurable fabric. Our CUDA-to-FPGA flow employs AutoPilot, an advanced high-level synthesis tool (available from Xilinx) which enables high-abstraction FPGA programming. FCUDA is based on a source-to-source compilation that transforms the SIMT (Single Instruction, Multiple Thread) CUDA code into task-level parallel C code for AutoPilot. We describe the details of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the resulting customized FPGA multicore accelerators. To the best of our knowledge, this is the first CUDA-to-FPGA flow to demonstrate the applicability and potential advantage of using the CUDA programming model for high-performance computing in FPGAs.
field-programmable custom computing machines | 2009
Jason Cong; Karthik Gururaj; Bin Liu; Chunyue Liu; Zhiru Zhang; Sheng Zhou; Yi Zou
Precision analysis and optimization is very important when transforming a floating-point algorithm into fixed-point hardware implementations. The core analysis techniques are either based on dynamic analysis or static analysis. We believe in static error analysis, as it is the only technique that can guarantee the desired worst-case accuracy. In this paper we study various underlying arithmetic candidates that can be used in static error analysis and compare their computed sensitivities. The approaches studied include Affine Arithmetic(AA), General Interval Arithmetic (GIA) and Automatic Differentiation (Symbolic Arithmetic). Our study shows that symbolic method is preferred for expressions with higher order cancelation. For programs without strong cancelation, any method works fairly well and GIA slightly outperforms others. We also study the impact of program transformations on these arithmetics.
field programmable gate arrays | 2009
Jason Cong; Karthik Gururaj; Guoling Han
Reconfigurable high-performance computing systems (RHPC) have been attracting more and more attention over the past few years. RHPC systems are a promising solution for accelerating system performance, lowering power consumption and minimizing operation cost. In order to achieve high performance on this hybrid system, it is important to effectively explore the design space, which includes accelerator synthesis, resource allocation and job scheduling. In this paper we propose novel algorithms for reconfigurable resource allocation and job scheduling to optimize performance of multicore RHPC systems. Specifically, we first propose an interesting approximation algorithm to assign jobs to processors with consideration of coprocessors at the global optimization step. Then we present an optimal solution for coprocessor selection in the local optimization step. In this paper we also demonstrate that designers can quickly explore a large number of accelerator design choices with the help of high-level synthesis tools. Experiments show that our proposed techniques provide efficient solutions for real-life benchmarks and generate higher quality of results. When compared to other heuristic algorithms, our results can achieve up to 47% performance improvement.
international conference on supercomputing | 2009
Alexandros Papakonstantinou; Karthik Gururaj; John A. Stratton; Deming Chen; Jason Cong; Wen-mei W. Hwu
In this work, we propose a new FPGA design flow that combines the CUDA programming model from Nvidia with the state of the art high-level synthesis tool AutoPilot from AutoESL, to efficiently map the exposed parallelism in CUDA kernels onto reconfigurable devices. The use of the CUDA programming model offers the advantage of a common programming interface for exploiting parallelism on two very different types of accelerators -- FPGAs and GPUs. Moreover, by leveraging the advanced synthesis capabilities of AutoPilot we enable efficient exploitation of the FPGA configurability for application specific acceleration. Our flow is based on a compilation process that transforms the SPMD CUDA thread blocks into high-concurrency AutoPilot-C code. We provide an overview of our CUDA-to-FPGA flow and demonstrate the highly competitive performance of the generated multi-core accelerators.