Kingshuk Karuri
RWTH Aachen University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Kingshuk Karuri.
design, automation, and test in europe | 2006
Torsten Kempf; Kingshuk Karuri; Stefan Wallentowitz; Gerd Ascheid; Rainer Leupers; Heinrich Meyr
The increasing demands of high-performance in embedded applications under shortening time-to-market has prompted system architects in recent time to opt for multi-processor systems-on-chip (MP-SoCs) employing several programmable devices. The programmable cores provide a high amount of flexibility and reusability, and can be optimized to the requirements of the application to deliver high-performance as well. Since application software forms the basis of such designs, the need to tune the underlying SoC architecture for extracting maximum performance from the software code has become imperative. In this paper, we propose a framework that enables software development, verification and evaluation from the very beginning of MP-SoC design cycle. Unlike traditional SoC design flows where software design starts only after the initial SoC architecture is ready, our framework allows a co-development of the hardware and the software components in a tightly coupled loop where the hardware can be refined by considering the requirements of the software in a stepwise manner. The key element of this framework is the integration of a fine-grained software instrumentation tool into a system-level-design (SLD) environment to obtain accurate software performance and memory access statistics. The accuracy of such statistics is comparable to that obtained through instruction set simulation (ISS), while the execution speed of the instrumented software is almost an order of magnitude faster than ISS. Such a combined design approach assists system architects to optimize both the hardware and the software through fast exploration cycles, and can result in far shorter design cycles and high productivity. We demonstrate the generality and the efficiency of our methodology with two case studies selected from two most prominent and computationally intensive embedded application domains
design automation conference | 2005
Kingshuk Karuri; M.A. Al Faruque; Stefan Kraemer; Rainer Leupers; Gerd Ascheid; Heinrich Meyr
Current application specific instruction set processor (ASIP) design methodologies are mostly based on iterative architecture exploration that uses Architecture Description Languages (ADLs) and retargetable software development tools. However, for improved design efficiency, additional pre-architecture exploration tools are required to help narrow-down the huge design space and making coarse-grained instruction set architecture (ISA) decisions before detailed ADL modeling. Extensive application code profiling is the key in such early design stages. Based on a novel code instrumentation technology, we present a micro-profiling approach that fills the current gap between source-level and instruction-level profilers and combines their advantages w.r.t. speed and accuracy. We show how the micro-profiler is embedded into an advanced ASIP design flow and justify its use in a case study to design an MP3 decoder ASIP.
design, automation, and test in europe | 2004
Manuel Hohenauer; Hanno Scharwaechter; Kingshuk Karuri; Oliver Wahlen; Tim Kogel; Rainer Leupers; Gerd Ascheid; Heinrich Meyr; Gunnar Braun; Hans van Someren
Retargetable C compilers are key tools for efficient architecture exploration for embedded processors. In this paper we describe a novel approach to retargetable compilation based on LISA, an industrial processor modeling language for efficient ASIP design. In order to circumvent the well-known trade-off between flexibility and code quality in retargetable compilation, we propose a user-guided, semiautomatic methodology that in turn builds on a powerful existing C compiler design platform. Our approach allows to include generated C compilers into the ASIP architecture exploration loop at an early stage, thereby allowing for a more efficient design process and avoiding application/architecture mismatches. We present the corresponding methodology and tool suite and provide experimental data for two real-life embedded processors that prove the feasibility of the approach.
design automation conference | 2008
Lei Gao; Kingshuk Karuri; Stefan Kraemer; Rainer Leupers; Gerd Ascheid; Heinrich Meyr
With the growing number of programmable processing elements in todays Multiprocessor System-on-Chip (MPSoC) designs, the synergy required for the development of the hardware architecture and the software running on them is also increasing. In MPSoC development environment, changes in the hardware architecture can bring in extensive re-partitioning or re-parallelization of the software architecture. Fast and accurate functional simulation and performance estimation techniques are needed to cope with this co-design problem at the early phases of MPSoC design space exploration. The current paper addresses this issue by introducing a framework which combines hybrid simulation, cache simulation and online trace-driven replay techniques to accurately predict performance of programmable elements in an MPSoC environment. The resulting simulation technique can easily cope with the continuous re-organizations of software architectures during an Instruction Set Simulator (ISS) based design process. Experimental results show that this framework can improve system simulation speed by 3-5X on average while achieving accuracy closely comparable to traditional ISSes.
IEEE Transactions on Very Large Scale Integration Systems | 2008
Kingshuk Karuri; Anupam Chattopadhyay; Xiaolin Chen; David Kammler; Ling Hao; Rainer Leupers; Heinrich Meyr; Gerd Ascheid
During the last years, the growing application complexity, design, and mask costs have compelled embedded system designers to increasingly consider partially reconfigurable application-specific instruction set processors (rASIPs) which combine a programmable base processor with a reconfigurable fabric. Although such processors promise to deliver excellent balance between performance and flexibility, their design remains a challenging task. The key to the successful design of a rASIP is combined architecture exploration of all the three major components: the programmable core, the reconfigurable fabric, and the interfaces between these two. This work presents a design flow that supports fast architecture exploration for rASIPs. The design flow is centered around a unified description of an entire rASIP in an architecture description language (ADL). This ADL description facilitates consistent modeling and exploration of all three components of a rASIP through automatic generation of the software tools (compiler tool chain and instruction set simulator) and the RTL hardware model. The generated software tools and the RTL model can be used either for final implementation of the rASIP or can serve as a preoptimized starting point for implementation that can be hand optimized afterward. The design flow is further enhanced by a number of automatic application analysis tools, including a fine-grained application profiler, an instruction set extension (ISE) generator, and a data path mapper for coarse grained reconfigurable architectures (CGRAs). We present some case studies on embedded benchmarks to show how the design space exploration process helps to efficiently design an application domain specific rASIP.
international conference on computer aided design | 2007
Kingshuk Karuri; Anupam Chattopadhyay; Manuel Hohenauer; Rainer Leupers; Gerd Ascheid; Heinrich Meyr
The conflicting requirements of performance and flexibility in today s embedded system market are forcing system designers to use more and more of the so called configurable or customizable processor cores. Such processors tend to meet the demanding performance constraints by accommodating application specific instruction set extensions (ISEs) which have, naturally, become a vital component of current processor customization flows. One major bottleneck in maximizing ISE performance is the limitation on the data-bandwidth between the general purpose register (GPR) file and the ISEs. For improved performance, it is desirable to have a large data-bandwidth from the GPRs to ISEs. However, the tight area constraints of modern embedded processors often restrict the GPR I/O of ISEs to save port area of the register files. This paper presents a novel approach to increase the GPR I/O of ISEs without significantly increasing the size of the GPR files. This is achieved by applying the concept of register clustering, common in many VLIW architectures, to single-issue processors with high performance ISEs. Such clustering often causes extra register moves in compiled code. This work also presents an algorithm to minimize such register moves. The benchmark results presented in this paper show that our solution can significantly reduce the area overhead of many-port GPR files without sacrificing the performance improvements through ISEs.
design, automation, and test in europe | 2006
Kingshuk Karuri; Rainer Leupers; Gerd Ascheid; Heinrich Meyr; Monu Kedia
Multimedia and communication algorithms from embedded system domain often make extensive use of floating-point arithmetic. Due to the complexity and expense of the floating-point hardware, final implementations of these algorithms are usually carried out using floating-point emulation in software, or conversion (manually or automatically) of the floating-point operations to fixed point operations. Such strategies often lead to semi-optimal and imprecise software implementation. This paper presents the design and implementation of a floating-point unit (FPU) for an application specific instruction set processor (ASIP) suitable for embedded systems domain. Using a state-of-the-art architecture description language (ADL) based ASIP design framework, the FPU is implemented in such a modular way that it can be easily adapted to any other RISC like processor. The implemented operations are fully compliant to the IEEE 754 standard which facilitates portable software development. The benchmarking, in terms of energy, area and speed, of the designed FPU highlights the trade-offs of having a hardware FPU w.r.t. software emulation of floating-point operations
ACM Transactions in Embedded Computing Systems | 2007
Hanno Scharwaechter; David Kammler; Andreas Wieferink; Manuel Hohenauer; Kingshuk Karuri; Jianjiang Ceng; Rainer Leupers; Gerd Ascheid; Heinrich Meyr
Application-Specific Instruction-Set Processors (ASIPs) are becoming increasingly popular in the world of customized, application-driven System-on-Chip (SoC) designs. Efficient ASIP design requires an iterative architecture exploration loop---gradual refinement of the processor architecture starting from an initial template. To accomplish this task, design automation tools are used to detect bottlenecks in embedded applications, to implement application-specific processor instructions, and to automatically generate the required software tools (such as instruction-set simulator, C-compiler, assembler, and profiler), as well as to synthesize the hardware. This paper describes an architecture exploration loop for an ASIP coprocessor that implements common encryption functionality used in symmetric block cipher algorithms for internet protocol security (IPSec). The coprocessor is accessed via shared memory and, as a consequence, our approach is easily adaptable to arbitrary main processor architectures. This paper presents the extended version of our case study that has been already published on the SCOPES conference in 2004. In both papers, a MIPS architecture is used as the main processor and Blowfish as encryption algorithm.
international conference / workshop on embedded computer systems: architectures, modeling and simulation | 2009
Kingshuk Karuri; Rainer Leupers; Gerd Ascheid; Heinrich Meyr
Instruction-Set Extensions (ISEs) have gained prominence in the past few years as a useful method for tailoring the ISAs (Instruction-Set Architectures) of ASIPs (Application Specific Instruction-Set Processors) to the computational requirements of various embedded applications. This work presents a generic and easily adaptable flow for application oriented ISE design that supports both of the prevalent ASIP design paradigms - complete ISA design from scratch through an extensive design-space exploration, or limited ISA adaptation for a pre-designed and pre-verified base-processor core. The broad applicability of this design flow is demonstrated using ISA customization case studies for both of these two design philosophies.
symposium/workshop on electronic design, test and applications | 2006
Kingshuk Karuri; Christian Huben; Rainer Leupers; Gerd Ascheid; Heinrich Meyr
The memory subsystem is the major performance bottleneck in terms of speed and power consumption in todays digital systems. This is especially true for application specific embedded systems where power consumption due to memory traffic, memory latency and size of the on-chip caches have a significant role in overall system performance, energy efficiency and cost. There is an urgent need of tools that help designers take informed decisions about memory subsystems for embedded applications. This paper presents a novel, fine-grained memory profiling technique that provides the designer with valuable information such as the total amount of dynamic memory requirement of an application, the most heavily accessed source level data objects, the most memory intensive portions of an application etc. Such information can aid designers to take decisions about the overall memory subsystem comprising of a number of different cache levels, scratch-pad memories and main memory. It can also be used by a compiler to perform advanced compiler controlled memory assignment techniques, and by the application programmer to optimize an application. Case studies at the end of this paper demonstrate the accuracy of our profiling technique and provide some example usage scenarios of it.