Network


Latest external collaboration on country level. Dive into details by clicking on the dots.

Hotspot


Dive into the research topics where Arun Kejariwal is active.

Publication


Featured researches published by Arun Kejariwal.


languages and compilers for parallel computing | 2008

Compiler-Driven Dependence Profiling to Guide Program Parallelization

Peng Wu; Arun Kejariwal; Călin Caşcaval

As hardware systems move toward multicore and multithreaded architectures, programmers increasingly rely on automated tools to help with both the parallelization of legacy codes and effective exploitation of all available hardware resources. Thread-level speculation (TLS) has been proposed as a technique to parallelize the execution of serial codes or serial sections of parallel codes. One of the key aspects of TLS is task selection for speculative execution. In this paper we propose a cost model for compiler-driven task selection for TLS. The model employs profile-based analysis of may -dependences to estimate the probability of successful speculation. We discuss two techniques to eliminate potential inter-task dependences, thereby improving the rate of successful speculation. We also present a profiling tool, DProf, that is used to provide run-time information about may -dependences to the compiler and map dynamic dependences to the source code. This information is also made available to the programmer to assist in code rewriting and/or algorithm redesign. We used DProf to quantify the potential of this approach and we present results on selected applications from the SPEC CPU2006 and SEQUOIA benchmarks.


international conference on vlsi design | 2004

Synthesis-driven exploration of pipelined embedded processors

Prabhat Mishra; Arun Kejariwal; Nikil D. Dutt

Recent advances on language based software toolkit generation enables performance driven exploration of embedded systems by exploiting the application behavior. There is a need for an automatic generation of hardware to determine the required silicon area, clock frequency, and power consumption of the candidate architectures. In this paper, we present a language based exploration framework that automatically generates synthesizable RTL models for pipelined processors. Our framework allows varied micro-architectural modifications, such as, addition of pipeline stages, pipeline paths, opcodes and new functional units. The generated RTL is synthesized to determine the area, power, and clock frequency of the modified architectures. Our exploration results demonstrate the power of reuse in composing heterogeneous architectures using functional abstraction primitives allowing for a reduction in the time for specification and exploration by at least an order of magnitude.


rapid system prototyping | 2003

Rapid exploration of pipelined processors through automatic generation of synthesizable RTL models

Prabhat Mishra; Arun Kejariwal; Nikil D. Dutt

As embedded systems continue to face increasingly higher performance requirements, deeply pipelined processor architectures are being employed to meet desired system performance. System architects critically need modeling techniques to rapidly explore and evaluate candidate architectures based on area, power, and performance constraints. We present an exploration framework for pipelined processors. We use the EXPRESSION Architecture Description Language (ADL) to capture a wide spectrum of processor architectures. The ADL has been used to enable performance driven exploration by generating a software toolkit from the ADL specification. In this paper, we present a functional abstraction technique to automatically generate synthesizable RTL from the ADL specification. Automatic generation of RTL enables rapid exploration of candidate architectures under given design constraints such as area, clock frequency, power, and performance. Our exploration results demonstrate the power of reuse in composing heterogeneous architectures using functional abstraction primitives allowing for a reduction in the time for specification and exploration by at least an order of magnitude.


international conference on supercomputing | 2006

On the performance potential of different types of speculative thread-level parallelism: The DL version of this paper includes corrections that were not made available in the printed proceedings

Arun Kejariwal; Xinmin Tian; Wei Li; Milind Girkar; Sergey Kozhukhov; Hideki Saito; Utpal Banerjee; Alexandru Nicolau; Alexander V. Veidenbaum; Constantine D. Polychronopoulos

Recent research in thread-level speculation (TLS) has proposed several mechanisms for optimistic execution of difficult-to-analyze serial codes in parallel. Though it has been shown that TLS helps to achieve higher levels of parallelism, evaluation of the unique performance potential of TLS, i.e., performance gain that be achieved only through speculation, has not received much attention. In this paper, we evaluate this aspect, by separating the speedup achievable via true TLP (thread-level parallelism) and TLS, for the SPEC CPU2000 benchmark. Further, we dissect the performance potential of each type of speculation --- control speculation, data dependence speculation and data value speculation. To the best of our knowledge, this is the first dissection study of its kind. Assuming an oracle TLS mechanism --- which corresponds to perfect speculation and zero threading overhead --- whereby the execution time of a candidate program region (for speculative execution) can be reduced to zero, our study shows that, at the loop-level, the upper bound on the arithmetic mean and geometric mean speedup achievable via TLS across SPEC CPU2000 is 39.16% (standard deviation = 31.23) and 18.18% respectively.


acm sigplan symposium on principles and practice of parallel programming | 2007

Tight analysis of the performance potential of thread speculation using spec CPU 2006

Arun Kejariwal; Xinmin Tian; Milind Girkar; Wei Li; Sergey Kozhukhov; Utpal Banerjee; Alexander Nicolau; Alexander V. Veidenbaum; Constantine D. Polychronopoulos

Multi-cores such as the Intel®1 Core™2 Duo processor, facilitate efficient thread-level parallel execution of ordinary programs, wherein the different threads-of-execution are mapped onto different physical processors. In this context, several techniques have been proposed for auto-parallelization of programs. Recently, thread-level speculation (TLS) has been proposed as a means to parallelize difficult-to-analyze serial codes. In general, more than one technique can be employed for parallelizing a given program. The overlapping nature of the applicability of the various techniques makes it hard to assess the intrinsic performance potential of each. In this paper, we present a tight analysis of the (unique) performance potential of both: (a) TLS in general and (b) specific types of thread-level speculation, viz., control speculation, data dependence speculation and data value speculation, for the SPEC2 CPU2006 benchmark suite in light of the various limiting factors such as the threading overhead and misspeculation penalty. To the best of our knowledge, this is the first evaluation of TLS based on SPEC CPU2006 and accounts for the aforementioned real-life con-straints. Our analysis shows that, at the innermost loop level, the upper bound on the speedup uniquely achievable via TLS with the state-of-the-art thread implementations for both SPEC CINT2006 and CFP2006 is of the order of 1%.


IEEE Transactions on Very Large Scale Integration Systems | 2006

Energy efficient watermarking on mobile devices using proxy-based partitioning

Arun Kejariwal; Sumit Gupta; Alexandru Nicolau; Nikil D. Dutt; Rajesh K. Gupta

Digital watermarking embeds an imperceptible signature or watermark in a digital file containing audio, image, text, or video data. The watermark can be used to authenticate the data file and for tamper detection. It is particularly valuable in the use and exchange of digital media, such as audio and video, on emerging handheld devices. However, watermarking is computationally expensive and adds to the drain of the available energy in handheld devices. In this paper, we first analyze the energy profile of various watermarking algorithms. We also study the impact of security and image quality on energy consumption. Second, we present an approach in which we partition the watermarking embedding and extraction algorithms and migrate some tasks to a proxy server. This leads to a lower energy consumption on the handheld without compromising the security of the watermarking process. Experimental results show that executing the watermarking tasks that are partitioned between the proxy and the handheld devices, reduces the total energy consumed by 80%, and improves performance by two orders of magnitude compared to running the application on only the handheld device


international conference on hardware/software codesign and system synthesis | 2006

Challenges in exploitation of loop parallelism in embedded applications

Arun Kejariwal; Alexander V. Veidenbaum; Alexandru Nicolau; Milind Girkarmark; Xinmin Tian; Hideki Saito

Embedded processors have been increasingly exploiting hardware parallelism. Vector units, multiple processors or cores, hyper-threading, special-purpose accelerators such as DSPs or cryptographic engines, or a combination of the above have appeared in a number of processors. They serve to address the increasing performance requirements of modern embedded applications. How this hardware parallelism can be exploited by applications is directly related to the amount of parallelism inherent in a target application. In this paper we evaluate the performance potential of different types of parallelism, viz., true thread-level parallelism, speculative thread- level parallelism and vector parallelism, when executing loops. Applications from the industry-standard EEMBC 1.1, EEMBC 2.0 and the MiBench embedded benchmark suites are analyzed using the Intel C compiler. The results show what can be achieved today, provide upper bounds on the performance potential of different types of thread parallelism, and point out a number of issues that need to be addressed to improve performance. The latter include parallelization of libraries such as libc and design of parallel algorithms to allow maximal exploitation of parallelism. The results also point to the need for developing new benchmark suites more suitable to parallel compilation and execution.


ACM Transactions in Embedded Computing Systems | 2008

Improving SDRAM access energy efficiency for low-power embedded systems

Jelena Trajkovic; Alexander V. Veidenbaum; Arun Kejariwal

DRAM (dynamic random-access memory) energy consumption in low-power embedded systems can be very high, exceeding that of the data cache or even that of the processor. This paper presents and evaluates a scheme for reducing the energy consumption of SDRAM (synchronous DRAM) memory access by a combination of techniques that take advantage of SDRAM energy efficiencies in bank and row access. This is achieved by using small, cachelike structures in the memory controller to prefetch an additional cache block(s) on SDRAM reads and to combine block writes to the same SDRAM row. The results quantify the SDRAM energy consumption of MiBench applications and demonstrate significant savings in SDRAM energy consumption, 23%, on average, and reduction in the energy-delay product, 44%, on average. The approach also improves performance: the CPI is reduced by 26%, on average.


ACM Transactions in Embedded Computing Systems | 2009

On the exploitation of loop-level parallelism in embedded applications

Arun Kejariwal; Alexander V. Veidenbaum; Alexandru Nicolau; Milind Girkar; Xinmin Tian; Hideki Saito

Advances in the silicon technology have enabled increasing support for hardware parallelism in embedded processors. Vector units, multiple processors/cores, multithreading, special-purpose accelerators such as DSPs or cryptographic engines, or a combination of the above have appeared in a number of processors. They serve to address the increasing performance requirements of modern embedded applications. To what extent the available hardware parallelism can be exploited is directly dependent on the amount of parallelism inherent in the given application and the congruence between the granularity of hardware and application parallelism. This paper discusses how loop-level parallelism in embedded applications can be exploited in hardware and software. Specifically, it evaluates the efficacy of automatic loop parallelization and the performance potential of different types of parallelism, viz., true thread-level parallelism (TLP), speculative thread-level parallelism and vector parallelism, when executing loops. Additionally, it discusses the interaction between parallelization and vectorization. Applications from both the industry-standard EEMBC®,1 1.1, EEMBC 2.0 and the academic MiBench embedded benchmark suites are analyzed using the Intel®2 C compiler. The results show the performance that can be achieved today on real hardware and using a production compiler, provide upper bounds on the performance potential of the different types of thread-level parallelism, and point out a number of issues that need to be addressed to improve performance. The latter include parallelization of libraries such as libc and design of parallel algorithms to allow maximal exploitation of parallelism. The results also point to the need for developing new benchmark suites more suitable to parallel compilation and execution. 1 Other names and brands may be claimed as the property of others. 2 Intel is a trademark of Intel Corporation or its subsidiaries in the United States and other countries.


international conference on embedded computer systems: architectures, modeling, and simulation | 2008

Comparative architectural characterization of SPEC CPU2000 and CPU2006 benchmarks on the intel® Core™ 2 Duo processor

Arun Kejariwal; Alexander V. Veidenbaum; Alexandru Nicolau; Xinmin Tian; Milind Girkar; Hideki Saito; Utpal Banerjee

SPEC CPU benchmarks are commonly used by compiler writers and architects of general purpose processors for performance evaluation. Since the release of the CPU89 suite, the SPEC CPU benchmark suites have evolved, with applications either removed or added or upgraded. This influences the design decisions for the next generation compilers and microarchitectures. In view of the above, it is critical to characterize the applications in the new suite - SPEC CPU2006 - to guide the decision making process. Although similar studies using the retired SPEC CPU benchmark suites have been done in the past, to the best of our knowledge, a thorough performance characterization of CPU2006 and its comparison with CPU2000 has not been done so far. In this paper, we present the above. For this, we compiled the applications in CPU2000 and CPU2006 using the Intelreg2 Fortran/C++ optimizing compiler and executed them, using the reference data sets, on the state-of-the-art Intel Coretrade2 Duo processor. The performance information was collected by using the Intel VTunetrade performance analyzer that takes advantage of the built-in hardware performance counters to obtain accurate information on program behavior and its use of processor resources. The focus of this paper is on branch and memory access behavior, the well-known reasons for program performance problems. By analyzing and comparing the L1 data and L2 cache miss rates, branch prediction accuracy, and resource stalls the performance impact in each suite is indirectly determined and described. Not surprisingly, the CPU2006 codes are larger, more complex, and have larger data sets. This leads to higher average L2 cache miss rates and a slight reduction in average IPC compared to the CPU2000 suite. Similarly, the average branch behavior is slightly worse in CPU2006 suite. However, based on processor stall counts branches are much less of a problem. The results presented here are a step towards understanding the SPEC CPU2006 benchmarks and will aid compiler writers in understanding the impact of currently implemented optimizations and in the design of new ones to address the new challenges presented by SPEC CPU2006. Similar opportunities exist for architecture optimization.

Collaboration


Dive into the Arun Kejariwal's collaboration.

Top Co-Authors

Avatar
Top Co-Authors

Avatar
Top Co-Authors

Avatar

Nikil D. Dutt

University of California

View shared research outputs
Top Co-Authors

Avatar
Researchain Logo
Decentralizing Knowledge