Ronak Singhal | Researchain

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Ronak Singhal is active.

Explore More

Publication

Featured researches published by Ronak Singhal.

international symposium on computer architecture | 2010

Debunking the 100X GPU vs. CPU myth: an evaluation of throughput computing on CPU and GPU

Victor W. Lee; Changkyu Kim; Jatin Chhugani; Michael E. Deisher; Daehyun Kim; Anthony D. Nguyen; Nadathur Satish; Mikhail Smelyanskiy; Srinivas Chennupaty; Per Hammarlund; Ronak Singhal; Pradeep Dubey

Recent advances in computing have led to an explosion in the amount of data being generated. Processing the ever-growing data in a timely manner has made throughput computing an important aspect for emerging applications. Our analysis of a set of important throughput computing kernels shows that there is an ample amount of parallelism in these kernels which makes them suitable for todays multi-core CPUs and GPUs. In the past few years there have been many studies claiming GPUs deliver substantial speedups (between 10X and 1000X) over multi-core CPUs on these kernels. To understand where such large performance difference comes from, we perform a rigorous performance analysis and find that after applying optimizations appropriate for both CPUs and GPUs the performance gap between an Nvidia GTX280 processor and the Intel Core i7-960 processor narrows to only 2.5x on average. In this paper, we discuss optimization techniques for both CPU and GPU, analyze what architecture features contributed to performance differences between the two architectures, and recommend a set of architectural features which provide significant improvement in architectural efficiency for throughput kernels.

IEEE Micro | 2014

Haswell: The Fourth-Generation Intel Core Processor

Per Hammarlund; Alberto J. Martinez; Atiq Bajwa; David L. Hill; Erik G. Hallnor; Hong Jiang; Martin G. Dixon; Michael N. Derr; Mikal C. Hunsaker; Rajesh Kumar; Randy B. Osborne; Ravi Rajwar; Ronak Singhal; Reynold V. D'Sa; Robert S. Chappell; Shiv Kaushik; Srinivas Chennupaty; Stephan J. Jourdan; Steve H. Gunther; Thomas A. Piazza; Ted Burton

Haswell, Intels fourth-generation core processor architecture, delivers a range of client parts, a converged core for the client and server, and technologies used across many products. It uses an optimized version of Intel 22-nm process technology. Haswell provides enhancements in power-performance efficiency, power management, form factor and cost, core and uncore microarchitecture, and the cores instruction set.

field programmable gate arrays | 2009

Intel nehalem processor core made FPGA synthesizable

Graham Schelle; Jamison D. Collins; Ethan Schuchman; Perry H. Wang; Xiang Zou; Gautham N. Chinya; Ralf Plate; Thorsten Mattner; Franz Olbrich; Per Hammarlund; Ronak Singhal; Jim Brayton; Sebastian Steibl; Hong Wang

We present an FPGA-synthesizable version of the Intel Atom processor core, synthesized to a Virtex-5 based FPGA emulation system. To make the production Atom design in SystemVerilog synthesizable through industry standard EDA tool flow, we transformed and mapped latches in the design, converted clock gating, and replaced nonsynthesizable constructs with FPGA-synthesizable counterparts. Additionally, as the target FPGA emulator is hosted on a PC platform with the Pentium-based CPU socket that supports a significantly different front side bus (FSB) protocol from that of the Atom processor, we replaced the existing bus control logic in the Atom core with an alternate FSB protocol to communicate with the rest of the PC platform. With these efforts, we succeeded in synthesizing the entire Atom processor core to fit within a single Virtex-5 LX330 FPGA. The synthesizable Atom core runs at 50Mhz on the Pentium PC motherboard with fully functional I/O peripherals. It is capable of booting off-the-shelf MS-DOS, Windows XP and Linux operating systems, and executing standard x86 workloads.

high-performance computer architecture | 2016

Cache QoS: From concept to reality in the Intel® Xeon® processor E5-2600 v3 product family

Andrew J. Herdrich; Edwin Verplanke; Priya Autee; Ramesh Illikkal; Chris Gianos; Ronak Singhal; Ravi R. Iyer

Over the last decade, addressing quality of service (QoS) in multi-core server platforms has been growing research topic. QoS techniques have been proposed to address the shared resource contention between co-running applications or virtual machines in servers and thereby provide better isolation, performance determinism and potentially improve overall throughput. One of the most important shared resources is cache space. Most proposals for addressing shared cache contention are based on simulations and analysis and no commercial platforms were available that integrated such techniques and provided a practical solution. In this paper, we will present the first set of shared cache QoS techniques designed and implemented in state-of-the-art commercial servers (the Intel® Xeon® processor E5-2600 v3 product family). We will describe two key technologies: (i) Cache Monitoring Technology (CMT) to enable monitoring of shared cache usage by different applications and (ii) Cache Allocation Technology (CAT) which enables redistribution of shared cache space between applications to address contention. This is the first paper to describing these techniques as they moved from concept to reality, starting from early research to product implementation. We will also present case studies highlighting the value of these techniques using example scenarios of multi-programmed workloads, virtualized platforms in datacenters and communications platforms. Finally, we will describe initial software infrastructure and enabling for industry practitioners and researchers to take advantage of these technologies for their QoS needs.

international symposium on computer architecture | 2014

Improving the energy efficiency of big cores

Kenneth Czechowski; Victor W. Lee; Ed Grochowski; Ronny Ronen; Ronak Singhal; Richard W. Vuduc; Pradeep Dubey

Traditionally, architectural innovations designed to boost single-threaded performance incur overhead costs which significantly increase power consumption. In many cases the increase in power exceeds the improvement in performance, resulting in a net increase in energy consumption. Thus, it is reasonable to assume that modern attempts to improve single-threaded performance will have a negative impact on energy efficiency. This has led to the belief that “Big Cores” are inherently inefficient. To the contrary, we present a study which finds that the increased complexity of the core microarchitecture in recent generations of the Intel® Core™ processor have reduced both the time and energy required to run various workloads. Moreover, taking out the impact of process technology changes, our study still finds the architecture and microarchitecture changes -such as the increase in SIMD width, addition of the frontend caches, and the enhancement to the out-of-order execution engine- account for 1.2x improvement in energy efficiency for these processors. This paper provides real-world examples of how architectural innovations can mitigate inefficiencies associated with “Big Cores” -for example, micro-op caches obviate the costly decode of complex x86 instructions- resulting in a core architecture that is both high performance and energy efficient. It also contributes to the understanding of how microarchitecture affects performance, power and energy efficiency by modeling the relationship between them.

computing frontiers | 2011

AstroLIT: enabling simulation-based microarchitecture comparison between Intel® and Transmeta designs

Guilherme Ottoni; Gautham N. Chinya; Gerolf F. Hoflehner; Jamison D. Collins; Amit Kumar; Ethan Schuchman; David R. Ditzel; Ronak Singhal; Hong Wang

While the out-of-order engine has been the mainstream micro-architecture-design paradigm to achieve high performance, Transmeta took a different approach using dynamic binary translation (BT). To enable detailed comparison of these two radically different processor-design approaches, it is natural to leverage well-established simulation-based methodologies. However, BT-based processor designs pose new challenges to standard sampling-based simulation methodologies. This paper describes these challenges, and it also introduces the AstroLIT methodology to address them.

Archive | 2004