Niket Kumar Choudhary

Archive Network Publication Hotspot Collaboration

Network

Latest external collaboration on country level. Dive into details by clicking on the dots.

Explore More

Hotspot

Dive into the research topics where Niket Kumar Choudhary is active.

Explore More

Publication

Featured researches published by Niket Kumar Choudhary.

international conference on parallel architectures and compilation techniques | 2009

Core-Selectability in Chip Multiprocessors

Hashem Hashemi Najaf-abadi; Niket Kumar Choudhary; Eric Rotenberg

The centralized structures necessary for the extraction of instruction-level parallelism (ILP) are consuming progressively smaller portions of the total die area of chip multiprocessors (CMP). The reason for this is that scaling these structures does not enhance general performance as much as scaling the cache and interconnect. However, the fact that these structures now consume less proportional die area opens an avenue to enhancing their performance through truly overcoming the one-size-fits-all approach to their design. This paper proposes core-selectability – incorporating differently-designed cores that can be toggled into active employment. This enables differently customized ILP-extracting structures to be at hand in the system while not dramatically adding to the interconnect complexity. The design verification effort is minimized by separating the complexity of different core designs. Moreover, contrary to alternative approaches, the performance and power efficiency of the core designs are not compromised. Evaluation results are presented that show that, even when limiting the diversity between core designs to only the sizing of microarchitectural structures, core-selectability has the potential to provide notable performance enhancement (with an average of 10%) to scalable multithreaded applications, without increased concurrency. In addition, it can provide significantly greater throughput to multiprogrammed workloads by providing the potential for the system to transform into a heterogeneous design.

international symposium on microarchitecture | 2012

FabScalar: Automating Superscalar Core Design

Niket Kumar Choudhary; Salil V. Wadhavkar; Tanmay A. Shah; Hiran Mayukh; Jayneel Gandhi; Brandon H. Dwiel; Sandeep Navada; Hashem Hashemi Najaf-abadi; Eric Rotenberg

Providing multiple superscalar core types on a chip, each tailored to different classes of instruction-level behavior, is an exciting direction for increasing processor performance and energy efficiency. Unfortunately, processor design and verification effort increases with each additional core type, limiting the microarchitectural diversity that can be practically implemented. FabScalar aims to automate superscalar core design, opening up processor design to microarchitectural diversity and its many opportunities.

international conference on parallel architectures and compilation techniques | 2013

A unified view of non-monotonic core selection and application steering in heterogeneous chip multiprocessors

Sandeep Navada; Niket Kumar Choudhary; Salil V. Wadhavkar; Eric Rotenberg

A single-ISA heterogeneous chip multiprocessor (HCMP) is an attractive substrate to improve single-thread performance and energy efficiency in the dark silicon era. We consider HCMPs comprised of non-monotonic core types where each core type is performance-optimized to different instruction-level behavior and hence cannot be ranked - different program phases achieve their highest performance on different cores. Although non-monotonic heterogeneous designs offer higher performance potential than either monotonic heterogeneous designs or homogeneous designs, steering applications to the best-performing core is challenging due to performance ambiguity of core types.

high performance embedded architectures and compilers | 2012

Efficiently exploiting memory level parallelism on asymmetric coupled cores in the dark silicon era

George Patsilaras; Niket Kumar Choudhary; James Tuck

Extracting high memory-level parallelism (MLP) is essential for speeding up single-threaded applications which are memory bound. At the same time, the projected amount of dark silicon (the fraction of the chip powered off) on a chip is growing. Hence, Asymmetric Multicore Processors (AMP) offer a unique opportunity to integrate many types of cores, each powered at different times, in order to optimize for different regions of execution. In this work, we quantify the potential for exploiting core customization to speedup programs during regions of high MLP. Based on a careful design space exploration, we discover that an AMP that includes a narrow and fast specialized core has the potential to efficiently exploit MLP. Using the results of our analysis, we design an AMP with both an MLP and ILP specialized core, and we propose a hardware-level, application steering mechanism called Symbiotic Core Execution (SCE). SCE detects MLP phases by monitoring the L2 miss rate of the application, and it uses that information to steer the application to the best core. Interestingly, we show that L2 miss rates are important for deciding when an MLP region begins and when it ends. As a program runs, its execution migrates to a core customized for MLP during regions of high MLP; when the region ends, it is re-scheduled on the core that fits the application characteristics. Compared to a monolithic core optimized for both modes of operation, our AMP design provides a harmonic mean performance improvement of 5.3% and 6.6% for SPEC2000 and SPEC2006, respectively, with a maximum speedup of 14.5%. For the same study, it achieves a 18.3% and 21.1% energy delay2 reduction for SPEC2000 and SPEC2006, respectively. Our findings yield an important message for designing AMPs with specialized cores: core customization enables efficient exploitation of MLP, and application steering mechanisms for MLP are simple to implement and effective.

international symposium on performance analysis of systems and software | 2012

FPGA modeling of diverse superscalar processors

Brandon H. Dwiel; Niket Kumar Choudhary; Eric Rotenberg

There is increasing interest in using Field Programmable Gate Arrays (FPGAs) as platforms for computer architecture simulation. This paper is concerned with modeling superscalar processors with FPGAs. To be transformative, the FPGA modeling framework should meet three criteria. (1) Configurable: The framework should be able to model diverse superscalar processors, like a software model. In particular, it should be possible to vary superscalar parameters such as fetch, issue, and retire widths, depths of pipeline stages, queue sizes, etc. (2) Automatic: The framework should be able to automatically and efficiently map any one of its superscalar processor configurations to the FPGA. (3) Realistic: The framework should model a modern superscalar microarchitecture in detail, ideally with prototype quality, to enable a new era and depth of microarchitecture research. A framework that meets these three criteria will enjoy the convenience of a software model, the speed of an FPGA model, and the experience of a prototype. This paper describes FPGA-Sim, a configurable, automatically FPGA-synthesizable, and register-transfer-level (RTL) model of an out-of-order superscalar processor. FPGA-Sim enables FPGA modeling of diverse superscalar processors out-of-the-box. Moreover, its direct RTL implementation yields the fidelity of a hardware prototype.

international conference on parallel architectures and compilation techniques | 2010

Criticality-driven superscalar design space exploration

Sandeep Navada; Niket Kumar Choudhary; Eric Rotenberg

It has become increasingly difficult to perform design space exploration (DSE) of computer systems with a short turnaround time because of exploding design spaces, increasing design complexity and long-running workloads. Researchers have used classical search/optimization techniques like simulated annealing, genetic algorithms, etc., to accelerate the DSE. While these techniques are better than an exhaustive search, a substantial amount of time must still be dedicated to DSE. This is a serious bottleneck in reducing research/development time. These techniques do not perform the DSE quickly enough, primarily because they do not leverage any insight as to how the different design parameters of a computer system interact to increase or degrade performance at a design point and treat the computer system as a “black-box”.

2012 IEEE/IFIP 20th International Conference on VLSI and System-on-Chip (VLSI-SoC) | 2012

A physical design study of fabscalar-generated superscalar cores

Niket Kumar Choudhary; Brandon H. Dwiel; Eric Rotenberg

FabScalar is a recently published tool for automatically generating superscalar cores, of different pipeline widths, depths and sizes. The output of FabScalar is a synthesizable register-transfer-level (RTL) description of the desired core. While this capability makes sophisticated cores more accessible to designers and researchers, meaningful applications require reducing RTL descriptions to physical designs. This paper presents the first systematic physical design study of FabScalar-generated superscalar cores.

international symposium on quality electronic design | 2013

Hetero 2 3D integration: A scheme for optimizing efficiency/cost of Chip Multiprocessors

Shivam Priyadarshi; Niket Kumar Choudhary; Brandon H. Dwiel; Ankita Upreti; Eric Rotenberg; Rhett Davis; Paul D. Franzon

Timing the transition of a processor design to a new technology poses a provocative tradeoff. On the one hand, transitioning as early as possible offers a significant competitive advantage, by bringing improved designs to market early. On the other hand, an aggressive strategy may prove to be unprofitable, due to the low manufacturing yield of a technology that has not had time to mature. We propose exploiting two complementary forms of heterogeneity to profitably exploit an immature technology for Chip Multiprocessors (CMP). First, 3D integration facilitates a technology alloy. The CMP is split across two dies, one fabricated in the old technology and the other in the new technology. The alloy derives benefit from the new technology while limiting cost exposure. Second, to compensate for lower efficiency of old-technology cores, we exploit application and microarchitectural heterogeneity: applications which gain less from technology scaling are scheduled on old-technology cores, moreover, these cores are retuned to optimize this class of application. For a defect density ratio of 200 between 45nm and 65nm, Hetero2 3D gives 3.6× and 1.5× higher efficiency/cost compared to 2D and 3D homogeneous implementations, respectively, with only 6.5% degradation in efficiency. We also present a sensitivity analysis by sweeping the defect density ratio. The analysis reveals the defect density break-even points, where homogeneous 2D and 3D designs in 45nm achieve the same efficiency/cost as Hetero2 3D, marking significant points in the maturing of the technology.

international conference on computer design | 2014

Design-effort alloy: Boosting a highly tuned primary core with untuned alternate cores

Elliott Forbes; Niket Kumar Choudhary; Brandon H. Dwiel; Eric Rotenberg

A commercial flagship superscalar core is a highly tuned machine. Designers spend significant effort to tune the register-transfer-level (RTL) model, circuits, and layout to optimize performance and power. Nonetheless, the one-size-fits-all microarchitecture still suffers from suboptimal performance and power on individual applications. A single-ISA heterogeneous multi-core, with its multiple diverse core designs, has potential to exploit application diversity. However, tuning multiple core types will incur insurmountable design effort. This paper proposes a new class of single-ISA heterogeneous multi-core processor, called design-effort alloy (DEA). Only one of the core types, called the high-effort core (HEC), is tuned using a high-effort design flow. Much less effort is spent on tuning other core types, called low-effort cores (LECs). We begin with synthesizable RTL designs of a palette of out-of-order superscalar core types. A LEC and HEC is designed for each core type: the LEC is based on design automation and the HEC is derived from its LEC counterpart, using frequency and energy scaling factors that account for RTL, circuit, and layout optimizations. The resulting HECs have more than a 2x frequency advantage with only a 1.3× increase in energy consumption compared to their corresponding LECs. From the palette of core types, we find the best 4-core-type DEA processor for 179 SPEC SimPoints (program phases). Our study yielded the following key results: 1) The DEA processors HEC is the same core type in the best high-effort homogeneous multi-core, owing to most program phases demonstrating “average” instruction-level behavior and favoring this balanced core. 2) The DEA processor yields a speedup in BIPS3/W of 1%-87%, and a geometric-mean speedup of 25%, on 20 out of 179 SimPoints over the best high-effort homogeneous multi-core. Thus, untuned LECs operating at less than half the frequency of the HEC nonetheless accelerate program phases with “outlier” instruction-level behavior.

international symposium on computer architecture | 2011

FabScalar: composing synthesizable RTL designs of arbitrary cores within a canonical superscalar template

Niket Kumar Choudhary; Salil V. Wadhavkar; Tanmay A. Shah; Hiran Mayukh; Jayneel Gandhi; Brandon H. Dwiel; Sandeep Navada; Hashem Hashemi Najaf-abadi; Eric Rotenberg

Explore More