Siddharth Garg
New York University
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Siddharth Garg.
design automation conference | 2014
Muhammad Shafique; Siddharth Garg; Jörg Henkel; Diana Marculescu
Technology scaling has resulted in smaller and faster transistors in successive technology generations. However, transistor power consumption no longer scales commensurately with integration density and, consequently, it is projected that in future technology nodes it will only be possible to simultaneously power on a fraction of cores on a multi-core chip in order to stay within the power budget. The part of the chip that is powered off is referred to as dark silicon and brings new challenges as well as opportunities for the design community, particularly in the context of the interaction of dark silicon with thermal, reliability and variability concerns. In this perspectives paper we describe these new challenges and opportunities, and provide preliminary experimental evidence in their support.
design, automation, and test in europe | 2013
Bharathwaj Raghunathan; Yatish Turakhia; Siddharth Garg; Diana Marculescu
It is projected that increasing on-chip integration with technology scaling will lead to the so-called dark silicon era in which more transistors are available on a chip than can be simultaneously powered on. It is conventionally assumed that the dark silicon will be provisioned with heterogeneous resources, for example dedicated hardware accelerators. In this paper we challenge the conventional assumption and build a case for homogeneous dark silicon CMPs that exploit the inherent variations in process parameters that exist in scaled technologies to offer increased performance. Since process variations result in core-to-core variations in power and frequency, the idea is to cherry pick the best subset of cores for an application so as to maximize performance within the power budget. To this end, we propose a polynomial time algorithm for optimal core selection, thread mapping and frequency assignment for a large class of multi-threaded applications. Our experimental results based on the Sniper multi-core simulator show that up to 22% and 30% performance improvement is observed for homogeneous CMPs with 33% and 50% dark silicon, respectively.
design automation conference | 2013
Yatish Turakhia; Bharathwaj Raghunathan; Siddharth Garg; Diana Marculescu
In this paper, we propose an efficient iterative optimization based approach for architectural synthesis of dark silicon heterogeneous chip multi-processors (CMPs). The goal is to determine the optimal number of cores of each type to provision the CMP with, such that the area and power budgets are met and the application performance is maximized. We consider general-purpose multi-threaded applications with a varying degree of parallelism (DOP) that can be set at run-time, and propose an accurate analytical model to predict the execution time of such applications on heterogeneous CMPs. Our experimental results illustrate that the synthesized heterogeneous dark silicon CMPs provide between 19% to 60% performance improvements over conventional homogeneous designs for variable and fixed DOP scenarios, respectively.
international conference on hardware/software codesign and system synthesis | 2014
Muhammad Shafique; Siddharth Garg; Tulika Mitra; Sri Parameswaran; Jörg Henkel
Dark Silicon refers to the observation that in future technology nodes, it may only be possible to power-on a fraction of on-chip resources (processing cores, hardware accelerators, cache blocks and so on) in order to stay within the power budget and safe thermal limits, while the other resources will have to be kept powered-off or “dark”. In other words, chips will have an abundance of transistors, i.e., more than the number that can be simultaneously powered-on. Heterogeneous computing has been proposed as one way to effectively leverage this abundance of transistors in order to increase performance, energy efficiency and even reliability within power and thermal constraints. However, several critical challenges remain to be addressed including design, automated synthesis, design space exploration and run-time management of heterogeneous dark silicon processors. The hardware/software co-design and synthesis community has potentially much to contribute in solving these new challenges introduced by dark silicon and, in particular, heterogeneous computing. In this paper, we identify and highlight some of these critical challenges, and outline some of our early research efforts in addressing them.
international symposium on quality electronic design | 2009
Siddharth Garg; Diana Marculescu
3D Integrated Circuits (ICs) have been recently proposed as a solution to the increasing wire delay concerns in scaled technologies. At the same time, technology scaling leads to increasing variability in manufacturing process parameters, making it imperative to quantify the impact of these variations on performance. In this work, we take, to the best of our knowledge, the first step towards formally modeling the impact of process variations on the clock frequency of fully-synchronous (FS) 3D ICs. The proposed analytical models demonstrate theoretically and experimentally that 3D designs behave very differently under the impact of process variations as compared to equivalent 2D designs. In particular, for the same number of critical paths, we show that a 3D design is always less likely to meet a pre-defined frequency target compared to its 2D counterpart. Furthermore, as opposed to models for 2D ICs, the 3D models need to accurately account for not only within-die (WID) critical paths, i.e., paths that lie entirely within one of the die layers, but also D2D critical paths that use through-silicon vias (TSVs) to span across multiple dies in the 3D stack. Finally, we show, theoretically and experimentally, that the mapping of critical paths to the die layers of a 3D IC can also affect the timing yield of a design, while the mapping issue does not arise in the 2D case since there is only a single die layer in a 2D IC. The accuracy of the proposed models is experimentally verified and found to be in excellent agreement with detailed SPICE and gate-level Monte Carlo (MC) simulations.
IEEE Transactions on Very Large Scale Integration Systems | 2012
Sebastian Herbert; Siddharth Garg; Diana Marculescu
Fine-grained dynamic voltage/frequency scaling (DVFS) is an important tool in managing the balance between power and performance in chip-multiprocessors. Although manufacturing process variations are giving rise to significant core-to-core variations in power and performance, traditional DVFS controllers are unaware of these variations. Exploiting the different power profiles of the cores can significantly improve energy efficiency. Process variations do not significantly affect dynamic power, so less-leaky processing units are more energy-efficient than their leakier counterparts at a given supply voltage and frequency. Taking advantage of this observation, three existing DVFS control algorithms are modified to shift work from inefficient, leaky processing units to efficient, less leaky ones, maintaining performance while reducing total power consumption. This work-shifting is carried out both between dies in a given speed bin and between voltage/frequency islands on a given die. The gains enabled by incorporating variability-awareness into the three DVFS algorithms are demonstrated on both multithreaded and multiprogrammed workloads. For a baseline 16-core design with per-core voltage/frequency islands (VFIs) and a 4×4 mesh on-chip network, the aggregate power per squared throughput (power/throughput2 or P/T2) over all fabricated dies is reduced by 9.2%, 5.7%, and 7.7% for the three controllers. Chip multiprocessor designs using other VFI granularities and network topologies are also examined.
design automation conference | 2009
Siddharth Garg; Diana Marculescu; Radu Marculescu; Umit Y. Ogras
In this paper, we consider the case of network-on-chip (NoC) based multiple processor systems-on-chip (MPSoCs) implemented using multiple voltage and frequency islands (VFIs) that rely on fine grained dynamic voltage and frequency scaling (DVFS) for run time control of the system power dissipation. Specifically, we present a framework to compute theoretical bounds on the performance of DVFS controllers for such systems under the impact of three important technology driven constraints: reliability and temperature driven upper limits on the maximum supply voltage; inductive noise driven constraints on the maximum rate of change of voltage/frequency; and increasing manufacturing process variations. Our experimental results show that, for the benchmarks considered, any DVFS control algorithm will lose up to 87% performance, measured in terms of the number of steps required to reach a reference steady state, in the presence of maximum frequency and maximum frequency increment constraints. In addition, increasing process variations can lead to up to 60% of fabricated chips being unable to meet the specified DVFS control specifications, irrespective of the DVFS algorithm used.
international conference on hardware/software codesign and system synthesis | 2013
Da Cheng Juan; Siddharth Garg; Jinpyo Park; Diana Marculescu
Near-Threshold Computing (NTC) has emerged as a solution that promises to significantly increase the energy efficiency of next-generation multi-core systems. This paper evaluates and analyzes the behavior of dynamic voltage and frequency scaling (DVFS) control algorithms for multi-core systems operating under near-threshold, nominal, or turbo-mode conditions. We adapt the model selection technique from machine learning to learn the relationship between performance and power. The theoretical results show that the resulting models satisfy convexity properties essential to efficiently determining optimal voltage/frequency operating points for minimizing energy consumption under throughput constraints or maximizing throughput under a given power budget. Our experimental results show that, compared with DVFS in the conventional operating range, extended range DVFS control including turbo-mode and near-threshold operation achieves an additional (1) 13.28% average energy reduction under isoperformance conditions, and (2) 7.54% average throughput increase under iso-power conditions.
international conference on computer aided design | 2006
Diana Marculescu; Siddharth Garg
The problem of determining bounds for application completion times running on generic systems comprised of single or multiple voltage-frequency islands (VFIs) with arbitrary topologies is addressed in the context of manufacturing-driven variability. The approach provides an exact solution for the system-level timing yield in single clock, single voltage (SSV) and VFI systems with an underlying tree-based topology, and a tight upper bound for generic, non-tree based topologies. The results show that: (a) timing yield for overall source-to-sink completion time for generic systems can be modeled in an exact manner for both SSV and VFI systems; and (b) multiple VFI, latency-constrained systems can achieve 11-90% higher timing yield than their SSV counterparts. The results are proven formally and supported by experimental results on two embedded applications, namely software defined radio and MPEG2 encoder
international symposium on low power electronics and design | 2010
Siddharth Garg; Diana Marculescu; Radu Marculescu
In this paper, we propose Custom Feedback Control, a new dynamic voltage and frequency control architecture for MP-SoC designs that bridges the gap between the two extreme points on the performance versus implementation cost tradeoff curve, i.e., fully-centralized and full-decentralized control architectures. We outline a methodology to efficiently explore the vast design space of Custom Feedback control architectures, enabling designers to synthesize controllers that meet both the performance and implementation cost criteria. Our experimental results on an MPSoC platform running a video-encoding application demonstrate that, for the same energy dissipation, Custom Feedback control can achieve within 5% of the performance of a fully-centralized controller with only 17% of the implementation cost. In contrast, the performance of a fully-decentralized controller can be up to 2.5X worse than that of the fully-centralized controller.