Bharan Giridhar
University of Michigan
Network
Latest external collaboration on country level. Dive into details by clicking on the dots.
Publication
Featured researches published by Bharan Giridhar.
international solid-state circuits conference | 2012
David Fick; Ronald G. Dreslinski; Bharan Giridhar; Gyouho Kim; Sangwon Seo; Matthew Fojtik; Sudhir Satpathy; Yoonmyung Lee; Daeyeon Kim; Nurrachman Liu; Michael Wieckowski; Gregory K. Chen; Trevor N. Mudge; Dennis Sylvester; David T. Blaauw
Recent high performance IC design has been dominated by power density constraints. 3D integration increases device density even further, and these devices will not be usable without viable strategies to reduce power consumption. This paper proposes the use of near-threshold computing (NTC) to address this issue in a stacked 3D system. In NTC, cores are operated near the threshold voltage (~200mV above Vth) to optimally balance power and performance [1]. In Centip3De, we operate cores at 650mV, as opposed to the wear-out limited supply voltage of 1.5V. This improves measured energy efficiency by 5.1×. The dramatically lower power consumption of NTC makes it an attractive match for 3D design, which has limited power dissipation capabilities, but also has improved innate power and performance compared to 2D design.
international solid-state circuits conference | 2013
Suyoung Bang; Allan Wang; Bharan Giridhar; David T. Blaauw; Dennis Sylvester
Ultra-low power microsystems are gaining more popularity due to their applicability in critical areas of societal need. Power management in these microsystems is a major challenge as a relatively high battery voltage (e.g., 4V) must be down-converted to several low supplies, such as 0.6V for near-threshold digital circuits and 1.2V for analog circuits [1]. Furthermore, the small form factors of such systems rule out the use of external inductors, making switched-capacitor (SC) DC-DC converters the favored topology [2-4].
high-performance computer architecture | 2013
Nilmini Abeyratne; Reetuparna Das; Qingkun Li; Korey Sewell; Bharan Giridhar; Ronald G. Dreslinski; David T. Blaauw; Trevor N. Mudge
In this paper, we explore the challenges in scaling on-chip networks towards kilo-core processors. Current low-radix topologies optimize for fast local communication, but do not scale well to kilo-core systems because of the large number of routers required. These increase both power and hop count. In contrast, symmetric high-radix topologies optimize for global communication with fewer hop counts, but degrade local communication with their large, slow routers. To address both local and global communication optimizations independently, we decouple the interconnect design using asymmetric high-radix topologies. By setting a design goal of matching rauter speed with wire speed, our praposed topologies use fast medium-radix rauters to optimize for local communication and a few slow high-radix rauters that reduce hop count to optimize for global communication. Our asymmetric high-radix designs are enabled by recently praposed SwizzleSwitches, which allow us to achieve peiformance scalability within realistic power budgets. We prapose and evaluate two asymmetric high-radix topologies: Super-Star (asymmetric folded Clos) and Super-StarX (asymmetric folded Clos with superimposed mesh). Our evaluations show that the best performing asymmetric high-radix topology improves average network latency over a mesh by 45% while reducing the power consumption by 40%. When compared to symmetric high-radix topologies network thraughput is improved by 2.9× while still praviding similar latency benefits and power ejficiency.
international solid-state circuits conference | 2011
Yoonmyung Lee; Bharan Giridhar; Zhiyoong Foo; Dennis Sylvester; David T. Blaauw
Recent work in ultra-low-power sensor platforms has enabled a number of new applications in medical, infrastructure, and environmental monitoring. Due to their limited energy storage volume, these sensors operate with long idle times and ultra-low standby power ranging from 10s of nW down to 100s of pW [1–2]. Since radio transmission is relatively expensive, even at the lowest reported power of 0.2mW [3], wireless communication between sensor nodes must be performed infrequently. Accurate measurement of the time interval between communication events (i.e. the synchronization cycle) is of great importance. Inaccuracy in the synchronization cycle time results in a longer period of uncertainty where sensor nodes are required to enable their radios to establish communication (Fig. 2.7.1), quickly making radios dominate the energy budget. Quartz crystal oscillators and CMOS harmonic oscillators exhibit very small sensitivity to supply voltage and temperature [4] but cannot be used in the target application space since they operate at very high frequencies and exhibit power consumption that is several orders of magnitude larger (>300nW) than the needed idle power. A gate-leakage-based timer was proposed [5] that leveraged small gate leakage currents to achieve power consumption within the required budget (< 1nW). However, this timer incurs high RMS jitter (1400ppm) and temperature sensitivity (0.16%/ºC). A 150pW program-and-hold timer was proposed [6] to reduce temperature sensitivity but its drifting clock frequency limits its use for synchronization. The quality of a timer is not captured well by RMS jitter since it ignores the averaging of jitter over multiple timer clock periods in a single synchronization cycle. Instead, we propose the uncertainty in a single synchronization cycle of length T as new metric and use this synchronization uncertainty (SU) to evaluate different timer approaches. The timer period is a random variable X(n), with mean and sigma, μ and σ. Given a synchronization cycle time T, consisting of N timer periods, we define SU as the standard deviation of T as given by √T/μ × σ, assuming X(n) is Gaussian. Note that a smaller clock period increases N and results in more averaging and a lower SU with fixed jitter (σ/μ).
ieee international conference on high performance computing data and analytics | 2013
Bharan Giridhar; Michael Cieslak; Deepankar Duggal; Ronald G. Dreslinski; Hsing Min Chen; Robert Patti; Betina Hold; Chaitali Chakrabarti; Trevor N. Mudge; David T. Blaauw
The power target for exascale supercomputing is 20MW, with about 30% budgeted for the memory subsystem. Commodity DRAMs will not satisfy this requirement. Additionally, the large number of memory chips (>10M) required will result in crippling failure rates. Although specialized DRAM memories have been reorganized to reduce power through 3D-stacking or row buffer resizing, their implications on fault tolerance have not been considered. We show that addressing reliability and energy is a co-optimization problem involving tradeoffs between error correction cost, access energy and refresh power-reducing the physical page size to decrease access energy increases the energy/area overhead of error resilience. Additionally, power can be reduced by optimizing bitline lengths. The proposed 3D-stacked memory uses a page size of 4kb and consumes 5.1pJ/bit based on simulations with NEK5000 benchmarks. Scaling to 100PB, the memory consumes 4.7MW at 100PB/s which, while well within the total power budget (20MW), is also error-resilient.
architectural support for programming languages and operating systems | 2014
Anthony Gutierrez; Michael Cieslak; Bharan Giridhar; Ronald G. Dreslinski; Luis Ceze; Trevor N. Mudge
Key-value stores, such as Memcached, have been used to scale web services since the beginning of the Web 2.0 era. Data center real estate is expensive, and several industry experts we have spoken to have suggested that a significant portion of their data center space is devoted to key value stores. Despite its wide-spread use, there is little in the way of hardware specialization for increasing the efficiency and density of Memcached; it is currently deployed on commodity servers that contain high-end CPUs designed to extract as much instruction-level parallelism as possible. Out-of-order CPUs, however have been shown to be inefficient when running Memcached. To address Memcached efficiency issues, we propose two architectures using 3D stacking to increase data storage efficiency. Our first 3D architecture, Mercury, consists of stacks of ARM Cortex-A7 cores with 4GB of DRAM, as well as NICs. Our second architecture, Iridium, replaces DRAM with NAND Flash to improve density. We explore, through simulation, the potential efficiency benefits of running Memcached on servers that use 3D-stacking to closely integrate low-power CPUs with NICs and memory. With Mercury we demonstrate that density may be improved by 2.9X, power efficiency by 4.9X, throughput by 10X, and throughput per GB by 3.5X over a state-of-the-art server running optimized Memcached. With Iridium we show that density may be increased by 14X, power efficiency by 2.4X, and throughput by 5.2X, while still meeting latency requirements for a majority of requests.
IEEE Journal of Solid-state Circuits | 2013
David Fick; Ronald G. Dreslinski; Bharan Giridhar; Gyouho Kim; Sangwon Seo; Matthew Fojtik; Sudhir Satpathy; Yoonmyung Lee; Daeyeon Kim; Nurrachman Liu; Michael Wieckowski; Gregory K. Chen; Trevor N. Mudge; David T. Blaauw; Dennis Sylvester
We present Centip3De, a large-scale 3D CMP with a cluster-based near-threshold computing (NTC) architecture. Centip3De uses a 3D stacking technology in conjunction with 130 nm CMOS. Measured results for a two-layer, 64-core system are discussed, with the system achieving 3930 DMIPS/W energy efficiency, which is >; 3x improvement over traditional operation at full supply voltage. This project demonstrates the feasibility of large-scale 3D design, a synergy between 3D and NTC architectures, a unique cluster-based NTC cache design, and how to maximize performance in a thermally-constrained design.
IEEE Micro | 2013
Ronald G. Dreslinski; David Fick; Bharan Giridhar; Gyouho Kim; Sangwon Seo; Matthew Fojtik; Sudhir Satpathy; Yoonmyung Lee; Daeyeon Kim; Nurrachman Liu; Michael Wieckowski; Gregory K. Chen; Dennis Sylvester; David T. Blaauw; Trevor N. Mudge
Centip3De uses the synergy between 3D integration and near-threshold computing to create a reconfigurable system that provides both energy-efficient operation and techniques to address single-thread performance bottlenecks. The original Centip3De design is a seven-layer 3D stacked design with 128 cores and 256 Mbytes of DRAM. Silicon results show a two-layer, 64-core system in 130-nm technology, which achieved an energy efficiency of 3,930 DMIPS/W.
IEEE Journal of Solid-state Circuits | 2013
Yoonmyung Lee; Bharan Giridhar; Zhiyoong Foo; Dennis Sylvester; David Blaauw
Accurate measurement of synchronization cycle time is required for ultra-low power wireless sensor nodes with stringent power budgets. A multi-stage gate-leakage-based timer with boosted charging is proposed to address the high jitter of prior-art gate-leakage-based timers. The key approaches are faster load capacitor charging, wider voltage swing, and an improved gain sensing inverter. The proposed timer reduces RMS jitter by 8.1× and synchronization uncertainty by 4.1×, which allows hourly tracking with 200 ms uncertainty while consuming 660 pW. A novel closed-loop temperature compensation scheme with dynamic leakage adjustment is also proposed to achieve temperature sensitivity of 31 ppm/°C.
international solid-state circuits conference | 2014
Bharan Giridhar; Nathaniel Ross Pinckney; Dennis Sylvester; David T. Blaauw
This work presents an area-efficient and variation-tolerant small-signal differential sensing (VTS) scheme that modifies the conventional SA circuit to include: 1) a structure for on-the-fly, auto-zeroing offset compensation, 2) pre-amplification of bitline differential by reconfiguring the SA inverter pair as amplifiers, and 3) latching of the amplified voltage differential by returning the SA to its conventional cross-coupled configuration. The approach is demonstrated to improve SA robustness over conventional sensing at isosensing time without area overhead (Fig. 13.7.1). Conversely, sensing time can be reduced at iso-robustness and area. Measurements of a 28nm CMOS test chip show that an iso-area VTS scheme improves offset noise tolerance by -1.2σVth or sensing speed by up to 42% at iso-robustness (<;0.3% failure rate).