Himanshu Kaul
Intel
                                 Network
                            
                            Latest external collaboration on country level. Dive into details by clicking on the dots.
                                 Publication
                            
                            Featured researches published by Himanshu Kaul.
design automation conference | 2012
Himanshu Kaul; Mark A. Anders; Steven K. Hsu; Amit Agarwal; Ram K. Krishnamurthy; Shekhar Borkar
Moores Law will continue providing abundance of transistors for integration, only to be limited by the energy consumption. Near threshold voltage (NTV) operation has potential to improve energy efficiency by an order of magnitude. We discuss design techniques necessary for reliable operation over a wide range of supply voltage - from nominal down to subthreshold region. The system designed for NTV can dynamically select modes of operation, from high performance, to high energy efficiency, to the lowest power.
international solid-state circuits conference | 2008
Himanshu Kaul; Mark A. Anders; Sanu K. Mathew; Steven K. Hsu; Amit Agarwal; Ram K. Krishnamurthy; Shekhar Borkar
Motion estimation for compressing inter-frame redundancies is the most performance and power-critical operation in video encoding applications, where a wide range of throughput and power constraints are required to handle a variety of video resolution, frame rate and application specifications. A motion estimation engine targeted for special-purpose on-die acceleration of sum of absolute difference (SAD) computation in real-time video encoding workloads on power-constrained mobile microprocessors is fabricated in 65nm CMOS.
international solid-state circuits conference | 2014
Sanu K. Mathew; Sudhir K. Satpathy; Mark A. Anders; Himanshu Kaul; Steven K. Hsu; Amit Agarwal; Gregory K. Chen; Rachael J. Parker; Ram K. Krishnamurthy; Vivek De
Physically unclonable function (PUF) circuits are low-cost cryptographic primitives used for generation of unique, stable and secure keys or chip IDs for device authentication and data security in high-performance microprocessors [1][2][3][7]. The volatile nature of PUFs provides a high level of security and tamper resistance against invasive probing attacks compared to conventional fuse-based key storage technologies [4]. A process-voltage-temperature (PVT) variation-tolerant all-digital PUF array targeted for on-die generation of 100% stable, device-specific, high-entropy keys is fabricated in 22nm tri-gate high-κ metal-gate CMOS technology [5], featuring: i) a hybrid delay/cross-coupled PUF circuit where interaction of 16 minimum-sized, variation-impacted transistors determines resolution dynamics, ii) a temporal majority voting (TMV) circuit to stabilize occasionally unstable bits, resulting in 53% reduction in instability, iii) burn-in hardening to reinforce manufacturing-time PUF bias, resulting in 22% reduction in bit-errors, iv) soft dark bits for run-time identification and sequestration of highly unstable bits during field operation, resulting in 78% lower bit-errors, v) 19× separation between inter- and intra-PUF Hamming distance, enabling die-specific keys, vi) autocorrelation factor≈0 and entropy=0.9997, while passing NIST randomness tests, vii) high tolerance to voltage and temperature variation with 82% reduction in average Hamming-distance using a 100-cycle dark bit window, viii) in-situ PUF hardening by leveraging directed NBTI aging to improve stability during field operation, and ix) ultra-low energy consumption of 0.19pJ/b with compact bitcell layout of 4.66μm2 (Fig. 16.2.7a).
IEEE Journal of Solid-state Circuits | 2011
Sanu K. Mathew; Farhana Sheikh; Michael E. Kounavis; Shay Gueron; Amit Agarwal; Steven K. Hsu; Himanshu Kaul; Mark A. Anders; Ram K. Krishnamurthy
Abstract-This paper describes an on-die, reconfigurable AES encrypt/decrypt hardware accelerator fabricated in 45 nm CMOS, targeted for content-protection in high-performance microprocessors. 100% round computation in native GF(24)2 composite-field arithmetic, unified reconfigurable datapath for encrypt/decrypt, optimized ground & composite-field polynomials, integrated affine/bypass multiplexer circuits, fused Mix/InvMixColumn circuits and a folded ShiftRow datapath enable peak 2.2 Tbps/Watt AES-128 energy efficiency with a dense 2-round layout occupying 0.052 mm2, while achieving: (i) 53/44/38 Gbps AES-128/192/256 performance, 125 mW, measured at 1.1 V, 50 °C, (ii) scalable AES-128 performance up to 66 Gbps, measured at 1.35 V, 50 °C, (iii) wide operating supply voltage range with robust subthreshold voltage performance of 800 Mbps, 409 μW, measured at 320 mV, 50 °C (iv) 37% Sbox delay reduction and 25% area reduction with a compact Sbox layout occupying 759 μm2 (v) 67% reduction in worst-case interconnect length and 33% reduction in ShiftRow wiring tracks and (vi) 43 % reduction in Mix/InvMixColumn area with no performance penalty.
international solid-state circuits conference | 2009
Himanshu Kaul; Mark A. Anders; Sanu K. Mathew; Steven K. Hsu; Amit Agarwal; Ram K. Krishnamurthy; Shekhar Borkar
This paper describes a reconfigurable 4-way SIMD engine fabricated in 45 nm high-k/metal-gate CMOS, targeted for on-die acceleration of vector processing in power-constrained mobile microprocessors. The SIMD accelerator is reconfigured to perform 4-way 16b × 16b multiplies, 32b × 32b multiply, 4-way 16b additions, 2-way 32b additions or 72b addition with single-cycle throughput and wide supply voltage range of operation (1.3 V-230 mV). A reconfigurable 2 × 2 tile of signed 2s complement 16b multipliers, with conditional carry gating in the 72b sparse tree adder, dual-supplies for voltage hopping, and fine-grained power-gating enables peak energy efficiency of 494GOPS/W (measured at 300 mV, 50°C) with a dense layout occupying 0.081 mm2 while achieving: (i) scalable performance up to 2.8 GHz, 278 mW measured at 1.3 V; (ii) fast single-cycle switching between any operating/idle mode; (iii) configuration-dependent power reduction of up to 41% in total power and 6.5× in active leakage power; (iv) 10× standby leakage reduction during idle mode; (v) deep subthreshold operation measured at 230 mV, 8.8 MHz, 87 ¿W; and (vi) compensation for up to 3× performance variation in ultra-low voltage mode.
great lakes symposium on vlsi | 2002
Himanshu Kaul; Dennis Sylvester; David T. Blaauw
A new shielding scheme, active shielding, is proposed for reducing delays on interconnects. As opposed to conventional (passive) shielding, the active shielding approach helps to speed up signal propagation on a wire by ensuring in-phase switching of adjacent nets. Results show that the active shielding scheme improves performance by up to 16% compared to passive shields and up to 29% compared to unshielded wires. When signal slopes at the end of the line are compared, savings of up to 38% and 27% can be achieved when compared to passive shields and unshielded wires, respectively.
IEEE Design & Test of Computers | 2001
Dennis Sylvester; Himanshu Kaul
Addressing fundamental challenges to designing high-performance ICs in nanometer-scale technologies, the authors advocate a flexible approach to limiting both dynamic and static power. They recommend global-signaling strategies to curb communication power requirements and thermal management techniques to ease the burden on packaging.
design automation conference | 2001
Dennis Sylvester; Himanshu Kaul
We highlight several fundamental challenges to designing high-performance integrated circuits in nanometer-scale technologies (i.e. drawn feature sizes <100 nm). Dynamic power scaling trends lead to major packaging problems. To alleviate these concerns, thermal monitoring and feedback mechanisms can limit worst-case dissipation and reduce costs. Furthermore, a flexible multi-V/sub dd/+multi-V/sub th/+re-sizing approach is advocated to leverage the inherent properties of ultrasmall MOSFETs and limit both dynamic and static power. Alternative global signaling strategies such as differential and low-swing drivers are recommended in order to curb the power requirements of cross-chip communication. Finally, potential power delivery challenges are addressed with respect to ITRS packaging predictions.
IEEE Journal of Solid-state Circuits | 2012
Sanu K. Mathew; Suresh Srinivasan; Mark A. Anders; Himanshu Kaul; Steven K. Hsu; Farhana Sheikh; Amit Agarwal; Sudhir K. Satpathy; Ram K. Krishnamurthy
This paper describes an all-digital PVT-variation tolerant true-random number generator (TRNG), fabricated in 45 nm high-k/metal-gate CMOS, targeted for on-die entropy generation in high-performance microprocessors. The TRNG harvests differential thermal-noise at the diffusion nodes of a pre-charged cross-coupled inverter pair to resolve out of metastability, generating one random bit/cycle. A self-calibrating 2-step tuning mechanism using coarse-grained configurable inverters and fine-grained programmable clock delay generators, along with an entropy-tracking feedback loop provide tolerance to 20% PVT variation-induced device mismatches, enabling lowest-reported energy-consumption of 2.9 pJ/bit with a dense layout occupying 4004 μm2, while achieving: (i) 2.4 Gbps random bit throughput, 7 mW total power consumption with 0.7 mW leakage power component, measured at 1.1 V, 50°C, (ii) random bitstreams that passes all NIST RNG tests with raw entropy/bit measured up to 0.9999999993, (iii) good distribution of 1s with 4-bit entropy of 3.97996 and high-entropy pattern probability of 0.066 (iv) wide operating supply voltage range with robust sub-threshold voltage performance of 14 Mbps, 5.6 μW, measured at 280 mV, 50°C, (v) 12 fine-grained high-entropy settings for the TRNG to dither in during steady-state operation, (vi) <;3% error while using an analytical ergodic Markov chain model for predicting pattern probabilities and (vii) 200x higher throughput and 9x higher energy-efficiency than previously reported implementations. Design modifications for robust operation in 22 nm high-volume manufacturing in the presence of 3σ process variations demonstrate scalability of the all-digital design to future technologies.
IEEE Journal of Solid-state Circuits | 2009
Himanshu Kaul; Mark A. Anders; Sanu K. Mathew; Steven K. Hsu; Amit Agarwal; Ram K. Krishnamurthy; Shekhar Borkar
This paper describes a motion estimation engine fabricated in 65 nm CMOS, targeted for special-purpose on-die acceleration of sum of absolute difference (SAD) computation in real-time video encoding workloads on power-constrained mobile microprocessors. Four-way speculative difference computation using dual 4:2 compressors, optimal reuse of sum XOR min-terms in static 4:2 compressor carry gates, distributed accumulation of input carries for efficient negation and robust ultra-low voltage optimized circuits enable peak SAD efficiency of 12.8 macro-block SADs/nJ within a dense layout occupying 0.089 mm2 while achieving: (i) scalable performance up to 2.4 GHz, 82 mW measured at 1.4 V, 50degC , (ii) deep subthreshold operation measured at 230 mV while operating down to 4.3 MHz and consuming 14.4 muW , (iii) maximum energy efficiency of 411 GOPS/Watt by operating at 320 mV, 23 MHz and consuming 56 muW (9.6x higher efficiency than nominal 1.2 V operation), (iv) 20% higher energy efficiency for up-conversion of ultra-low voltage signals using a two-stage cascaded split-output level shifter, and (v) tolerance of up to plusmn2x process and temperature induced performance variation using supply voltage compensation of plusmn50 mV.
